CN110349597A - A kind of speech detection method and device - Google Patents
A kind of speech detection method and device Download PDFInfo
- Publication number
- CN110349597A CN110349597A CN201910594785.2A CN201910594785A CN110349597A CN 110349597 A CN110349597 A CN 110349597A CN 201910594785 A CN201910594785 A CN 201910594785A CN 110349597 A CN110349597 A CN 110349597A
- Authority
- CN
- China
- Prior art keywords
- model
- audio
- training
- voice
- gmm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
Abstract
Present disclose provides speech detection method and devices.Speech detection method includes building speech detection model;It is in series with RNN model again after being connected in parallel by the first GMM model, the second GMM model and LSTM model;The process of training speech detection model are as follows: respectively correspond the first GMM model of training, the second GMM model and LSTM model using voice data, non-speech data and voice and non-voice blended data, output phase should identify score value, a three-dimensional vector is formed, the vector as audio fragment characterizes;By audio fragment vector characterization one time series of composition at each moment, each moment previous moment and later moment in time, RNN model is trained as input quantity;The process of testing audio data are as follows: segmentation testing audio data are several audio fragments, it is input to the speech detection model of training completion one by one again, the audio fragment for obtaining the corresponding moment belongs to the probability value of voice, and audio fragment is determined as voice or non-voice by comparison probability value and given threshold.
Description
Technical field
The disclosure belongs to speech detection field more particularly to a kind of speech detection method and device.
Background technique
Only there is provided background technical informations relevant to the disclosure for the statement of this part, it is not necessary to so constitute first skill
Art.
Important content one of of the speech detection as audio detection field, gets the attention.Speech detection has wide
Wealthy application prospect can be used as the front end pretreatment of speech recognition technology, voice to be identified detected from audio data
Data improve the recognition efficiency of voice;Speech detection can also detect the speech of someone from session recording, form meeting
Abstract.With the fast development of depth learning technology, in speech detection field, deep neural network gradually replaces traditional common
Machine learning model is classified.Tradition common machine learning model in audio detection field has gauss hybrid models
(Gaussian Mixture Model, GMM), hidden Markov model (Hidden Markov Model, HMM), supporting vector
Machine (Support Vector Machine, SVM) etc..
Inventors have found that conventional machines learning model has the following problems:
1) the audible spectrum dimension that conventional machines learning model obtains is higher, so that the operand of neural network is big, expends
The training of neural network and classification time are more, and operation efficiency is low;
2) important information in the audio sample that conventional machines learning model extracts makes score there are the interference of redundancy
Class model cannot identify speech samples well, reduce Detection accuracy.
Summary of the invention
To solve the above-mentioned problems, the first aspect of the disclosure provides a kind of speech detection method, by GMM model,
LSTM model and RNN model effectively combine, and can give full play to three respective advantages of model, whole to improve speech detection model
The classification and Detection ability of body.
To achieve the goals above, the disclosure adopts the following technical scheme that
A kind of speech detection method, comprising:
Construct speech detection model;The speech detection model is by the first GMM model, the second GMM model and LSTM model
It is in series with RNN model again after being connected in parallel;
Training speech detection model;Its process are as follows:
Using voice data, non-speech data and voice and non-voice blended data respectively correspond training the first GMM model,
Second GMM model and LSTM model, output phase should identify score value, and then form a three-dimensional vector, as audio fragment
Vector characterization;
By audio fragment vector characterization one time sequence of composition at each moment, each moment previous moment and later moment in time
Column, RNN model is trained as input quantity, until the average probability value that the audio fragment at all moment of output belongs to voice is inclined
Difference meets default required precision;
Testing audio data;Its process are as follows:
Segmentation testing audio data are several audio fragments, then audio fragment is input to the voice of training completion one by one
Detection model, the audio fragment for obtaining the corresponding moment belong to the probability value of voice.
Further, if probability value is greater than or equal to given threshold, judge that the audio fragment at corresponding moment belongs to voice;
Otherwise, judge that the audio fragment at corresponding moment is not belonging to voice.
The advantages of technical solution, is that the probability value by the way that audio fragment to be belonged to voice comes compared with given threshold
Judge whether the audio fragment at corresponding moment belongs to voice, so that testing result is more intuitive.
Further, the process of the first GMM model of training are as follows:
By the audio fragment sub-frame processing only containing voice data, the mel-frequency cepstrum of the default dimension of every frame audio is extracted
Coefficient forms a training sample and stores to the first training sample set as audio frequency characteristics;
Training sample in first training sample set is input in the first GMM model, the voice of each frame audio is exported
It identifies score value, the speech recognition score value of frames all in audio fragment is averaged, the voice for obtaining respective audio segment is known
Other score value;
The first GMM model is obtained by the training sample training in the first training sample set by expectation-maximization algorithm
All parameters.
The advantages of technical solution, is, using the audio fragment only containing voice data and passes through expectation-maximization algorithm
All parameters of first GMM model are obtained by the training sample training in the first training sample set, reduce the first GMM model
Training and classification the time, improve operation efficiency, score of the sample on speech model can be obtained more accurately, obtained
More accurate speech detection model improves the Detection accuracy of entire speech detection model.
Further, the process of the second GMM model of training are as follows:
By the audio fragment sub-frame processing only containing non-speech data, the mel-frequency for extracting the default dimension of every frame audio is fallen
Spectral coefficient forms a training sample and stores to the second training sample set as audio frequency characteristics;
Training sample in second training sample set is input in the second GMM model, the non-language of each frame audio is exported
Sound identifies score value, and the non-speech recognition score value of frames all in audio fragment is averaged, the non-of respective audio segment is obtained
Speech recognition score value;
The second GMM model is obtained by the training sample training in the second training sample set by expectation-maximization algorithm
All parameters.
The advantages of technical solution, is, calculates using the audio fragment only containing non-speech data and by expectation maximization
Method obtains all parameters of the second GMM model by the training sample training in the second training sample set, reduces the 2nd GMM mould
The training of type and classification time, operation efficiency is improved, score of the sample on non-voice model can be obtained more accurately,
More accurate non-voice detection model is obtained, the Detection accuracy of entire speech detection model is improved.
Further, the process of training LSTM model are as follows:
By the audio fragment sub-frame processing containing voice data and non-speech data, the plum of the default dimension of every frame audio is extracted
These audio frequency characteristics are arranged to make up a time series as audio frequency characteristics by that frequency cepstral coefficient sequentially in time;
Above-mentioned time series is input in LSTM model, output obtains the speech recognition score value of respective audio segment;
Using Adam optimization algorithm training LSTM model, until the parameter of LSTM model is optimal.
The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam
Optimization algorithm trains LSTM model, and the contextual information of sample can be made full use of to be identified, obtain sample more accurately
The probability score for belonging to voice obtains accurate speech detection model, improves the Detection accuracy of entire speech detection model.
Further, the process of training RNN model are as follows:
The identification score value composition one that trained first GMM model, the second GMM model and LSTM model are exported respectively
A three-dimensional vector, the vector as audio fragment characterize;
The audio fragment vector characterization of current time, previous moment and later moment in time is formed into a time series, as
Input quantity trains RNN model, and output obtains the probability value that current time audio fragment belongs to voice;
Using Adam optimization algorithm training RNN model, until the audio fragment at all moment of output belongs to the flat of voice
Equal probability value deviation meets default required precision.
The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam
Optimization algorithm trains RNN model, and the contextual information of sample can be made full use of to be identified, improves recognition accuracy;In addition,
The input of RNN model is three-dimensional vector, and the character representation of low-dimensional vector can reduce training and the classification time of model, improve fortune
Efficiency is calculated, and the character representation of low-dimensional vector reduces the interference of redundancy, can more accurately obtain sample in this way and belong to
The probability value of voice obtains accurate speech detection model, improves the Detection accuracy of entire speech detection model.
The second aspect of the disclosure provides a kind of speech detection device, by GMM model, LSTM model and RNN model
It effectively combines, three respective advantages of model can be given full play to, to improve the classification and Detection ability of speech detection model entirety.
To achieve the goals above, the disclosure adopts the following technical scheme that
A kind of speech detection device, comprising:
Speech detection model construction module is used to construct speech detection model, and the speech detection model is by the first GMM
Model, the second GMM model and LSTM model are in series with RNN model again after being connected in parallel;
Speech detection model training module is used to train speech detection model, process are as follows:
Using voice data, non-speech data and voice and non-voice blended data respectively correspond training the first GMM model,
Second GMM model and LSTM model, output phase should identify score value, and then form a three-dimensional vector, as audio fragment
Vector characterization;
By audio fragment vector characterization one time sequence of composition at each moment, each moment previous moment and later moment in time
Column, RNN model is trained as input quantity, until the average probability value that the audio fragment at all moment of output belongs to voice is inclined
Difference meets default required precision;
Audio data test module is used for testing audio data, process are as follows:
Segmentation testing audio data are several audio fragments, then audio fragment is input to the voice of training completion one by one
Detection model, the audio fragment for obtaining the corresponding moment belong to the probability value of voice.
Further, in the audio data test module, if probability value is greater than or equal to given threshold, judge phase
The audio fragment at moment is answered to belong to voice;Otherwise, judge that the audio fragment at corresponding moment is not belonging to voice.
The advantages of technical solution, is that the probability value by the way that audio fragment to be belonged to voice comes compared with given threshold
Judge whether the audio fragment at corresponding moment belongs to voice, so that testing result is more intuitive.
Further, in the speech detection model training module, the process of the first GMM model of training are as follows:
By the audio fragment sub-frame processing only containing voice data, the mel-frequency cepstrum of the default dimension of every frame audio is extracted
Coefficient forms a training sample and stores to the first training sample set as audio frequency characteristics;
Training sample in first training sample set is input in the first GMM model, the voice of each frame audio is exported
It identifies score value, the speech recognition score value of frames all in audio fragment is averaged, the voice for obtaining respective audio segment is known
Other score value;
The first GMM model is obtained by the training sample training in the first training sample set by expectation-maximization algorithm
All parameters.
The advantages of technical solution, is, using the audio fragment only containing voice data and passes through expectation-maximization algorithm
All parameters of first GMM model are obtained by the training sample training in the first training sample set, reduce the first GMM model
Training and classification the time, improve operation efficiency, score of the sample on speech model can be obtained more accurately, obtained
More accurate speech detection model improves the Detection accuracy of entire speech detection model.
Further, in the speech detection model training module, the process of the second GMM model of training are as follows:
By the audio fragment sub-frame processing only containing non-speech data, the mel-frequency for extracting the default dimension of every frame audio is fallen
Spectral coefficient forms a training sample and stores to the second training sample set as audio frequency characteristics;
Training sample in second training sample set is input in the second GMM model, the non-language of each frame audio is exported
Sound identifies score value, and the non-speech recognition score value of frames all in audio fragment is averaged, the non-of respective audio segment is obtained
Speech recognition score value;
The second GMM model is obtained by the training sample training in the second training sample set by expectation-maximization algorithm
All parameters.
The advantages of technical solution, is, calculates using the audio fragment only containing non-speech data and by expectation maximization
Method obtains all parameters of the second GMM model by the training sample training in the second training sample set, reduces the 2nd GMM mould
The training of type and classification time, operation efficiency is improved, score of the sample on non-voice model can be obtained more accurately,
More accurate non-voice detection model is obtained, the Detection accuracy of entire speech detection model is improved.
Further, in the speech detection model training module, the process of training LSTM model are as follows:
By the audio fragment sub-frame processing containing voice data and non-speech data, the plum of the default dimension of every frame audio is extracted
These audio frequency characteristics are arranged to make up a time series as audio frequency characteristics by that frequency cepstral coefficient sequentially in time;
Above-mentioned time series is input in LSTM model, output obtains the speech recognition score value of respective audio segment;
Using Adam optimization algorithm training LSTM model, until the parameter of LSTM model is optimal.
The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam
Optimization algorithm trains LSTM model, and the contextual information of sample can be made full use of to be identified, more accurate ground sample is obtained
The probability score for belonging to voice obtains accurate speech detection model, improves the Detection accuracy of entire speech detection model.
Further, in the speech detection model training module, the process of training RNN model are as follows:
The identification score value composition one that trained first GMM model, the second GMM model and LSTM model are exported respectively
A three-dimensional vector, the vector as audio fragment characterize;
The audio fragment vector characterization of current time, previous moment and later moment in time is formed into a time series, as
Input quantity trains RNN model, and output obtains the probability value that current time audio fragment belongs to voice;
Using Adam optimization algorithm training RNN model, until the audio fragment at all moment of output belongs to the flat of voice
Equal probability value deviation meets default required precision.
The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam
Optimization algorithm trains RNN model, and the contextual information of sample can be made full use of to be identified, improves recognition accuracy;In addition,
The input of RNN model is three-dimensional vector, and the character representation of low-dimensional vector reduces training and the classification time of model, improves fortune
Efficiency is calculated, and the character representation of low-dimensional vector reduces the interference of redundancy, available more accurate ground sample belongs to language
The probability value of sound obtains accurate speech detection model, improves the Detection accuracy of entire speech detection model.
A kind of computer readable storage medium is provided in terms of the third of the disclosure.
A kind of computer readable storage medium, is stored thereon with computer program, realization when which is executed by processor
Step in speech detection method as described above.
4th aspect of the disclosure provides a kind of computer equipment.
A kind of computer equipment can be run on a memory and on a processor including memory, processor and storage
Computer program, the processor realize the step in speech detection method as described above when executing described program.
The beneficial effect of the disclosure is:
(1) disclosure is converted to the input of RNN model by traditional audible spectrum by GMM model and LSTM model low
The character representation of dimension, low-dimensional feature can reduce the operand of RNN model, reduce training and the classification time of RNN model, improve
Operation efficiency.
(2) character representation of the disclosure using the classification score of different models as audio sample, this character representation can be with
The important information in audio sample is efficiently extracted out, the interference of redundancy is reduced, and then disaggregated model is enable preferably to know
Not Chu speech samples, improve Detection accuracy.
(3) disclosure realizes traditional classification model: effective combination of GMM model, LSTM model and RNN model, GMM mould
Type can analogue audio frequency sample well feature structure, LSTM model and RNN model can effectively utilize audio sample up and down
Literary information is classified, and the speech detection model of the disclosure can give full play to GMM model, LSTM model and RNN model respectively
Advantage, improve whole classification and Detection performance.
(4) speech detection method of disclosure design can also obtain good speech detection in the lower situation of signal-to-noise ratio
Performance, thus there is good robustness to noise.
(5) mentality of designing of the present embodiment is to carry out traditional common speech detection model and deep neural network model
In conjunction with the traditional voice detection model in association schemes is not limited to GMM, the not office of the deep neural network model in association schemes
It is limited to LSTM and RNN, it can be traditional voice detection model and deep neural network mould that association schemes, which have good expansion,
The combination of type provides good method and uses for reference.
Detailed description of the invention
The Figure of description for constituting a part of this disclosure is used to provide further understanding of the disclosure, and the disclosure is shown
Meaning property embodiment and its explanation do not constitute the improper restriction to the disclosure for explaining the disclosure.
Fig. 1 is a kind of speech detection method flow chart of the embodiment of the present disclosure.
Fig. 2 is the speech detection model structure schematic diagram of the embodiment of the present disclosure.
Fig. 3 is the procedure chart of the testing audio data of the embodiment of the present disclosure.
Fig. 4 is a kind of speech detection device structural schematic diagram of the embodiment of the present disclosure.
Specific embodiment
The disclosure is described further with embodiment with reference to the accompanying drawing.
It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the disclosure.Unless another
It indicates, all technical and scientific terms used herein has usual with disclosure person of an ordinary skill in the technical field
The identical meanings of understanding.
It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root
According to the illustrative embodiments of the disclosure.As used herein, unless the context clearly indicates otherwise, otherwise singular
Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet
Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.
Term is explained:
GMM model: gauss hybrid models exactly accurately quantify thing with Gaussian probability-density function (normal distribution curve)
One things is decomposed into several models formed based on Gaussian probability-density function (normal distribution curve) by object.
LSTM model: Long Short-Term Memory, shot and long term memory network model are a kind of time circulation nerves
Network is suitable for being spaced and postpone relatively long critical event in processing and predicted time sequence.
RNN model: Recurrent Neural Network, Recognition with Recurrent Neural Network model are a kind of node orientation connections
The artificial neural network of cyclization.The internal state of this network can show dynamic time sequence behavior.Different from feedforward neural network
, RNN can use its internal memory to handle the list entries of arbitrary sequence, this allows it that can be easier processing if not
Handwriting recognition, speech recognition of segmentation etc..
Embodiment one
Fig. 1 gives a kind of speech detection method flow chart of the present embodiment.
As shown in Figure 1, the speech detection method of the present embodiment, comprising:
S101: building speech detection model.
The speech detection model by the first GMM model, the second GMM model and LSTM model be connected in parallel after again with
RNN model is in series, as shown in Figure 2.
In the present embodiment, the number of Gaussian mixture components is set as 5 in the first GMM model.
The number of Gaussian mixture components is set as 5 in second GMM model.
The hidden layer and output layer that LSTM model includes input layer, is made of 2 layers of LSTM network structure.The section of input layer
Point number is set as 39, and output layer neuron number is set as 1, if the class label of voice is " 1 ", the class label of non-voice is
"0".Dropout layers are added after each LSTM network structure layer, dropout layers of parameter is set as 0.2.
The node number of the input layer of RNN model is set as 3, and hidden layer is set as 2 layers, neuron in each hidden layer
Number is set as 50, and dropout layers are added after each hidden layer, and the parameter that dropout is arranged is 0.2, output layer neuron
Number be set as 1, if the class label of voice is " 1 ", the class label of non-language is " 0 ", and output is obtained each sound by output layer
Frequency segment belongs to the probability of voice.
It is understood that in other embodiments, Gaussian mixture components in the first GMM model and the second GMM model
Number may be alternatively provided as other values, and those skilled in the art can be specifically arranged according to the actual situation, and and will not be described here in detail.
In other embodiments, the number of plies of LSTM network structure may be alternatively provided as other values, this field skill in LSTM model
Art personnel can be specifically arranged according to the actual situation, and and will not be described here in detail.
In other embodiments, each node layer quantity of RNN model and the number of plies of hidden layer may be alternatively provided as other values, ability
Field technique personnel can be specifically arranged according to the actual situation, and and will not be described here in detail.
S102: training speech detection model.
In specific implementation, the process of the training speech detection model of step S102 are as follows:
S1021: training first is respectively corresponded using voice data, non-speech data and voice and non-voice blended data
GMM model, the second GMM model and LSTM model, output phase should identify score value, and then form a three-dimensional vector, as sound
The vector of frequency segment characterizes.
Specifically, in the step S1021, the process of the first GMM model of training are as follows:
S1021-11: by the audio fragment sub-frame processing only containing voice data, the plum of the default dimension of every frame audio is extracted
That frequency cepstral coefficient forms a training sample and stores to the first training sample set as audio frequency characteristics.
Such as: by training data each in the training set only containing voice data as unit of 100 milliseconds long, it is divided into one
The non-overlapping audio fragment of series;Sub-frame processing is carried out to each 100 milliseconds long of audio fragments, frame length is set as 30 millis
Second, frame shifting is set as 10 milliseconds;After sub-frame processing, 39 dimension MFCC features are extracted to each audio frame, with this 39 dimension MFCC feature
To express each trained speech samples.
Wherein, MFCC, Mel Frequency Cepstrum Coefficient, mel-frequency cepstrum coefficient, Meier frequency
Rate is put forward based on human hearing characteristic, it and Hz frequency are at nonlinear correspondence relation.Mel-frequency cepstrum coefficient
It (MFCC) is then the Hz spectrum signature being calculated using this relationship between them.
It should be noted that existing method can be used to extract mel-frequency cepstrum coefficient, those skilled in the art can
It is specifically chosen according to the actual situation.
S1021-12: the training sample in the first training sample set is input in the first GMM model, exports each frame sound
The speech recognition score value of frames all in audio fragment is averaged, obtains respective audio segment by the speech recognition score value of frequency
Speech recognition score value.
Wherein, speech recognition score value is bigger, then respective audio segment belong to voice probability it is bigger.
S1021-13: first is obtained by the training sample training in the first training sample set by expectation-maximization algorithm
All parameters of GMM model.
EM algorithm (Expectation-Maximization algorithm, EM), be it is a kind of by iteration into
The optimization algorithm of row Maximum-likelihood estimation, usually as being substituted for comprising hidden variable or missing data for Newton iteration method
Probabilistic model carries out parameter Estimation, can provide hidden variable, the i.e. posteriority of missing data, therefore answer in missing data problem
With.
The present embodiment is using the only audio fragment containing voice data and by expectation-maximization algorithm by the first training sample
Training sample training in this set obtains all parameters of the first GMM model, reduces the training and classification of the first GMM model
Time improves operation efficiency, can obtain score of the sample on speech model more accurately, obtain more accurate language
Sound detection model improves the Detection accuracy of entire speech detection model.
Specifically, in the step S1021, the process of the second GMM model of training are as follows:
S1021-21: by the audio fragment sub-frame processing only containing non-speech data, the default dimension of every frame audio is extracted
Mel-frequency cepstrum coefficient forms a training sample and stores to the second training sample set as audio frequency characteristics.
Such as: by training data each in the training set only containing non-speech data as unit of 100 milliseconds long, it is divided into
A series of non-overlapping audio fragments;Sub-frame processing is carried out to each 100 milliseconds long of audio fragments, frame length is set as 30 millis
Second, frame shifting is set as 10 milliseconds;After sub-frame processing, 39 Jan Vermeer frequency cepstral coefficients are extracted to each audio frame, with this 39 dimension
Mel-frequency cepstrum coefficient expresses each trained non-voice sample.
It should be noted that existing method can be used to extract mel-frequency cepstrum coefficient, those skilled in the art can
It is specifically chosen according to the actual situation.
S1021-22: the training sample in the second training sample set is input in the second GMM model, exports each frame sound
The non-speech recognition score value of frames all in audio fragment is averaged, obtains respective audio by the non-speech recognition score value of frequency
The non-speech recognition score value of segment.
Wherein, non-speech recognition score value is bigger, then respective audio segment belong to voice probability it is smaller.
S1021-23: second is obtained by the training sample training in the second training sample set by expectation-maximization algorithm
All parameters of GMM model.
EM algorithm (Expectation-Maximization algorithm, EM), be it is a kind of by iteration into
The optimization algorithm of row Maximum-likelihood estimation, usually as being substituted for comprising hidden variable or missing data for Newton iteration method
Probabilistic model carries out parameter Estimation, can provide hidden variable, the i.e. posteriority of missing data, therefore answer in missing data problem
With.
The advantages of technical solution, is, calculates using the audio fragment only containing non-speech data and by expectation maximization
Method obtains all parameters of the second GMM model by the training sample training in the second training sample set, reduces the 2nd GMM mould
The training of type and classification time, operation efficiency is improved, the identification score that sample belongs to non-voice can be obtained more accurately,
More accurate non-voice detection model is obtained, the Detection accuracy of entire speech detection model is improved.
Specifically, in the step S1021, the process of training LSTM model are as follows:
S1021-31: by the audio fragment sub-frame processing containing voice data and non-speech data, every frame audio is extracted
The mel-frequency cepstrum coefficient of default dimension is as audio frequency characteristics, when these audio frequency characteristics are arranged to make up one sequentially in time
Between sequence.
It specifically, will be containing each training data in voice data and non-speech data training set with 100 milliseconds of a length of lists
Position, is divided into a series of non-overlapping audio fragments;Sub-frame processing, frame length are carried out to each 100 milliseconds long of audio fragments
It is set as 30 milliseconds, frame shifting is set as 10 milliseconds, and 100 milliseconds of long audio fragments are divided into 8 30 milliseconds of long audio frames,
This 8 audio frames constitute a time series;After sub-frame processing, 39 dimension MFCC features are extracted to each audio frame;
Time series to be constituted after each 100 milliseconds long audio fragment framings trains LSTM model for input.
S1021-32: above-mentioned time series is input in LSTM model, and output obtains identifying for respective audio segment
Score value.
Wherein, identification score value it is bigger, then respective audio segment belong to voice probability it is bigger.
S1021-33: using Adam optimization algorithm training LSTM model, until the parameter of LSTM model is optimal.
The initialization of LSTM model is uniformly distributed initial method using Glorot, and loss function is using intersection entropy loss letter
Number, training use Adam optimization algorithm, and it be 128, epoch parameter is 20 that setting learning rate, which is 0.01, batch_size parameter,.
LSTM model is set as just being exported when reading in the last frame of input time sequence.
Adam is a kind of first-order optimization method that can substitute traditional stochastic gradient descent process, it can be based on training data
Iteratively update neural network weight.It is advantageous that: it realizes straight from the shoulder;It is efficient to calculate;Required memory is few;Gradient pair
The invariance of angle scaling;It is suitble to solve the optimization problem containing large-scale data and parameter;Suitable for unstable state (non-
Stationary) target;Suitable for solving the problems, such as comprising very strong noise or sparse gradient;Hyper parameter can be solved intuitively very much
It releases, and minimal amount of tune is substantially only needed to join.
The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam
Optimization algorithm trains LSTM model, and the contextual information of sample can be made full use of to be identified, can be obtained more accurately
Sample belongs to the probability score of voice, obtains more accurate speech detection model, improves the inspection of entire speech detection model
Survey accuracy rate.
S1022: by audio fragment vector characterization composition one of each moment, each moment previous moment and later moment in time
Time series trains RNN model as input quantity, until the audio fragment at all moment of output belongs to the average general of voice
Rate value deviation meets default required precision.
Assuming that sharing N number of training data, x in training seti(i=1 ..., N) indicates i-th of training data, it is assumed that xiIn
Shared MiA audio fragment, xij(j=1 ..., Mi) indicate xiJ-th of audio fragment, xijIt is the first GMM model, the 2nd GMM
The form of one three-dimensional vector of model and the corresponding identification score value composition of LSTM model output.When characterizing audio fragment,
In order to which its contextual information to be also included, by xijWith the audio fragment x of its previous momenti(j-1)And the audio of later moment in time
Segment xi(j+1)A time series [x is formed togetheri(j-1),xij,xi(j+1)] (j=2 ..., Mi- 1), using this time sequence as
The input of RNN neural network, training RNN network.
Specifically, the process of training RNN model are as follows:
The identification score value composition one that trained first GMM model, the second GMM model and LSTM model are exported respectively
A three-dimensional vector, the vector as audio fragment characterize;
The audio fragment vector characterization of current time, previous moment and later moment in time is formed into a time series, as
Input quantity trains RNN model, and output obtains the probability value that current time audio fragment belongs to voice;
Using Adam optimization algorithm training RNN model, until the audio fragment at all moment of output belongs to the flat of voice
Equal probability value deviation meets default required precision.
The node number of the input layer of RNN network is set as 3, and hidden layer is set as 2 layers, neuron in each hidden layer
Number is set as 50, and dropout layers are added after each hidden layer, and the parameter that dropout is arranged is 0.2, output layer neuron
Number be set as 1, if the class label of voice is " 1 ", the class label of non-language is " 0 ", and output is obtained each sound by output layer
Frequency segment belongs to the probability value of voice;
The initialization of RNN network is uniformly distributed initial method using Glorot, and loss function is using intersection entropy loss letter
Number, training use Adam optimization algorithm, and it be 128, epoch parameter is 20 that setting learning rate, which is 0.01, batch_size parameter,.
S103: testing audio data.
In specific implementation, as shown in figure 3, the process of the testing audio data of step S103 are as follows:
Segmentation testing audio data are several audio fragments, then audio fragment is input to the voice of training completion one by one
Detection model, the audio fragment for obtaining the corresponding moment belong to the probability value of voice.
Specifically, if probability value is greater than or equal to given threshold, judge that the audio fragment at corresponding moment belongs to voice;It is no
Then, judge that the audio fragment at corresponding moment is not belonging to voice.
Assuming that K audio fragment is obtained after the segmentation of testing audio data, y is usedkIndicate its k-th of audio fragment, k=
1,…,K.To each audio fragment yk, it is passed through to the knowledge that the first GMM model, the second GMM model and LSTM model acquire respectively
3 dimensional vectors must be grouped as, characterize audio fragment y with this 3 dimensional vectork。
By ykWith the audio fragment y of its previous momentk-1And the audio fragment y of later moment in timek+1A time is formed together
Sequence [yk-1,yk,yk+1] (k=2 ..., K-1) using this time sequence as the input of RNN neural network acquire audio fragment yk
Belong to the probability value of voice, given threshold 0.5, probability value is greater than 0.5, then by ykBe determined as voice, probability value less than 0.5,
Then by ykIt is determined as non-voice.
It should be noted that the probability that audio fragment belongs to voice can be arranged in those skilled in the art according to available accuracy demand
The threshold size of value.
The present embodiment is converted to the input of RNN model by traditional audible spectrum by GMM model and LSTM model low
The character representation of dimension, low-dimensional feature can reduce the operand of RNN model, reduce training and the classification time of RNN model, improve
Operation efficiency.
Character representation of the present embodiment using the classification score of different models as audio sample, this character representation can have
The important information in audio sample is extracted to effect, reduces the interference of redundancy, and then disaggregated model is enable preferably to identify
Speech samples out improve Detection accuracy.
The present embodiment realizes traditional classification model: effective combination of GMM model, LSTM model and RNN model, GMM mould
Type can analogue audio frequency sample well feature structure, LSTM model and RNN model can effectively utilize audio sample up and down
Literary information is classified, and the speech detection model of the disclosure can give full play to GMM model, LSTM model and RNN model respectively
Advantage, improve whole classification and Detection performance.
The present embodiment can also obtain good speech detection performance in the lower situation of signal-to-noise ratio, thus have to noise
Good robustness.
The mentality of designing of the present embodiment is to tie traditional common speech detection model and deep neural network model
It closes, the traditional voice detection model in association schemes is not limited to GMM, and the deep neural network model in association schemes is not limited to
In LSTM and RNN, it can be traditional voice detection model and deep neural network model that association schemes, which have good expansion,
Combination good method be provided use for reference.
Embodiment two
Fig. 4 is a kind of speech detection device structural schematic diagram that the embodiment of the present disclosure provides.
As shown in figure 4, the speech detection device of the present embodiment, comprising:
(1) speech detection model construction module is used to construct speech detection model, and the speech detection model is by first
GMM model, the second GMM model and LSTM model are in series with RNN model again after being connected in parallel;
(2) speech detection model training module is used to train speech detection model, process are as follows:
Using voice data, non-speech data and voice and non-voice blended data respectively correspond training the first GMM model,
Second GMM model and LSTM model, output phase should identify score value, and then form a three-dimensional vector, as audio fragment
Vector characterization;
By audio fragment vector characterization one time sequence of composition at each moment, each moment previous moment and later moment in time
Column, RNN model is trained as input quantity, until the average probability value that the audio fragment at all moment of output belongs to voice is inclined
Difference meets default required precision.
Specifically, in the speech detection model training module, the process of the first GMM model of training are as follows:
By the audio fragment sub-frame processing only containing voice data, the mel-frequency cepstrum of the default dimension of every frame audio is extracted
Coefficient forms a training sample and stores to the first training sample set as audio frequency characteristics;
Training sample in first training sample set is input in the first GMM model, the voice of each frame audio is exported
It identifies score value, the speech recognition score value of frames all in audio fragment is averaged, the voice for obtaining respective audio segment is known
Other score value;
The first GMM model is obtained by the training sample training in the first training sample set by expectation-maximization algorithm
All parameters.
The advantages of technical solution, is, using the audio fragment only containing voice data and passes through expectation-maximization algorithm
All parameters of first GMM model are obtained by the training sample training in the first training sample set, reduce the first GMM model
Training and classification the time, improve operation efficiency, score of the sample on speech model can be obtained more accurately, obtained
More accurate speech detection model improves the Detection accuracy of entire speech detection model.
In the speech detection model training module, the process of the second GMM model of training are as follows:
By the audio fragment sub-frame processing only containing non-speech data, the mel-frequency for extracting the default dimension of every frame audio is fallen
Spectral coefficient forms a training sample and stores to the second training sample set as audio frequency characteristics;
Training sample in second training sample set is input in the second GMM model, the non-language of each frame audio is exported
Sound identifies score value, and the non-speech recognition score value of frames all in audio fragment is averaged, the non-of respective audio segment is obtained
Speech recognition score value;
The second GMM model is obtained by the training sample training in the second training sample set by expectation-maximization algorithm
All parameters.
The advantages of technical solution, is, calculates using the audio fragment only containing non-speech data and by expectation maximization
Method obtains all parameters of the second GMM model by the training sample training in the second training sample set, reduces the 2nd GMM mould
The training of type and classification time, operation efficiency is improved, score of the sample on non-voice model can be obtained more accurately,
More accurate non-voice detection model is obtained, the Detection accuracy of entire speech detection model is improved.
In the speech detection model training module, the process of training LSTM model are as follows:
By the audio fragment sub-frame processing containing voice data and non-speech data, the plum of the default dimension of every frame audio is extracted
These audio frequency characteristics are arranged to make up a time series as audio frequency characteristics by that frequency cepstral coefficient sequentially in time;
Above-mentioned time series is input in LSTM model, output obtains the identification score value of respective audio segment;
Using Adam optimization algorithm training LSTM model, until the parameter of LSTM model is optimal.
The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam
Optimization algorithm trains LSTM model, and the contextual information of sample can be made full use of to be identified, can be obtained more accurately
Sample belongs to the probability score of voice, obtains more accurate speech detection model, improves the inspection of entire speech detection model
Survey accuracy rate.
In the speech detection model training module, the process of training RNN model are as follows:
The identification score value composition one that trained first GMM model, the second GMM model and LSTM model are exported respectively
A three-dimensional vector, the vector as audio fragment characterize;
The audio fragment vector characterization of current time, previous moment and later moment in time is formed into a time series, as
Input quantity trains RNN model, and output obtains the probability value that current time audio fragment belongs to voice;
Using Adam optimization algorithm training RNN model, until the audio fragment at all moment of output belongs to the flat of voice
Equal probability value deviation meets default required precision.
The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam
Optimization algorithm trains RNN model, and the contextual information of sample can be made full use of to be identified, improves recognition accuracy, in addition,
The input of RNN model is three-dimensional vector, and the character representation of low-dimensional vector reduces training and the classification time of model, improves fortune
Efficiency is calculated, and the character representation of low-dimensional vector reduces the interference of redundancy, can obtain more accurately sample and belong to language
The probability value of sound obtains more accurate speech detection model, improves the Detection accuracy of entire speech detection model.
(3) audio data test module is used for testing audio data, process are as follows:
Segmentation testing audio data are several audio fragments, then audio fragment is input to the voice of training completion one by one
Detection model, the audio fragment for obtaining the corresponding moment belong to the probability value of voice.
Specifically, in the audio data test module, if probability value is greater than or equal to given threshold, judgement is corresponding
The audio fragment at moment belongs to voice;Otherwise, judge that the audio fragment at corresponding moment is not belonging to voice.
The advantages of technical solution, is that the probability value by the way that audio fragment to be belonged to voice comes compared with given threshold
Judge whether the audio fragment at corresponding moment belongs to voice, so that testing result is more intuitive.
The present embodiment is converted to the input of RNN model by traditional audible spectrum by GMM model and LSTM model low
The character representation of dimension, low-dimensional feature can reduce the operand of RNN model, reduce training and the classification time of RNN model, improve
Operation efficiency.
Character representation of the present embodiment using the classification score of different models as audio sample, this character representation can have
The important information in audio sample is extracted to effect, reduces the interference of redundancy, and then disaggregated model is enable preferably to identify
Speech samples out improve Detection accuracy.
The present embodiment realizes traditional classification model: effective combination of GMM model, LSTM model and RNN model, GMM mould
Type can analogue audio frequency sample well feature structure, LSTM model and RNN model can effectively utilize audio sample up and down
Literary information is classified, and the speech detection model of the disclosure can give full play to GMM model, LSTM model and RNN model respectively
Advantage, improve whole classification and Detection performance.
The present embodiment can also obtain good speech detection performance in the lower situation of signal-to-noise ratio, thus have to noise
Good robustness.
Embodiment three
A kind of computer readable storage medium is provided in terms of the third of the disclosure.
A kind of computer readable storage medium, is stored thereon with computer program, realization when which is executed by processor
Step in speech detection method as shown in Figure 1.
The present embodiment is converted to the input of RNN model by traditional audible spectrum by GMM model and LSTM model low
The character representation of dimension, low-dimensional feature can reduce the operand of RNN model, reduce training and the classification time of RNN model, improve
Operation efficiency.
Character representation of the present embodiment using the classification score of different models as audio sample, this character representation can have
The important information in audio sample is extracted to effect, reduces the interference of redundancy, and then disaggregated model is enable preferably to identify
Speech samples out improve Detection accuracy.
The present embodiment realizes traditional classification model: effective combination of GMM model, LSTM model and RNN model, GMM mould
Type can analogue audio frequency sample well feature structure, LSTM model and RNN model can effectively utilize audio sample up and down
Literary information is classified, and the speech detection model of the disclosure can give full play to GMM model, LSTM model and RNN model respectively
Advantage, improve whole classification and Detection performance.
The present embodiment can also obtain good speech detection performance in the lower situation of signal-to-noise ratio, thus have to noise
Good robustness.
Example IV
4th aspect of the disclosure provides a kind of computer equipment.
A kind of computer equipment can be run on a memory and on a processor including memory, processor and storage
Computer program, the processor realize the step in speech detection method as shown in Figure 1 when executing described program.
The present embodiment is converted to the input of RNN model by traditional audible spectrum by GMM model and LSTM model low
The character representation of dimension, low-dimensional feature can reduce the operand of RNN model, reduce training and the classification time of RNN model, improve
Operation efficiency.
Character representation of the present embodiment using the classification score of different models as audio sample, this character representation can have
The important information in audio sample is extracted to effect, reduces the interference of redundancy, and then disaggregated model is enable preferably to identify
Speech samples out improve Detection accuracy.
The present embodiment realizes traditional classification model: effective combination of GMM model, LSTM model and RNN model, GMM mould
Type can analogue audio frequency sample well feature structure, LSTM model and RNN model can effectively utilize audio sample up and down
Literary information is classified, and the speech detection model of the disclosure can give full play to GMM model, LSTM model and RNN model respectively
Advantage, improve whole classification and Detection performance.
The present embodiment can also obtain good speech detection performance in the lower situation of signal-to-noise ratio, thus have to noise
Good robustness.
It should be understood by those skilled in the art that, embodiment of the disclosure can provide as method, system or computer program
Product.Therefore, the shape of hardware embodiment, software implementation or embodiment combining software and hardware aspects can be used in the disclosure
Formula.Moreover, the disclosure, which can be used, can use storage in the computer that one or more wherein includes computer usable program code
The form for the computer program product implemented on medium (including but not limited to magnetic disk storage and optical memory etc.).
The disclosure is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present disclosure
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium
In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic
Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random
AccessMemory, RAM) etc..
The foregoing is merely preferred embodiment of the present disclosure, are not limited to the disclosure, for the skill of this field
For art personnel, the disclosure can have various modifications and variations.It is all within the spirit and principle of the disclosure, it is made any to repair
Change, equivalent replacement, improvement etc., should be included within the protection scope of the disclosure.
Claims (10)
1. a kind of speech detection method characterized by comprising
Construct speech detection model;The speech detection model is in parallel by the first GMM model, the second GMM model and LSTM model
It is in series with RNN model again after connection;
Training speech detection model;Its process are as follows:
Training the first GMM model, second are respectively corresponded using voice data, non-speech data and voice and non-voice blended data
GMM model and LSTM model, output phase should identify score value, and then form a three-dimensional vector, the vector as audio fragment
Characterization;
The audio fragment vector characterization at each moment, each moment previous moment and later moment in time is formed into a time series,
RNN model is trained as input quantity, until the audio fragment at all moment of output belongs to the average probability value deviation of voice
Meet default required precision;
Testing audio data;Its process are as follows:
Segmentation testing audio data are several audio fragments, then audio fragment is input to the speech detection of training completion one by one
Model, the audio fragment for obtaining the corresponding moment belong to the probability value of voice.
2. speech detection method as described in claim 1, which is characterized in that belong to the general of voice acquiring testing audio segment
After rate value, if probability value is greater than or equal to given threshold, judge that the audio fragment at corresponding moment belongs to voice;Otherwise, judge
The audio fragment at corresponding moment is not belonging to voice.
3. speech detection method as described in claim 1, which is characterized in that the process of the first GMM model of training are as follows:
By the audio fragment sub-frame processing only containing voice data, the mel-frequency cepstrum coefficient of the default dimension of every frame audio is extracted
As audio frequency characteristics, forms a training sample and store to the first training sample set;
Training sample in first training sample set is input in the first GMM model, the speech recognition of each frame audio is exported
The speech recognition score value of frames all in audio fragment is averaged by score value, and the speech recognition for obtaining respective audio segment obtains
Score value;
The all of the first GMM model are obtained by the training sample training in the first training sample set by expectation-maximization algorithm
Parameter;
Or
The process of the second GMM model of training are as follows:
By the audio fragment sub-frame processing only containing non-speech data, the mel-frequency cepstrum system of the default dimension of every frame audio is extracted
Number is used as audio frequency characteristics, forms a training sample and stores to the second training sample set;
Training sample in second training sample set is input in the second GMM model, the non-voice for exporting each frame audio is known
The non-speech recognition score value of frames all in audio fragment is averaged, obtains the non-voice of respective audio segment by other score value
Identify score value;
The all of the second GMM model are obtained by the training sample training in the second training sample set by expectation-maximization algorithm
Parameter.
4. speech detection method as described in claim 1, which is characterized in that the process of training LSTM model are as follows:
By the audio fragment sub-frame processing containing voice data and non-speech data, the Meier frequency of the default dimension of every frame audio is extracted
Rate cepstrum coefficient is arranged to make up a time series as audio frequency characteristics, by these audio frequency characteristics sequentially in time;
Above-mentioned time series is input in LSTM model, output obtains the identification score value of respective audio segment;
Using Adam optimization algorithm training LSTM model, until the parameter of LSTM model is optimal.
5. speech detection method as described in claim 1, which is characterized in that the process of training RNN model are as follows:
The identification score value that trained first GMM model, the second GMM model and LSTM model are exported respectively forms one three
Dimensional vector, the vector as audio fragment characterize;
By audio fragment vector characterization one time series of composition of current time, previous moment and later moment in time, as input
It measures to train RNN model, output obtains the probability value that current time audio fragment belongs to voice;
Using Adam optimization algorithm training RNN model, until the audio fragment at all moment of output belongs to the average general of voice
Rate value deviation meets default required precision.
6. a kind of speech detection device characterized by comprising
Speech detection model construction module is used to construct speech detection model, and the speech detection model is by the first GMM mould
Type, the second GMM model and LSTM model are in series with RNN model again after being connected in parallel;
Speech detection model training module is used to train speech detection model, process are as follows:
Training the first GMM model, second are respectively corresponded using voice data, non-speech data and voice and non-voice blended data
GMM model and LSTM model, output phase should identify score value, and then form a three-dimensional vector, the vector as audio fragment
Characterization;
The audio fragment vector characterization at each moment, each moment previous moment and later moment in time is formed into a time series,
RNN model is trained as input quantity, until the audio fragment at all moment of output belongs to the average probability value deviation of voice
Meet default required precision;
Audio data test module is used for testing audio data, process are as follows:
Segmentation testing audio data are several audio fragments, then audio fragment is input to the speech detection of training completion one by one
Model, the audio fragment for obtaining the corresponding moment belong to the probability value of voice.
7. speech detection device as claimed in claim 6, which is characterized in that in the audio data test module, if generally
Rate value is greater than or equal to given threshold, then judges that the audio fragment at corresponding moment belongs to voice;Otherwise, judge the sound at corresponding moment
Frequency segment is not belonging to voice.
8. speech detection device as claimed in claim 6, which is characterized in that in the speech detection model training module,
The process of the first GMM model of training are as follows:
By the audio fragment sub-frame processing only containing voice data, the mel-frequency cepstrum coefficient of the default dimension of every frame audio is extracted
As audio frequency characteristics, forms a training sample and store to the first training sample set;
Training sample in first training sample set is input in the first GMM model, the speech recognition of each frame audio is exported
The speech recognition score value of frames all in audio fragment is averaged by score value, and the speech recognition for obtaining respective audio segment obtains
Score value;
The all of the first GMM model are obtained by the training sample training in the first training sample set by expectation-maximization algorithm
Parameter;
Or
In the speech detection model training module, the process of the second GMM model of training are as follows:
By the audio fragment sub-frame processing only containing non-speech data, the mel-frequency cepstrum system of the default dimension of every frame audio is extracted
Number is used as audio frequency characteristics, forms a training sample and stores to the second training sample set;
Training sample in second training sample set is input in the second GMM model, the non-voice for exporting each frame audio is known
The non-speech recognition score value of frames all in audio fragment is averaged, obtains the non-voice of respective audio segment by other score value
Identify score value;
The all of the second GMM model are obtained by the training sample training in the second training sample set by expectation-maximization algorithm
Parameter;
Or
In the speech detection model training module, the process of training LSTM model are as follows:
By the audio fragment sub-frame processing containing voice data and non-speech data, the Meier frequency of the default dimension of every frame audio is extracted
Rate cepstrum coefficient is arranged to make up a time series as audio frequency characteristics, by these audio frequency characteristics sequentially in time;
Above-mentioned time series is input in LSTM model, output obtains the identification score value of respective audio segment;
Using Adam optimization algorithm training LSTM model, until the parameter of LSTM model is optimal;
Or
In the speech detection model training module, the process of training RNN model are as follows:
The identification score value that trained first GMM model, the second GMM model and LSTM model are exported respectively forms one three
Dimensional vector, the vector as audio fragment characterize;
By audio fragment vector characterization one time series of composition of current time, previous moment and later moment in time, as input
It measures to train RNN model, output obtains the probability value that current time audio fragment belongs to voice;
Using Adam optimization algorithm training RNN model, until the audio fragment at all moment of output belongs to the average general of voice
Rate value deviation meets default required precision.
9. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor
The step in speech detection method according to any one of claims 1 to 5 is realized when row.
10. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor
Calculation machine program, which is characterized in that the processor realizes language according to any one of claims 1 to 5 when executing described program
Step in sound detection method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910594785.2A CN110349597B (en) | 2019-07-03 | 2019-07-03 | Voice detection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910594785.2A CN110349597B (en) | 2019-07-03 | 2019-07-03 | Voice detection method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110349597A true CN110349597A (en) | 2019-10-18 |
CN110349597B CN110349597B (en) | 2021-06-25 |
Family
ID=68177773
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910594785.2A Active CN110349597B (en) | 2019-07-03 | 2019-07-03 | Voice detection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110349597B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110827798A (en) * | 2019-11-12 | 2020-02-21 | 广州欢聊网络科技有限公司 | Audio signal processing method and device |
CN111341351A (en) * | 2020-02-25 | 2020-06-26 | 厦门亿联网络技术股份有限公司 | Voice activity detection method and device based on self-attention mechanism and storage medium |
CN111444379A (en) * | 2020-03-30 | 2020-07-24 | 腾讯科技(深圳)有限公司 | Audio feature vector generation method and audio segment representation model training method |
CN112151072A (en) * | 2020-08-21 | 2020-12-29 | 北京搜狗科技发展有限公司 | Voice processing method, apparatus and medium |
CN112270933A (en) * | 2020-11-12 | 2021-01-26 | 北京猿力未来科技有限公司 | Audio identification method and device |
CN112885350A (en) * | 2021-02-25 | 2021-06-01 | 北京百度网讯科技有限公司 | Control method and device of network conference, electronic equipment and storage medium |
WO2021136029A1 (en) * | 2019-12-31 | 2021-07-08 | 百果园技术(新加坡)有限公司 | Training method and device for re-scoring model and method and device for speech recognition |
CN113724734A (en) * | 2021-08-31 | 2021-11-30 | 上海师范大学 | Sound event detection method and device, storage medium and electronic device |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060111900A1 (en) * | 2004-11-25 | 2006-05-25 | Lg Electronics Inc. | Speech distinction method |
CN101548313A (en) * | 2006-11-16 | 2009-09-30 | 国际商业机器公司 | Voice activity detection system and method |
CN103226948A (en) * | 2013-04-22 | 2013-07-31 | 山东师范大学 | Audio scene recognition method based on acoustic events |
CN103646649A (en) * | 2013-12-30 | 2014-03-19 | 中国科学院自动化研究所 | High-efficiency voice detecting method |
US20150228277A1 (en) * | 2014-02-11 | 2015-08-13 | Malaspina Labs (Barbados), Inc. | Voiced Sound Pattern Detection |
US20170069313A1 (en) * | 2015-09-06 | 2017-03-09 | International Business Machines Corporation | Covariance matrix estimation with structural-based priors for speech processing |
US20170372725A1 (en) * | 2016-06-28 | 2017-12-28 | Pindrop Security, Inc. | System and method for cluster-based audio event detection |
US20180166067A1 (en) * | 2016-12-14 | 2018-06-14 | International Business Machines Corporation | Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier |
CN108305616A (en) * | 2018-01-16 | 2018-07-20 | 国家计算机网络与信息安全管理中心 | A kind of audio scene recognition method and device based on long feature extraction in short-term |
CN108492820A (en) * | 2018-03-20 | 2018-09-04 | 华南理工大学 | Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model |
CN108597539A (en) * | 2018-02-09 | 2018-09-28 | 桂林电子科技大学 | Speech-emotion recognition method based on parameter migration and sound spectrograph |
CN108831508A (en) * | 2018-06-13 | 2018-11-16 | 百度在线网络技术(北京)有限公司 | Voice activity detection method, device and equipment |
CN109192210A (en) * | 2018-10-25 | 2019-01-11 | 腾讯科技(深圳)有限公司 | A kind of method of speech recognition, the method and device for waking up word detection |
CN109697982A (en) * | 2019-02-01 | 2019-04-30 | 北京清帆科技有限公司 | A kind of speaker speech recognition system in instruction scene |
-
2019
- 2019-07-03 CN CN201910594785.2A patent/CN110349597B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060111900A1 (en) * | 2004-11-25 | 2006-05-25 | Lg Electronics Inc. | Speech distinction method |
CN101548313A (en) * | 2006-11-16 | 2009-09-30 | 国际商业机器公司 | Voice activity detection system and method |
CN103226948A (en) * | 2013-04-22 | 2013-07-31 | 山东师范大学 | Audio scene recognition method based on acoustic events |
CN103646649A (en) * | 2013-12-30 | 2014-03-19 | 中国科学院自动化研究所 | High-efficiency voice detecting method |
US20150228277A1 (en) * | 2014-02-11 | 2015-08-13 | Malaspina Labs (Barbados), Inc. | Voiced Sound Pattern Detection |
US20170069313A1 (en) * | 2015-09-06 | 2017-03-09 | International Business Machines Corporation | Covariance matrix estimation with structural-based priors for speech processing |
US20170372725A1 (en) * | 2016-06-28 | 2017-12-28 | Pindrop Security, Inc. | System and method for cluster-based audio event detection |
US20180166067A1 (en) * | 2016-12-14 | 2018-06-14 | International Business Machines Corporation | Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier |
CN108305616A (en) * | 2018-01-16 | 2018-07-20 | 国家计算机网络与信息安全管理中心 | A kind of audio scene recognition method and device based on long feature extraction in short-term |
CN108597539A (en) * | 2018-02-09 | 2018-09-28 | 桂林电子科技大学 | Speech-emotion recognition method based on parameter migration and sound spectrograph |
CN108492820A (en) * | 2018-03-20 | 2018-09-04 | 华南理工大学 | Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model |
CN108831508A (en) * | 2018-06-13 | 2018-11-16 | 百度在线网络技术(北京)有限公司 | Voice activity detection method, device and equipment |
CN109192210A (en) * | 2018-10-25 | 2019-01-11 | 腾讯科技(深圳)有限公司 | A kind of method of speech recognition, the method and device for waking up word detection |
CN109697982A (en) * | 2019-02-01 | 2019-04-30 | 北京清帆科技有限公司 | A kind of speaker speech recognition system in instruction scene |
Non-Patent Citations (7)
Title |
---|
EYBEN, F. , ET AL.: "Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies", 《ICASSP》 * |
GOSZTOLYA, G. , ET AL.: "DNN-Based Feature Extraction and Classifier Combination for Child-Directed Speech, Cold and Snoring Identification", 《INTERSPEECH》 * |
LENG, Y. , ET AL.: "Employing unlabeled data to improve the classification performance of SVM, and its application in audio event classification", 《KNOWLEDGE-BASED SYSTEMS》 * |
SOO HYUN BAE ET AL.: "Acoustic Scene Classification Using Parallel Combination of LSTM and CNN", 《DETECTION AND CLASSIFICATION OF ACOUSTIC SCENES AND RVENTE 2016》 * |
YAN LENG ET AL.: "Classification of Overlapped Audio Events Based on AT, PLSA, and the Combination of Them", 《RADIO ENGINEERING》 * |
刘文举 等: "基于深度学习语音分离技术的研究现状与进展", 《自动化学报》 * |
沈凌洁 等: "基于融合特征的短语音汉语声调自动识别方法", 《声学技术》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110827798A (en) * | 2019-11-12 | 2020-02-21 | 广州欢聊网络科技有限公司 | Audio signal processing method and device |
WO2021136029A1 (en) * | 2019-12-31 | 2021-07-08 | 百果园技术(新加坡)有限公司 | Training method and device for re-scoring model and method and device for speech recognition |
CN111341351A (en) * | 2020-02-25 | 2020-06-26 | 厦门亿联网络技术股份有限公司 | Voice activity detection method and device based on self-attention mechanism and storage medium |
CN111444379A (en) * | 2020-03-30 | 2020-07-24 | 腾讯科技(深圳)有限公司 | Audio feature vector generation method and audio segment representation model training method |
CN111444379B (en) * | 2020-03-30 | 2023-08-08 | 腾讯科技(深圳)有限公司 | Audio feature vector generation method and audio fragment representation model training method |
CN112151072A (en) * | 2020-08-21 | 2020-12-29 | 北京搜狗科技发展有限公司 | Voice processing method, apparatus and medium |
CN112270933A (en) * | 2020-11-12 | 2021-01-26 | 北京猿力未来科技有限公司 | Audio identification method and device |
WO2022100691A1 (en) * | 2020-11-12 | 2022-05-19 | 北京猿力未来科技有限公司 | Audio recognition method and device |
CN112270933B (en) * | 2020-11-12 | 2024-03-12 | 北京猿力未来科技有限公司 | Audio identification method and device |
CN112885350A (en) * | 2021-02-25 | 2021-06-01 | 北京百度网讯科技有限公司 | Control method and device of network conference, electronic equipment and storage medium |
CN113724734A (en) * | 2021-08-31 | 2021-11-30 | 上海师范大学 | Sound event detection method and device, storage medium and electronic device |
CN113724734B (en) * | 2021-08-31 | 2023-07-25 | 上海师范大学 | Sound event detection method and device, storage medium and electronic device |
Also Published As
Publication number | Publication date |
---|---|
CN110349597B (en) | 2021-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110349597A (en) | A kind of speech detection method and device | |
Jiang et al. | Improving transformer-based speech recognition using unsupervised pre-training | |
JP6933264B2 (en) | Label generators, model learning devices, emotion recognition devices, their methods, programs, and recording media | |
CN104143327B (en) | A kind of acoustic training model method and apparatus | |
CN108831445A (en) | Sichuan dialect recognition methods, acoustic training model method, device and equipment | |
Tong et al. | A comparative study of robustness of deep learning approaches for VAD | |
CN108711421A (en) | A kind of voice recognition acoustic model method for building up and device and electronic equipment | |
CN108346436A (en) | Speech emotional detection method, device, computer equipment and storage medium | |
Yi et al. | Singing voice synthesis using deep autoregressive neural networks for acoustic modeling | |
Dinkel et al. | Voice activity detection in the wild via weakly supervised sound event detection | |
CN111899766B (en) | Speech emotion recognition method based on optimization fusion of depth features and acoustic features | |
CN111599339B (en) | Speech splicing synthesis method, system, equipment and medium with high naturalness | |
CN114330551A (en) | Multi-modal emotion analysis method based on multi-task learning and attention layer fusion | |
CN106448660B (en) | It is a kind of introduce big data analysis natural language smeared out boundary determine method | |
Jiang et al. | Speech Emotion Recognition Using Deep Convolutional Neural Network and Simple Recurrent Unit. | |
JP3920749B2 (en) | Acoustic model creation method for speech recognition, apparatus thereof, program thereof and recording medium thereof, speech recognition apparatus using acoustic model | |
Yu et al. | Language Recognition Based on Unsupervised Pretrained Models. | |
Liu et al. | Hierarchical component-attention based speaker turn embedding for emotion recognition | |
Zhou et al. | Learning and Modeling Unit Embeddings for Improving HMM-based Unit Selection Speech Synthesis. | |
CN102237082B (en) | Self-adaption method of speech recognition system | |
Xu | Intelligent automobile auxiliary propagation system based on speech recognition and AI driven feature extraction techniques | |
Reshma et al. | A survey on speech emotion recognition | |
Harrag et al. | GA-based feature subset selection: Application to Arabic speaker recognition system | |
Yuan et al. | Vector quantization codebook design method for speech recognition based on genetic algorithm | |
Krishna et al. | Self supervised representation learning with deep clustering for acoustic unit discovery from raw speech |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |