CN110349597A - A kind of speech detection method and device - Google Patents

A kind of speech detection method and device Download PDF

Info

Publication number
CN110349597A
CN110349597A CN201910594785.2A CN201910594785A CN110349597A CN 110349597 A CN110349597 A CN 110349597A CN 201910594785 A CN201910594785 A CN 201910594785A CN 110349597 A CN110349597 A CN 110349597A
Authority
CN
China
Prior art keywords
model
audio
training
voice
gmm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910594785.2A
Other languages
Chinese (zh)
Other versions
CN110349597B (en
Inventor
冷严
林蝉
赵玮玮
齐广慧
李登旺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Normal University
Original Assignee
Shandong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Normal University filed Critical Shandong Normal University
Priority to CN201910594785.2A priority Critical patent/CN110349597B/en
Publication of CN110349597A publication Critical patent/CN110349597A/en
Application granted granted Critical
Publication of CN110349597B publication Critical patent/CN110349597B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Abstract

Present disclose provides speech detection method and devices.Speech detection method includes building speech detection model;It is in series with RNN model again after being connected in parallel by the first GMM model, the second GMM model and LSTM model;The process of training speech detection model are as follows: respectively correspond the first GMM model of training, the second GMM model and LSTM model using voice data, non-speech data and voice and non-voice blended data, output phase should identify score value, a three-dimensional vector is formed, the vector as audio fragment characterizes;By audio fragment vector characterization one time series of composition at each moment, each moment previous moment and later moment in time, RNN model is trained as input quantity;The process of testing audio data are as follows: segmentation testing audio data are several audio fragments, it is input to the speech detection model of training completion one by one again, the audio fragment for obtaining the corresponding moment belongs to the probability value of voice, and audio fragment is determined as voice or non-voice by comparison probability value and given threshold.

Description

A kind of speech detection method and device
Technical field
The disclosure belongs to speech detection field more particularly to a kind of speech detection method and device.
Background technique
Only there is provided background technical informations relevant to the disclosure for the statement of this part, it is not necessary to so constitute first skill Art.
Important content one of of the speech detection as audio detection field, gets the attention.Speech detection has wide Wealthy application prospect can be used as the front end pretreatment of speech recognition technology, voice to be identified detected from audio data Data improve the recognition efficiency of voice;Speech detection can also detect the speech of someone from session recording, form meeting Abstract.With the fast development of depth learning technology, in speech detection field, deep neural network gradually replaces traditional common Machine learning model is classified.Tradition common machine learning model in audio detection field has gauss hybrid models (Gaussian Mixture Model, GMM), hidden Markov model (Hidden Markov Model, HMM), supporting vector Machine (Support Vector Machine, SVM) etc..
Inventors have found that conventional machines learning model has the following problems:
1) the audible spectrum dimension that conventional machines learning model obtains is higher, so that the operand of neural network is big, expends The training of neural network and classification time are more, and operation efficiency is low;
2) important information in the audio sample that conventional machines learning model extracts makes score there are the interference of redundancy Class model cannot identify speech samples well, reduce Detection accuracy.
Summary of the invention
To solve the above-mentioned problems, the first aspect of the disclosure provides a kind of speech detection method, by GMM model, LSTM model and RNN model effectively combine, and can give full play to three respective advantages of model, whole to improve speech detection model The classification and Detection ability of body.
To achieve the goals above, the disclosure adopts the following technical scheme that
A kind of speech detection method, comprising:
Construct speech detection model;The speech detection model is by the first GMM model, the second GMM model and LSTM model It is in series with RNN model again after being connected in parallel;
Training speech detection model;Its process are as follows:
Using voice data, non-speech data and voice and non-voice blended data respectively correspond training the first GMM model, Second GMM model and LSTM model, output phase should identify score value, and then form a three-dimensional vector, as audio fragment Vector characterization;
By audio fragment vector characterization one time sequence of composition at each moment, each moment previous moment and later moment in time Column, RNN model is trained as input quantity, until the average probability value that the audio fragment at all moment of output belongs to voice is inclined Difference meets default required precision;
Testing audio data;Its process are as follows:
Segmentation testing audio data are several audio fragments, then audio fragment is input to the voice of training completion one by one Detection model, the audio fragment for obtaining the corresponding moment belong to the probability value of voice.
Further, if probability value is greater than or equal to given threshold, judge that the audio fragment at corresponding moment belongs to voice; Otherwise, judge that the audio fragment at corresponding moment is not belonging to voice.
The advantages of technical solution, is that the probability value by the way that audio fragment to be belonged to voice comes compared with given threshold Judge whether the audio fragment at corresponding moment belongs to voice, so that testing result is more intuitive.
Further, the process of the first GMM model of training are as follows:
By the audio fragment sub-frame processing only containing voice data, the mel-frequency cepstrum of the default dimension of every frame audio is extracted Coefficient forms a training sample and stores to the first training sample set as audio frequency characteristics;
Training sample in first training sample set is input in the first GMM model, the voice of each frame audio is exported It identifies score value, the speech recognition score value of frames all in audio fragment is averaged, the voice for obtaining respective audio segment is known Other score value;
The first GMM model is obtained by the training sample training in the first training sample set by expectation-maximization algorithm All parameters.
The advantages of technical solution, is, using the audio fragment only containing voice data and passes through expectation-maximization algorithm All parameters of first GMM model are obtained by the training sample training in the first training sample set, reduce the first GMM model Training and classification the time, improve operation efficiency, score of the sample on speech model can be obtained more accurately, obtained More accurate speech detection model improves the Detection accuracy of entire speech detection model.
Further, the process of the second GMM model of training are as follows:
By the audio fragment sub-frame processing only containing non-speech data, the mel-frequency for extracting the default dimension of every frame audio is fallen Spectral coefficient forms a training sample and stores to the second training sample set as audio frequency characteristics;
Training sample in second training sample set is input in the second GMM model, the non-language of each frame audio is exported Sound identifies score value, and the non-speech recognition score value of frames all in audio fragment is averaged, the non-of respective audio segment is obtained Speech recognition score value;
The second GMM model is obtained by the training sample training in the second training sample set by expectation-maximization algorithm All parameters.
The advantages of technical solution, is, calculates using the audio fragment only containing non-speech data and by expectation maximization Method obtains all parameters of the second GMM model by the training sample training in the second training sample set, reduces the 2nd GMM mould The training of type and classification time, operation efficiency is improved, score of the sample on non-voice model can be obtained more accurately, More accurate non-voice detection model is obtained, the Detection accuracy of entire speech detection model is improved.
Further, the process of training LSTM model are as follows:
By the audio fragment sub-frame processing containing voice data and non-speech data, the plum of the default dimension of every frame audio is extracted These audio frequency characteristics are arranged to make up a time series as audio frequency characteristics by that frequency cepstral coefficient sequentially in time;
Above-mentioned time series is input in LSTM model, output obtains the speech recognition score value of respective audio segment;
Using Adam optimization algorithm training LSTM model, until the parameter of LSTM model is optimal.
The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam Optimization algorithm trains LSTM model, and the contextual information of sample can be made full use of to be identified, obtain sample more accurately The probability score for belonging to voice obtains accurate speech detection model, improves the Detection accuracy of entire speech detection model.
Further, the process of training RNN model are as follows:
The identification score value composition one that trained first GMM model, the second GMM model and LSTM model are exported respectively A three-dimensional vector, the vector as audio fragment characterize;
The audio fragment vector characterization of current time, previous moment and later moment in time is formed into a time series, as Input quantity trains RNN model, and output obtains the probability value that current time audio fragment belongs to voice;
Using Adam optimization algorithm training RNN model, until the audio fragment at all moment of output belongs to the flat of voice Equal probability value deviation meets default required precision.
The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam Optimization algorithm trains RNN model, and the contextual information of sample can be made full use of to be identified, improves recognition accuracy;In addition, The input of RNN model is three-dimensional vector, and the character representation of low-dimensional vector can reduce training and the classification time of model, improve fortune Efficiency is calculated, and the character representation of low-dimensional vector reduces the interference of redundancy, can more accurately obtain sample in this way and belong to The probability value of voice obtains accurate speech detection model, improves the Detection accuracy of entire speech detection model.
The second aspect of the disclosure provides a kind of speech detection device, by GMM model, LSTM model and RNN model It effectively combines, three respective advantages of model can be given full play to, to improve the classification and Detection ability of speech detection model entirety.
To achieve the goals above, the disclosure adopts the following technical scheme that
A kind of speech detection device, comprising:
Speech detection model construction module is used to construct speech detection model, and the speech detection model is by the first GMM Model, the second GMM model and LSTM model are in series with RNN model again after being connected in parallel;
Speech detection model training module is used to train speech detection model, process are as follows:
Using voice data, non-speech data and voice and non-voice blended data respectively correspond training the first GMM model, Second GMM model and LSTM model, output phase should identify score value, and then form a three-dimensional vector, as audio fragment Vector characterization;
By audio fragment vector characterization one time sequence of composition at each moment, each moment previous moment and later moment in time Column, RNN model is trained as input quantity, until the average probability value that the audio fragment at all moment of output belongs to voice is inclined Difference meets default required precision;
Audio data test module is used for testing audio data, process are as follows:
Segmentation testing audio data are several audio fragments, then audio fragment is input to the voice of training completion one by one Detection model, the audio fragment for obtaining the corresponding moment belong to the probability value of voice.
Further, in the audio data test module, if probability value is greater than or equal to given threshold, judge phase The audio fragment at moment is answered to belong to voice;Otherwise, judge that the audio fragment at corresponding moment is not belonging to voice.
The advantages of technical solution, is that the probability value by the way that audio fragment to be belonged to voice comes compared with given threshold Judge whether the audio fragment at corresponding moment belongs to voice, so that testing result is more intuitive.
Further, in the speech detection model training module, the process of the first GMM model of training are as follows:
By the audio fragment sub-frame processing only containing voice data, the mel-frequency cepstrum of the default dimension of every frame audio is extracted Coefficient forms a training sample and stores to the first training sample set as audio frequency characteristics;
Training sample in first training sample set is input in the first GMM model, the voice of each frame audio is exported It identifies score value, the speech recognition score value of frames all in audio fragment is averaged, the voice for obtaining respective audio segment is known Other score value;
The first GMM model is obtained by the training sample training in the first training sample set by expectation-maximization algorithm All parameters.
The advantages of technical solution, is, using the audio fragment only containing voice data and passes through expectation-maximization algorithm All parameters of first GMM model are obtained by the training sample training in the first training sample set, reduce the first GMM model Training and classification the time, improve operation efficiency, score of the sample on speech model can be obtained more accurately, obtained More accurate speech detection model improves the Detection accuracy of entire speech detection model.
Further, in the speech detection model training module, the process of the second GMM model of training are as follows:
By the audio fragment sub-frame processing only containing non-speech data, the mel-frequency for extracting the default dimension of every frame audio is fallen Spectral coefficient forms a training sample and stores to the second training sample set as audio frequency characteristics;
Training sample in second training sample set is input in the second GMM model, the non-language of each frame audio is exported Sound identifies score value, and the non-speech recognition score value of frames all in audio fragment is averaged, the non-of respective audio segment is obtained Speech recognition score value;
The second GMM model is obtained by the training sample training in the second training sample set by expectation-maximization algorithm All parameters.
The advantages of technical solution, is, calculates using the audio fragment only containing non-speech data and by expectation maximization Method obtains all parameters of the second GMM model by the training sample training in the second training sample set, reduces the 2nd GMM mould The training of type and classification time, operation efficiency is improved, score of the sample on non-voice model can be obtained more accurately, More accurate non-voice detection model is obtained, the Detection accuracy of entire speech detection model is improved.
Further, in the speech detection model training module, the process of training LSTM model are as follows:
By the audio fragment sub-frame processing containing voice data and non-speech data, the plum of the default dimension of every frame audio is extracted These audio frequency characteristics are arranged to make up a time series as audio frequency characteristics by that frequency cepstral coefficient sequentially in time;
Above-mentioned time series is input in LSTM model, output obtains the speech recognition score value of respective audio segment;
Using Adam optimization algorithm training LSTM model, until the parameter of LSTM model is optimal.
The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam Optimization algorithm trains LSTM model, and the contextual information of sample can be made full use of to be identified, more accurate ground sample is obtained The probability score for belonging to voice obtains accurate speech detection model, improves the Detection accuracy of entire speech detection model.
Further, in the speech detection model training module, the process of training RNN model are as follows:
The identification score value composition one that trained first GMM model, the second GMM model and LSTM model are exported respectively A three-dimensional vector, the vector as audio fragment characterize;
The audio fragment vector characterization of current time, previous moment and later moment in time is formed into a time series, as Input quantity trains RNN model, and output obtains the probability value that current time audio fragment belongs to voice;
Using Adam optimization algorithm training RNN model, until the audio fragment at all moment of output belongs to the flat of voice Equal probability value deviation meets default required precision.
The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam Optimization algorithm trains RNN model, and the contextual information of sample can be made full use of to be identified, improves recognition accuracy;In addition, The input of RNN model is three-dimensional vector, and the character representation of low-dimensional vector reduces training and the classification time of model, improves fortune Efficiency is calculated, and the character representation of low-dimensional vector reduces the interference of redundancy, available more accurate ground sample belongs to language The probability value of sound obtains accurate speech detection model, improves the Detection accuracy of entire speech detection model.
A kind of computer readable storage medium is provided in terms of the third of the disclosure.
A kind of computer readable storage medium, is stored thereon with computer program, realization when which is executed by processor Step in speech detection method as described above.
4th aspect of the disclosure provides a kind of computer equipment.
A kind of computer equipment can be run on a memory and on a processor including memory, processor and storage Computer program, the processor realize the step in speech detection method as described above when executing described program.
The beneficial effect of the disclosure is:
(1) disclosure is converted to the input of RNN model by traditional audible spectrum by GMM model and LSTM model low The character representation of dimension, low-dimensional feature can reduce the operand of RNN model, reduce training and the classification time of RNN model, improve Operation efficiency.
(2) character representation of the disclosure using the classification score of different models as audio sample, this character representation can be with The important information in audio sample is efficiently extracted out, the interference of redundancy is reduced, and then disaggregated model is enable preferably to know Not Chu speech samples, improve Detection accuracy.
(3) disclosure realizes traditional classification model: effective combination of GMM model, LSTM model and RNN model, GMM mould Type can analogue audio frequency sample well feature structure, LSTM model and RNN model can effectively utilize audio sample up and down Literary information is classified, and the speech detection model of the disclosure can give full play to GMM model, LSTM model and RNN model respectively Advantage, improve whole classification and Detection performance.
(4) speech detection method of disclosure design can also obtain good speech detection in the lower situation of signal-to-noise ratio Performance, thus there is good robustness to noise.
(5) mentality of designing of the present embodiment is to carry out traditional common speech detection model and deep neural network model In conjunction with the traditional voice detection model in association schemes is not limited to GMM, the not office of the deep neural network model in association schemes It is limited to LSTM and RNN, it can be traditional voice detection model and deep neural network mould that association schemes, which have good expansion, The combination of type provides good method and uses for reference.
Detailed description of the invention
The Figure of description for constituting a part of this disclosure is used to provide further understanding of the disclosure, and the disclosure is shown Meaning property embodiment and its explanation do not constitute the improper restriction to the disclosure for explaining the disclosure.
Fig. 1 is a kind of speech detection method flow chart of the embodiment of the present disclosure.
Fig. 2 is the speech detection model structure schematic diagram of the embodiment of the present disclosure.
Fig. 3 is the procedure chart of the testing audio data of the embodiment of the present disclosure.
Fig. 4 is a kind of speech detection device structural schematic diagram of the embodiment of the present disclosure.
Specific embodiment
The disclosure is described further with embodiment with reference to the accompanying drawing.
It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the disclosure.Unless another It indicates, all technical and scientific terms used herein has usual with disclosure person of an ordinary skill in the technical field The identical meanings of understanding.
It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the disclosure.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.
Term is explained:
GMM model: gauss hybrid models exactly accurately quantify thing with Gaussian probability-density function (normal distribution curve) One things is decomposed into several models formed based on Gaussian probability-density function (normal distribution curve) by object.
LSTM model: Long Short-Term Memory, shot and long term memory network model are a kind of time circulation nerves Network is suitable for being spaced and postpone relatively long critical event in processing and predicted time sequence.
RNN model: Recurrent Neural Network, Recognition with Recurrent Neural Network model are a kind of node orientation connections The artificial neural network of cyclization.The internal state of this network can show dynamic time sequence behavior.Different from feedforward neural network , RNN can use its internal memory to handle the list entries of arbitrary sequence, this allows it that can be easier processing if not Handwriting recognition, speech recognition of segmentation etc..
Embodiment one
Fig. 1 gives a kind of speech detection method flow chart of the present embodiment.
As shown in Figure 1, the speech detection method of the present embodiment, comprising:
S101: building speech detection model.
The speech detection model by the first GMM model, the second GMM model and LSTM model be connected in parallel after again with RNN model is in series, as shown in Figure 2.
In the present embodiment, the number of Gaussian mixture components is set as 5 in the first GMM model.
The number of Gaussian mixture components is set as 5 in second GMM model.
The hidden layer and output layer that LSTM model includes input layer, is made of 2 layers of LSTM network structure.The section of input layer Point number is set as 39, and output layer neuron number is set as 1, if the class label of voice is " 1 ", the class label of non-voice is "0".Dropout layers are added after each LSTM network structure layer, dropout layers of parameter is set as 0.2.
The node number of the input layer of RNN model is set as 3, and hidden layer is set as 2 layers, neuron in each hidden layer Number is set as 50, and dropout layers are added after each hidden layer, and the parameter that dropout is arranged is 0.2, output layer neuron Number be set as 1, if the class label of voice is " 1 ", the class label of non-language is " 0 ", and output is obtained each sound by output layer Frequency segment belongs to the probability of voice.
It is understood that in other embodiments, Gaussian mixture components in the first GMM model and the second GMM model Number may be alternatively provided as other values, and those skilled in the art can be specifically arranged according to the actual situation, and and will not be described here in detail.
In other embodiments, the number of plies of LSTM network structure may be alternatively provided as other values, this field skill in LSTM model Art personnel can be specifically arranged according to the actual situation, and and will not be described here in detail.
In other embodiments, each node layer quantity of RNN model and the number of plies of hidden layer may be alternatively provided as other values, ability Field technique personnel can be specifically arranged according to the actual situation, and and will not be described here in detail.
S102: training speech detection model.
In specific implementation, the process of the training speech detection model of step S102 are as follows:
S1021: training first is respectively corresponded using voice data, non-speech data and voice and non-voice blended data GMM model, the second GMM model and LSTM model, output phase should identify score value, and then form a three-dimensional vector, as sound The vector of frequency segment characterizes.
Specifically, in the step S1021, the process of the first GMM model of training are as follows:
S1021-11: by the audio fragment sub-frame processing only containing voice data, the plum of the default dimension of every frame audio is extracted That frequency cepstral coefficient forms a training sample and stores to the first training sample set as audio frequency characteristics.
Such as: by training data each in the training set only containing voice data as unit of 100 milliseconds long, it is divided into one The non-overlapping audio fragment of series;Sub-frame processing is carried out to each 100 milliseconds long of audio fragments, frame length is set as 30 millis Second, frame shifting is set as 10 milliseconds;After sub-frame processing, 39 dimension MFCC features are extracted to each audio frame, with this 39 dimension MFCC feature To express each trained speech samples.
Wherein, MFCC, Mel Frequency Cepstrum Coefficient, mel-frequency cepstrum coefficient, Meier frequency Rate is put forward based on human hearing characteristic, it and Hz frequency are at nonlinear correspondence relation.Mel-frequency cepstrum coefficient It (MFCC) is then the Hz spectrum signature being calculated using this relationship between them.
It should be noted that existing method can be used to extract mel-frequency cepstrum coefficient, those skilled in the art can It is specifically chosen according to the actual situation.
S1021-12: the training sample in the first training sample set is input in the first GMM model, exports each frame sound The speech recognition score value of frames all in audio fragment is averaged, obtains respective audio segment by the speech recognition score value of frequency Speech recognition score value.
Wherein, speech recognition score value is bigger, then respective audio segment belong to voice probability it is bigger.
S1021-13: first is obtained by the training sample training in the first training sample set by expectation-maximization algorithm All parameters of GMM model.
EM algorithm (Expectation-Maximization algorithm, EM), be it is a kind of by iteration into The optimization algorithm of row Maximum-likelihood estimation, usually as being substituted for comprising hidden variable or missing data for Newton iteration method Probabilistic model carries out parameter Estimation, can provide hidden variable, the i.e. posteriority of missing data, therefore answer in missing data problem With.
The present embodiment is using the only audio fragment containing voice data and by expectation-maximization algorithm by the first training sample Training sample training in this set obtains all parameters of the first GMM model, reduces the training and classification of the first GMM model Time improves operation efficiency, can obtain score of the sample on speech model more accurately, obtain more accurate language Sound detection model improves the Detection accuracy of entire speech detection model.
Specifically, in the step S1021, the process of the second GMM model of training are as follows:
S1021-21: by the audio fragment sub-frame processing only containing non-speech data, the default dimension of every frame audio is extracted Mel-frequency cepstrum coefficient forms a training sample and stores to the second training sample set as audio frequency characteristics.
Such as: by training data each in the training set only containing non-speech data as unit of 100 milliseconds long, it is divided into A series of non-overlapping audio fragments;Sub-frame processing is carried out to each 100 milliseconds long of audio fragments, frame length is set as 30 millis Second, frame shifting is set as 10 milliseconds;After sub-frame processing, 39 Jan Vermeer frequency cepstral coefficients are extracted to each audio frame, with this 39 dimension Mel-frequency cepstrum coefficient expresses each trained non-voice sample.
It should be noted that existing method can be used to extract mel-frequency cepstrum coefficient, those skilled in the art can It is specifically chosen according to the actual situation.
S1021-22: the training sample in the second training sample set is input in the second GMM model, exports each frame sound The non-speech recognition score value of frames all in audio fragment is averaged, obtains respective audio by the non-speech recognition score value of frequency The non-speech recognition score value of segment.
Wherein, non-speech recognition score value is bigger, then respective audio segment belong to voice probability it is smaller.
S1021-23: second is obtained by the training sample training in the second training sample set by expectation-maximization algorithm All parameters of GMM model.
EM algorithm (Expectation-Maximization algorithm, EM), be it is a kind of by iteration into The optimization algorithm of row Maximum-likelihood estimation, usually as being substituted for comprising hidden variable or missing data for Newton iteration method Probabilistic model carries out parameter Estimation, can provide hidden variable, the i.e. posteriority of missing data, therefore answer in missing data problem With.
The advantages of technical solution, is, calculates using the audio fragment only containing non-speech data and by expectation maximization Method obtains all parameters of the second GMM model by the training sample training in the second training sample set, reduces the 2nd GMM mould The training of type and classification time, operation efficiency is improved, the identification score that sample belongs to non-voice can be obtained more accurately, More accurate non-voice detection model is obtained, the Detection accuracy of entire speech detection model is improved.
Specifically, in the step S1021, the process of training LSTM model are as follows:
S1021-31: by the audio fragment sub-frame processing containing voice data and non-speech data, every frame audio is extracted The mel-frequency cepstrum coefficient of default dimension is as audio frequency characteristics, when these audio frequency characteristics are arranged to make up one sequentially in time Between sequence.
It specifically, will be containing each training data in voice data and non-speech data training set with 100 milliseconds of a length of lists Position, is divided into a series of non-overlapping audio fragments;Sub-frame processing, frame length are carried out to each 100 milliseconds long of audio fragments It is set as 30 milliseconds, frame shifting is set as 10 milliseconds, and 100 milliseconds of long audio fragments are divided into 8 30 milliseconds of long audio frames, This 8 audio frames constitute a time series;After sub-frame processing, 39 dimension MFCC features are extracted to each audio frame;
Time series to be constituted after each 100 milliseconds long audio fragment framings trains LSTM model for input.
S1021-32: above-mentioned time series is input in LSTM model, and output obtains identifying for respective audio segment Score value.
Wherein, identification score value it is bigger, then respective audio segment belong to voice probability it is bigger.
S1021-33: using Adam optimization algorithm training LSTM model, until the parameter of LSTM model is optimal.
The initialization of LSTM model is uniformly distributed initial method using Glorot, and loss function is using intersection entropy loss letter Number, training use Adam optimization algorithm, and it be 128, epoch parameter is 20 that setting learning rate, which is 0.01, batch_size parameter,. LSTM model is set as just being exported when reading in the last frame of input time sequence.
Adam is a kind of first-order optimization method that can substitute traditional stochastic gradient descent process, it can be based on training data Iteratively update neural network weight.It is advantageous that: it realizes straight from the shoulder;It is efficient to calculate;Required memory is few;Gradient pair The invariance of angle scaling;It is suitble to solve the optimization problem containing large-scale data and parameter;Suitable for unstable state (non- Stationary) target;Suitable for solving the problems, such as comprising very strong noise or sparse gradient;Hyper parameter can be solved intuitively very much It releases, and minimal amount of tune is substantially only needed to join.
The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam Optimization algorithm trains LSTM model, and the contextual information of sample can be made full use of to be identified, can be obtained more accurately Sample belongs to the probability score of voice, obtains more accurate speech detection model, improves the inspection of entire speech detection model Survey accuracy rate.
S1022: by audio fragment vector characterization composition one of each moment, each moment previous moment and later moment in time Time series trains RNN model as input quantity, until the audio fragment at all moment of output belongs to the average general of voice Rate value deviation meets default required precision.
Assuming that sharing N number of training data, x in training seti(i=1 ..., N) indicates i-th of training data, it is assumed that xiIn Shared MiA audio fragment, xij(j=1 ..., Mi) indicate xiJ-th of audio fragment, xijIt is the first GMM model, the 2nd GMM The form of one three-dimensional vector of model and the corresponding identification score value composition of LSTM model output.When characterizing audio fragment, In order to which its contextual information to be also included, by xijWith the audio fragment x of its previous momenti(j-1)And the audio of later moment in time Segment xi(j+1)A time series [x is formed togetheri(j-1),xij,xi(j+1)] (j=2 ..., Mi- 1), using this time sequence as The input of RNN neural network, training RNN network.
Specifically, the process of training RNN model are as follows:
The identification score value composition one that trained first GMM model, the second GMM model and LSTM model are exported respectively A three-dimensional vector, the vector as audio fragment characterize;
The audio fragment vector characterization of current time, previous moment and later moment in time is formed into a time series, as Input quantity trains RNN model, and output obtains the probability value that current time audio fragment belongs to voice;
Using Adam optimization algorithm training RNN model, until the audio fragment at all moment of output belongs to the flat of voice Equal probability value deviation meets default required precision.
The node number of the input layer of RNN network is set as 3, and hidden layer is set as 2 layers, neuron in each hidden layer Number is set as 50, and dropout layers are added after each hidden layer, and the parameter that dropout is arranged is 0.2, output layer neuron Number be set as 1, if the class label of voice is " 1 ", the class label of non-language is " 0 ", and output is obtained each sound by output layer Frequency segment belongs to the probability value of voice;
The initialization of RNN network is uniformly distributed initial method using Glorot, and loss function is using intersection entropy loss letter Number, training use Adam optimization algorithm, and it be 128, epoch parameter is 20 that setting learning rate, which is 0.01, batch_size parameter,.
S103: testing audio data.
In specific implementation, as shown in figure 3, the process of the testing audio data of step S103 are as follows:
Segmentation testing audio data are several audio fragments, then audio fragment is input to the voice of training completion one by one Detection model, the audio fragment for obtaining the corresponding moment belong to the probability value of voice.
Specifically, if probability value is greater than or equal to given threshold, judge that the audio fragment at corresponding moment belongs to voice;It is no Then, judge that the audio fragment at corresponding moment is not belonging to voice.
Assuming that K audio fragment is obtained after the segmentation of testing audio data, y is usedkIndicate its k-th of audio fragment, k= 1,…,K.To each audio fragment yk, it is passed through to the knowledge that the first GMM model, the second GMM model and LSTM model acquire respectively 3 dimensional vectors must be grouped as, characterize audio fragment y with this 3 dimensional vectork
By ykWith the audio fragment y of its previous momentk-1And the audio fragment y of later moment in timek+1A time is formed together Sequence [yk-1,yk,yk+1] (k=2 ..., K-1) using this time sequence as the input of RNN neural network acquire audio fragment yk Belong to the probability value of voice, given threshold 0.5, probability value is greater than 0.5, then by ykBe determined as voice, probability value less than 0.5, Then by ykIt is determined as non-voice.
It should be noted that the probability that audio fragment belongs to voice can be arranged in those skilled in the art according to available accuracy demand The threshold size of value.
The present embodiment is converted to the input of RNN model by traditional audible spectrum by GMM model and LSTM model low The character representation of dimension, low-dimensional feature can reduce the operand of RNN model, reduce training and the classification time of RNN model, improve Operation efficiency.
Character representation of the present embodiment using the classification score of different models as audio sample, this character representation can have The important information in audio sample is extracted to effect, reduces the interference of redundancy, and then disaggregated model is enable preferably to identify Speech samples out improve Detection accuracy.
The present embodiment realizes traditional classification model: effective combination of GMM model, LSTM model and RNN model, GMM mould Type can analogue audio frequency sample well feature structure, LSTM model and RNN model can effectively utilize audio sample up and down Literary information is classified, and the speech detection model of the disclosure can give full play to GMM model, LSTM model and RNN model respectively Advantage, improve whole classification and Detection performance.
The present embodiment can also obtain good speech detection performance in the lower situation of signal-to-noise ratio, thus have to noise Good robustness.
The mentality of designing of the present embodiment is to tie traditional common speech detection model and deep neural network model It closes, the traditional voice detection model in association schemes is not limited to GMM, and the deep neural network model in association schemes is not limited to In LSTM and RNN, it can be traditional voice detection model and deep neural network model that association schemes, which have good expansion, Combination good method be provided use for reference.
Embodiment two
Fig. 4 is a kind of speech detection device structural schematic diagram that the embodiment of the present disclosure provides.
As shown in figure 4, the speech detection device of the present embodiment, comprising:
(1) speech detection model construction module is used to construct speech detection model, and the speech detection model is by first GMM model, the second GMM model and LSTM model are in series with RNN model again after being connected in parallel;
(2) speech detection model training module is used to train speech detection model, process are as follows:
Using voice data, non-speech data and voice and non-voice blended data respectively correspond training the first GMM model, Second GMM model and LSTM model, output phase should identify score value, and then form a three-dimensional vector, as audio fragment Vector characterization;
By audio fragment vector characterization one time sequence of composition at each moment, each moment previous moment and later moment in time Column, RNN model is trained as input quantity, until the average probability value that the audio fragment at all moment of output belongs to voice is inclined Difference meets default required precision.
Specifically, in the speech detection model training module, the process of the first GMM model of training are as follows:
By the audio fragment sub-frame processing only containing voice data, the mel-frequency cepstrum of the default dimension of every frame audio is extracted Coefficient forms a training sample and stores to the first training sample set as audio frequency characteristics;
Training sample in first training sample set is input in the first GMM model, the voice of each frame audio is exported It identifies score value, the speech recognition score value of frames all in audio fragment is averaged, the voice for obtaining respective audio segment is known Other score value;
The first GMM model is obtained by the training sample training in the first training sample set by expectation-maximization algorithm All parameters.
The advantages of technical solution, is, using the audio fragment only containing voice data and passes through expectation-maximization algorithm All parameters of first GMM model are obtained by the training sample training in the first training sample set, reduce the first GMM model Training and classification the time, improve operation efficiency, score of the sample on speech model can be obtained more accurately, obtained More accurate speech detection model improves the Detection accuracy of entire speech detection model.
In the speech detection model training module, the process of the second GMM model of training are as follows:
By the audio fragment sub-frame processing only containing non-speech data, the mel-frequency for extracting the default dimension of every frame audio is fallen Spectral coefficient forms a training sample and stores to the second training sample set as audio frequency characteristics;
Training sample in second training sample set is input in the second GMM model, the non-language of each frame audio is exported Sound identifies score value, and the non-speech recognition score value of frames all in audio fragment is averaged, the non-of respective audio segment is obtained Speech recognition score value;
The second GMM model is obtained by the training sample training in the second training sample set by expectation-maximization algorithm All parameters.
The advantages of technical solution, is, calculates using the audio fragment only containing non-speech data and by expectation maximization Method obtains all parameters of the second GMM model by the training sample training in the second training sample set, reduces the 2nd GMM mould The training of type and classification time, operation efficiency is improved, score of the sample on non-voice model can be obtained more accurately, More accurate non-voice detection model is obtained, the Detection accuracy of entire speech detection model is improved.
In the speech detection model training module, the process of training LSTM model are as follows:
By the audio fragment sub-frame processing containing voice data and non-speech data, the plum of the default dimension of every frame audio is extracted These audio frequency characteristics are arranged to make up a time series as audio frequency characteristics by that frequency cepstral coefficient sequentially in time;
Above-mentioned time series is input in LSTM model, output obtains the identification score value of respective audio segment;
Using Adam optimization algorithm training LSTM model, until the parameter of LSTM model is optimal.
The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam Optimization algorithm trains LSTM model, and the contextual information of sample can be made full use of to be identified, can be obtained more accurately Sample belongs to the probability score of voice, obtains more accurate speech detection model, improves the inspection of entire speech detection model Survey accuracy rate.
In the speech detection model training module, the process of training RNN model are as follows:
The identification score value composition one that trained first GMM model, the second GMM model and LSTM model are exported respectively A three-dimensional vector, the vector as audio fragment characterize;
The audio fragment vector characterization of current time, previous moment and later moment in time is formed into a time series, as Input quantity trains RNN model, and output obtains the probability value that current time audio fragment belongs to voice;
Using Adam optimization algorithm training RNN model, until the audio fragment at all moment of output belongs to the flat of voice Equal probability value deviation meets default required precision.
The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam Optimization algorithm trains RNN model, and the contextual information of sample can be made full use of to be identified, improves recognition accuracy, in addition, The input of RNN model is three-dimensional vector, and the character representation of low-dimensional vector reduces training and the classification time of model, improves fortune Efficiency is calculated, and the character representation of low-dimensional vector reduces the interference of redundancy, can obtain more accurately sample and belong to language The probability value of sound obtains more accurate speech detection model, improves the Detection accuracy of entire speech detection model.
(3) audio data test module is used for testing audio data, process are as follows:
Segmentation testing audio data are several audio fragments, then audio fragment is input to the voice of training completion one by one Detection model, the audio fragment for obtaining the corresponding moment belong to the probability value of voice.
Specifically, in the audio data test module, if probability value is greater than or equal to given threshold, judgement is corresponding The audio fragment at moment belongs to voice;Otherwise, judge that the audio fragment at corresponding moment is not belonging to voice.
The advantages of technical solution, is that the probability value by the way that audio fragment to be belonged to voice comes compared with given threshold Judge whether the audio fragment at corresponding moment belongs to voice, so that testing result is more intuitive.
The present embodiment is converted to the input of RNN model by traditional audible spectrum by GMM model and LSTM model low The character representation of dimension, low-dimensional feature can reduce the operand of RNN model, reduce training and the classification time of RNN model, improve Operation efficiency.
Character representation of the present embodiment using the classification score of different models as audio sample, this character representation can have The important information in audio sample is extracted to effect, reduces the interference of redundancy, and then disaggregated model is enable preferably to identify Speech samples out improve Detection accuracy.
The present embodiment realizes traditional classification model: effective combination of GMM model, LSTM model and RNN model, GMM mould Type can analogue audio frequency sample well feature structure, LSTM model and RNN model can effectively utilize audio sample up and down Literary information is classified, and the speech detection model of the disclosure can give full play to GMM model, LSTM model and RNN model respectively Advantage, improve whole classification and Detection performance.
The present embodiment can also obtain good speech detection performance in the lower situation of signal-to-noise ratio, thus have to noise Good robustness.
Embodiment three
A kind of computer readable storage medium is provided in terms of the third of the disclosure.
A kind of computer readable storage medium, is stored thereon with computer program, realization when which is executed by processor Step in speech detection method as shown in Figure 1.
The present embodiment is converted to the input of RNN model by traditional audible spectrum by GMM model and LSTM model low The character representation of dimension, low-dimensional feature can reduce the operand of RNN model, reduce training and the classification time of RNN model, improve Operation efficiency.
Character representation of the present embodiment using the classification score of different models as audio sample, this character representation can have The important information in audio sample is extracted to effect, reduces the interference of redundancy, and then disaggregated model is enable preferably to identify Speech samples out improve Detection accuracy.
The present embodiment realizes traditional classification model: effective combination of GMM model, LSTM model and RNN model, GMM mould Type can analogue audio frequency sample well feature structure, LSTM model and RNN model can effectively utilize audio sample up and down Literary information is classified, and the speech detection model of the disclosure can give full play to GMM model, LSTM model and RNN model respectively Advantage, improve whole classification and Detection performance.
The present embodiment can also obtain good speech detection performance in the lower situation of signal-to-noise ratio, thus have to noise Good robustness.
Example IV
4th aspect of the disclosure provides a kind of computer equipment.
A kind of computer equipment can be run on a memory and on a processor including memory, processor and storage Computer program, the processor realize the step in speech detection method as shown in Figure 1 when executing described program.
The present embodiment is converted to the input of RNN model by traditional audible spectrum by GMM model and LSTM model low The character representation of dimension, low-dimensional feature can reduce the operand of RNN model, reduce training and the classification time of RNN model, improve Operation efficiency.
Character representation of the present embodiment using the classification score of different models as audio sample, this character representation can have The important information in audio sample is extracted to effect, reduces the interference of redundancy, and then disaggregated model is enable preferably to identify Speech samples out improve Detection accuracy.
The present embodiment realizes traditional classification model: effective combination of GMM model, LSTM model and RNN model, GMM mould Type can analogue audio frequency sample well feature structure, LSTM model and RNN model can effectively utilize audio sample up and down Literary information is classified, and the speech detection model of the disclosure can give full play to GMM model, LSTM model and RNN model respectively Advantage, improve whole classification and Detection performance.
The present embodiment can also obtain good speech detection performance in the lower situation of signal-to-noise ratio, thus have to noise Good robustness.
It should be understood by those skilled in the art that, embodiment of the disclosure can provide as method, system or computer program Product.Therefore, the shape of hardware embodiment, software implementation or embodiment combining software and hardware aspects can be used in the disclosure Formula.Moreover, the disclosure, which can be used, can use storage in the computer that one or more wherein includes computer usable program code The form for the computer program product implemented on medium (including but not limited to magnetic disk storage and optical memory etc.).
The disclosure is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present disclosure Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random AccessMemory, RAM) etc..
The foregoing is merely preferred embodiment of the present disclosure, are not limited to the disclosure, for the skill of this field For art personnel, the disclosure can have various modifications and variations.It is all within the spirit and principle of the disclosure, it is made any to repair Change, equivalent replacement, improvement etc., should be included within the protection scope of the disclosure.

Claims (10)

1. a kind of speech detection method characterized by comprising
Construct speech detection model;The speech detection model is in parallel by the first GMM model, the second GMM model and LSTM model It is in series with RNN model again after connection;
Training speech detection model;Its process are as follows:
Training the first GMM model, second are respectively corresponded using voice data, non-speech data and voice and non-voice blended data GMM model and LSTM model, output phase should identify score value, and then form a three-dimensional vector, the vector as audio fragment Characterization;
The audio fragment vector characterization at each moment, each moment previous moment and later moment in time is formed into a time series, RNN model is trained as input quantity, until the audio fragment at all moment of output belongs to the average probability value deviation of voice Meet default required precision;
Testing audio data;Its process are as follows:
Segmentation testing audio data are several audio fragments, then audio fragment is input to the speech detection of training completion one by one Model, the audio fragment for obtaining the corresponding moment belong to the probability value of voice.
2. speech detection method as described in claim 1, which is characterized in that belong to the general of voice acquiring testing audio segment After rate value, if probability value is greater than or equal to given threshold, judge that the audio fragment at corresponding moment belongs to voice;Otherwise, judge The audio fragment at corresponding moment is not belonging to voice.
3. speech detection method as described in claim 1, which is characterized in that the process of the first GMM model of training are as follows:
By the audio fragment sub-frame processing only containing voice data, the mel-frequency cepstrum coefficient of the default dimension of every frame audio is extracted As audio frequency characteristics, forms a training sample and store to the first training sample set;
Training sample in first training sample set is input in the first GMM model, the speech recognition of each frame audio is exported The speech recognition score value of frames all in audio fragment is averaged by score value, and the speech recognition for obtaining respective audio segment obtains Score value;
The all of the first GMM model are obtained by the training sample training in the first training sample set by expectation-maximization algorithm Parameter;
Or
The process of the second GMM model of training are as follows:
By the audio fragment sub-frame processing only containing non-speech data, the mel-frequency cepstrum system of the default dimension of every frame audio is extracted Number is used as audio frequency characteristics, forms a training sample and stores to the second training sample set;
Training sample in second training sample set is input in the second GMM model, the non-voice for exporting each frame audio is known The non-speech recognition score value of frames all in audio fragment is averaged, obtains the non-voice of respective audio segment by other score value Identify score value;
The all of the second GMM model are obtained by the training sample training in the second training sample set by expectation-maximization algorithm Parameter.
4. speech detection method as described in claim 1, which is characterized in that the process of training LSTM model are as follows:
By the audio fragment sub-frame processing containing voice data and non-speech data, the Meier frequency of the default dimension of every frame audio is extracted Rate cepstrum coefficient is arranged to make up a time series as audio frequency characteristics, by these audio frequency characteristics sequentially in time;
Above-mentioned time series is input in LSTM model, output obtains the identification score value of respective audio segment;
Using Adam optimization algorithm training LSTM model, until the parameter of LSTM model is optimal.
5. speech detection method as described in claim 1, which is characterized in that the process of training RNN model are as follows:
The identification score value that trained first GMM model, the second GMM model and LSTM model are exported respectively forms one three Dimensional vector, the vector as audio fragment characterize;
By audio fragment vector characterization one time series of composition of current time, previous moment and later moment in time, as input It measures to train RNN model, output obtains the probability value that current time audio fragment belongs to voice;
Using Adam optimization algorithm training RNN model, until the audio fragment at all moment of output belongs to the average general of voice Rate value deviation meets default required precision.
6. a kind of speech detection device characterized by comprising
Speech detection model construction module is used to construct speech detection model, and the speech detection model is by the first GMM mould Type, the second GMM model and LSTM model are in series with RNN model again after being connected in parallel;
Speech detection model training module is used to train speech detection model, process are as follows:
Training the first GMM model, second are respectively corresponded using voice data, non-speech data and voice and non-voice blended data GMM model and LSTM model, output phase should identify score value, and then form a three-dimensional vector, the vector as audio fragment Characterization;
The audio fragment vector characterization at each moment, each moment previous moment and later moment in time is formed into a time series, RNN model is trained as input quantity, until the audio fragment at all moment of output belongs to the average probability value deviation of voice Meet default required precision;
Audio data test module is used for testing audio data, process are as follows:
Segmentation testing audio data are several audio fragments, then audio fragment is input to the speech detection of training completion one by one Model, the audio fragment for obtaining the corresponding moment belong to the probability value of voice.
7. speech detection device as claimed in claim 6, which is characterized in that in the audio data test module, if generally Rate value is greater than or equal to given threshold, then judges that the audio fragment at corresponding moment belongs to voice;Otherwise, judge the sound at corresponding moment Frequency segment is not belonging to voice.
8. speech detection device as claimed in claim 6, which is characterized in that in the speech detection model training module, The process of the first GMM model of training are as follows:
By the audio fragment sub-frame processing only containing voice data, the mel-frequency cepstrum coefficient of the default dimension of every frame audio is extracted As audio frequency characteristics, forms a training sample and store to the first training sample set;
Training sample in first training sample set is input in the first GMM model, the speech recognition of each frame audio is exported The speech recognition score value of frames all in audio fragment is averaged by score value, and the speech recognition for obtaining respective audio segment obtains Score value;
The all of the first GMM model are obtained by the training sample training in the first training sample set by expectation-maximization algorithm Parameter;
Or
In the speech detection model training module, the process of the second GMM model of training are as follows:
By the audio fragment sub-frame processing only containing non-speech data, the mel-frequency cepstrum system of the default dimension of every frame audio is extracted Number is used as audio frequency characteristics, forms a training sample and stores to the second training sample set;
Training sample in second training sample set is input in the second GMM model, the non-voice for exporting each frame audio is known The non-speech recognition score value of frames all in audio fragment is averaged, obtains the non-voice of respective audio segment by other score value Identify score value;
The all of the second GMM model are obtained by the training sample training in the second training sample set by expectation-maximization algorithm Parameter;
Or
In the speech detection model training module, the process of training LSTM model are as follows:
By the audio fragment sub-frame processing containing voice data and non-speech data, the Meier frequency of the default dimension of every frame audio is extracted Rate cepstrum coefficient is arranged to make up a time series as audio frequency characteristics, by these audio frequency characteristics sequentially in time;
Above-mentioned time series is input in LSTM model, output obtains the identification score value of respective audio segment;
Using Adam optimization algorithm training LSTM model, until the parameter of LSTM model is optimal;
Or
In the speech detection model training module, the process of training RNN model are as follows:
The identification score value that trained first GMM model, the second GMM model and LSTM model are exported respectively forms one three Dimensional vector, the vector as audio fragment characterize;
By audio fragment vector characterization one time series of composition of current time, previous moment and later moment in time, as input It measures to train RNN model, output obtains the probability value that current time audio fragment belongs to voice;
Using Adam optimization algorithm training RNN model, until the audio fragment at all moment of output belongs to the average general of voice Rate value deviation meets default required precision.
9. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor The step in speech detection method according to any one of claims 1 to 5 is realized when row.
10. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor Calculation machine program, which is characterized in that the processor realizes language according to any one of claims 1 to 5 when executing described program Step in sound detection method.
CN201910594785.2A 2019-07-03 2019-07-03 Voice detection method and device Active CN110349597B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910594785.2A CN110349597B (en) 2019-07-03 2019-07-03 Voice detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910594785.2A CN110349597B (en) 2019-07-03 2019-07-03 Voice detection method and device

Publications (2)

Publication Number Publication Date
CN110349597A true CN110349597A (en) 2019-10-18
CN110349597B CN110349597B (en) 2021-06-25

Family

ID=68177773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910594785.2A Active CN110349597B (en) 2019-07-03 2019-07-03 Voice detection method and device

Country Status (1)

Country Link
CN (1) CN110349597B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827798A (en) * 2019-11-12 2020-02-21 广州欢聊网络科技有限公司 Audio signal processing method and device
CN111341351A (en) * 2020-02-25 2020-06-26 厦门亿联网络技术股份有限公司 Voice activity detection method and device based on self-attention mechanism and storage medium
CN111444379A (en) * 2020-03-30 2020-07-24 腾讯科技(深圳)有限公司 Audio feature vector generation method and audio segment representation model training method
CN112151072A (en) * 2020-08-21 2020-12-29 北京搜狗科技发展有限公司 Voice processing method, apparatus and medium
CN112270933A (en) * 2020-11-12 2021-01-26 北京猿力未来科技有限公司 Audio identification method and device
CN112885350A (en) * 2021-02-25 2021-06-01 北京百度网讯科技有限公司 Control method and device of network conference, electronic equipment and storage medium
WO2021136029A1 (en) * 2019-12-31 2021-07-08 百果园技术(新加坡)有限公司 Training method and device for re-scoring model and method and device for speech recognition
CN113724734A (en) * 2021-08-31 2021-11-30 上海师范大学 Sound event detection method and device, storage medium and electronic device

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060111900A1 (en) * 2004-11-25 2006-05-25 Lg Electronics Inc. Speech distinction method
CN101548313A (en) * 2006-11-16 2009-09-30 国际商业机器公司 Voice activity detection system and method
CN103226948A (en) * 2013-04-22 2013-07-31 山东师范大学 Audio scene recognition method based on acoustic events
CN103646649A (en) * 2013-12-30 2014-03-19 中国科学院自动化研究所 High-efficiency voice detecting method
US20150228277A1 (en) * 2014-02-11 2015-08-13 Malaspina Labs (Barbados), Inc. Voiced Sound Pattern Detection
US20170069313A1 (en) * 2015-09-06 2017-03-09 International Business Machines Corporation Covariance matrix estimation with structural-based priors for speech processing
US20170372725A1 (en) * 2016-06-28 2017-12-28 Pindrop Security, Inc. System and method for cluster-based audio event detection
US20180166067A1 (en) * 2016-12-14 2018-06-14 International Business Machines Corporation Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier
CN108305616A (en) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 A kind of audio scene recognition method and device based on long feature extraction in short-term
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model
CN108597539A (en) * 2018-02-09 2018-09-28 桂林电子科技大学 Speech-emotion recognition method based on parameter migration and sound spectrograph
CN108831508A (en) * 2018-06-13 2018-11-16 百度在线网络技术(北京)有限公司 Voice activity detection method, device and equipment
CN109192210A (en) * 2018-10-25 2019-01-11 腾讯科技(深圳)有限公司 A kind of method of speech recognition, the method and device for waking up word detection
CN109697982A (en) * 2019-02-01 2019-04-30 北京清帆科技有限公司 A kind of speaker speech recognition system in instruction scene

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060111900A1 (en) * 2004-11-25 2006-05-25 Lg Electronics Inc. Speech distinction method
CN101548313A (en) * 2006-11-16 2009-09-30 国际商业机器公司 Voice activity detection system and method
CN103226948A (en) * 2013-04-22 2013-07-31 山东师范大学 Audio scene recognition method based on acoustic events
CN103646649A (en) * 2013-12-30 2014-03-19 中国科学院自动化研究所 High-efficiency voice detecting method
US20150228277A1 (en) * 2014-02-11 2015-08-13 Malaspina Labs (Barbados), Inc. Voiced Sound Pattern Detection
US20170069313A1 (en) * 2015-09-06 2017-03-09 International Business Machines Corporation Covariance matrix estimation with structural-based priors for speech processing
US20170372725A1 (en) * 2016-06-28 2017-12-28 Pindrop Security, Inc. System and method for cluster-based audio event detection
US20180166067A1 (en) * 2016-12-14 2018-06-14 International Business Machines Corporation Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier
CN108305616A (en) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 A kind of audio scene recognition method and device based on long feature extraction in short-term
CN108597539A (en) * 2018-02-09 2018-09-28 桂林电子科技大学 Speech-emotion recognition method based on parameter migration and sound spectrograph
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model
CN108831508A (en) * 2018-06-13 2018-11-16 百度在线网络技术(北京)有限公司 Voice activity detection method, device and equipment
CN109192210A (en) * 2018-10-25 2019-01-11 腾讯科技(深圳)有限公司 A kind of method of speech recognition, the method and device for waking up word detection
CN109697982A (en) * 2019-02-01 2019-04-30 北京清帆科技有限公司 A kind of speaker speech recognition system in instruction scene

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
EYBEN, F. , ET AL.: "Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies", 《ICASSP》 *
GOSZTOLYA, G. , ET AL.: "DNN-Based Feature Extraction and Classifier Combination for Child-Directed Speech, Cold and Snoring Identification", 《INTERSPEECH》 *
LENG, Y. , ET AL.: "Employing unlabeled data to improve the classification performance of SVM, and its application in audio event classification", 《KNOWLEDGE-BASED SYSTEMS》 *
SOO HYUN BAE ET AL.: "Acoustic Scene Classification Using Parallel Combination of LSTM and CNN", 《DETECTION AND CLASSIFICATION OF ACOUSTIC SCENES AND RVENTE 2016》 *
YAN LENG ET AL.: "Classification of Overlapped Audio Events Based on AT, PLSA, and the Combination of Them", 《RADIO ENGINEERING》 *
刘文举 等: "基于深度学习语音分离技术的研究现状与进展", 《自动化学报》 *
沈凌洁 等: "基于融合特征的短语音汉语声调自动识别方法", 《声学技术》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827798A (en) * 2019-11-12 2020-02-21 广州欢聊网络科技有限公司 Audio signal processing method and device
WO2021136029A1 (en) * 2019-12-31 2021-07-08 百果园技术(新加坡)有限公司 Training method and device for re-scoring model and method and device for speech recognition
CN111341351A (en) * 2020-02-25 2020-06-26 厦门亿联网络技术股份有限公司 Voice activity detection method and device based on self-attention mechanism and storage medium
CN111444379A (en) * 2020-03-30 2020-07-24 腾讯科技(深圳)有限公司 Audio feature vector generation method and audio segment representation model training method
CN111444379B (en) * 2020-03-30 2023-08-08 腾讯科技(深圳)有限公司 Audio feature vector generation method and audio fragment representation model training method
CN112151072A (en) * 2020-08-21 2020-12-29 北京搜狗科技发展有限公司 Voice processing method, apparatus and medium
CN112270933A (en) * 2020-11-12 2021-01-26 北京猿力未来科技有限公司 Audio identification method and device
WO2022100691A1 (en) * 2020-11-12 2022-05-19 北京猿力未来科技有限公司 Audio recognition method and device
CN112270933B (en) * 2020-11-12 2024-03-12 北京猿力未来科技有限公司 Audio identification method and device
CN112885350A (en) * 2021-02-25 2021-06-01 北京百度网讯科技有限公司 Control method and device of network conference, electronic equipment and storage medium
CN113724734A (en) * 2021-08-31 2021-11-30 上海师范大学 Sound event detection method and device, storage medium and electronic device
CN113724734B (en) * 2021-08-31 2023-07-25 上海师范大学 Sound event detection method and device, storage medium and electronic device

Also Published As

Publication number Publication date
CN110349597B (en) 2021-06-25

Similar Documents

Publication Publication Date Title
CN110349597A (en) A kind of speech detection method and device
Jiang et al. Improving transformer-based speech recognition using unsupervised pre-training
JP6933264B2 (en) Label generators, model learning devices, emotion recognition devices, their methods, programs, and recording media
CN104143327B (en) A kind of acoustic training model method and apparatus
CN108831445A (en) Sichuan dialect recognition methods, acoustic training model method, device and equipment
Tong et al. A comparative study of robustness of deep learning approaches for VAD
CN108711421A (en) A kind of voice recognition acoustic model method for building up and device and electronic equipment
CN108346436A (en) Speech emotional detection method, device, computer equipment and storage medium
Yi et al. Singing voice synthesis using deep autoregressive neural networks for acoustic modeling
Dinkel et al. Voice activity detection in the wild via weakly supervised sound event detection
CN111899766B (en) Speech emotion recognition method based on optimization fusion of depth features and acoustic features
CN111599339B (en) Speech splicing synthesis method, system, equipment and medium with high naturalness
CN114330551A (en) Multi-modal emotion analysis method based on multi-task learning and attention layer fusion
CN106448660B (en) It is a kind of introduce big data analysis natural language smeared out boundary determine method
Jiang et al. Speech Emotion Recognition Using Deep Convolutional Neural Network and Simple Recurrent Unit.
JP3920749B2 (en) Acoustic model creation method for speech recognition, apparatus thereof, program thereof and recording medium thereof, speech recognition apparatus using acoustic model
Yu et al. Language Recognition Based on Unsupervised Pretrained Models.
Liu et al. Hierarchical component-attention based speaker turn embedding for emotion recognition
Zhou et al. Learning and Modeling Unit Embeddings for Improving HMM-based Unit Selection Speech Synthesis.
CN102237082B (en) Self-adaption method of speech recognition system
Xu Intelligent automobile auxiliary propagation system based on speech recognition and AI driven feature extraction techniques
Reshma et al. A survey on speech emotion recognition
Harrag et al. GA-based feature subset selection: Application to Arabic speaker recognition system
Yuan et al. Vector quantization codebook design method for speech recognition based on genetic algorithm
Krishna et al. Self supervised representation learning with deep clustering for acoustic unit discovery from raw speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant