CN110349597A

CN110349597A - A kind of speech detection method and device

Info

Publication number: CN110349597A
Application number: CN201910594785.2A
Authority: CN
Inventors: 冷严; 林蝉; 赵玮玮; 齐广慧; 李登旺
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2019-07-03
Filing date: 2019-07-03
Publication date: 2019-10-18
Anticipated expiration: 2039-07-03
Also published as: CN110349597B

Abstract

Present disclose provides speech detection method and devices.Speech detection method includes building speech detection model；It is in series with RNN model again after being connected in parallel by the first GMM model, the second GMM model and LSTM model；The process of training speech detection model are as follows: respectively correspond the first GMM model of training, the second GMM model and LSTM model using voice data, non-speech data and voice and non-voice blended data, output phase should identify score value, a three-dimensional vector is formed, the vector as audio fragment characterizes；By audio fragment vector characterization one time series of composition at each moment, each moment previous moment and later moment in time, RNN model is trained as input quantity；The process of testing audio data are as follows: segmentation testing audio data are several audio fragments, it is input to the speech detection model of training completion one by one again, the audio fragment for obtaining the corresponding moment belongs to the probability value of voice, and audio fragment is determined as voice or non-voice by comparison probability value and given threshold.

Description

A kind of speech detection method and device

Technical field

The disclosure belongs to speech detection field more particularly to a kind of speech detection method and device.

Background technique

Only there is provided background technical informations relevant to the disclosure for the statement of this part, it is not necessary to so constitute first skill Art.

Important content one of of the speech detection as audio detection field, gets the attention.Speech detection has wide Wealthy application prospect can be used as the front end pretreatment of speech recognition technology, voice to be identified detected from audio data Data improve the recognition efficiency of voice；Speech detection can also detect the speech of someone from session recording, form meeting Abstract.With the fast development of depth learning technology, in speech detection field, deep neural network gradually replaces traditional common Machine learning model is classified.Tradition common machine learning model in audio detection field has gauss hybrid models (Gaussian Mixture Model, GMM), hidden Markov model (Hidden Markov Model, HMM), supporting vector Machine (Support Vector Machine, SVM) etc..

Inventors have found that conventional machines learning model has the following problems:

1) the audible spectrum dimension that conventional machines learning model obtains is higher, so that the operand of neural network is big, expends The training of neural network and classification time are more, and operation efficiency is low；

2) important information in the audio sample that conventional machines learning model extracts makes score there are the interference of redundancy Class model cannot identify speech samples well, reduce Detection accuracy.

Summary of the invention

To solve the above-mentioned problems, the first aspect of the disclosure provides a kind of speech detection method, by GMM model, LSTM model and RNN model effectively combine, and can give full play to three respective advantages of model, whole to improve speech detection model The classification and Detection ability of body.

To achieve the goals above, the disclosure adopts the following technical scheme that

A kind of speech detection method, comprising:

Construct speech detection model；The speech detection model is by the first GMM model, the second GMM model and LSTM model It is in series with RNN model again after being connected in parallel；

Training speech detection model；Its process are as follows:

Using voice data, non-speech data and voice and non-voice blended data respectively correspond training the first GMM model, Second GMM model and LSTM model, output phase should identify score value, and then form a three-dimensional vector, as audio fragment Vector characterization；

By audio fragment vector characterization one time sequence of composition at each moment, each moment previous moment and later moment in time Column, RNN model is trained as input quantity, until the average probability value that the audio fragment at all moment of output belongs to voice is inclined Difference meets default required precision；

Testing audio data；Its process are as follows:

Segmentation testing audio data are several audio fragments, then audio fragment is input to the voice of training completion one by one Detection model, the audio fragment for obtaining the corresponding moment belong to the probability value of voice.

Further, if probability value is greater than or equal to given threshold, judge that the audio fragment at corresponding moment belongs to voice； Otherwise, judge that the audio fragment at corresponding moment is not belonging to voice.

The advantages of technical solution, is that the probability value by the way that audio fragment to be belonged to voice comes compared with given threshold Judge whether the audio fragment at corresponding moment belongs to voice, so that testing result is more intuitive.

Further, the process of the first GMM model of training are as follows:

By the audio fragment sub-frame processing only containing voice data, the mel-frequency cepstrum of the default dimension of every frame audio is extracted Coefficient forms a training sample and stores to the first training sample set as audio frequency characteristics；

Training sample in first training sample set is input in the first GMM model, the voice of each frame audio is exported It identifies score value, the speech recognition score value of frames all in audio fragment is averaged, the voice for obtaining respective audio segment is known Other score value；

The first GMM model is obtained by the training sample training in the first training sample set by expectation-maximization algorithm All parameters.

The advantages of technical solution, is, using the audio fragment only containing voice data and passes through expectation-maximization algorithm All parameters of first GMM model are obtained by the training sample training in the first training sample set, reduce the first GMM model Training and classification the time, improve operation efficiency, score of the sample on speech model can be obtained more accurately, obtained More accurate speech detection model improves the Detection accuracy of entire speech detection model.

Further, the process of the second GMM model of training are as follows:

By the audio fragment sub-frame processing only containing non-speech data, the mel-frequency for extracting the default dimension of every frame audio is fallen Spectral coefficient forms a training sample and stores to the second training sample set as audio frequency characteristics；

Training sample in second training sample set is input in the second GMM model, the non-language of each frame audio is exported Sound identifies score value, and the non-speech recognition score value of frames all in audio fragment is averaged, the non-of respective audio segment is obtained Speech recognition score value；

The second GMM model is obtained by the training sample training in the second training sample set by expectation-maximization algorithm All parameters.

The advantages of technical solution, is, calculates using the audio fragment only containing non-speech data and by expectation maximization Method obtains all parameters of the second GMM model by the training sample training in the second training sample set, reduces the 2nd GMM mould The training of type and classification time, operation efficiency is improved, score of the sample on non-voice model can be obtained more accurately, More accurate non-voice detection model is obtained, the Detection accuracy of entire speech detection model is improved.

Further, the process of training LSTM model are as follows:

By the audio fragment sub-frame processing containing voice data and non-speech data, the plum of the default dimension of every frame audio is extracted These audio frequency characteristics are arranged to make up a time series as audio frequency characteristics by that frequency cepstral coefficient sequentially in time；

Above-mentioned time series is input in LSTM model, output obtains the speech recognition score value of respective audio segment；

Using Adam optimization algorithm training LSTM model, until the parameter of LSTM model is optimal.

The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam Optimization algorithm trains LSTM model, and the contextual information of sample can be made full use of to be identified, obtain sample more accurately The probability score for belonging to voice obtains accurate speech detection model, improves the Detection accuracy of entire speech detection model.

Further, the process of training RNN model are as follows:

The identification score value composition one that trained first GMM model, the second GMM model and LSTM model are exported respectively A three-dimensional vector, the vector as audio fragment characterize；

The audio fragment vector characterization of current time, previous moment and later moment in time is formed into a time series, as Input quantity trains RNN model, and output obtains the probability value that current time audio fragment belongs to voice；

Using Adam optimization algorithm training RNN model, until the audio fragment at all moment of output belongs to the flat of voice Equal probability value deviation meets default required precision.

The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam Optimization algorithm trains RNN model, and the contextual information of sample can be made full use of to be identified, improves recognition accuracy；In addition, The input of RNN model is three-dimensional vector, and the character representation of low-dimensional vector can reduce training and the classification time of model, improve fortune Efficiency is calculated, and the character representation of low-dimensional vector reduces the interference of redundancy, can more accurately obtain sample in this way and belong to The probability value of voice obtains accurate speech detection model, improves the Detection accuracy of entire speech detection model.

The second aspect of the disclosure provides a kind of speech detection device, by GMM model, LSTM model and RNN model It effectively combines, three respective advantages of model can be given full play to, to improve the classification and Detection ability of speech detection model entirety.

A kind of speech detection device, comprising:

Speech detection model construction module is used to construct speech detection model, and the speech detection model is by the first GMM Model, the second GMM model and LSTM model are in series with RNN model again after being connected in parallel；

Speech detection model training module is used to train speech detection model, process are as follows:

Audio data test module is used for testing audio data, process are as follows:

Further, in the audio data test module, if probability value is greater than or equal to given threshold, judge phase The audio fragment at moment is answered to belong to voice；Otherwise, judge that the audio fragment at corresponding moment is not belonging to voice.

Further, in the speech detection model training module, the process of the first GMM model of training are as follows:

Further, in the speech detection model training module, the process of the second GMM model of training are as follows:

Further, in the speech detection model training module, the process of training LSTM model are as follows:

The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam Optimization algorithm trains LSTM model, and the contextual information of sample can be made full use of to be identified, more accurate ground sample is obtained The probability score for belonging to voice obtains accurate speech detection model, improves the Detection accuracy of entire speech detection model.

Further, in the speech detection model training module, the process of training RNN model are as follows:

The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam Optimization algorithm trains RNN model, and the contextual information of sample can be made full use of to be identified, improves recognition accuracy；In addition, The input of RNN model is three-dimensional vector, and the character representation of low-dimensional vector reduces training and the classification time of model, improves fortune Efficiency is calculated, and the character representation of low-dimensional vector reduces the interference of redundancy, available more accurate ground sample belongs to language The probability value of sound obtains accurate speech detection model, improves the Detection accuracy of entire speech detection model.

A kind of computer readable storage medium is provided in terms of the third of the disclosure.

A kind of computer readable storage medium, is stored thereon with computer program, realization when which is executed by processor Step in speech detection method as described above.

4th aspect of the disclosure provides a kind of computer equipment.

A kind of computer equipment can be run on a memory and on a processor including memory, processor and storage Computer program, the processor realize the step in speech detection method as described above when executing described program.

The beneficial effect of the disclosure is:

(1) disclosure is converted to the input of RNN model by traditional audible spectrum by GMM model and LSTM model low The character representation of dimension, low-dimensional feature can reduce the operand of RNN model, reduce training and the classification time of RNN model, improve Operation efficiency.

(2) character representation of the disclosure using the classification score of different models as audio sample, this character representation can be with The important information in audio sample is efficiently extracted out, the interference of redundancy is reduced, and then disaggregated model is enable preferably to know Not Chu speech samples, improve Detection accuracy.

(3) disclosure realizes traditional classification model: effective combination of GMM model, LSTM model and RNN model, GMM mould Type can analogue audio frequency sample well feature structure, LSTM model and RNN model can effectively utilize audio sample up and down Literary information is classified, and the speech detection model of the disclosure can give full play to GMM model, LSTM model and RNN model respectively Advantage, improve whole classification and Detection performance.

(4) speech detection method of disclosure design can also obtain good speech detection in the lower situation of signal-to-noise ratio Performance, thus there is good robustness to noise.

(5) mentality of designing of the present embodiment is to carry out traditional common speech detection model and deep neural network model In conjunction with the traditional voice detection model in association schemes is not limited to GMM, the not office of the deep neural network model in association schemes It is limited to LSTM and RNN, it can be traditional voice detection model and deep neural network mould that association schemes, which have good expansion, The combination of type provides good method and uses for reference.

Detailed description of the invention

The Figure of description for constituting a part of this disclosure is used to provide further understanding of the disclosure, and the disclosure is shown Meaning property embodiment and its explanation do not constitute the improper restriction to the disclosure for explaining the disclosure.

Fig. 1 is a kind of speech detection method flow chart of the embodiment of the present disclosure.

Fig. 2 is the speech detection model structure schematic diagram of the embodiment of the present disclosure.

Fig. 3 is the procedure chart of the testing audio data of the embodiment of the present disclosure.

Fig. 4 is a kind of speech detection device structural schematic diagram of the embodiment of the present disclosure.

Specific embodiment

The disclosure is described further with embodiment with reference to the accompanying drawing.

It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the disclosure.Unless another It indicates, all technical and scientific terms used herein has usual with disclosure person of an ordinary skill in the technical field The identical meanings of understanding.

It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the disclosure.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.

Term is explained:

GMM model: gauss hybrid models exactly accurately quantify thing with Gaussian probability-density function (normal distribution curve) One things is decomposed into several models formed based on Gaussian probability-density function (normal distribution curve) by object.

LSTM model: Long Short-Term Memory, shot and long term memory network model are a kind of time circulation nerves Network is suitable for being spaced and postpone relatively long critical event in processing and predicted time sequence.

RNN model: Recurrent Neural Network, Recognition with Recurrent Neural Network model are a kind of node orientation connections The artificial neural network of cyclization.The internal state of this network can show dynamic time sequence behavior.Different from feedforward neural network , RNN can use its internal memory to handle the list entries of arbitrary sequence, this allows it that can be easier processing if not Handwriting recognition, speech recognition of segmentation etc..

Embodiment one

Fig. 1 gives a kind of speech detection method flow chart of the present embodiment.

As shown in Figure 1, the speech detection method of the present embodiment, comprising:

S101: building speech detection model.

The speech detection model by the first GMM model, the second GMM model and LSTM model be connected in parallel after again with RNN model is in series, as shown in Figure 2.

In the present embodiment, the number of Gaussian mixture components is set as 5 in the first GMM model.

The number of Gaussian mixture components is set as 5 in second GMM model.

The hidden layer and output layer that LSTM model includes input layer, is made of 2 layers of LSTM network structure.The section of input layer Point number is set as 39, and output layer neuron number is set as 1, if the class label of voice is " 1 ", the class label of non-voice is "0".Dropout layers are added after each LSTM network structure layer, dropout layers of parameter is set as 0.2.

The node number of the input layer of RNN model is set as 3, and hidden layer is set as 2 layers, neuron in each hidden layer Number is set as 50, and dropout layers are added after each hidden layer, and the parameter that dropout is arranged is 0.2, output layer neuron Number be set as 1, if the class label of voice is " 1 ", the class label of non-language is " 0 ", and output is obtained each sound by output layer Frequency segment belongs to the probability of voice.

It is understood that in other embodiments, Gaussian mixture components in the first GMM model and the second GMM model Number may be alternatively provided as other values, and those skilled in the art can be specifically arranged according to the actual situation, and and will not be described here in detail.

In other embodiments, the number of plies of LSTM network structure may be alternatively provided as other values, this field skill in LSTM model Art personnel can be specifically arranged according to the actual situation, and and will not be described here in detail.

In other embodiments, each node layer quantity of RNN model and the number of plies of hidden layer may be alternatively provided as other values, ability Field technique personnel can be specifically arranged according to the actual situation, and and will not be described here in detail.

S102: training speech detection model.

In specific implementation, the process of the training speech detection model of step S102 are as follows:

S1021: training first is respectively corresponded using voice data, non-speech data and voice and non-voice blended data GMM model, the second GMM model and LSTM model, output phase should identify score value, and then form a three-dimensional vector, as sound The vector of frequency segment characterizes.

Specifically, in the step S1021, the process of the first GMM model of training are as follows:

S1021-11: by the audio fragment sub-frame processing only containing voice data, the plum of the default dimension of every frame audio is extracted That frequency cepstral coefficient forms a training sample and stores to the first training sample set as audio frequency characteristics.

Such as: by training data each in the training set only containing voice data as unit of 100 milliseconds long, it is divided into one The non-overlapping audio fragment of series；Sub-frame processing is carried out to each 100 milliseconds long of audio fragments, frame length is set as 30 millis Second, frame shifting is set as 10 milliseconds；After sub-frame processing, 39 dimension MFCC features are extracted to each audio frame, with this 39 dimension MFCC feature To express each trained speech samples.

Wherein, MFCC, Mel Frequency Cepstrum Coefficient, mel-frequency cepstrum coefficient, Meier frequency Rate is put forward based on human hearing characteristic, it and Hz frequency are at nonlinear correspondence relation.Mel-frequency cepstrum coefficient It (MFCC) is then the Hz spectrum signature being calculated using this relationship between them.

It should be noted that existing method can be used to extract mel-frequency cepstrum coefficient, those skilled in the art can It is specifically chosen according to the actual situation.

S1021-12: the training sample in the first training sample set is input in the first GMM model, exports each frame sound The speech recognition score value of frames all in audio fragment is averaged, obtains respective audio segment by the speech recognition score value of frequency Speech recognition score value.

Wherein, speech recognition score value is bigger, then respective audio segment belong to voice probability it is bigger.

S1021-13: first is obtained by the training sample training in the first training sample set by expectation-maximization algorithm All parameters of GMM model.

EM algorithm (Expectation-Maximization algorithm, EM), be it is a kind of by iteration into The optimization algorithm of row Maximum-likelihood estimation, usually as being substituted for comprising hidden variable or missing data for Newton iteration method Probabilistic model carries out parameter Estimation, can provide hidden variable, the i.e. posteriority of missing data, therefore answer in missing data problem With.

The present embodiment is using the only audio fragment containing voice data and by expectation-maximization algorithm by the first training sample Training sample training in this set obtains all parameters of the first GMM model, reduces the training and classification of the first GMM model Time improves operation efficiency, can obtain score of the sample on speech model more accurately, obtain more accurate language Sound detection model improves the Detection accuracy of entire speech detection model.

Specifically, in the step S1021, the process of the second GMM model of training are as follows:

S1021-21: by the audio fragment sub-frame processing only containing non-speech data, the default dimension of every frame audio is extracted Mel-frequency cepstrum coefficient forms a training sample and stores to the second training sample set as audio frequency characteristics.

Such as: by training data each in the training set only containing non-speech data as unit of 100 milliseconds long, it is divided into A series of non-overlapping audio fragments；Sub-frame processing is carried out to each 100 milliseconds long of audio fragments, frame length is set as 30 millis Second, frame shifting is set as 10 milliseconds；After sub-frame processing, 39 Jan Vermeer frequency cepstral coefficients are extracted to each audio frame, with this 39 dimension Mel-frequency cepstrum coefficient expresses each trained non-voice sample.

S1021-22: the training sample in the second training sample set is input in the second GMM model, exports each frame sound The non-speech recognition score value of frames all in audio fragment is averaged, obtains respective audio by the non-speech recognition score value of frequency The non-speech recognition score value of segment.

Wherein, non-speech recognition score value is bigger, then respective audio segment belong to voice probability it is smaller.

S1021-23: second is obtained by the training sample training in the second training sample set by expectation-maximization algorithm All parameters of GMM model.

The advantages of technical solution, is, calculates using the audio fragment only containing non-speech data and by expectation maximization Method obtains all parameters of the second GMM model by the training sample training in the second training sample set, reduces the 2nd GMM mould The training of type and classification time, operation efficiency is improved, the identification score that sample belongs to non-voice can be obtained more accurately, More accurate non-voice detection model is obtained, the Detection accuracy of entire speech detection model is improved.

Specifically, in the step S1021, the process of training LSTM model are as follows:

S1021-31: by the audio fragment sub-frame processing containing voice data and non-speech data, every frame audio is extracted The mel-frequency cepstrum coefficient of default dimension is as audio frequency characteristics, when these audio frequency characteristics are arranged to make up one sequentially in time Between sequence.

It specifically, will be containing each training data in voice data and non-speech data training set with 100 milliseconds of a length of lists Position, is divided into a series of non-overlapping audio fragments；Sub-frame processing, frame length are carried out to each 100 milliseconds long of audio fragments It is set as 30 milliseconds, frame shifting is set as 10 milliseconds, and 100 milliseconds of long audio fragments are divided into 8 30 milliseconds of long audio frames, This 8 audio frames constitute a time series；After sub-frame processing, 39 dimension MFCC features are extracted to each audio frame；

Time series to be constituted after each 100 milliseconds long audio fragment framings trains LSTM model for input.

S1021-32: above-mentioned time series is input in LSTM model, and output obtains identifying for respective audio segment Score value.

Wherein, identification score value it is bigger, then respective audio segment belong to voice probability it is bigger.

S1021-33: using Adam optimization algorithm training LSTM model, until the parameter of LSTM model is optimal.

The initialization of LSTM model is uniformly distributed initial method using Glorot, and loss function is using intersection entropy loss letter Number, training use Adam optimization algorithm, and it be 128, epoch parameter is 20 that setting learning rate, which is 0.01, batch_size parameter,. LSTM model is set as just being exported when reading in the last frame of input time sequence.

Adam is a kind of first-order optimization method that can substitute traditional stochastic gradient descent process, it can be based on training data Iteratively update neural network weight.It is advantageous that: it realizes straight from the shoulder；It is efficient to calculate；Required memory is few；Gradient pair The invariance of angle scaling；It is suitble to solve the optimization problem containing large-scale data and parameter；Suitable for unstable state (non- Stationary) target；Suitable for solving the problems, such as comprising very strong noise or sparse gradient；Hyper parameter can be solved intuitively very much It releases, and minimal amount of tune is substantially only needed to join.

The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam Optimization algorithm trains LSTM model, and the contextual information of sample can be made full use of to be identified, can be obtained more accurately Sample belongs to the probability score of voice, obtains more accurate speech detection model, improves the inspection of entire speech detection model Survey accuracy rate.

S1022: by audio fragment vector characterization composition one of each moment, each moment previous moment and later moment in time Time series trains RNN model as input quantity, until the audio fragment at all moment of output belongs to the average general of voice Rate value deviation meets default required precision.

Assuming that sharing N number of training data, x in training set_i(i=1 ..., N) indicates i-th of training data, it is assumed that x_iIn Shared M_iA audio fragment, x_ij(j=1 ..., M_i) indicate x_iJ-th of audio fragment, x_ijIt is the first GMM model, the 2nd GMM The form of one three-dimensional vector of model and the corresponding identification score value composition of LSTM model output.When characterizing audio fragment, In order to which its contextual information to be also included, by x_ijWith the audio fragment x of its previous moment_i(j-1)And the audio of later moment in time Segment x_i(j+1)A time series [x is formed together_i(j-1),x_ij,x_i(j+1)] (j=2 ..., M_i- 1), using this time sequence as The input of RNN neural network, training RNN network.

Specifically, the process of training RNN model are as follows:

The node number of the input layer of RNN network is set as 3, and hidden layer is set as 2 layers, neuron in each hidden layer Number is set as 50, and dropout layers are added after each hidden layer, and the parameter that dropout is arranged is 0.2, output layer neuron Number be set as 1, if the class label of voice is " 1 ", the class label of non-language is " 0 ", and output is obtained each sound by output layer Frequency segment belongs to the probability value of voice；

The initialization of RNN network is uniformly distributed initial method using Glorot, and loss function is using intersection entropy loss letter Number, training use Adam optimization algorithm, and it be 128, epoch parameter is 20 that setting learning rate, which is 0.01, batch_size parameter,.

S103: testing audio data.

In specific implementation, as shown in figure 3, the process of the testing audio data of step S103 are as follows:

Specifically, if probability value is greater than or equal to given threshold, judge that the audio fragment at corresponding moment belongs to voice；It is no Then, judge that the audio fragment at corresponding moment is not belonging to voice.

Assuming that K audio fragment is obtained after the segmentation of testing audio data, y is used_kIndicate its k-th of audio fragment, k= 1,…,K.To each audio fragment y_k, it is passed through to the knowledge that the first GMM model, the second GMM model and LSTM model acquire respectively 3 dimensional vectors must be grouped as, characterize audio fragment y with this 3 dimensional vector_k。

By y_kWith the audio fragment y of its previous moment_k-1And the audio fragment y of later moment in time_k+1A time is formed together Sequence [y_k-1,y_k,y_k+1] (k=2 ..., K-1) using this time sequence as the input of RNN neural network acquire audio fragment y_k Belong to the probability value of voice, given threshold 0.5, probability value is greater than 0.5, then by y_kBe determined as voice, probability value less than 0.5, Then by y_kIt is determined as non-voice.

It should be noted that the probability that audio fragment belongs to voice can be arranged in those skilled in the art according to available accuracy demand The threshold size of value.

The present embodiment is converted to the input of RNN model by traditional audible spectrum by GMM model and LSTM model low The character representation of dimension, low-dimensional feature can reduce the operand of RNN model, reduce training and the classification time of RNN model, improve Operation efficiency.

Character representation of the present embodiment using the classification score of different models as audio sample, this character representation can have The important information in audio sample is extracted to effect, reduces the interference of redundancy, and then disaggregated model is enable preferably to identify Speech samples out improve Detection accuracy.

The present embodiment realizes traditional classification model: effective combination of GMM model, LSTM model and RNN model, GMM mould Type can analogue audio frequency sample well feature structure, LSTM model and RNN model can effectively utilize audio sample up and down Literary information is classified, and the speech detection model of the disclosure can give full play to GMM model, LSTM model and RNN model respectively Advantage, improve whole classification and Detection performance.

The present embodiment can also obtain good speech detection performance in the lower situation of signal-to-noise ratio, thus have to noise Good robustness.

The mentality of designing of the present embodiment is to tie traditional common speech detection model and deep neural network model It closes, the traditional voice detection model in association schemes is not limited to GMM, and the deep neural network model in association schemes is not limited to In LSTM and RNN, it can be traditional voice detection model and deep neural network model that association schemes, which have good expansion, Combination good method be provided use for reference.

Embodiment two

Fig. 4 is a kind of speech detection device structural schematic diagram that the embodiment of the present disclosure provides.

As shown in figure 4, the speech detection device of the present embodiment, comprising:

(1) speech detection model construction module is used to construct speech detection model, and the speech detection model is by first GMM model, the second GMM model and LSTM model are in series with RNN model again after being connected in parallel；

(2) speech detection model training module is used to train speech detection model, process are as follows:

By audio fragment vector characterization one time sequence of composition at each moment, each moment previous moment and later moment in time Column, RNN model is trained as input quantity, until the average probability value that the audio fragment at all moment of output belongs to voice is inclined Difference meets default required precision.

Specifically, in the speech detection model training module, the process of the first GMM model of training are as follows:

In the speech detection model training module, the process of the second GMM model of training are as follows:

In the speech detection model training module, the process of training LSTM model are as follows:

Above-mentioned time series is input in LSTM model, output obtains the identification score value of respective audio segment；

In the speech detection model training module, the process of training RNN model are as follows:

The advantages of technical solution, is, using the audio fragment containing voice data and non-speech data and uses Adam Optimization algorithm trains RNN model, and the contextual information of sample can be made full use of to be identified, improves recognition accuracy, in addition, The input of RNN model is three-dimensional vector, and the character representation of low-dimensional vector reduces training and the classification time of model, improves fortune Efficiency is calculated, and the character representation of low-dimensional vector reduces the interference of redundancy, can obtain more accurately sample and belong to language The probability value of sound obtains more accurate speech detection model, improves the Detection accuracy of entire speech detection model.

(3) audio data test module is used for testing audio data, process are as follows:

Specifically, in the audio data test module, if probability value is greater than or equal to given threshold, judgement is corresponding The audio fragment at moment belongs to voice；Otherwise, judge that the audio fragment at corresponding moment is not belonging to voice.

Embodiment three

A kind of computer readable storage medium, is stored thereon with computer program, realization when which is executed by processor Step in speech detection method as shown in Figure 1.

Example IV

4th aspect of the disclosure provides a kind of computer equipment.

A kind of computer equipment can be run on a memory and on a processor including memory, processor and storage Computer program, the processor realize the step in speech detection method as shown in Figure 1 when executing described program.

It should be understood by those skilled in the art that, embodiment of the disclosure can provide as method, system or computer program Product.Therefore, the shape of hardware embodiment, software implementation or embodiment combining software and hardware aspects can be used in the disclosure Formula.Moreover, the disclosure, which can be used, can use storage in the computer that one or more wherein includes computer usable program code The form for the computer program product implemented on medium (including but not limited to magnetic disk storage and optical memory etc.).

The disclosure is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present disclosure Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random AccessMemory, RAM) etc..

The foregoing is merely preferred embodiment of the present disclosure, are not limited to the disclosure, for the skill of this field For art personnel, the disclosure can have various modifications and variations.It is all within the spirit and principle of the disclosure, it is made any to repair Change, equivalent replacement, improvement etc., should be included within the protection scope of the disclosure.

Claims

1. a kind of speech detection method characterized by comprising

Construct speech detection model；The speech detection model is in parallel by the first GMM model, the second GMM model and LSTM model It is in series with RNN model again after connection；

Training speech detection model；Its process are as follows:

Training the first GMM model, second are respectively corresponded using voice data, non-speech data and voice and non-voice blended data GMM model and LSTM model, output phase should identify score value, and then form a three-dimensional vector, the vector as audio fragment Characterization；

The audio fragment vector characterization at each moment, each moment previous moment and later moment in time is formed into a time series, RNN model is trained as input quantity, until the audio fragment at all moment of output belongs to the average probability value deviation of voice Meet default required precision；

Testing audio data；Its process are as follows:

Segmentation testing audio data are several audio fragments, then audio fragment is input to the speech detection of training completion one by one Model, the audio fragment for obtaining the corresponding moment belong to the probability value of voice.

2. speech detection method as described in claim 1, which is characterized in that belong to the general of voice acquiring testing audio segment After rate value, if probability value is greater than or equal to given threshold, judge that the audio fragment at corresponding moment belongs to voice；Otherwise, judge The audio fragment at corresponding moment is not belonging to voice.

3. speech detection method as described in claim 1, which is characterized in that the process of the first GMM model of training are as follows:

By the audio fragment sub-frame processing only containing voice data, the mel-frequency cepstrum coefficient of the default dimension of every frame audio is extracted As audio frequency characteristics, forms a training sample and store to the first training sample set；

Training sample in first training sample set is input in the first GMM model, the speech recognition of each frame audio is exported The speech recognition score value of frames all in audio fragment is averaged by score value, and the speech recognition for obtaining respective audio segment obtains Score value；

The all of the first GMM model are obtained by the training sample training in the first training sample set by expectation-maximization algorithm Parameter；

Or

The process of the second GMM model of training are as follows:

By the audio fragment sub-frame processing only containing non-speech data, the mel-frequency cepstrum system of the default dimension of every frame audio is extracted Number is used as audio frequency characteristics, forms a training sample and stores to the second training sample set；

Training sample in second training sample set is input in the second GMM model, the non-voice for exporting each frame audio is known The non-speech recognition score value of frames all in audio fragment is averaged, obtains the non-voice of respective audio segment by other score value Identify score value；

The all of the second GMM model are obtained by the training sample training in the second training sample set by expectation-maximization algorithm Parameter.

4. speech detection method as described in claim 1, which is characterized in that the process of training LSTM model are as follows:

By the audio fragment sub-frame processing containing voice data and non-speech data, the Meier frequency of the default dimension of every frame audio is extracted Rate cepstrum coefficient is arranged to make up a time series as audio frequency characteristics, by these audio frequency characteristics sequentially in time；

5. speech detection method as described in claim 1, which is characterized in that the process of training RNN model are as follows:

The identification score value that trained first GMM model, the second GMM model and LSTM model are exported respectively forms one three Dimensional vector, the vector as audio fragment characterize；

By audio fragment vector characterization one time series of composition of current time, previous moment and later moment in time, as input It measures to train RNN model, output obtains the probability value that current time audio fragment belongs to voice；

Using Adam optimization algorithm training RNN model, until the audio fragment at all moment of output belongs to the average general of voice Rate value deviation meets default required precision.

6. a kind of speech detection device characterized by comprising

Speech detection model construction module is used to construct speech detection model, and the speech detection model is by the first GMM mould Type, the second GMM model and LSTM model are in series with RNN model again after being connected in parallel；

Audio data test module is used for testing audio data, process are as follows:

7. speech detection device as claimed in claim 6, which is characterized in that in the audio data test module, if generally Rate value is greater than or equal to given threshold, then judges that the audio fragment at corresponding moment belongs to voice；Otherwise, judge the sound at corresponding moment Frequency segment is not belonging to voice.

8. speech detection device as claimed in claim 6, which is characterized in that in the speech detection model training module, The process of the first GMM model of training are as follows:

Or

The all of the second GMM model are obtained by the training sample training in the second training sample set by expectation-maximization algorithm Parameter；

Or

Using Adam optimization algorithm training LSTM model, until the parameter of LSTM model is optimal；

Or

9. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor The step in speech detection method according to any one of claims 1 to 5 is realized when row.

10. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor Calculation machine program, which is characterized in that the processor realizes language according to any one of claims 1 to 5 when executing described program Step in sound detection method.