CN105679316A

CN105679316A - Voice keyword identification method and apparatus based on deep neural network

Info

Publication number: CN105679316A
Application number: CN201511016642.1A
Authority: CN
Inventors: 闫振雷; 魏磊磊
Original assignee: Shenzhen Weifu Robot Technology Co ltd
Current assignee: Shenzhen Weifu Robot Technology Co ltd
Priority date: 2015-12-29
Filing date: 2015-12-29
Publication date: 2016-06-15

Abstract

The invention provides a voice keyword identification method and apparatus based on a deep neural network. The method comprises the steps: framing the voice to be identified to obtain a plurality of voice frames; extracting features from each voice frame, and obtaining a Mel frequency cepstral coefficient MFCC sequence for each voice frame; inputting the MFCC sequence for each voice frame into a preset deep neural network model in parallel; respectively calculating the posterior probability under each neural unit of the output layer in the preset deep neural network model, for the MFCC sequence for each voice frame; forming posterior probability sequences corresponding to the plurality of voice frames through the posterior probability under each neural unit of the output layer; monitoring the posterior probability sequence under each neural unit of the output layer; and according to the comparative result between the posterior probability sequence and the probability sequence of the preset threshold, determining the keywords of the voice to be identified, and utilizing the pre-trained deep neural network to perform voice keywords identification. Therefore, the voice keyword identification method and apparatus based on a deep neural network can improve the identification speed and alleviate the problem of identification delay.

Description

A kind of voice keyword recognition method based on deep neural network and device

Technical field

The present invention relates to voice keyword recognition technology field, specifically, it relates to a kind of voice keyword recognition method based on deep neural network and device.

Background technology

At present, along with the raising of the widespread use of "smart" products, storage device performance and capacity, and network, communication flourish, voice has become the strong carrier of information, thus the process of voice and utilisation technology more and more receive the concern of people. Wherein, voice keyword recognition technology refers to and identifies given keyword in given voice and indicate the position at its place, voice keyword recognition technology is an important branch of speech recognition technology, is process nature voice, effective solution that to realize man machine language mutual. In a lot of application scene, voice keyword identification is widely used, such as voice inquiry system, speech searching system, the real-time Controlling System of voice command, do not need word for word to identify all the elements that voice comprises, and only need to identify the predetermined keyword in given voice. Therefore, voice keyword recognition technology has a extensive future, and becomes the research focus of field of speech recognition.

Currently, correlation technique provides a kind of voice keyword recognition technology based on model, such as, based on the continuous speech recognition of large vocabulary, need first to convert speech signal to text with voice recognizer, again given keyword being carried out text search, this voice keyword recognition technology need to carry out speech signal conversion again after one whole section of complete input of continuous voice; And for example, based on keyword model and the keyword identification filling (filler) model, need all non-keywords are identified as loaded with dielectric, all non-keywords also need to be identified as loaded with dielectric when the complete input of one whole section of continuous voice by this voice keyword recognition technology, keyword is identified as keyword model, and then determines the keyword of one whole section of continuous voice.

In the process realizing the present invention, contriver finds at least to exist in correlation technique following problem: there is, in voice keyword recognition technology, the problem identifying and postponing at present, therefore cannot realize in time, carry out human-computer interaction fast.

Summary of the invention

In view of this, the object of the embodiment of the present invention is to provide a kind of voice keyword recognition method based on deep neural network and device, to solve the problem existing in voice keyword recognition technology and identifying and postpone, improve the identification speed of voice keyword, it is achieved carry out human-computer interaction in time, fast.

First aspect, embodiments provides a kind of voice keyword recognition method based on deep neural network, and this recognition methods comprises:

Input voice to be identified is carried out framing and obtains multiple voice frame;

Each above-mentioned voice frame is carried out feature extraction, obtains the mel cepstrum characteristic coefficient MFCC sequence of each above-mentioned voice frame;

And be about to each above-mentioned voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each above-mentioned voice frame respectively under each neural unit of the output layer of above-mentioned default deep neural network model, by posterior probability sequence corresponding for the posterior probability above-mentioned multiple voice frame of composition under each neural unit of above-mentioned output layer, wherein, output layer each corresponding keyword of neural unit;

Above-mentioned posterior probability sequence under each neural unit of monitoring output layer;

The keyword of above-mentioned input voice to be identified is determined according to the comparative result of above-mentioned posterior probability sequence and the probability sequence of predetermined threshold value.

In conjunction with first aspect, embodiments providing the first possible enforcement mode of first aspect, wherein, above-mentioned default deep neural network model is set up in the following manner:

Utilize degree of deep learning method that the voice sample data chosen is carried out deep neural network training, obtain the deep neural network model preset, wherein, above-mentioned deep neural network model comprises: the input layer that the neural unit answered by MFCC sequence pair forms, the hidden layer being made up of nonlinear mapping unit and the output layer being made up of the neural unit that the posterior probability of each keyword is corresponding.

In conjunction with the first possible enforcement mode of first aspect, embodiments provide the 2nd kind of possible enforcement mode of first aspect, wherein, above-mentioned utilizing degree of deep learning method that the voice sample data chosen is carried out deep neural network training, the deep neural network model obtaining presetting comprises:

According to the voice sample data training Hidden Markov Model (HMM) chosen and mixed Gauss model, wherein, above-mentioned Hidden Markov Model (HMM) and the above-mentioned voice sample data chosen are one to one, and above-mentioned mixed Gauss model is used for describing the output probability distribution of above-mentioned Hidden Markov Model state;

Viterbi decoding algorithm is adopted to utilize the above-mentioned Hidden Markov Model (HMM) trained and above-mentioned mixed Gauss model that the voice sample data chosen carries out initial frame and end frame registration process, it is determined that the boundary information of above-mentioned voice sample data;

The boundary information training of voice information according to above-mentioned voice sample data, content of text and above-mentioned voice sample data obtains the deep neural network model preset.

In conjunction with the 2nd kind of first aspect possible enforcement mode, embodiments provide the third possible enforcement mode of first aspect, wherein, above-mentioned utilize degree of deep learning method that the voice sample data chosen is carried out deep neural network training, after obtaining the deep neural network model preset, also comprise:

Monitor the posterior probability of each voice sample data under each neural unit of the output layer of the above-mentioned default deep neural network model trained;

The posterior probability under the neural unit of correspondence is maximum to judge each voice sample data;

If not, then utilize back-propagation algorithm the parameter of above-mentioned default deep neural network model to be adjusted, until each voice sample data all the posterior probability under the neural unit of correspondence is maximum.

In conjunction with first aspect to any one in the third possible enforcement mode of first aspect, embodiments providing the 4th kind of possible enforcement mode of first aspect, wherein, above-mentioned recognition methods also comprises:

Utilize corresponding Hidden Markov Model (HMM) that the above-mentioned keyword identified is carried out marking process, calculate the likelihood probability of above-mentioned keyword under above-mentioned Hidden Markov Model (HMM);

If above-mentioned likelihood probability is greater than predetermined threshold value, then determine that recognition result is true.

In conjunction with the 4th kind of first aspect possible enforcement mode, embodiments provide the 5th kind of possible enforcement mode of first aspect, wherein, determine that the keyword of above-mentioned input voice to be identified comprises according to the comparative result of above-mentioned posterior probability sequence and the probability sequence of predetermined threshold value:

Judge the probability sequence whether above-mentioned posterior probability sequence exists a continuous print numerical value subsegment and be all greater than predetermined threshold value;

If whether the time length judged corresponding to above-mentioned continuous print numerical value subsegment between initial frame and end frame is greater than the default time;

When the time length judged corresponding to above-mentioned continuous print numerical value subsegment between initial frame and end frame is greater than the default time, using keyword corresponding for the neural unit belonging to above-mentioned continuous print numerical value subsegment as the keyword represented by input voice to be identified.

Second aspect, the embodiment of the present invention additionally provides a kind of voice keyword means of identification based on deep neural network, and this means of identification comprises:

Voice divides frame module, obtains multiple voice frame for input voice to be identified is carried out framing;

Characteristic extracting module, for each above-mentioned voice frame is carried out feature extraction, obtains the mel cepstrum characteristic coefficient MFCC sequence of each above-mentioned voice frame;

Probability calculation module, for and be about to each above-mentioned voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each above-mentioned voice frame respectively under each neural unit of the output layer of above-mentioned default deep neural network model, by posterior probability sequence corresponding for the posterior probability above-mentioned multiple voice frame of composition under each neural unit of above-mentioned output layer, wherein, output layer each corresponding keyword of neural unit;

Monitoring modular, for the above-mentioned posterior probability sequence monitored under each neural unit of output layer;

Keyword identification module, determines the keyword of above-mentioned input voice to be identified for the comparative result of the probability sequence according to above-mentioned posterior probability sequence and predetermined threshold value.

In conjunction with second aspect, embodiments providing the first possible enforcement mode of second aspect, wherein, above-mentioned default deep neural network model is by setting up with lower module:

Model determination module, for utilizing degree of deep learning method that the voice sample data chosen is carried out deep neural network training, obtain the deep neural network model preset, wherein, above-mentioned deep neural network model comprises: the input layer that the neural unit answered by MFCC sequence pair forms, the hidden layer being made up of nonlinear mapping unit and the output layer being made up of the neural unit that the posterior probability of each keyword is corresponding.

In conjunction with the first possible enforcement mode of second aspect, embodiments providing the 2nd kind of possible enforcement mode of second aspect, wherein, above-mentioned model determination module comprises:

Training unit, for according to the voice sample data training Hidden Markov Model (HMM) chosen and mixed Gauss model, wherein, above-mentioned Hidden Markov Model (HMM) and the above-mentioned voice sample data chosen are one to one, and above-mentioned mixed Gauss model is used for describing the output probability distribution of above-mentioned Hidden Markov Model state;

Registration process unit, for adopting Viterbi decoding algorithm to utilize the above-mentioned Hidden Markov Model (HMM) that trains and above-mentioned mixed Gauss model that the voice sample data chosen carries out initial frame and end frame registration process, it is determined that the boundary information of above-mentioned voice sample data;

Model determining unit, the boundary information for the voice information according to above-mentioned voice sample data, content of text and above-mentioned voice sample data trains the deep neural network model obtaining presetting.

In conjunction with the 2nd kind of second aspect possible enforcement mode, embodiments providing the third possible enforcement mode of second aspect, wherein, above-mentioned means of identification also comprises:

Monitoring modular, for monitoring the posterior probability of each voice sample data under each neural unit of the output layer of the above-mentioned default deep neural network model trained;

Judging module, for judging each voice sample data, the posterior probability under the neural unit of correspondence is maximum;

Fine setting module, for when judging voice sample data the posterior probability under the neural unit of correspondence be not maximum, utilize back-propagation algorithm the parameter of above-mentioned default deep neural network model to be adjusted, until each voice sample data all the posterior probability under the neural unit of correspondence is maximum.

In conjunction with second aspect to any one in the third possible enforcement mode of second aspect, embodiments providing the 4th kind of possible enforcement mode of second aspect, wherein, above-mentioned means of identification also comprises:

Marking module, for utilizing corresponding Hidden Markov Model (HMM) that the above-mentioned keyword identified is carried out marking process, calculates the likelihood probability of above-mentioned keyword under above-mentioned Hidden Markov Model (HMM);

Recognition result confirms module, if being greater than predetermined threshold value for above-mentioned likelihood probability, then determines that recognition result is true.

In conjunction with the 4th kind of second aspect possible enforcement mode, embodiments providing the 5th kind of possible enforcement mode of second aspect, wherein, above-mentioned keyword identification module comprises:

First judging unit, for judging the probability sequence whether above-mentioned posterior probability sequence exists a continuous print numerical value subsegment and be all greater than predetermined threshold value;

2nd judging unit, for when judging that above-mentioned posterior probability sequence exists the probability sequence that a continuous print numerical value subsegment is all greater than predetermined threshold value, whether the time length judged corresponding to above-mentioned continuous print numerical value subsegment between initial frame and end frame is greater than the default time;

Keyword determining unit, during for being greater than the default time when the time length judged corresponding to above-mentioned continuous print numerical value subsegment between initial frame and end frame, using keyword corresponding for the neural unit belonging to above-mentioned continuous print numerical value subsegment as the keyword represented by input voice to be identified.

In the voice keyword recognition method based on deep neural network and device of embodiment of the present invention offer, the method comprises: first, input voice to be identified is carried out sub-frame processing, the multiple voice frames obtained are carried out feature extraction, thus obtains the mel cepstrum characteristic coefficient MFCC sequence of each voice frame; Then, and be about to each voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each voice frame respectively under each neural unit of the output layer of default deep neural network model, forms posterior probability sequence corresponding to multiple voice frame by the posterior probability under each neural unit of output layer; Finally, the posterior probability sequence under each neural unit of output layer is monitored; The keyword of input voice to be identified is determined according to the comparative result of posterior probability sequence and the probability sequence of predetermined threshold value. Utilizing the deep neural network that training in advance is good to carry out voice keyword identification in embodiments of the present invention, it is to increase the identification speed of voice keyword, the identification alleviating voice keyword postpones problem, such that it is able to realize carrying out human-computer interaction in time, fast.

For making above-mentioned purpose, the feature and advantage of the present invention become apparent, better embodiment cited below particularly, and coordinate appended accompanying drawing, it is described in detail below.

Accompanying drawing explanation

In order to be illustrated more clearly in the technical scheme of the embodiment of the present invention, it is briefly described to the accompanying drawing used required in embodiment below, it is to be understood that, the following drawings illustrate only some embodiment of the present invention, therefore should not be counted as is the restriction to scope, for those of ordinary skill in the art, under the prerequisite not paying creative work, it is also possible to obtain other relevant accompanying drawings according to these accompanying drawings.

Fig. 1 shows the schematic flow sheet of a kind of voice keyword recognition method based on deep neural network that the embodiment of the present invention provides;

Fig. 2 shows another kind that the embodiment of the present invention provides schematic flow sheet based on the voice keyword recognition method of deep neural network;

Fig. 3 shows the structural representation of a kind of voice keyword means of identification based on deep neural network that the embodiment of the present invention provides;

Fig. 4 shows another kind that the embodiment of the present invention provides structural representation based on the voice keyword means of identification of deep neural network.

Embodiment

For making the object of the embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments. The assembly of the embodiment of the present invention usually describing in accompanying drawing herein and illustrating can be arranged with various different configuration and design. Therefore, below to the detailed description of the embodiments of the invention provided in the accompanying drawings and the scope of the claimed the present invention of not intended to be limiting, but only represent the selected embodiment of the present invention. Based on embodiments of the invention, other embodiments all that those skilled in the art obtain under the prerequisite not making creative work, all belong to the scope of protection of the invention.

Consider the problem existing in current voice keyword recognition technology in correlation technique and identifying and postpone, therefore cannot realize in time, carry out human-computer interaction fast. Based on this, embodiments provide a kind of voice keyword recognition method based on deep neural network and device, it is described below by embodiment.

As shown in Figure 1, embodiments provide a kind of voice keyword recognition method based on deep neural network, the method comprising the steps of S102-S110, specific as follows:

Step S102: input voice to be identified is carried out framing and obtains multiple voice frame;

Wherein, first needing input voice to be identified is carried out sub-frame processing, it is possible to the time length of each voice frame is set as 25ms, frame moves as 10ms, namely according to default framing mode, input voice to be identified is divided into multiple voice frame.

Step S104: each above-mentioned voice frame is carried out feature extraction, obtains the mel cepstrum characteristic coefficient MFCC sequence of each above-mentioned voice frame;

Concrete, the multiple voice frames obtained after sub-frame processing are carried out feature extraction, by the sound signal of each voice frame has identification constituents extraction out, obtain the mel cepstrum characteristic coefficient MFCC sequence that each voice frame is corresponding, wherein, this mel cepstrum characteristic coefficient MFCC sequence has the characteristic of 39 dimensions, using corresponding for each voice frame the 39 MFCC sequences tieed up as the input feature vector of the input layer of default deep neural network, therefore, the input layer of default deep neural network is set to 39 neural unit.

Step S106: and be about to each above-mentioned voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each above-mentioned voice frame respectively under each neural unit of the output layer of above-mentioned default deep neural network model, by posterior probability sequence corresponding for the posterior probability above-mentioned multiple voice frame of composition under each neural unit of above-mentioned output layer, wherein, output layer each corresponding keyword of neural unit;

Concrete, above-mentioned feature is extracted the input layer feature of the 39 MFCC sequences tieed up as default deep neural network of each the voice frame obtained, between the neural unit of each of input layer separate, to input each voice frame 39 dimension MFCC sequences carry out parallelization process after, transfer to the hidden layer of default deep neural network, this hidden layer can be 3 layers, and be made up of nonlinear mapping unit, input the MFCC sequence of 39 dimensions of each voice frame successively, and the posterior probability of the MFCC sequence calculating each voice frame respectively under each neural unit of the output layer of default deep neural network model, separate between the neural unit of each of output layer can realize parallelism recognition, owing to one section of voice turns into multiple voice frame after sub-frame processing, carry out feature extraction, the MFCC sequence of 39 dimensions that feature is extracted each the voice frame obtained is as input, therefore, for one section of input voice to be identified, a posterior probability sequence is all there is under the neural unit of each of output layer.

Step S108: the above-mentioned posterior probability sequence under each neural unit of monitoring output layer;

Step S110: the keyword determining above-mentioned input voice to be identified according to the comparative result of above-mentioned posterior probability sequence and the probability sequence of predetermined threshold value.

Wherein, posterior probability values under the neural unit that each voice frame of input voice to be identified is corresponding in default deep neural network should be greater than predetermined threshold value, therefore, the probability sequence that a continuous print numerical value subsegment is all greater than predetermined threshold value is there is under the neural unit that multiple voice frames of input voice to be identified are corresponding in default deep neural network, thus, the keyword corresponding to neural unit judging this output layer is the keyword of this section of input voice to be identified.

Illustrate, such as, input voice to be identified is " please turn left ", first, the input voice being somebody's turn to do " please turn left " is carried out sub-frame processing, again each voice frame is carried out feature extraction, by the MFCC sequence inputting of each voice frame of " please turn left " to the deep neural network model preset, when utilizing this deep neural network model preset this " please turn left " to be identified, the bigger posterior probability sequence of a string value should be there is under the neural unit of the expression " turning left " of output layer, and there is not the bigger posterior probability sequence of a string value or there is the bigger posterior probability sequence of several intermittent values under other neural unit of output layer.

In the voice keyword recognition method based on deep neural network and device of embodiment of the present invention offer, the method comprises: first, input voice to be identified is carried out sub-frame processing, the multiple voice frames obtained are carried out feature extraction, thus obtains the mel cepstrum characteristic coefficient MFCC sequence of each voice frame; Then, and be about to each voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each voice frame respectively under each neural unit of the output layer of default deep neural network model, forms posterior probability sequence corresponding to multiple voice frame by the posterior probability under each neural unit of output layer;Finally, the posterior probability sequence under each neural unit of output layer is monitored; The keyword of input voice to be identified is determined according to the comparative result of posterior probability sequence and the probability sequence of predetermined threshold value. Utilizing the deep neural network that training in advance is good to carry out voice keyword identification in embodiments of the present invention, it is to increase the identification speed of voice keyword, the identification alleviating voice keyword postpones problem, such that it is able to realize carrying out human-computer interaction in time, fast.

Consider to there is the similar situation of keyword, thus not only need to judge the probability sequence whether posterior probability sequence exists a continuous print numerical value subsegment and be all greater than predetermined threshold value, whether the time length also needed to judge corresponding to continuous print numerical value subsegment between initial frame and end frame is greater than the default time, based on this, determine that the keyword of above-mentioned input voice to be identified comprises according to the comparative result of above-mentioned posterior probability sequence and the probability sequence of predetermined threshold value:

Concrete, also for above-mentioned input voice to be identified " please turn left ", the input voice that there is similar key has " please turn right ", now, when input voice to be identified is for " please turn left ", the bigger posterior probability sequence of a string value is all there is under the neural unit of the expression " turning left " of the output layer of default deep neural network model and under the neural unit of expression " turning right ", and under the neural unit of the expression " turning left " of output layer, there is the bigger posterior probability sequence of a longer value, the posterior probability sequence that the value being interrupted in the middle of having two under the neural unit of the expression " turning right " of output layer is bigger, therefore, the keyword in input voice to be identified is determined again by the size of the time length judged corresponding to continuous print numerical value subsegment between initial frame and end frame.

In embodiment provided by the invention, not only judge whether the posterior probability sequence under each neural unit of output layer exists the probability sequence that a continuous print numerical value subsegment is all greater than predetermined threshold value, whether the time length also needed to judge corresponding to continuous print numerical value subsegment between initial frame and end frame is greater than the default time, further increases the accuracy of voice keyword identification.

Further, above-mentioned default deep neural network model is set up in the following manner:

Utilize degree of deep learning method that the voice sample data chosen is carried out deep neural network training, obtain the deep neural network model preset, wherein, above-mentioned deep neural network model comprises: the input layer that the neural unit answered by MFCC sequence pair forms, the hidden layer being made up of nonlinear mapping unit and the output layer being made up of the neural unit that the posterior probability of each keyword is corresponding, the neural unit of output layer comprises neural unit corresponding to each keyword, the neural unit of the neural unit of an ambient sound and a non-keyword, even training obtains the deep neural network model of N number of keyword, then the number of the neural unit of the output layer of this deep neural network model is N+2.

Concrete, above-mentioned utilize degree of deep learning method that the voice sample data chosen is carried out deep neural network training, the deep neural network model obtaining presetting comprises:

Wherein, above-mentioned each Hidden Markov Model (HMM) 9 states, each state mixed Gauss model that a component number is 8 describes the output probability distribution of state. Hidden Markov Model (HMM) and mixed Gauss model all utilize open source software, forward-backward algorithm (BW) algorithm, expectation maximization (EM) algorithm, and maximumization likelihood criterion (MLE) training obtains.

When training deep neural network model, using the voice sample data chosen as training objects, owing to this training deep neural network model is the core link of whole voice keyword recognition process, more comprehensive in order to obtain containing information, have more the deep neural network model of ubiquity, choose representative voice sample data very important, can using voices different for identical for content accent as voice sample data, choose sufficient voice sample data as training objects, the deep neural network model obtained now is trained to possess the feature of diversity of the different accent of identical content, utilizing this deep neural network model to carry out subsequent voice keyword recognition process can make recognition result more accurate.

In embodiment provided by the invention, when training deep neural network model, the voice sample data chosen is carried out initial frame and end frame registration process, determine the boundary information of voice sample data, where it is determined that the concrete steps of the boundary information of voice sample data are: adopt Viterbi decoding algorithm to utilize the above-mentioned Hidden Markov Model (HMM) trained and above-mentioned mixed Gauss model to carry out moving identifying processing based on current voice sample data and namely carry out initial frame and end frame registration process. The object determining boundary information improves accuracy and the identification speed of voice learning sample, ratio is " head is turned left " if any a voice learning sample, keyword is " turning left ", it is necessary to determine the zero position of keyword in voice data, thus removes redundancy audio frequency data. So that the deep neural network model that training obtains can identify the boundary bit location information of voice keyword quickly and accurately in subsequent voice keyword recognition process, when determining this keyword after a keyword identification immediately, carrying out keyword identification again after waiting one whole section of phonetic entry, the further voice key that solves identifies the problem postponed.

In order to the deep neural network that further optimization has trained, thus improve the accuracy of follow-up recognition process, based on this, above-mentioned utilize degree of deep learning method that the voice sample data chosen is carried out deep neural network training, after obtaining the deep neural network model preset, also comprise:

Such as: for the voice sample data chosen as " please turn left ", in training utterance frame sequence, 15th frame is " turning left " to the content of text of the 75th frame voice, when inputting the 15th frame of voice sample data, maximum posterior probability values is obtained under should representing the neural unit of " turning left " in the output layer of the deep neural network model trained, when obtaining maximum posterior probability values under other neural unit of output layer, back-propagation algorithm is utilized the parameter of the model of this deep neural network to be adjusted, when making the 15th frame inputting voice sample data, maximum posterior probability values is obtained under the output layer of the deep neural network model trained represents the neural unit of " turning left ", then checking input the 16th frame is continued, 17th frame ... to the 75th frame voice, maximum posterior probability values is obtained under making the neural unit of all expressions " turning left " in the output layer of the deep neural network model trained.

In embodiment provided by the invention, by the deep neural network model trained is carried out parameter adjustment, make each voice frame of input posterior probability values under corresponding neural unit in the output layer of deep neural network model be maximum value, thus further increase accuracy and the identification speed of subsequent voice keyword recognition process.

Further, in order to improve the accuracy of voice keyword recognition result, as shown in Figure 2, above-mentioned recognition methods also comprises:

Step S112: utilize corresponding Hidden Markov Model (HMM) that the above-mentioned keyword identified is carried out marking process, calculate the likelihood probability of above-mentioned keyword under above-mentioned Hidden Markov Model (HMM);

Step S114: if above-mentioned likelihood probability is greater than predetermined threshold value, then determine that recognition result is true.

Wherein, in the process of training deep neural network model, the keyword voice sample data training marked by each band obtains a corresponding Hidden Markov Model (HMM) (HMM) and mixed Gauss model (GMM), and this Hidden Markov Model (HMM) (HMM) utilizes mixed Gauss model (GMM) to describe feature spatial distribution. In above-mentioned steps, first, utilize the Hidden Markov Model (HMM) (HMM) corresponding with the keyword identified that this keyword is carried out marking process, namely calculate the likelihood probability of this keyword under above-mentioned Hidden Markov Model (HMM). Such as, also for above-mentioned input voice to be identified " please turn left ", the keyword now identified is " turning left ", then calculate the likelihood probability value of this keyword under the Hidden Markov Model (HMM) (HMM) of " turning left ", then, this likelihood probability calculated and setting threshold value are compared, when the likelihood probability calculated is greater than predetermined threshold value, then determining that recognition result is true, namely the keyword of this input voice to be identified is " turning left "; When the likelihood probability calculated is less than predetermined threshold value, then determine that recognition result is false, it is necessary to re-start identification.

In embodiment provided by the invention, the checking of recognition result is carried out again after identifying keyword, first, utilize the Hidden Markov Model (HMM) corresponding with the keyword identified that this keyword is carried out marking process, namely calculate the likelihood probability of this keyword under above-mentioned Hidden Markov Model (HMM); Then, this likelihood probability calculated and setting threshold value are compared, according to comparative result determines the whether correct of recognition result, such that it is able to improve the accuracy of voice keyword recognition result further.

In the voice keyword recognition method based on deep neural network and device of embodiment of the present invention offer, the method comprises: first, input voice to be identified is carried out sub-frame processing, the multiple voice frames obtained are carried out feature extraction, thus obtains the mel cepstrum characteristic coefficient MFCC sequence of each voice frame;Then, and be about to each voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each voice frame respectively under each neural unit of the output layer of default deep neural network model, forms posterior probability sequence corresponding to multiple voice frame by the posterior probability under each neural unit of output layer; Finally, the posterior probability sequence under each neural unit of output layer is monitored; The keyword of input voice to be identified is determined according to the comparative result of posterior probability sequence and the probability sequence of predetermined threshold value. Utilizing the deep neural network that training in advance is good to carry out voice keyword identification in embodiments of the present invention, it is to increase the identification speed of voice keyword, the identification alleviating voice keyword postpones problem, such that it is able to realize carrying out human-computer interaction in time, fast; Further, not only judge whether the posterior probability sequence under each neural unit of output layer exists the probability sequence that a continuous print numerical value subsegment is all greater than predetermined threshold value, whether the time length also needed to judge corresponding to continuous print numerical value subsegment between initial frame and end frame is greater than the default time, further increases the accuracy of voice keyword identification; Further, the checking of recognition result is carried out again after identifying keyword, first, utilize the Hidden Markov Model (HMM) corresponding with the keyword identified that this keyword is carried out marking process, namely calculate the likelihood probability of this keyword under above-mentioned Hidden Markov Model (HMM); Then, this likelihood probability calculated and setting threshold value are compared, according to comparative result determines the whether correct of recognition result, such that it is able to improve the accuracy of voice keyword recognition result further.

Corresponding to above-mentioned recognition methods, the embodiment of the present invention additionally provides a kind of voice keyword means of identification based on deep neural network, and as shown in Figure 3, this means of identification comprises:

Voice divides frame module 302, obtains multiple voice frame for input voice to be identified is carried out framing;

Characteristic extracting module 304, for each above-mentioned voice frame is carried out feature extraction, obtains the mel cepstrum characteristic coefficient MFCC sequence of each above-mentioned voice frame;

Probability calculation module 306, for and be about to each above-mentioned voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each above-mentioned voice frame respectively under each neural unit of the output layer of above-mentioned default deep neural network model, by posterior probability sequence corresponding for the posterior probability above-mentioned multiple voice frame of composition under each neural unit of above-mentioned output layer, wherein, output layer each corresponding keyword of neural unit;

Monitoring modular 308, for the above-mentioned posterior probability sequence monitored under each neural unit of output layer;

Keyword identification module 310, determines the keyword of above-mentioned input voice to be identified for the comparative result of the probability sequence according to above-mentioned posterior probability sequence and predetermined threshold value.

Further, above-mentioned default deep neural network model is by setting up with lower module:

Further, above-mentioned model determination module can be realized by following functional unit, specifically comprises:

Further, above-mentioned means of identification also comprises:

Further, as shown in Figure 4, above-mentioned means of identification also comprises:

Marking module 312, for utilizing corresponding Hidden Markov Model (HMM) that the above-mentioned keyword identified is carried out marking process, calculates the likelihood probability of above-mentioned keyword under above-mentioned Hidden Markov Model (HMM);

Recognition result confirms module 314, if being greater than predetermined threshold value for above-mentioned likelihood probability, then determines that recognition result is true.

Further, above-mentioned keyword identification module 310 comprises:

Known based on above-mentioned analysis, compared with the electric energy meter calibrating apparatus in correlation technique, the voice keyword means of identification that the embodiment of the present invention provides is first, input voice to be identified is carried out sub-frame processing, the multiple voice frames obtained are carried out feature extraction, thus obtains the mel cepstrum characteristic coefficient MFCC sequence of each voice frame; Then, and be about to each voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each voice frame respectively under each neural unit of the output layer of default deep neural network model, forms posterior probability sequence corresponding to multiple voice frame by the posterior probability under each neural unit of output layer; Finally, the posterior probability sequence under each neural unit of output layer is monitored; The keyword of input voice to be identified is determined according to the comparative result of posterior probability sequence and the probability sequence of predetermined threshold value. Utilizing the deep neural network that training in advance is good to carry out voice keyword identification in embodiments of the present invention, it is to increase the identification speed of voice keyword, the identification alleviating voice keyword postpones problem, such that it is able to realize carrying out human-computer interaction in time, fast; Further, not only judge whether the posterior probability sequence under each neural unit of output layer exists the probability sequence that a continuous print numerical value subsegment is all greater than predetermined threshold value, whether the time length also needed to judge corresponding to continuous print numerical value subsegment between initial frame and end frame is greater than the default time, further increases the accuracy of voice keyword identification; Further, the checking of recognition result is carried out again after identifying keyword, first, utilize the Hidden Markov Model (HMM) corresponding with the keyword identified that this keyword is carried out marking process, namely calculate the likelihood probability of this keyword under above-mentioned Hidden Markov Model (HMM); Then, this likelihood probability calculated and setting threshold value are compared, according to comparative result determines the whether correct of recognition result, such that it is able to improve the accuracy of voice keyword recognition result further.

The voice keyword means of identification that the embodiment of the present invention provides can be the specific hardware on equipment or the software being installed on equipment or firmware etc.The device that the embodiment of the present invention provides, its technique effect realizing principle and generation is identical with aforementioned embodiment of the method, is concise and to the point description, and device embodiment part does not mention part, can with reference to corresponding contents in aforementioned embodiment of the method. The technician of art can be well understood to, and for convenience and simplicity of description, the concrete working process of the system of aforementioned description, device and unit, all with reference to the corresponding process in aforesaid method embodiment, can not repeat them here.

In embodiment provided by the present invention, it should be appreciated that, disclosed device and method, it is possible to realize by another way. Device embodiment described above is only schematic, such as, the division of described unit, it is only a kind of logic function to divide, actual can have other dividing mode when realizing, again such as, multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can ignore, or do not perform. Another point, shown or discussed coupling each other or directly coupling or communication connection can be the indirect coupling by some communication interfaces, device or unit or communication connection, it is possible to be electrical, machinery or other form.

The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or can also be distributed on multiple NE. Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.

In addition, each functional unit in embodiment provided by the invention can be integrated in a processing unit, it is also possible to is that the independent physics of each unit exists, it is also possible to two or more unit are in a unit integrated.

If described function realize using the form of software functional unit and as independent production marketing or when using, it is possible to be stored in a computer read/write memory medium. Based on such understanding, the technical scheme of the present invention in essence or says that the part of part or this technical scheme prior art contributed can embody with the form of software product, this computer software product is stored in a storage media, comprise some instructions with so that a computer equipment (can be Personal Computer, server, or the network equipment etc.) perform all or part of step of method described in each embodiment of the present invention. And aforesaid storage media comprises: USB flash disk, portable hard drive, read-only storage (ROM, Read-OnlyMemory), random access memory (RAM, RandomAccessMemory), magnetic disc or CD etc. various can be program code stored medium.

It should be noted that: similar label and letter accompanying drawing below represents similar item, therefore, once a certain Xiang Yi accompanying drawing is defined, then do not need it carries out definition further and explains in accompanying drawing subsequently, in addition, term " first ", " the 2nd ", " the 3rd " etc. are only for distinguishing description, and can not be interpreted as instruction or hint relative importance.

Last it is noted that the above embodiment, it is only the specific embodiment of the present invention, in order to the technical scheme of the present invention to be described, it is not intended to limit, protection scope of the present invention is not limited thereto, although with reference to previous embodiment to invention has been detailed description, it will be understood by those within the art that: any be familiar with those skilled in the art in the technical scope that the present invention discloses, technical scheme described in previous embodiment still can be modified or can be expected change easily by it, or wherein part technology feature is carried out equivalent replacement,And these amendments, change or replacement, do not make the spirit and scope of the essence disengaging embodiment of the present invention technical scheme of appropriate technical solution. All should be encompassed within protection scope of the present invention. Therefore, protection scope of the present invention should described be as the criterion with the protection domain of claim.

Claims

1. the voice keyword recognition method based on deep neural network, it is characterised in that, comprising:

Voice frame described in each is carried out feature extraction, obtains the mel cepstrum characteristic coefficient MFCC sequence of voice frame described in each;

And be about to voice frame described in each MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating voice frame described in each respectively under each neural unit of the output layer of described default deep neural network model, by posterior probability sequence corresponding for the posterior probability described multiple voice frame of composition under each neural unit of described output layer, wherein, output layer each corresponding keyword of neural unit;

Described posterior probability sequence under each neural unit of monitoring output layer;

The keyword of described input voice to be identified is determined according to the comparative result of described posterior probability sequence and the probability sequence of predetermined threshold value.

2. the voice keyword recognition method based on deep neural network according to claim 1, it is characterised in that, described default deep neural network model is set up in the following manner:

Utilize degree of deep learning method that the voice sample data chosen is carried out deep neural network training, obtain the deep neural network model preset, wherein, described deep neural network model comprises: the input layer that the neural unit answered by MFCC sequence pair forms, the hidden layer being made up of nonlinear mapping unit and the output layer being made up of the neural unit that the posterior probability of each keyword is corresponding.

3. the voice keyword recognition method based on deep neural network according to claim 2, it is characterized in that, described utilizing degree of deep learning method that the voice sample data chosen is carried out deep neural network training, the deep neural network model obtaining presetting comprises:

According to the voice sample data training Hidden Markov Model (HMM) chosen and mixed Gauss model, wherein, described Hidden Markov Model (HMM) and the described voice sample data chosen are one to one, and described mixed Gauss model is used for describing the output probability distribution of described Hidden Markov Model state;

Viterbi decoding algorithm is adopted to utilize the described Hidden Markov Model (HMM) trained and described mixed Gauss model that the voice sample data chosen carries out initial frame and end frame registration process, it is determined that the boundary information of described voice sample data;

The boundary information training of voice information according to described voice sample data, content of text and described voice sample data obtains the deep neural network model preset.

4. the voice keyword recognition method based on deep neural network according to claim 3, it is characterized in that, described utilize degree of deep learning method that the voice sample data chosen is carried out deep neural network training, after obtaining the deep neural network model preset, also comprise:

Monitor the posterior probability of each voice sample data under each neural unit of the output layer of the described default deep neural network model trained;

If not, then utilize back-propagation algorithm the parameter of described default deep neural network model to be adjusted, until each voice sample data all the posterior probability under the neural unit of correspondence is maximum.

5. the voice keyword recognition method based on deep neural network according to the arbitrary item of claim 1-4, it is characterised in that, also comprise:

Utilize corresponding Hidden Markov Model (HMM) that the described keyword identified is carried out marking process, calculate the likelihood probability of described keyword under described Hidden Markov Model (HMM);

If described likelihood probability is greater than predetermined threshold value, then determine that recognition result is true.

6. the voice keyword recognition method based on deep neural network according to claim 5, it is characterised in that, determine that the keyword of described input voice to be identified comprises according to the comparative result of described posterior probability sequence and the probability sequence of predetermined threshold value:

Judge the probability sequence whether described posterior probability sequence exists a continuous print numerical value subsegment and be all greater than predetermined threshold value;

If whether the time length judged corresponding to described continuous print numerical value subsegment between initial frame and end frame is greater than the default time;

When the time length judged corresponding to described continuous print numerical value subsegment between initial frame and end frame is greater than the default time, using keyword corresponding for the neural unit belonging to described continuous print numerical value subsegment as the keyword represented by input voice to be identified.

7. the voice keyword means of identification based on deep neural network, it is characterised in that, comprising:

Characteristic extracting module, for voice frame described in each is carried out feature extraction, obtains the mel cepstrum characteristic coefficient MFCC sequence of voice frame described in each;

Probability calculation module, for and be about to voice frame described in each MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating voice frame described in each respectively under each neural unit of the output layer of described default deep neural network model, by posterior probability sequence corresponding for the posterior probability described multiple voice frame of composition under each neural unit of described output layer, wherein, output layer each corresponding keyword of neural unit;

Monitoring modular, for the described posterior probability sequence monitored under each neural unit of output layer;

Keyword identification module, determines the keyword of described input voice to be identified for the comparative result of the probability sequence according to described posterior probability sequence and predetermined threshold value.

8. the voice keyword means of identification based on deep neural network according to claim 7, it is characterised in that, described default deep neural network model is by setting up with lower module:

Model determination module, for utilizing degree of deep learning method that the voice sample data chosen is carried out deep neural network training, obtain the deep neural network model preset, wherein, described deep neural network model comprises: the input layer that the neural unit answered by MFCC sequence pair forms, the hidden layer being made up of nonlinear mapping unit and the output layer being made up of the neural unit that the posterior probability of each keyword is corresponding.

9. the voice keyword means of identification based on deep neural network according to claim 8, it is characterised in that, described model determination module comprises:

Training unit, for according to the voice sample data training Hidden Markov Model (HMM) chosen and mixed Gauss model, wherein, described Hidden Markov Model (HMM) and the described voice sample data chosen are one to one, and described mixed Gauss model is used for describing the output probability distribution of described Hidden Markov Model state;

Registration process unit, for adopting Viterbi decoding algorithm to utilize the described Hidden Markov Model (HMM) that trains and described mixed Gauss model that the voice sample data chosen carries out initial frame and end frame registration process, it is determined that the boundary information of described voice sample data;

Model determining unit, the boundary information for the voice information according to described voice sample data, content of text and described voice sample data trains the deep neural network model obtaining presetting.

10. the voice keyword means of identification based on deep neural network according to claim 9, it is characterised in that, described means of identification also comprises:

Monitoring modular, for monitoring the posterior probability of each voice sample data under each neural unit of the output layer of the described default deep neural network model trained;

Fine setting module, for when judging voice sample data the posterior probability under the neural unit of correspondence be not maximum, utilize back-propagation algorithm the parameter of described default deep neural network model to be adjusted, until each voice sample data all the posterior probability under the neural unit of correspondence is maximum.

The 11. voice keyword means of identification based on deep neural network according to the arbitrary item of claim 7-10, it is characterised in that, also comprise:

Marking module, for utilizing corresponding Hidden Markov Model (HMM) that the described keyword identified is carried out marking process, calculates the likelihood probability of described keyword under described Hidden Markov Model (HMM);

Recognition result confirms module, if being greater than predetermined threshold value for described likelihood probability, then determines that recognition result is true.

The 12. voice keyword means of identification based on deep neural network according to claim 11, it is characterised in that, described keyword identification module comprises:

First judging unit, for judging the probability sequence whether described posterior probability sequence exists a continuous print numerical value subsegment and be all greater than predetermined threshold value;

2nd judging unit, for when judging that described posterior probability sequence exists the probability sequence that a continuous print numerical value subsegment is all greater than predetermined threshold value, whether the time length judged corresponding to described continuous print numerical value subsegment between initial frame and end frame is greater than the default time;

Keyword determining unit, during for being greater than the default time when the time length judged corresponding to described continuous print numerical value subsegment between initial frame and end frame, using keyword corresponding for the neural unit belonging to described continuous print numerical value subsegment as the keyword represented by input voice to be identified.