CN105679316A - Voice keyword identification method and apparatus based on deep neural network - Google Patents
Voice keyword identification method and apparatus based on deep neural network Download PDFInfo
- Publication number
- CN105679316A CN105679316A CN201511016642.1A CN201511016642A CN105679316A CN 105679316 A CN105679316 A CN 105679316A CN 201511016642 A CN201511016642 A CN 201511016642A CN 105679316 A CN105679316 A CN 105679316A
- Authority
- CN
- China
- Prior art keywords
- voice
- neural network
- keyword
- deep neural
- posterior probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 84
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 55
- 230000001537 neural effect Effects 0.000 claims abstract description 105
- 238000003062 neural network model Methods 0.000 claims abstract description 83
- 230000000052 comparative effect Effects 0.000 claims abstract description 17
- 238000012544 monitoring process Methods 0.000 claims abstract description 13
- 238000009432 framing Methods 0.000 claims abstract description 8
- 238000012549 training Methods 0.000 claims description 39
- 230000008569 process Effects 0.000 claims description 33
- 238000000605 extraction Methods 0.000 claims description 14
- 238000013135 deep learning Methods 0.000 claims description 12
- 238000009826 distribution Methods 0.000 claims description 8
- 238000013507 mapping Methods 0.000 claims description 7
- 239000000203 mixture Substances 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 14
- 238000012545 processing Methods 0.000 description 10
- 230000003993 interaction Effects 0.000 description 7
- 230000014509 gene expression Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 4
- 238000003860 storage Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000945 filler Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/54—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
- Image Analysis (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention provides a voice keyword identification method and apparatus based on a deep neural network. The method comprises the steps: framing the voice to be identified to obtain a plurality of voice frames; extracting features from each voice frame, and obtaining a Mel frequency cepstral coefficient MFCC sequence for each voice frame; inputting the MFCC sequence for each voice frame into a preset deep neural network model in parallel; respectively calculating the posterior probability under each neural unit of the output layer in the preset deep neural network model, for the MFCC sequence for each voice frame; forming posterior probability sequences corresponding to the plurality of voice frames through the posterior probability under each neural unit of the output layer; monitoring the posterior probability sequence under each neural unit of the output layer; and according to the comparative result between the posterior probability sequence and the probability sequence of the preset threshold, determining the keywords of the voice to be identified, and utilizing the pre-trained deep neural network to perform voice keywords identification. Therefore, the voice keyword identification method and apparatus based on a deep neural network can improve the identification speed and alleviate the problem of identification delay.
Description
Technical field
The present invention relates to voice keyword recognition technology field, specifically, it relates to a kind of voice keyword recognition method based on deep neural network and device.
Background technology
At present, along with the raising of the widespread use of "smart" products, storage device performance and capacity, and network, communication flourish, voice has become the strong carrier of information, thus the process of voice and utilisation technology more and more receive the concern of people. Wherein, voice keyword recognition technology refers to and identifies given keyword in given voice and indicate the position at its place, voice keyword recognition technology is an important branch of speech recognition technology, is process nature voice, effective solution that to realize man machine language mutual. In a lot of application scene, voice keyword identification is widely used, such as voice inquiry system, speech searching system, the real-time Controlling System of voice command, do not need word for word to identify all the elements that voice comprises, and only need to identify the predetermined keyword in given voice. Therefore, voice keyword recognition technology has a extensive future, and becomes the research focus of field of speech recognition.
Currently, correlation technique provides a kind of voice keyword recognition technology based on model, such as, based on the continuous speech recognition of large vocabulary, need first to convert speech signal to text with voice recognizer, again given keyword being carried out text search, this voice keyword recognition technology need to carry out speech signal conversion again after one whole section of complete input of continuous voice; And for example, based on keyword model and the keyword identification filling (filler) model, need all non-keywords are identified as loaded with dielectric, all non-keywords also need to be identified as loaded with dielectric when the complete input of one whole section of continuous voice by this voice keyword recognition technology, keyword is identified as keyword model, and then determines the keyword of one whole section of continuous voice.
In the process realizing the present invention, contriver finds at least to exist in correlation technique following problem: there is, in voice keyword recognition technology, the problem identifying and postponing at present, therefore cannot realize in time, carry out human-computer interaction fast.
Summary of the invention
In view of this, the object of the embodiment of the present invention is to provide a kind of voice keyword recognition method based on deep neural network and device, to solve the problem existing in voice keyword recognition technology and identifying and postpone, improve the identification speed of voice keyword, it is achieved carry out human-computer interaction in time, fast.
First aspect, embodiments provides a kind of voice keyword recognition method based on deep neural network, and this recognition methods comprises:
Input voice to be identified is carried out framing and obtains multiple voice frame;
Each above-mentioned voice frame is carried out feature extraction, obtains the mel cepstrum characteristic coefficient MFCC sequence of each above-mentioned voice frame;
And be about to each above-mentioned voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each above-mentioned voice frame respectively under each neural unit of the output layer of above-mentioned default deep neural network model, by posterior probability sequence corresponding for the posterior probability above-mentioned multiple voice frame of composition under each neural unit of above-mentioned output layer, wherein, output layer each corresponding keyword of neural unit;
Above-mentioned posterior probability sequence under each neural unit of monitoring output layer;
The keyword of above-mentioned input voice to be identified is determined according to the comparative result of above-mentioned posterior probability sequence and the probability sequence of predetermined threshold value.
In conjunction with first aspect, embodiments providing the first possible enforcement mode of first aspect, wherein, above-mentioned default deep neural network model is set up in the following manner:
Utilize degree of deep learning method that the voice sample data chosen is carried out deep neural network training, obtain the deep neural network model preset, wherein, above-mentioned deep neural network model comprises: the input layer that the neural unit answered by MFCC sequence pair forms, the hidden layer being made up of nonlinear mapping unit and the output layer being made up of the neural unit that the posterior probability of each keyword is corresponding.
In conjunction with the first possible enforcement mode of first aspect, embodiments provide the 2nd kind of possible enforcement mode of first aspect, wherein, above-mentioned utilizing degree of deep learning method that the voice sample data chosen is carried out deep neural network training, the deep neural network model obtaining presetting comprises:
According to the voice sample data training Hidden Markov Model (HMM) chosen and mixed Gauss model, wherein, above-mentioned Hidden Markov Model (HMM) and the above-mentioned voice sample data chosen are one to one, and above-mentioned mixed Gauss model is used for describing the output probability distribution of above-mentioned Hidden Markov Model state;
Viterbi decoding algorithm is adopted to utilize the above-mentioned Hidden Markov Model (HMM) trained and above-mentioned mixed Gauss model that the voice sample data chosen carries out initial frame and end frame registration process, it is determined that the boundary information of above-mentioned voice sample data;
The boundary information training of voice information according to above-mentioned voice sample data, content of text and above-mentioned voice sample data obtains the deep neural network model preset.
In conjunction with the 2nd kind of first aspect possible enforcement mode, embodiments provide the third possible enforcement mode of first aspect, wherein, above-mentioned utilize degree of deep learning method that the voice sample data chosen is carried out deep neural network training, after obtaining the deep neural network model preset, also comprise:
Monitor the posterior probability of each voice sample data under each neural unit of the output layer of the above-mentioned default deep neural network model trained;
The posterior probability under the neural unit of correspondence is maximum to judge each voice sample data;
If not, then utilize back-propagation algorithm the parameter of above-mentioned default deep neural network model to be adjusted, until each voice sample data all the posterior probability under the neural unit of correspondence is maximum.
In conjunction with first aspect to any one in the third possible enforcement mode of first aspect, embodiments providing the 4th kind of possible enforcement mode of first aspect, wherein, above-mentioned recognition methods also comprises:
Utilize corresponding Hidden Markov Model (HMM) that the above-mentioned keyword identified is carried out marking process, calculate the likelihood probability of above-mentioned keyword under above-mentioned Hidden Markov Model (HMM);
If above-mentioned likelihood probability is greater than predetermined threshold value, then determine that recognition result is true.
In conjunction with the 4th kind of first aspect possible enforcement mode, embodiments provide the 5th kind of possible enforcement mode of first aspect, wherein, determine that the keyword of above-mentioned input voice to be identified comprises according to the comparative result of above-mentioned posterior probability sequence and the probability sequence of predetermined threshold value:
Judge the probability sequence whether above-mentioned posterior probability sequence exists a continuous print numerical value subsegment and be all greater than predetermined threshold value;
If whether the time length judged corresponding to above-mentioned continuous print numerical value subsegment between initial frame and end frame is greater than the default time;
When the time length judged corresponding to above-mentioned continuous print numerical value subsegment between initial frame and end frame is greater than the default time, using keyword corresponding for the neural unit belonging to above-mentioned continuous print numerical value subsegment as the keyword represented by input voice to be identified.
Second aspect, the embodiment of the present invention additionally provides a kind of voice keyword means of identification based on deep neural network, and this means of identification comprises:
Voice divides frame module, obtains multiple voice frame for input voice to be identified is carried out framing;
Characteristic extracting module, for each above-mentioned voice frame is carried out feature extraction, obtains the mel cepstrum characteristic coefficient MFCC sequence of each above-mentioned voice frame;
Probability calculation module, for and be about to each above-mentioned voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each above-mentioned voice frame respectively under each neural unit of the output layer of above-mentioned default deep neural network model, by posterior probability sequence corresponding for the posterior probability above-mentioned multiple voice frame of composition under each neural unit of above-mentioned output layer, wherein, output layer each corresponding keyword of neural unit;
Monitoring modular, for the above-mentioned posterior probability sequence monitored under each neural unit of output layer;
Keyword identification module, determines the keyword of above-mentioned input voice to be identified for the comparative result of the probability sequence according to above-mentioned posterior probability sequence and predetermined threshold value.
In conjunction with second aspect, embodiments providing the first possible enforcement mode of second aspect, wherein, above-mentioned default deep neural network model is by setting up with lower module:
Model determination module, for utilizing degree of deep learning method that the voice sample data chosen is carried out deep neural network training, obtain the deep neural network model preset, wherein, above-mentioned deep neural network model comprises: the input layer that the neural unit answered by MFCC sequence pair forms, the hidden layer being made up of nonlinear mapping unit and the output layer being made up of the neural unit that the posterior probability of each keyword is corresponding.
In conjunction with the first possible enforcement mode of second aspect, embodiments providing the 2nd kind of possible enforcement mode of second aspect, wherein, above-mentioned model determination module comprises:
Training unit, for according to the voice sample data training Hidden Markov Model (HMM) chosen and mixed Gauss model, wherein, above-mentioned Hidden Markov Model (HMM) and the above-mentioned voice sample data chosen are one to one, and above-mentioned mixed Gauss model is used for describing the output probability distribution of above-mentioned Hidden Markov Model state;
Registration process unit, for adopting Viterbi decoding algorithm to utilize the above-mentioned Hidden Markov Model (HMM) that trains and above-mentioned mixed Gauss model that the voice sample data chosen carries out initial frame and end frame registration process, it is determined that the boundary information of above-mentioned voice sample data;
Model determining unit, the boundary information for the voice information according to above-mentioned voice sample data, content of text and above-mentioned voice sample data trains the deep neural network model obtaining presetting.
In conjunction with the 2nd kind of second aspect possible enforcement mode, embodiments providing the third possible enforcement mode of second aspect, wherein, above-mentioned means of identification also comprises:
Monitoring modular, for monitoring the posterior probability of each voice sample data under each neural unit of the output layer of the above-mentioned default deep neural network model trained;
Judging module, for judging each voice sample data, the posterior probability under the neural unit of correspondence is maximum;
Fine setting module, for when judging voice sample data the posterior probability under the neural unit of correspondence be not maximum, utilize back-propagation algorithm the parameter of above-mentioned default deep neural network model to be adjusted, until each voice sample data all the posterior probability under the neural unit of correspondence is maximum.
In conjunction with second aspect to any one in the third possible enforcement mode of second aspect, embodiments providing the 4th kind of possible enforcement mode of second aspect, wherein, above-mentioned means of identification also comprises:
Marking module, for utilizing corresponding Hidden Markov Model (HMM) that the above-mentioned keyword identified is carried out marking process, calculates the likelihood probability of above-mentioned keyword under above-mentioned Hidden Markov Model (HMM);
Recognition result confirms module, if being greater than predetermined threshold value for above-mentioned likelihood probability, then determines that recognition result is true.
In conjunction with the 4th kind of second aspect possible enforcement mode, embodiments providing the 5th kind of possible enforcement mode of second aspect, wherein, above-mentioned keyword identification module comprises:
First judging unit, for judging the probability sequence whether above-mentioned posterior probability sequence exists a continuous print numerical value subsegment and be all greater than predetermined threshold value;
2nd judging unit, for when judging that above-mentioned posterior probability sequence exists the probability sequence that a continuous print numerical value subsegment is all greater than predetermined threshold value, whether the time length judged corresponding to above-mentioned continuous print numerical value subsegment between initial frame and end frame is greater than the default time;
Keyword determining unit, during for being greater than the default time when the time length judged corresponding to above-mentioned continuous print numerical value subsegment between initial frame and end frame, using keyword corresponding for the neural unit belonging to above-mentioned continuous print numerical value subsegment as the keyword represented by input voice to be identified.
In the voice keyword recognition method based on deep neural network and device of embodiment of the present invention offer, the method comprises: first, input voice to be identified is carried out sub-frame processing, the multiple voice frames obtained are carried out feature extraction, thus obtains the mel cepstrum characteristic coefficient MFCC sequence of each voice frame; Then, and be about to each voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each voice frame respectively under each neural unit of the output layer of default deep neural network model, forms posterior probability sequence corresponding to multiple voice frame by the posterior probability under each neural unit of output layer; Finally, the posterior probability sequence under each neural unit of output layer is monitored; The keyword of input voice to be identified is determined according to the comparative result of posterior probability sequence and the probability sequence of predetermined threshold value. Utilizing the deep neural network that training in advance is good to carry out voice keyword identification in embodiments of the present invention, it is to increase the identification speed of voice keyword, the identification alleviating voice keyword postpones problem, such that it is able to realize carrying out human-computer interaction in time, fast.
For making above-mentioned purpose, the feature and advantage of the present invention become apparent, better embodiment cited below particularly, and coordinate appended accompanying drawing, it is described in detail below.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme of the embodiment of the present invention, it is briefly described to the accompanying drawing used required in embodiment below, it is to be understood that, the following drawings illustrate only some embodiment of the present invention, therefore should not be counted as is the restriction to scope, for those of ordinary skill in the art, under the prerequisite not paying creative work, it is also possible to obtain other relevant accompanying drawings according to these accompanying drawings.
Fig. 1 shows the schematic flow sheet of a kind of voice keyword recognition method based on deep neural network that the embodiment of the present invention provides;
Fig. 2 shows another kind that the embodiment of the present invention provides schematic flow sheet based on the voice keyword recognition method of deep neural network;
Fig. 3 shows the structural representation of a kind of voice keyword means of identification based on deep neural network that the embodiment of the present invention provides;
Fig. 4 shows another kind that the embodiment of the present invention provides structural representation based on the voice keyword means of identification of deep neural network.
Embodiment
For making the object of the embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments. The assembly of the embodiment of the present invention usually describing in accompanying drawing herein and illustrating can be arranged with various different configuration and design. Therefore, below to the detailed description of the embodiments of the invention provided in the accompanying drawings and the scope of the claimed the present invention of not intended to be limiting, but only represent the selected embodiment of the present invention. Based on embodiments of the invention, other embodiments all that those skilled in the art obtain under the prerequisite not making creative work, all belong to the scope of protection of the invention.
Consider the problem existing in current voice keyword recognition technology in correlation technique and identifying and postpone, therefore cannot realize in time, carry out human-computer interaction fast. Based on this, embodiments provide a kind of voice keyword recognition method based on deep neural network and device, it is described below by embodiment.
As shown in Figure 1, embodiments provide a kind of voice keyword recognition method based on deep neural network, the method comprising the steps of S102-S110, specific as follows:
Step S102: input voice to be identified is carried out framing and obtains multiple voice frame;
Wherein, first needing input voice to be identified is carried out sub-frame processing, it is possible to the time length of each voice frame is set as 25ms, frame moves as 10ms, namely according to default framing mode, input voice to be identified is divided into multiple voice frame.
Step S104: each above-mentioned voice frame is carried out feature extraction, obtains the mel cepstrum characteristic coefficient MFCC sequence of each above-mentioned voice frame;
Concrete, the multiple voice frames obtained after sub-frame processing are carried out feature extraction, by the sound signal of each voice frame has identification constituents extraction out, obtain the mel cepstrum characteristic coefficient MFCC sequence that each voice frame is corresponding, wherein, this mel cepstrum characteristic coefficient MFCC sequence has the characteristic of 39 dimensions, using corresponding for each voice frame the 39 MFCC sequences tieed up as the input feature vector of the input layer of default deep neural network, therefore, the input layer of default deep neural network is set to 39 neural unit.
Step S106: and be about to each above-mentioned voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each above-mentioned voice frame respectively under each neural unit of the output layer of above-mentioned default deep neural network model, by posterior probability sequence corresponding for the posterior probability above-mentioned multiple voice frame of composition under each neural unit of above-mentioned output layer, wherein, output layer each corresponding keyword of neural unit;
Concrete, above-mentioned feature is extracted the input layer feature of the 39 MFCC sequences tieed up as default deep neural network of each the voice frame obtained, between the neural unit of each of input layer separate, to input each voice frame 39 dimension MFCC sequences carry out parallelization process after, transfer to the hidden layer of default deep neural network, this hidden layer can be 3 layers, and be made up of nonlinear mapping unit, input the MFCC sequence of 39 dimensions of each voice frame successively, and the posterior probability of the MFCC sequence calculating each voice frame respectively under each neural unit of the output layer of default deep neural network model, separate between the neural unit of each of output layer can realize parallelism recognition, owing to one section of voice turns into multiple voice frame after sub-frame processing, carry out feature extraction, the MFCC sequence of 39 dimensions that feature is extracted each the voice frame obtained is as input, therefore, for one section of input voice to be identified, a posterior probability sequence is all there is under the neural unit of each of output layer.
Step S108: the above-mentioned posterior probability sequence under each neural unit of monitoring output layer;
Step S110: the keyword determining above-mentioned input voice to be identified according to the comparative result of above-mentioned posterior probability sequence and the probability sequence of predetermined threshold value.
Wherein, posterior probability values under the neural unit that each voice frame of input voice to be identified is corresponding in default deep neural network should be greater than predetermined threshold value, therefore, the probability sequence that a continuous print numerical value subsegment is all greater than predetermined threshold value is there is under the neural unit that multiple voice frames of input voice to be identified are corresponding in default deep neural network, thus, the keyword corresponding to neural unit judging this output layer is the keyword of this section of input voice to be identified.
Illustrate, such as, input voice to be identified is " please turn left ", first, the input voice being somebody's turn to do " please turn left " is carried out sub-frame processing, again each voice frame is carried out feature extraction, by the MFCC sequence inputting of each voice frame of " please turn left " to the deep neural network model preset, when utilizing this deep neural network model preset this " please turn left " to be identified, the bigger posterior probability sequence of a string value should be there is under the neural unit of the expression " turning left " of output layer, and there is not the bigger posterior probability sequence of a string value or there is the bigger posterior probability sequence of several intermittent values under other neural unit of output layer.
In the voice keyword recognition method based on deep neural network and device of embodiment of the present invention offer, the method comprises: first, input voice to be identified is carried out sub-frame processing, the multiple voice frames obtained are carried out feature extraction, thus obtains the mel cepstrum characteristic coefficient MFCC sequence of each voice frame; Then, and be about to each voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each voice frame respectively under each neural unit of the output layer of default deep neural network model, forms posterior probability sequence corresponding to multiple voice frame by the posterior probability under each neural unit of output layer;Finally, the posterior probability sequence under each neural unit of output layer is monitored; The keyword of input voice to be identified is determined according to the comparative result of posterior probability sequence and the probability sequence of predetermined threshold value. Utilizing the deep neural network that training in advance is good to carry out voice keyword identification in embodiments of the present invention, it is to increase the identification speed of voice keyword, the identification alleviating voice keyword postpones problem, such that it is able to realize carrying out human-computer interaction in time, fast.
Consider to there is the similar situation of keyword, thus not only need to judge the probability sequence whether posterior probability sequence exists a continuous print numerical value subsegment and be all greater than predetermined threshold value, whether the time length also needed to judge corresponding to continuous print numerical value subsegment between initial frame and end frame is greater than the default time, based on this, determine that the keyword of above-mentioned input voice to be identified comprises according to the comparative result of above-mentioned posterior probability sequence and the probability sequence of predetermined threshold value:
Judge the probability sequence whether above-mentioned posterior probability sequence exists a continuous print numerical value subsegment and be all greater than predetermined threshold value;
If whether the time length judged corresponding to above-mentioned continuous print numerical value subsegment between initial frame and end frame is greater than the default time;
When the time length judged corresponding to above-mentioned continuous print numerical value subsegment between initial frame and end frame is greater than the default time, using keyword corresponding for the neural unit belonging to above-mentioned continuous print numerical value subsegment as the keyword represented by input voice to be identified.
Concrete, also for above-mentioned input voice to be identified " please turn left ", the input voice that there is similar key has " please turn right ", now, when input voice to be identified is for " please turn left ", the bigger posterior probability sequence of a string value is all there is under the neural unit of the expression " turning left " of the output layer of default deep neural network model and under the neural unit of expression " turning right ", and under the neural unit of the expression " turning left " of output layer, there is the bigger posterior probability sequence of a longer value, the posterior probability sequence that the value being interrupted in the middle of having two under the neural unit of the expression " turning right " of output layer is bigger, therefore, the keyword in input voice to be identified is determined again by the size of the time length judged corresponding to continuous print numerical value subsegment between initial frame and end frame.
In embodiment provided by the invention, not only judge whether the posterior probability sequence under each neural unit of output layer exists the probability sequence that a continuous print numerical value subsegment is all greater than predetermined threshold value, whether the time length also needed to judge corresponding to continuous print numerical value subsegment between initial frame and end frame is greater than the default time, further increases the accuracy of voice keyword identification.
Further, above-mentioned default deep neural network model is set up in the following manner:
Utilize degree of deep learning method that the voice sample data chosen is carried out deep neural network training, obtain the deep neural network model preset, wherein, above-mentioned deep neural network model comprises: the input layer that the neural unit answered by MFCC sequence pair forms, the hidden layer being made up of nonlinear mapping unit and the output layer being made up of the neural unit that the posterior probability of each keyword is corresponding, the neural unit of output layer comprises neural unit corresponding to each keyword, the neural unit of the neural unit of an ambient sound and a non-keyword, even training obtains the deep neural network model of N number of keyword, then the number of the neural unit of the output layer of this deep neural network model is N+2.
Concrete, above-mentioned utilize degree of deep learning method that the voice sample data chosen is carried out deep neural network training, the deep neural network model obtaining presetting comprises:
According to the voice sample data training Hidden Markov Model (HMM) chosen and mixed Gauss model, wherein, above-mentioned Hidden Markov Model (HMM) and the above-mentioned voice sample data chosen are one to one, and above-mentioned mixed Gauss model is used for describing the output probability distribution of above-mentioned Hidden Markov Model state;
Viterbi decoding algorithm is adopted to utilize the above-mentioned Hidden Markov Model (HMM) trained and above-mentioned mixed Gauss model that the voice sample data chosen carries out initial frame and end frame registration process, it is determined that the boundary information of above-mentioned voice sample data;
The boundary information training of voice information according to above-mentioned voice sample data, content of text and above-mentioned voice sample data obtains the deep neural network model preset.
Wherein, above-mentioned each Hidden Markov Model (HMM) 9 states, each state mixed Gauss model that a component number is 8 describes the output probability distribution of state. Hidden Markov Model (HMM) and mixed Gauss model all utilize open source software, forward-backward algorithm (BW) algorithm, expectation maximization (EM) algorithm, and maximumization likelihood criterion (MLE) training obtains.
When training deep neural network model, using the voice sample data chosen as training objects, owing to this training deep neural network model is the core link of whole voice keyword recognition process, more comprehensive in order to obtain containing information, have more the deep neural network model of ubiquity, choose representative voice sample data very important, can using voices different for identical for content accent as voice sample data, choose sufficient voice sample data as training objects, the deep neural network model obtained now is trained to possess the feature of diversity of the different accent of identical content, utilizing this deep neural network model to carry out subsequent voice keyword recognition process can make recognition result more accurate.
In embodiment provided by the invention, when training deep neural network model, the voice sample data chosen is carried out initial frame and end frame registration process, determine the boundary information of voice sample data, where it is determined that the concrete steps of the boundary information of voice sample data are: adopt Viterbi decoding algorithm to utilize the above-mentioned Hidden Markov Model (HMM) trained and above-mentioned mixed Gauss model to carry out moving identifying processing based on current voice sample data and namely carry out initial frame and end frame registration process. The object determining boundary information improves accuracy and the identification speed of voice learning sample, ratio is " head is turned left " if any a voice learning sample, keyword is " turning left ", it is necessary to determine the zero position of keyword in voice data, thus removes redundancy audio frequency data. So that the deep neural network model that training obtains can identify the boundary bit location information of voice keyword quickly and accurately in subsequent voice keyword recognition process, when determining this keyword after a keyword identification immediately, carrying out keyword identification again after waiting one whole section of phonetic entry, the further voice key that solves identifies the problem postponed.
In order to the deep neural network that further optimization has trained, thus improve the accuracy of follow-up recognition process, based on this, above-mentioned utilize degree of deep learning method that the voice sample data chosen is carried out deep neural network training, after obtaining the deep neural network model preset, also comprise:
Monitor the posterior probability of each voice sample data under each neural unit of the output layer of the above-mentioned default deep neural network model trained;
The posterior probability under the neural unit of correspondence is maximum to judge each voice sample data;
If not, then utilize back-propagation algorithm the parameter of above-mentioned default deep neural network model to be adjusted, until each voice sample data all the posterior probability under the neural unit of correspondence is maximum.
Such as: for the voice sample data chosen as " please turn left ", in training utterance frame sequence, 15th frame is " turning left " to the content of text of the 75th frame voice, when inputting the 15th frame of voice sample data, maximum posterior probability values is obtained under should representing the neural unit of " turning left " in the output layer of the deep neural network model trained, when obtaining maximum posterior probability values under other neural unit of output layer, back-propagation algorithm is utilized the parameter of the model of this deep neural network to be adjusted, when making the 15th frame inputting voice sample data, maximum posterior probability values is obtained under the output layer of the deep neural network model trained represents the neural unit of " turning left ", then checking input the 16th frame is continued, 17th frame ... to the 75th frame voice, maximum posterior probability values is obtained under making the neural unit of all expressions " turning left " in the output layer of the deep neural network model trained.
In embodiment provided by the invention, by the deep neural network model trained is carried out parameter adjustment, make each voice frame of input posterior probability values under corresponding neural unit in the output layer of deep neural network model be maximum value, thus further increase accuracy and the identification speed of subsequent voice keyword recognition process.
Further, in order to improve the accuracy of voice keyword recognition result, as shown in Figure 2, above-mentioned recognition methods also comprises:
Step S112: utilize corresponding Hidden Markov Model (HMM) that the above-mentioned keyword identified is carried out marking process, calculate the likelihood probability of above-mentioned keyword under above-mentioned Hidden Markov Model (HMM);
Step S114: if above-mentioned likelihood probability is greater than predetermined threshold value, then determine that recognition result is true.
Wherein, in the process of training deep neural network model, the keyword voice sample data training marked by each band obtains a corresponding Hidden Markov Model (HMM) (HMM) and mixed Gauss model (GMM), and this Hidden Markov Model (HMM) (HMM) utilizes mixed Gauss model (GMM) to describe feature spatial distribution. In above-mentioned steps, first, utilize the Hidden Markov Model (HMM) (HMM) corresponding with the keyword identified that this keyword is carried out marking process, namely calculate the likelihood probability of this keyword under above-mentioned Hidden Markov Model (HMM). Such as, also for above-mentioned input voice to be identified " please turn left ", the keyword now identified is " turning left ", then calculate the likelihood probability value of this keyword under the Hidden Markov Model (HMM) (HMM) of " turning left ", then, this likelihood probability calculated and setting threshold value are compared, when the likelihood probability calculated is greater than predetermined threshold value, then determining that recognition result is true, namely the keyword of this input voice to be identified is " turning left "; When the likelihood probability calculated is less than predetermined threshold value, then determine that recognition result is false, it is necessary to re-start identification.
In embodiment provided by the invention, the checking of recognition result is carried out again after identifying keyword, first, utilize the Hidden Markov Model (HMM) corresponding with the keyword identified that this keyword is carried out marking process, namely calculate the likelihood probability of this keyword under above-mentioned Hidden Markov Model (HMM); Then, this likelihood probability calculated and setting threshold value are compared, according to comparative result determines the whether correct of recognition result, such that it is able to improve the accuracy of voice keyword recognition result further.
In the voice keyword recognition method based on deep neural network and device of embodiment of the present invention offer, the method comprises: first, input voice to be identified is carried out sub-frame processing, the multiple voice frames obtained are carried out feature extraction, thus obtains the mel cepstrum characteristic coefficient MFCC sequence of each voice frame;Then, and be about to each voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each voice frame respectively under each neural unit of the output layer of default deep neural network model, forms posterior probability sequence corresponding to multiple voice frame by the posterior probability under each neural unit of output layer; Finally, the posterior probability sequence under each neural unit of output layer is monitored; The keyword of input voice to be identified is determined according to the comparative result of posterior probability sequence and the probability sequence of predetermined threshold value. Utilizing the deep neural network that training in advance is good to carry out voice keyword identification in embodiments of the present invention, it is to increase the identification speed of voice keyword, the identification alleviating voice keyword postpones problem, such that it is able to realize carrying out human-computer interaction in time, fast; Further, not only judge whether the posterior probability sequence under each neural unit of output layer exists the probability sequence that a continuous print numerical value subsegment is all greater than predetermined threshold value, whether the time length also needed to judge corresponding to continuous print numerical value subsegment between initial frame and end frame is greater than the default time, further increases the accuracy of voice keyword identification; Further, the checking of recognition result is carried out again after identifying keyword, first, utilize the Hidden Markov Model (HMM) corresponding with the keyword identified that this keyword is carried out marking process, namely calculate the likelihood probability of this keyword under above-mentioned Hidden Markov Model (HMM); Then, this likelihood probability calculated and setting threshold value are compared, according to comparative result determines the whether correct of recognition result, such that it is able to improve the accuracy of voice keyword recognition result further.
Corresponding to above-mentioned recognition methods, the embodiment of the present invention additionally provides a kind of voice keyword means of identification based on deep neural network, and as shown in Figure 3, this means of identification comprises:
Voice divides frame module 302, obtains multiple voice frame for input voice to be identified is carried out framing;
Characteristic extracting module 304, for each above-mentioned voice frame is carried out feature extraction, obtains the mel cepstrum characteristic coefficient MFCC sequence of each above-mentioned voice frame;
Probability calculation module 306, for and be about to each above-mentioned voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each above-mentioned voice frame respectively under each neural unit of the output layer of above-mentioned default deep neural network model, by posterior probability sequence corresponding for the posterior probability above-mentioned multiple voice frame of composition under each neural unit of above-mentioned output layer, wherein, output layer each corresponding keyword of neural unit;
Monitoring modular 308, for the above-mentioned posterior probability sequence monitored under each neural unit of output layer;
Keyword identification module 310, determines the keyword of above-mentioned input voice to be identified for the comparative result of the probability sequence according to above-mentioned posterior probability sequence and predetermined threshold value.
Further, above-mentioned default deep neural network model is by setting up with lower module:
Model determination module, for utilizing degree of deep learning method that the voice sample data chosen is carried out deep neural network training, obtain the deep neural network model preset, wherein, above-mentioned deep neural network model comprises: the input layer that the neural unit answered by MFCC sequence pair forms, the hidden layer being made up of nonlinear mapping unit and the output layer being made up of the neural unit that the posterior probability of each keyword is corresponding.
Further, above-mentioned model determination module can be realized by following functional unit, specifically comprises:
Training unit, for according to the voice sample data training Hidden Markov Model (HMM) chosen and mixed Gauss model, wherein, above-mentioned Hidden Markov Model (HMM) and the above-mentioned voice sample data chosen are one to one, and above-mentioned mixed Gauss model is used for describing the output probability distribution of above-mentioned Hidden Markov Model state;
Registration process unit, for adopting Viterbi decoding algorithm to utilize the above-mentioned Hidden Markov Model (HMM) that trains and above-mentioned mixed Gauss model that the voice sample data chosen carries out initial frame and end frame registration process, it is determined that the boundary information of above-mentioned voice sample data;
Model determining unit, the boundary information for the voice information according to above-mentioned voice sample data, content of text and above-mentioned voice sample data trains the deep neural network model obtaining presetting.
Further, above-mentioned means of identification also comprises:
Monitoring modular, for monitoring the posterior probability of each voice sample data under each neural unit of the output layer of the above-mentioned default deep neural network model trained;
Judging module, for judging each voice sample data, the posterior probability under the neural unit of correspondence is maximum;
Fine setting module, for when judging voice sample data the posterior probability under the neural unit of correspondence be not maximum, utilize back-propagation algorithm the parameter of above-mentioned default deep neural network model to be adjusted, until each voice sample data all the posterior probability under the neural unit of correspondence is maximum.
Further, as shown in Figure 4, above-mentioned means of identification also comprises:
Marking module 312, for utilizing corresponding Hidden Markov Model (HMM) that the above-mentioned keyword identified is carried out marking process, calculates the likelihood probability of above-mentioned keyword under above-mentioned Hidden Markov Model (HMM);
Recognition result confirms module 314, if being greater than predetermined threshold value for above-mentioned likelihood probability, then determines that recognition result is true.
Further, above-mentioned keyword identification module 310 comprises:
First judging unit, for judging the probability sequence whether above-mentioned posterior probability sequence exists a continuous print numerical value subsegment and be all greater than predetermined threshold value;
2nd judging unit, for when judging that above-mentioned posterior probability sequence exists the probability sequence that a continuous print numerical value subsegment is all greater than predetermined threshold value, whether the time length judged corresponding to above-mentioned continuous print numerical value subsegment between initial frame and end frame is greater than the default time;
Keyword determining unit, during for being greater than the default time when the time length judged corresponding to above-mentioned continuous print numerical value subsegment between initial frame and end frame, using keyword corresponding for the neural unit belonging to above-mentioned continuous print numerical value subsegment as the keyword represented by input voice to be identified.
Known based on above-mentioned analysis, compared with the electric energy meter calibrating apparatus in correlation technique, the voice keyword means of identification that the embodiment of the present invention provides is first, input voice to be identified is carried out sub-frame processing, the multiple voice frames obtained are carried out feature extraction, thus obtains the mel cepstrum characteristic coefficient MFCC sequence of each voice frame; Then, and be about to each voice frame MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating each voice frame respectively under each neural unit of the output layer of default deep neural network model, forms posterior probability sequence corresponding to multiple voice frame by the posterior probability under each neural unit of output layer; Finally, the posterior probability sequence under each neural unit of output layer is monitored; The keyword of input voice to be identified is determined according to the comparative result of posterior probability sequence and the probability sequence of predetermined threshold value. Utilizing the deep neural network that training in advance is good to carry out voice keyword identification in embodiments of the present invention, it is to increase the identification speed of voice keyword, the identification alleviating voice keyword postpones problem, such that it is able to realize carrying out human-computer interaction in time, fast; Further, not only judge whether the posterior probability sequence under each neural unit of output layer exists the probability sequence that a continuous print numerical value subsegment is all greater than predetermined threshold value, whether the time length also needed to judge corresponding to continuous print numerical value subsegment between initial frame and end frame is greater than the default time, further increases the accuracy of voice keyword identification; Further, the checking of recognition result is carried out again after identifying keyword, first, utilize the Hidden Markov Model (HMM) corresponding with the keyword identified that this keyword is carried out marking process, namely calculate the likelihood probability of this keyword under above-mentioned Hidden Markov Model (HMM); Then, this likelihood probability calculated and setting threshold value are compared, according to comparative result determines the whether correct of recognition result, such that it is able to improve the accuracy of voice keyword recognition result further.
The voice keyword means of identification that the embodiment of the present invention provides can be the specific hardware on equipment or the software being installed on equipment or firmware etc.The device that the embodiment of the present invention provides, its technique effect realizing principle and generation is identical with aforementioned embodiment of the method, is concise and to the point description, and device embodiment part does not mention part, can with reference to corresponding contents in aforementioned embodiment of the method. The technician of art can be well understood to, and for convenience and simplicity of description, the concrete working process of the system of aforementioned description, device and unit, all with reference to the corresponding process in aforesaid method embodiment, can not repeat them here.
In embodiment provided by the present invention, it should be appreciated that, disclosed device and method, it is possible to realize by another way. Device embodiment described above is only schematic, such as, the division of described unit, it is only a kind of logic function to divide, actual can have other dividing mode when realizing, again such as, multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can ignore, or do not perform. Another point, shown or discussed coupling each other or directly coupling or communication connection can be the indirect coupling by some communication interfaces, device or unit or communication connection, it is possible to be electrical, machinery or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or can also be distributed on multiple NE. Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in embodiment provided by the invention can be integrated in a processing unit, it is also possible to is that the independent physics of each unit exists, it is also possible to two or more unit are in a unit integrated.
If described function realize using the form of software functional unit and as independent production marketing or when using, it is possible to be stored in a computer read/write memory medium. Based on such understanding, the technical scheme of the present invention in essence or says that the part of part or this technical scheme prior art contributed can embody with the form of software product, this computer software product is stored in a storage media, comprise some instructions with so that a computer equipment (can be Personal Computer, server, or the network equipment etc.) perform all or part of step of method described in each embodiment of the present invention. And aforesaid storage media comprises: USB flash disk, portable hard drive, read-only storage (ROM, Read-OnlyMemory), random access memory (RAM, RandomAccessMemory), magnetic disc or CD etc. various can be program code stored medium.
It should be noted that: similar label and letter accompanying drawing below represents similar item, therefore, once a certain Xiang Yi accompanying drawing is defined, then do not need it carries out definition further and explains in accompanying drawing subsequently, in addition, term " first ", " the 2nd ", " the 3rd " etc. are only for distinguishing description, and can not be interpreted as instruction or hint relative importance.
Last it is noted that the above embodiment, it is only the specific embodiment of the present invention, in order to the technical scheme of the present invention to be described, it is not intended to limit, protection scope of the present invention is not limited thereto, although with reference to previous embodiment to invention has been detailed description, it will be understood by those within the art that: any be familiar with those skilled in the art in the technical scope that the present invention discloses, technical scheme described in previous embodiment still can be modified or can be expected change easily by it, or wherein part technology feature is carried out equivalent replacement,And these amendments, change or replacement, do not make the spirit and scope of the essence disengaging embodiment of the present invention technical scheme of appropriate technical solution. All should be encompassed within protection scope of the present invention. Therefore, protection scope of the present invention should described be as the criterion with the protection domain of claim.
Claims (12)
1. the voice keyword recognition method based on deep neural network, it is characterised in that, comprising:
Input voice to be identified is carried out framing and obtains multiple voice frame;
Voice frame described in each is carried out feature extraction, obtains the mel cepstrum characteristic coefficient MFCC sequence of voice frame described in each;
And be about to voice frame described in each MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating voice frame described in each respectively under each neural unit of the output layer of described default deep neural network model, by posterior probability sequence corresponding for the posterior probability described multiple voice frame of composition under each neural unit of described output layer, wherein, output layer each corresponding keyword of neural unit;
Described posterior probability sequence under each neural unit of monitoring output layer;
The keyword of described input voice to be identified is determined according to the comparative result of described posterior probability sequence and the probability sequence of predetermined threshold value.
2. the voice keyword recognition method based on deep neural network according to claim 1, it is characterised in that, described default deep neural network model is set up in the following manner:
Utilize degree of deep learning method that the voice sample data chosen is carried out deep neural network training, obtain the deep neural network model preset, wherein, described deep neural network model comprises: the input layer that the neural unit answered by MFCC sequence pair forms, the hidden layer being made up of nonlinear mapping unit and the output layer being made up of the neural unit that the posterior probability of each keyword is corresponding.
3. the voice keyword recognition method based on deep neural network according to claim 2, it is characterized in that, described utilizing degree of deep learning method that the voice sample data chosen is carried out deep neural network training, the deep neural network model obtaining presetting comprises:
According to the voice sample data training Hidden Markov Model (HMM) chosen and mixed Gauss model, wherein, described Hidden Markov Model (HMM) and the described voice sample data chosen are one to one, and described mixed Gauss model is used for describing the output probability distribution of described Hidden Markov Model state;
Viterbi decoding algorithm is adopted to utilize the described Hidden Markov Model (HMM) trained and described mixed Gauss model that the voice sample data chosen carries out initial frame and end frame registration process, it is determined that the boundary information of described voice sample data;
The boundary information training of voice information according to described voice sample data, content of text and described voice sample data obtains the deep neural network model preset.
4. the voice keyword recognition method based on deep neural network according to claim 3, it is characterized in that, described utilize degree of deep learning method that the voice sample data chosen is carried out deep neural network training, after obtaining the deep neural network model preset, also comprise:
Monitor the posterior probability of each voice sample data under each neural unit of the output layer of the described default deep neural network model trained;
The posterior probability under the neural unit of correspondence is maximum to judge each voice sample data;
If not, then utilize back-propagation algorithm the parameter of described default deep neural network model to be adjusted, until each voice sample data all the posterior probability under the neural unit of correspondence is maximum.
5. the voice keyword recognition method based on deep neural network according to the arbitrary item of claim 1-4, it is characterised in that, also comprise:
Utilize corresponding Hidden Markov Model (HMM) that the described keyword identified is carried out marking process, calculate the likelihood probability of described keyword under described Hidden Markov Model (HMM);
If described likelihood probability is greater than predetermined threshold value, then determine that recognition result is true.
6. the voice keyword recognition method based on deep neural network according to claim 5, it is characterised in that, determine that the keyword of described input voice to be identified comprises according to the comparative result of described posterior probability sequence and the probability sequence of predetermined threshold value:
Judge the probability sequence whether described posterior probability sequence exists a continuous print numerical value subsegment and be all greater than predetermined threshold value;
If whether the time length judged corresponding to described continuous print numerical value subsegment between initial frame and end frame is greater than the default time;
When the time length judged corresponding to described continuous print numerical value subsegment between initial frame and end frame is greater than the default time, using keyword corresponding for the neural unit belonging to described continuous print numerical value subsegment as the keyword represented by input voice to be identified.
7. the voice keyword means of identification based on deep neural network, it is characterised in that, comprising:
Voice divides frame module, obtains multiple voice frame for input voice to be identified is carried out framing;
Characteristic extracting module, for voice frame described in each is carried out feature extraction, obtains the mel cepstrum characteristic coefficient MFCC sequence of voice frame described in each;
Probability calculation module, for and be about to voice frame described in each MFCC sequence inputting to preset deep neural network model, the posterior probability of the MFCC sequence calculating voice frame described in each respectively under each neural unit of the output layer of described default deep neural network model, by posterior probability sequence corresponding for the posterior probability described multiple voice frame of composition under each neural unit of described output layer, wherein, output layer each corresponding keyword of neural unit;
Monitoring modular, for the described posterior probability sequence monitored under each neural unit of output layer;
Keyword identification module, determines the keyword of described input voice to be identified for the comparative result of the probability sequence according to described posterior probability sequence and predetermined threshold value.
8. the voice keyword means of identification based on deep neural network according to claim 7, it is characterised in that, described default deep neural network model is by setting up with lower module:
Model determination module, for utilizing degree of deep learning method that the voice sample data chosen is carried out deep neural network training, obtain the deep neural network model preset, wherein, described deep neural network model comprises: the input layer that the neural unit answered by MFCC sequence pair forms, the hidden layer being made up of nonlinear mapping unit and the output layer being made up of the neural unit that the posterior probability of each keyword is corresponding.
9. the voice keyword means of identification based on deep neural network according to claim 8, it is characterised in that, described model determination module comprises:
Training unit, for according to the voice sample data training Hidden Markov Model (HMM) chosen and mixed Gauss model, wherein, described Hidden Markov Model (HMM) and the described voice sample data chosen are one to one, and described mixed Gauss model is used for describing the output probability distribution of described Hidden Markov Model state;
Registration process unit, for adopting Viterbi decoding algorithm to utilize the described Hidden Markov Model (HMM) that trains and described mixed Gauss model that the voice sample data chosen carries out initial frame and end frame registration process, it is determined that the boundary information of described voice sample data;
Model determining unit, the boundary information for the voice information according to described voice sample data, content of text and described voice sample data trains the deep neural network model obtaining presetting.
10. the voice keyword means of identification based on deep neural network according to claim 9, it is characterised in that, described means of identification also comprises:
Monitoring modular, for monitoring the posterior probability of each voice sample data under each neural unit of the output layer of the described default deep neural network model trained;
Judging module, for judging each voice sample data, the posterior probability under the neural unit of correspondence is maximum;
Fine setting module, for when judging voice sample data the posterior probability under the neural unit of correspondence be not maximum, utilize back-propagation algorithm the parameter of described default deep neural network model to be adjusted, until each voice sample data all the posterior probability under the neural unit of correspondence is maximum.
The 11. voice keyword means of identification based on deep neural network according to the arbitrary item of claim 7-10, it is characterised in that, also comprise:
Marking module, for utilizing corresponding Hidden Markov Model (HMM) that the described keyword identified is carried out marking process, calculates the likelihood probability of described keyword under described Hidden Markov Model (HMM);
Recognition result confirms module, if being greater than predetermined threshold value for described likelihood probability, then determines that recognition result is true.
The 12. voice keyword means of identification based on deep neural network according to claim 11, it is characterised in that, described keyword identification module comprises:
First judging unit, for judging the probability sequence whether described posterior probability sequence exists a continuous print numerical value subsegment and be all greater than predetermined threshold value;
2nd judging unit, for when judging that described posterior probability sequence exists the probability sequence that a continuous print numerical value subsegment is all greater than predetermined threshold value, whether the time length judged corresponding to described continuous print numerical value subsegment between initial frame and end frame is greater than the default time;
Keyword determining unit, during for being greater than the default time when the time length judged corresponding to described continuous print numerical value subsegment between initial frame and end frame, using keyword corresponding for the neural unit belonging to described continuous print numerical value subsegment as the keyword represented by input voice to be identified.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511016642.1A CN105679316A (en) | 2015-12-29 | 2015-12-29 | Voice keyword identification method and apparatus based on deep neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511016642.1A CN105679316A (en) | 2015-12-29 | 2015-12-29 | Voice keyword identification method and apparatus based on deep neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105679316A true CN105679316A (en) | 2016-06-15 |
Family
ID=56189743
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201511016642.1A Pending CN105679316A (en) | 2015-12-29 | 2015-12-29 | Voice keyword identification method and apparatus based on deep neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105679316A (en) |
Cited By (47)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106209491A (en) * | 2016-06-16 | 2016-12-07 | 苏州科达科技股份有限公司 | A kind of time delay detecting method and device |
CN106297792A (en) * | 2016-09-14 | 2017-01-04 | 厦门幻世网络科技有限公司 | The recognition methods of a kind of voice mouth shape cartoon and device |
CN106919702A (en) * | 2017-02-14 | 2017-07-04 | 北京时间股份有限公司 | Keyword method for pushing and device based on document |
CN107086036A (en) * | 2017-04-19 | 2017-08-22 | 杭州派尼澳电子科技有限公司 | A kind of freeway tunnel method for safety monitoring |
CN107221326A (en) * | 2017-05-16 | 2017-09-29 | 百度在线网络技术(北京)有限公司 | Voice awakening method, device and computer equipment based on artificial intelligence |
CN107230475A (en) * | 2017-05-27 | 2017-10-03 | 腾讯科技(深圳)有限公司 | A kind of voice keyword recognition method, device, terminal and server |
CN107393539A (en) * | 2017-07-17 | 2017-11-24 | 傅筱萸 | A kind of sound cipher control method |
CN107680597A (en) * | 2017-10-23 | 2018-02-09 | 平安科技(深圳)有限公司 | Audio recognition method, device, equipment and computer-readable recording medium |
CN107767863A (en) * | 2016-08-22 | 2018-03-06 | 科大讯飞股份有限公司 | voice awakening method, system and intelligent terminal |
CN108073978A (en) * | 2016-11-14 | 2018-05-25 | 顾泽苍 | A kind of constructive method of the ultra-deep learning model of artificial intelligence |
CN108073985A (en) * | 2016-11-14 | 2018-05-25 | 张素菁 | A kind of importing ultra-deep study method for voice recognition of artificial intelligence |
CN108205525A (en) * | 2016-12-20 | 2018-06-26 | 阿里巴巴集团控股有限公司 | The method and apparatus that user view is determined based on user speech information |
CN108305617A (en) * | 2018-01-31 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The recognition methods of voice keyword and device |
CN108389575A (en) * | 2018-01-11 | 2018-08-10 | 苏州思必驰信息科技有限公司 | Audio data recognition methods and system |
CN108417207A (en) * | 2018-01-19 | 2018-08-17 | 苏州思必驰信息科技有限公司 | A kind of depth mixing generation network self-adapting method and system |
CN108538285A (en) * | 2018-03-05 | 2018-09-14 | 清华大学 | A kind of various keyword detection method based on multitask neural network |
CN108564941A (en) * | 2018-03-22 | 2018-09-21 | 腾讯科技(深圳)有限公司 | Audio recognition method, device, equipment and storage medium |
CN108694940A (en) * | 2017-04-10 | 2018-10-23 | 北京猎户星空科技有限公司 | A kind of audio recognition method, device and electronic equipment |
CN108898076A (en) * | 2018-06-13 | 2018-11-27 | 北京大学深圳研究生院 | The method that a kind of positioning of video behavior time shaft and candidate frame extract |
CN108922521A (en) * | 2018-08-15 | 2018-11-30 | 合肥讯飞数码科技有限公司 | A kind of voice keyword retrieval method, apparatus, equipment and storage medium |
WO2018227780A1 (en) * | 2017-06-12 | 2018-12-20 | 平安科技(深圳)有限公司 | Speech recognition method and device, computer device and storage medium |
CN109065032A (en) * | 2018-07-16 | 2018-12-21 | 杭州电子科技大学 | A kind of external corpus audio recognition method based on depth convolutional neural networks |
CN109074822A (en) * | 2017-10-24 | 2018-12-21 | 深圳和而泰智能控制股份有限公司 | Specific sound recognition methods, equipment and storage medium |
CN109086387A (en) * | 2018-07-26 | 2018-12-25 | 上海慧子视听科技有限公司 | A kind of audio stream methods of marking, device, equipment and storage medium |
CN109215647A (en) * | 2018-08-30 | 2019-01-15 | 出门问问信息科技有限公司 | Voice awakening method, electronic equipment and non-transient computer readable storage medium |
CN109243446A (en) * | 2018-10-01 | 2019-01-18 | 厦门快商通信息技术有限公司 | A kind of voice awakening method based on RNN network |
CN109273003A (en) * | 2018-11-20 | 2019-01-25 | 苏州思必驰信息科技有限公司 | Sound control method and system for automobile data recorder |
CN109300279A (en) * | 2018-10-01 | 2019-02-01 | 厦门快商通信息技术有限公司 | A kind of shop security monitoring method |
CN109545190A (en) * | 2018-12-29 | 2019-03-29 | 联动优势科技有限公司 | A kind of audio recognition method based on keyword |
CN109559735A (en) * | 2018-10-11 | 2019-04-02 | 平安科技(深圳)有限公司 | A kind of audio recognition method neural network based, terminal device and medium |
CN110097870A (en) * | 2018-01-30 | 2019-08-06 | 阿里巴巴集团控股有限公司 | Method of speech processing, device, equipment and storage medium |
CN110223678A (en) * | 2019-06-12 | 2019-09-10 | 苏州思必驰信息科技有限公司 | Audio recognition method and system |
CN110322871A (en) * | 2019-05-30 | 2019-10-11 | 清华大学 | A kind of sample keyword retrieval method based on acoustics characterization vector |
CN110503970A (en) * | 2018-11-23 | 2019-11-26 | 腾讯科技(深圳)有限公司 | A kind of audio data processing method, device and storage medium |
CN110689880A (en) * | 2019-10-21 | 2020-01-14 | 国家电网公司华中分部 | Voice recognition method and device applied to power dispatching field |
CN110837758A (en) * | 2018-08-17 | 2020-02-25 | 杭州海康威视数字技术股份有限公司 | Keyword input method and device and electronic equipment |
CN110930997A (en) * | 2019-12-10 | 2020-03-27 | 四川长虹电器股份有限公司 | Method for labeling audio by using deep learning model |
CN111354352A (en) * | 2018-12-24 | 2020-06-30 | 中国科学院声学研究所 | Automatic template cleaning method and system for audio retrieval |
CN111508475A (en) * | 2020-04-16 | 2020-08-07 | 五邑大学 | Robot awakening voice keyword recognition method and device and storage medium |
CN111508498A (en) * | 2020-04-09 | 2020-08-07 | 携程计算机技术(上海)有限公司 | Conversational speech recognition method, system, electronic device and storage medium |
CN111833888A (en) * | 2020-07-24 | 2020-10-27 | 清华大学 | Near sensor processing system, circuit and method for voice keyword recognition |
CN112735469A (en) * | 2020-10-28 | 2021-04-30 | 西安电子科技大学 | Low-memory voice keyword detection method, system, medium, device and terminal |
CN112750445A (en) * | 2020-12-30 | 2021-05-04 | 标贝(北京)科技有限公司 | Voice conversion method, device and system and storage medium |
CN113035231A (en) * | 2021-03-18 | 2021-06-25 | 三星(中国)半导体有限公司 | Keyword detection method and device |
CN113361647A (en) * | 2021-07-06 | 2021-09-07 | 青岛洞听智能科技有限公司 | Method for identifying type of missed call |
CN113658596A (en) * | 2020-04-29 | 2021-11-16 | 扬智科技股份有限公司 | Semantic identification method and semantic identification device |
CN113888846A (en) * | 2021-09-27 | 2022-01-04 | 深圳市研色科技有限公司 | Method and device for reminding driving in advance |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102194454A (en) * | 2010-03-05 | 2011-09-21 | 富士通株式会社 | Equipment and method for detecting key word in continuous speech |
CN103559881A (en) * | 2013-11-08 | 2014-02-05 | 安徽科大讯飞信息科技股份有限公司 | Language-irrelevant key word recognition method and system |
CN103730115A (en) * | 2013-12-27 | 2014-04-16 | 北京捷成世纪科技股份有限公司 | Method and device for detecting keywords in voice |
CN104575490A (en) * | 2014-12-30 | 2015-04-29 | 苏州驰声信息科技有限公司 | Spoken language pronunciation detecting and evaluating method based on deep neural network posterior probability algorithm |
US20150279358A1 (en) * | 2014-03-31 | 2015-10-01 | International Business Machines Corporation | Method and system for efficient spoken term detection using confusion networks |
-
2015
- 2015-12-29 CN CN201511016642.1A patent/CN105679316A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102194454A (en) * | 2010-03-05 | 2011-09-21 | 富士通株式会社 | Equipment and method for detecting key word in continuous speech |
CN103559881A (en) * | 2013-11-08 | 2014-02-05 | 安徽科大讯飞信息科技股份有限公司 | Language-irrelevant key word recognition method and system |
CN103730115A (en) * | 2013-12-27 | 2014-04-16 | 北京捷成世纪科技股份有限公司 | Method and device for detecting keywords in voice |
US20150279358A1 (en) * | 2014-03-31 | 2015-10-01 | International Business Machines Corporation | Method and system for efficient spoken term detection using confusion networks |
CN104575490A (en) * | 2014-12-30 | 2015-04-29 | 苏州驰声信息科技有限公司 | Spoken language pronunciation detecting and evaluating method based on deep neural network posterior probability algorithm |
Non-Patent Citations (3)
Title |
---|
GUOGUO CHEN 等: "SMALL-FOOTPRINT KEYWORD SPOTTING USING DEEP NEURAL NETWORKS", 《2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTIC, SPEECH AND SIGNAL PROCESSING (ICASSP)》 * |
李文昕: "语音关键词识别中的置信度研究", 《解放军信息工程大学硕士论文》 * |
王朝松,韩纪庆,郑铁然: "基于非均匀 MCE 准则的 DNN 关键词检测系统中声学模型的训练", 《智能计算机与应用》 * |
Cited By (75)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106209491B (en) * | 2016-06-16 | 2019-07-02 | 苏州科达科技股份有限公司 | A kind of time delay detecting method and device |
CN106209491A (en) * | 2016-06-16 | 2016-12-07 | 苏州科达科技股份有限公司 | A kind of time delay detecting method and device |
CN107767863A (en) * | 2016-08-22 | 2018-03-06 | 科大讯飞股份有限公司 | voice awakening method, system and intelligent terminal |
CN106297792A (en) * | 2016-09-14 | 2017-01-04 | 厦门幻世网络科技有限公司 | The recognition methods of a kind of voice mouth shape cartoon and device |
CN108073978A (en) * | 2016-11-14 | 2018-05-25 | 顾泽苍 | A kind of constructive method of the ultra-deep learning model of artificial intelligence |
CN108073985A (en) * | 2016-11-14 | 2018-05-25 | 张素菁 | A kind of importing ultra-deep study method for voice recognition of artificial intelligence |
CN108205525B (en) * | 2016-12-20 | 2021-11-19 | 阿里巴巴集团控股有限公司 | Method and device for determining user intention based on user voice information |
CN108205525A (en) * | 2016-12-20 | 2018-06-26 | 阿里巴巴集团控股有限公司 | The method and apparatus that user view is determined based on user speech information |
CN106919702A (en) * | 2017-02-14 | 2017-07-04 | 北京时间股份有限公司 | Keyword method for pushing and device based on document |
CN106919702B (en) * | 2017-02-14 | 2020-02-11 | 北京时间股份有限公司 | Keyword pushing method and device based on document |
CN108694940A (en) * | 2017-04-10 | 2018-10-23 | 北京猎户星空科技有限公司 | A kind of audio recognition method, device and electronic equipment |
CN107086036A (en) * | 2017-04-19 | 2017-08-22 | 杭州派尼澳电子科技有限公司 | A kind of freeway tunnel method for safety monitoring |
CN107221326A (en) * | 2017-05-16 | 2017-09-29 | 百度在线网络技术(北京)有限公司 | Voice awakening method, device and computer equipment based on artificial intelligence |
CN107230475A (en) * | 2017-05-27 | 2017-10-03 | 腾讯科技(深圳)有限公司 | A kind of voice keyword recognition method, device, terminal and server |
CN107230475B (en) * | 2017-05-27 | 2022-04-05 | 腾讯科技(深圳)有限公司 | Voice keyword recognition method and device, terminal and server |
US11062699B2 (en) | 2017-06-12 | 2021-07-13 | Ping An Technology (Shenzhen) Co., Ltd. | Speech recognition with trained GMM-HMM and LSTM models |
WO2018227780A1 (en) * | 2017-06-12 | 2018-12-20 | 平安科技(深圳)有限公司 | Speech recognition method and device, computer device and storage medium |
CN107393539A (en) * | 2017-07-17 | 2017-11-24 | 傅筱萸 | A kind of sound cipher control method |
CN107680597A (en) * | 2017-10-23 | 2018-02-09 | 平安科技(深圳)有限公司 | Audio recognition method, device, equipment and computer-readable recording medium |
WO2019080248A1 (en) * | 2017-10-23 | 2019-05-02 | 平安科技(深圳)有限公司 | Speech recognition method, device, and apparatus, and computer readable storage medium |
CN109074822A (en) * | 2017-10-24 | 2018-12-21 | 深圳和而泰智能控制股份有限公司 | Specific sound recognition methods, equipment and storage medium |
CN108389575A (en) * | 2018-01-11 | 2018-08-10 | 苏州思必驰信息科技有限公司 | Audio data recognition methods and system |
CN108389575B (en) * | 2018-01-11 | 2020-06-26 | 苏州思必驰信息科技有限公司 | Audio data identification method and system |
CN108417207A (en) * | 2018-01-19 | 2018-08-17 | 苏州思必驰信息科技有限公司 | A kind of depth mixing generation network self-adapting method and system |
CN110097870A (en) * | 2018-01-30 | 2019-08-06 | 阿里巴巴集团控股有限公司 | Method of speech processing, device, equipment and storage medium |
CN108305617A (en) * | 2018-01-31 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The recognition methods of voice keyword and device |
CN108305617B (en) * | 2018-01-31 | 2020-09-08 | 腾讯科技(深圳)有限公司 | Method and device for recognizing voice keywords |
CN110444195B (en) * | 2018-01-31 | 2021-12-14 | 腾讯科技(深圳)有限公司 | Method and device for recognizing voice keywords |
US11222623B2 (en) | 2018-01-31 | 2022-01-11 | Tencent Technology (Shenzhen) Company Limited | Speech keyword recognition method and apparatus, computer-readable storage medium, and computer device |
WO2019149108A1 (en) * | 2018-01-31 | 2019-08-08 | 腾讯科技(深圳)有限公司 | Identification method and device for voice keywords, computer-readable storage medium, and computer device |
CN110444195A (en) * | 2018-01-31 | 2019-11-12 | 腾讯科技(深圳)有限公司 | The recognition methods of voice keyword and device |
CN110444193A (en) * | 2018-01-31 | 2019-11-12 | 腾讯科技(深圳)有限公司 | The recognition methods of voice keyword and device |
CN108538285A (en) * | 2018-03-05 | 2018-09-14 | 清华大学 | A kind of various keyword detection method based on multitask neural network |
CN108538285B (en) * | 2018-03-05 | 2021-05-04 | 清华大学 | Multi-instance keyword detection method based on multitask neural network |
CN108564941A (en) * | 2018-03-22 | 2018-09-21 | 腾讯科技(深圳)有限公司 | Audio recognition method, device, equipment and storage medium |
US11450312B2 (en) | 2018-03-22 | 2022-09-20 | Tencent Technology (Shenzhen) Company Limited | Speech recognition method, apparatus, and device, and storage medium |
CN108564941B (en) * | 2018-03-22 | 2020-06-02 | 腾讯科技(深圳)有限公司 | Voice recognition method, device, equipment and storage medium |
CN108898076A (en) * | 2018-06-13 | 2018-11-27 | 北京大学深圳研究生院 | The method that a kind of positioning of video behavior time shaft and candidate frame extract |
CN109065032A (en) * | 2018-07-16 | 2018-12-21 | 杭州电子科技大学 | A kind of external corpus audio recognition method based on depth convolutional neural networks |
CN109086387A (en) * | 2018-07-26 | 2018-12-25 | 上海慧子视听科技有限公司 | A kind of audio stream methods of marking, device, equipment and storage medium |
CN108922521A (en) * | 2018-08-15 | 2018-11-30 | 合肥讯飞数码科技有限公司 | A kind of voice keyword retrieval method, apparatus, equipment and storage medium |
CN110837758A (en) * | 2018-08-17 | 2020-02-25 | 杭州海康威视数字技术股份有限公司 | Keyword input method and device and electronic equipment |
CN110837758B (en) * | 2018-08-17 | 2023-06-02 | 杭州海康威视数字技术股份有限公司 | Keyword input method and device and electronic equipment |
CN109215647A (en) * | 2018-08-30 | 2019-01-15 | 出门问问信息科技有限公司 | Voice awakening method, electronic equipment and non-transient computer readable storage medium |
CN109300279A (en) * | 2018-10-01 | 2019-02-01 | 厦门快商通信息技术有限公司 | A kind of shop security monitoring method |
CN109243446A (en) * | 2018-10-01 | 2019-01-18 | 厦门快商通信息技术有限公司 | A kind of voice awakening method based on RNN network |
CN109559735B (en) * | 2018-10-11 | 2023-10-27 | 平安科技(深圳)有限公司 | Voice recognition method, terminal equipment and medium based on neural network |
CN109559735A (en) * | 2018-10-11 | 2019-04-02 | 平安科技(深圳)有限公司 | A kind of audio recognition method neural network based, terminal device and medium |
CN109273003A (en) * | 2018-11-20 | 2019-01-25 | 苏州思必驰信息科技有限公司 | Sound control method and system for automobile data recorder |
CN109273003B (en) * | 2018-11-20 | 2021-11-02 | 思必驰科技股份有限公司 | Voice control method and system for automobile data recorder |
CN110503970A (en) * | 2018-11-23 | 2019-11-26 | 腾讯科技(深圳)有限公司 | A kind of audio data processing method, device and storage medium |
CN110503970B (en) * | 2018-11-23 | 2021-11-23 | 腾讯科技(深圳)有限公司 | Audio data processing method and device and storage medium |
CN111354352B (en) * | 2018-12-24 | 2023-07-14 | 中国科学院声学研究所 | Automatic template cleaning method and system for audio retrieval |
CN111354352A (en) * | 2018-12-24 | 2020-06-30 | 中国科学院声学研究所 | Automatic template cleaning method and system for audio retrieval |
CN109545190A (en) * | 2018-12-29 | 2019-03-29 | 联动优势科技有限公司 | A kind of audio recognition method based on keyword |
CN110322871A (en) * | 2019-05-30 | 2019-10-11 | 清华大学 | A kind of sample keyword retrieval method based on acoustics characterization vector |
CN110223678A (en) * | 2019-06-12 | 2019-09-10 | 苏州思必驰信息科技有限公司 | Audio recognition method and system |
CN110689880A (en) * | 2019-10-21 | 2020-01-14 | 国家电网公司华中分部 | Voice recognition method and device applied to power dispatching field |
CN110930997B (en) * | 2019-12-10 | 2022-08-16 | 四川长虹电器股份有限公司 | Method for labeling audio by using deep learning model |
CN110930997A (en) * | 2019-12-10 | 2020-03-27 | 四川长虹电器股份有限公司 | Method for labeling audio by using deep learning model |
CN111508498B (en) * | 2020-04-09 | 2024-01-30 | 携程计算机技术(上海)有限公司 | Conversational speech recognition method, conversational speech recognition system, electronic device, and storage medium |
CN111508498A (en) * | 2020-04-09 | 2020-08-07 | 携程计算机技术(上海)有限公司 | Conversational speech recognition method, system, electronic device and storage medium |
CN111508475A (en) * | 2020-04-16 | 2020-08-07 | 五邑大学 | Robot awakening voice keyword recognition method and device and storage medium |
CN111508475B (en) * | 2020-04-16 | 2022-08-09 | 五邑大学 | Robot awakening voice keyword recognition method and device and storage medium |
CN113658596A (en) * | 2020-04-29 | 2021-11-16 | 扬智科技股份有限公司 | Semantic identification method and semantic identification device |
CN111833888A (en) * | 2020-07-24 | 2020-10-27 | 清华大学 | Near sensor processing system, circuit and method for voice keyword recognition |
CN111833888B (en) * | 2020-07-24 | 2022-11-11 | 清华大学 | Near sensor processing system, circuit and method for voice keyword recognition |
CN112735469B (en) * | 2020-10-28 | 2024-05-17 | 西安电子科技大学 | Low-memory voice keyword detection method, system, medium, equipment and terminal |
CN112735469A (en) * | 2020-10-28 | 2021-04-30 | 西安电子科技大学 | Low-memory voice keyword detection method, system, medium, device and terminal |
CN112750445A (en) * | 2020-12-30 | 2021-05-04 | 标贝(北京)科技有限公司 | Voice conversion method, device and system and storage medium |
CN112750445B (en) * | 2020-12-30 | 2024-04-12 | 标贝(青岛)科技有限公司 | Voice conversion method, device and system and storage medium |
CN113035231A (en) * | 2021-03-18 | 2021-06-25 | 三星(中国)半导体有限公司 | Keyword detection method and device |
CN113035231B (en) * | 2021-03-18 | 2024-01-09 | 三星(中国)半导体有限公司 | Keyword detection method and device |
CN113361647A (en) * | 2021-07-06 | 2021-09-07 | 青岛洞听智能科技有限公司 | Method for identifying type of missed call |
CN113888846A (en) * | 2021-09-27 | 2022-01-04 | 深圳市研色科技有限公司 | Method and device for reminding driving in advance |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105679316A (en) | Voice keyword identification method and apparatus based on deep neural network | |
CN108305634B (en) | Decoding method, decoder and storage medium | |
Tripathi et al. | Deep learning based emotion recognition system using speech features and transcriptions | |
Momeni et al. | Seeing wake words: Audio-visual keyword spotting | |
Polzehl et al. | Anger recognition in speech using acoustic and linguistic cues | |
US8069042B2 (en) | Using child directed speech to bootstrap a model based speech segmentation and recognition system | |
CN106847259B (en) | Method for screening and optimizing audio keyword template | |
CN105654940B (en) | Speech synthesis method and device | |
CN105336324A (en) | Language identification method and device | |
CN111737991B (en) | Text sentence breaking position identification method and system, electronic equipment and storage medium | |
CN112233680A (en) | Speaker role identification method and device, electronic equipment and storage medium | |
US10446136B2 (en) | Accent invariant speech recognition | |
EP4002354B1 (en) | Method and system for automatic speech recognition in resource constrained devices | |
CN114722822B (en) | Named entity recognition method, named entity recognition device, named entity recognition equipment and named entity recognition computer readable storage medium | |
CN112331207B (en) | Service content monitoring method, device, electronic equipment and storage medium | |
CN111753524A (en) | Text sentence break position identification method and system, electronic device and storage medium | |
Bhati et al. | Self-expressing autoencoders for unsupervised spoken term discovery | |
Bhati et al. | Unsupervised Acoustic Segmentation and Clustering Using Siamese Network Embeddings. | |
Bhati et al. | Phoneme based embedded segmental k-means for unsupervised term discovery | |
Aronowitz et al. | Context and uncertainty modeling for online speaker change detection | |
Musaev et al. | Automatic recognition of Uzbek speech based on integrated neural networks | |
CN112037772B (en) | Response obligation detection method, system and device based on multiple modes | |
CN112309398B (en) | Method and device for monitoring working time, electronic equipment and storage medium | |
CN114360584A (en) | Phoneme-level-based speech emotion layered recognition method and system | |
CN111429919B (en) | Crosstalk prevention method based on conference real recording system, electronic device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160615 |