CN105976812A - Voice identification method and equipment thereof - Google Patents
Voice identification method and equipment thereof Download PDFInfo
- Publication number
- CN105976812A CN105976812A CN201610272292.3A CN201610272292A CN105976812A CN 105976812 A CN105976812 A CN 105976812A CN 201610272292 A CN201610272292 A CN 201610272292A CN 105976812 A CN105976812 A CN 105976812A
- Authority
- CN
- China
- Prior art keywords
- data
- feature
- target
- training
- audio data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000006243 chemical reaction Methods 0.000 claims abstract description 48
- 230000002452 interceptive effect Effects 0.000 claims abstract description 32
- 238000012549 training Methods 0.000 claims description 210
- 239000000463 material Substances 0.000 claims description 41
- 239000011159 matrix material Substances 0.000 claims description 40
- 238000000605 extraction Methods 0.000 claims description 24
- 238000009432 framing Methods 0.000 claims description 19
- 238000013481 data capture Methods 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 12
- 239000000284 extract Substances 0.000 claims description 7
- 238000012163 sequencing technique Methods 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 230000000875 corresponding effect Effects 0.000 description 60
- 238000001228 spectrum Methods 0.000 description 16
- 230000006870 function Effects 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 230000008859 change Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 230000002596 correlated effect Effects 0.000 description 4
- 238000001914 filtration Methods 0.000 description 4
- 238000010606 normalization Methods 0.000 description 4
- 230000007704 transition Effects 0.000 description 4
- 238000012805 post-processing Methods 0.000 description 3
- 238000007476 Maximum Likelihood Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 235000011158 Prunus mume Nutrition 0.000 description 1
- 244000018795 Prunus mume Species 0.000 description 1
- 230000000712 assembly Effects 0.000 description 1
- 238000000429 assembly Methods 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/144—Training of HMMs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Probability & Statistics with Applications (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
An embodiment of the invention discloses a voice identification method and equipment thereof. The method comprises the following steps of acquiring input target audio data based on interactive application; extracting a target Filter bank characteristic in the target audio data; taking the target Filter bank characteristic in the target audio data as input data of a trained DNN model and acquiring a posterior probability characteristic on a target phoneme state of the target audio data output by the trained DNN model; and creating a phoneme decoding network associated with the target audio data, and using a phoneme conversion probability of a trained HMM and the posterior probability characteristic on the target phoneme state of the target audio data to acquire target word sequence data corresponding to the target audio data in the decoding network. By using the method and the equipment of the invention, voice identification of various kinds of practical application environments and pronunciation habits can be satisfied and accuracy of the voice identification is increased.
Description
Technical field
The present invention relates to field of computer technology, particularly relate to a kind of audio recognition method and equipment thereof.
Background technology
Constantly developing and perfect along with computer technology, the application scenarios for voice recognition the most gradually increases, such as:
Generated in corresponding chat by the associated person information in the audio extraction terminal that user inputs, the audio frequency that inputted by user
The audio frequency hold, inputted by user carries out user's checking etc., and voice recognition technology facilitates user at operating handset, computer etc. eventually
Operation during end, improves Consumer's Experience.
Existing voice recognition technology be based on gauss hybrid models (Gaussian Mixture Model, GMM) and
HMM (Hidden Markov Model, HMM) carries out the foundation of acoustic model, in actual application, needs
Mel-frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC) in target audio to be extracted is special
Levy, by MFCC feature input to acoustic model, finally export the voice identification result to target audio.Due to GMM-HMM's
Acoustic Modeling is the modeling pattern of a kind of distinction, and for solving the distinction problem of pronunciation phonemes state, therefore it needs tool
The MFCC feature of the independence between standby characteristic dimension is as the input data of acoustic model, it is impossible to meet various actual application ring
Border and the speech recognition of pronunciation custom, reduce the accuracy of speech recognition.
Summary of the invention
The embodiment of the present invention provides a kind of audio recognition method and equipment thereof, can meet various actual application environment and
The speech recognition of pronunciation custom, promotes the accuracy of speech recognition.
Embodiment of the present invention first aspect provides a kind of audio recognition method, it may include:
Obtain the target audio data inputted based on interactive application;
Extract target Filter bank (bank of filters) feature in described target audio data;
Using the target Filter bank feature in described target audio data as the deep-neural-network after training
The input data of (Deep Neural Networks, DNN) model, obtain the described mesh of the output of the DNN model after described training
Posterior probability feature in the target phoneme state of mark voice data;
Create the phoneme decoding network being associated with described target audio data, and use the phoneme of the HMM after training to turn
Change the posterior probability feature in the target phoneme state of probability and described target audio data in described decoding network, obtain institute
State the target word sequence data that target audio data are corresponding.
Embodiment of the present invention second aspect provides a kind of speech recognition apparatus, it may include:
Voice data acquiring unit, for obtaining the target audio data inputted based on interactive application;
Feature extraction unit, for extracting the target Filter bank feature in described target audio data;
Feature acquiring unit, for using the target Filter bank feature in described target audio data as training after
The input data of DNN model, obtain the target phoneme shape of the described target audio data of the output of the DNN model after described training
Posterior probability feature in state;
Word sequence data capture unit, for creating the phoneme decoding network being associated with described target audio data, and
Use the phoneme conversion probability of HMM after training and the posterior probability feature in the target phoneme state of described target audio data
The target word sequence data that described target audio data are corresponding is obtained in described decoding network.
In embodiments of the present invention, when getting target audio data based on interactive application input, by obtaining mesh
Target Filter bank feature in mark voice data, and based on the HMM after the DNN model after training and training, to target
Voice data carries out speech recognition and obtains target word sequence data.The acoustic model set up by DNN model and HMM realizes language
The function of sound identification, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove between characteristic dimension
Dependency, can meet various actual application environment and the speech recognition of pronunciation custom, improve the accuracy of speech recognition.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to
Other accompanying drawing is obtained according to these accompanying drawings.
Fig. 1 is the schematic flow sheet of a kind of audio recognition method that the embodiment of the present invention provides;
Fig. 2 is the schematic flow sheet of the another kind of audio recognition method that the embodiment of the present invention provides;
Fig. 3 is the structural representation of a kind of speech recognition apparatus that the embodiment of the present invention provides;
Fig. 4 is the structural representation of the another kind of speech recognition apparatus that the embodiment of the present invention provides;
Fig. 5 is the structural representation of the feature extraction unit that the embodiment of the present invention provides;
Fig. 6 is the structural representation of the feature acquiring unit that the embodiment of the present invention provides;
Fig. 7 is the structural representation of another speech recognition apparatus that the embodiment of the present invention provides.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on
Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under not making creative work premise
Embodiment, broadly falls into the scope of protection of the invention.
The audio recognition method that the embodiment of the present invention provides can apply to the target audio data to terminal use's input
(such as: comprise the audio frequency of numeral, the audio frequency etc. that comprises word) is identified and generates corresponding words sequence (such as: numeric string, word
Sentence etc.) scene, such as: speech recognition apparatus obtains the target audio data that inputted based on interactive application, described voice is known
Other equipment extracts the target Filter bank feature in described target audio data, and described speech recognition apparatus is by described target
Target Filter bank feature in voice data is as the input data of the DNN model after training, after obtaining described training
Posterior probability feature in the target phoneme state of the described target audio data of DNN model output, described speech recognition apparatus
Create the phoneme decoding network that is associated with described target audio data, and use the phoneme conversion probability of the HMM after training with
Posterior probability feature in the target phoneme state of described target audio data obtains described target sound in described decoding network
Frequency is according to the scene etc. of corresponding target word sequence data.The acoustic model set up by DNN model and HMM realizes voice to be known
Other function, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove being correlated with between characteristic dimension
Property, various actual application environment and the speech recognition of pronunciation custom can be met, improve the accuracy of speech recognition.
The application resource loading equipemtn that the present embodiments relate to can be to include panel computer, smart mobile phone, palm electricity
Brain, car-mounted terminal, PC (personal computer) and mobile internet device (MID) etc. possess the terminal of speech identifying function and set
Standby, it is also possible to for the server apparatus possessing speech identifying function that interactive application is corresponding;Described interactive application can be needs
Audio frequency in conjunction with user's input carries out the terminal applies of corresponding interactive function realization, such as: transaction application, instant messaging are applied
Deng, identifying code input, Password Input, Content of Communication input can be carried out by the audio recognition method that the embodiment of the present invention provides
Deng.
Below in conjunction with accompanying drawing 1 and accompanying drawing 2, a kind of audio recognition method providing the embodiment of the present invention is situated between in detail
Continue.
Refer to Fig. 1, for embodiments providing the schematic flow sheet of a kind of audio recognition method.As it is shown in figure 1,
The described method of the embodiment of the present invention may comprise steps of S101-step S104.
S101, obtains the target audio data inputted based on interactive application;
Concrete, speech recognition apparatus obtains the target audio data that user is inputted, described target based on interactive application
Voice data is specifically as follows user's application interface based on the described interactive application being currently needed for carrying out phonetic entry and is inputted
Voice, and for be currently needed for the voice data carrying out speech recognition.
S102, extracts the target Filter bank feature in described target audio data;
Concrete, described speech recognition apparatus can be special at described target audio extracting data target Filter bank
Levy, it should be noted that described speech recognition apparatus needs to split into described target audio data multiframe voice data, and divide
The other Filter bank feature to every frame voice data is extracted to input in the DNN model to following training, i.e. framing
Input carries out the calculating of the posterior probability feature of phoneme state.The most described speech recognition apparatus can be to described target sound frequency
According to carrying out data framing, obtaining at least one frame voice data in described target audio data, described speech recognition apparatus obtains
The first object Filter bank feature that in described at least one frame voice data, every frame the first voice data is corresponding, described target
Filter bank character representation is the Filter bank feature belonging to described target audio data, and described first voice data is
The currently practical speech data needing to carry out posterior probability feature calculation, described first object in described target audio data
Filter bank character representation is the Filter bank feature belonging to described first object voice data.
S103, using the target Filter bank feature in described target audio data as DNN model defeated after training
Entering data, the posteriority in the target phoneme state of the described target audio data obtaining the output of the DNN model after described training is general
Rate feature;
Concrete, described speech recognition apparatus can be by the target Filter bank feature in described target audio data
As the input data of the DNN model after training, obtain the described target audio data of the output of the DNN model after described training
Posterior probability feature in target phoneme state, it is preferred that phoneme state is phonetic symbol, described target phoneme state is described mesh
Phoneme state present in mark voice data, described DNN model can obtain the internodal matrix of output layer in the training process
Weighted value and matrix bias, described output layer node can be at least one node, the quantity of output layer node and phoneme shape
The quantity of state is correlated with (such as: equal), and an output layer node i.e. represents the characteristic vector of a phoneme state.
S104, creates the phoneme decoding network being associated with described target audio data, and uses the sound of the HMM after training
Posterior probability feature in the target phoneme state of element transition probability and described target audio data obtains in described decoding network
Take the target word sequence data that described target audio data are corresponding;
Concrete, described speech recognition apparatus can create the phoneme decoding net being associated with described target audio data
Network, it is preferred that described phoneme decoding network can be with cum rights FST (Weighted Finite-State
Transducer, WFST) it is framework, phoneme state sequence is input, and word sequence data are the word figure decoding network of output, permissible
Being understood by, described phoneme decoding network can also create in advance when being trained DNN model and HMM.
Described speech recognition apparatus uses phoneme conversion probability and the target of described target audio data of the HMM after training
Posterior probability feature on phoneme state obtains the target word sequence that described target audio data are corresponding in described decoding network
Data, the phoneme conversion probability of the HMM after described training include each phoneme state jump to self phoneme conversion probability with
And described each phoneme state jumps to the phoneme conversion probability of next phoneme state of self, it is to be understood that described
Speech recognition apparatus can be according to the phoneme conversion probability of the HMM after training and all of described first object Filter
Posterior probability feature in the target phoneme state of bank feature, arranges every network path in described phoneme decoding network
Probit, and filter out optimal path, and the knowledge indicated by described optimal path according to the probit of described every network path
Other result is as target word sequence data corresponding to described target audio data.
In embodiments of the present invention, when getting target audio data based on interactive application input, by obtaining mesh
Target Filter bank feature in mark voice data, and based on the HMM after the DNN model after training and training, to target
Voice data carries out speech recognition and obtains target word sequence data.The acoustic model set up by DNN model and HMM realizes language
The function of sound identification, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove between characteristic dimension
Dependency, can meet various actual application environment and the speech recognition of pronunciation custom, improve the accuracy of speech recognition.
Refer to Fig. 2, for embodiments providing the schematic flow sheet of another kind of audio recognition method.Such as Fig. 2 institute
Showing, the described method of the embodiment of the present invention may comprise steps of S201-step S211.
S201, uses training audio frequency language material to be trained GMM and HMM, obtains the GMM output after training at least one
The likelihood probability feature of each phoneme state in phoneme state, and obtain the phoneme conversion probability of the HMM after training;
Concrete, before DNN model is trained, need first to train the acoustic model of GMM and HMM, institute
Stating speech recognition apparatus can use training audio frequency language material to be trained GMM and HMM, obtains the GMM after training and exports extremely
The likelihood probability feature of each phoneme state in a few phoneme state, and obtain the phoneme conversion probability of the HMM after training, institute
State training audio frequency language material and can comprise the audio frequency under the scenes such as pause between different noise circumstance, different word speed, different words as far as possible
Data.
It should be noted that described speech recognition apparatus can carry out data prediction, described number to training audio frequency language material
Data preprocess may include that when training audio frequency language material is carried out data framing, data preemphasis, data windowing operation etc. to obtain
At least one frame voice data on territory;Carry out fast Fourier transform, described at least one frame voice data is transformed into frequency domain,
To at least one power spectrum data that described at least one frame voice data is corresponding on frequency domain;By at least one power on frequency domain
Modal data, by having the mel-frequency wave filter of triangle filtering characteristic, obtains at least one Mel power spectrum data;To extremely
A few Mel power spectrum data is taken the logarithm energy, obtains at least one Mel logarithmic energy modal data, now obtained by
At least one Mel logarithmic energy modal data (i.e. Filter bank feature), uses DCT to remove at least one Mel logarithmic energy
The data dependence of modal data to obtain MFCC feature, described speech recognition apparatus using described MFCC feature as the input of GMM
Data, to be trained GMM and HMM, and obtain each phoneme shape at least one phoneme state of the output of the GMM after training
The phoneme conversion probability of the HMM after the likelihood probability feature of state, and training.It is understood that for training audio frequency language material
In Filterbank feature and the MFCC feature of same frame voice data there is relation one to one.
S202, using and forcing alignment operation is described each sound by the likelihood probability Feature Conversion of described each phoneme state
The posterior probability feature of element state;
Concrete, described speech recognition apparatus can use pressure alignment operation by general for the likelihood of described each phoneme state
Rate Feature Conversion is the posterior probability feature of described each phoneme state, it is to be understood that owing to likelihood probability feature is to belong to
In the probability characteristics of diversity, therefore for the frame voice data in described training audio frequency language material, it is at each phoneme state
On the eigenvalue summation of likelihood probability feature be not 1, and for the frame voice data in described training audio frequency language material, its
The eigenvalue summation of the posterior probability feature on each phoneme state is 1, it is therefore desirable to choose the eigenvalue of likelihood probability feature
Maximum phoneme state, is set to 1 by the eigenvalue of the posterior probability feature on this phoneme state, and for this frame voice data
Other phoneme state on the eigenvalue of posterior probability feature be then set to 0, by that analogy, change described training audio frequency language material
In every frame voice data likelihood probability feature on phoneme state, it is thus achieved that in described training audio frequency language material, every frame voice data exists
Posterior probability feature on phoneme state.
S203, according to the training Filter bank feature extracted in described training audio frequency language material and described each
The posterior probability feature of phoneme state, calculates output layer internodal matrix weight value and matrix bias in DNN model;
S204, adds described matrix weight value and described matrix bias to described DNN model, after generating training
DNN model;
Concrete, described speech recognition apparatus can be according to the training extracted in described training audio frequency language material
Filter bank feature and the posterior probability feature of described each phoneme state, calculate output layer in DNN model internodal
Matrix weight value and matrix bias, it is preferred that described speech recognition apparatus can extract described training sound based on said method
Frequently the training Filter bank feature that in language material, every frame voice data is corresponding, and by described training Filter bank feature with right
The posterior probability feature answered is as training sample pair, and the most described training audio frequency language material can exist multiple training sample pair, based on
The plurality of training sample pair, and use the backward pass-algorithm of maximum-likelihood criterion to calculate in DNN model between output layer node
Matrix weight value and matrix bias.Described matrix weight value and described matrix bias are added by described speech recognition apparatus
To described DNN model, generate the DNN model after training.
S205, obtains the probability of occurrence of training word sequence data in training word sequence language material, and according to described training word
The probability of occurrence of sequence data generates N-Gram language model;
Concrete, described speech recognition apparatus is while the acoustic model of training DNN model and HMM, it is also possible to language
Speech model is trained, and described speech recognition apparatus can obtain the appearance of training word sequence data in training word sequence language material
Probability, and generate N-Gram language model according to the probability of occurrence of described training word sequence data, N-Gram language model is base
In a kind of it is assumed that prefixion K-1 the word of the appearance of k-th word is correlated with, and the most uncorrelated with other any word, a words
The product of the probability of occurrence that probability is each word of string.
S206, obtains the target audio data inputted based on interactive application;
Concrete, described speech recognition apparatus obtains the target audio data that user is inputted based on interactive application, described
Target audio data are specifically as follows user application interface institute based on the described interactive application being currently needed for carrying out phonetic entry
The voice of input, and for be currently needed for the voice data carrying out speech recognition.
Described target audio data are carried out data framing by S207, obtain at least one frame in described target audio data
Voice data;
S208, the first object Filter that at least one frame voice data described in acquisition, every frame the first voice data is corresponding
Bank feature;
Concrete, described speech recognition apparatus needs to split into described target audio data multiframe voice data, and divides
The other Filter bank feature to every frame voice data is extracted to input in the DNN model to following training, i.e. framing
Input carries out the calculating of the posterior probability feature of phoneme state.The most described speech recognition apparatus can be to described target sound frequency
According to carrying out data framing, obtaining at least one frame voice data in described target audio data, described speech recognition apparatus obtains
The first object Filter bank feature that in described at least one frame voice data, every frame the first voice data is corresponding, described target
Filter bank character representation is the Filter bank feature belonging to described target audio data, described first voice data
For the speech data needing to carry out posterior probability feature calculation currently practical in described target audio data, described first object
Filter bank character representation is the Filter bank feature belonging to described first object voice data.
Further, described speech recognition apparatus can carry out data prediction, described number to described target audio data
Data preprocess may include that data framing, data preemphasis, data windowing operation etc. are with at least one frame audio frequency obtaining in time domain
Data;Carry out fast Fourier transform, described at least one frame voice data be transformed into frequency domain, obtain described at least one frame audio frequency
At least one power spectrum data that data are corresponding on frequency domain;By at least one power spectrum data on frequency domain by having triangle
The mel-frequency wave filter of filtering characteristic, obtains at least one Mel power spectrum data;To at least one Mel power spectrum
Data are taken the logarithm energy, obtain at least one Mel logarithmic energy modal data, now obtained by least one Mel logarithm energy
The set of amount modal data is described target Filter bank feature, it is to be understood that Filter bank feature is in difference
There is data dependence between characteristic dimension, MFCC feature is then to use discrete cosine transform (DiscreteCosine
Transform, DCT) remove Filter bank feature data dependence obtained by feature.
Preferably, after described speech recognition apparatus also can carry out feature to described target Filter bank feature further
Processing, described feature post processing can include feature extension and feature normalization, and feature extension can be to ask for described target
The first-order difference of Filter bank feature and second differnce feature, obtain the default dimension that described every frame the first voice data is corresponding
The target Filter bank feature of number feature, feature normalization can be to use cepstral mean to subtract (Cepstrum Mean
Subtraction, CMS) the target Filter bank of the technology default Dimension Characteristics corresponding to described every frame the first voice data
Feature carries out regular, obtains the first object Filter bank feature that described every frame the first voice data is corresponding, it is preferred that institute
Stating default dimension can be 72 dimensions.
S209, according to the time-sequencing of described at least one frame voice data, before obtaining described every frame the first voice data
The rear second audio data presetting frame number;
S210, by the second corresponding to described first object Filter bank feature and described second audio data target
Filter bank feature, as the input data of the DNN model after training, obtains the institute of the output of the DNN model after described training
State the posterior probability feature in the target phoneme state of first object Filter bank feature;
Concrete, described speech recognition apparatus can obtain institute according to the time-sequencing of described at least one frame voice data
Presetting the second audio data of frame number before and after stating every frame the first voice data, described speech recognition apparatus is by described first object
After Filter bank feature and the second target Filter bank feature corresponding to described second audio data are as training
The input data of DNN model, obtain the described first object Filter bank feature of the output of the DNN model after described training
Posterior probability feature in target phoneme state, it is to be understood that described second audio data is and described first audio frequency number
According to the data possessing dimension relatedness.
Assume described target audio data exist N frame voice data, the first object that i-th frame the first voice data is corresponding
Filter bank is characterized as Fi, i=1,2,3 ... N, front and back preset frame number for front and back 8 frames, then input data can include FiAnd
Second target Filter bank feature of 8 frames before and after i-th frame the first voice data, preferably presets dimension, then institute based on above-mentioned
The quantity stating input layer corresponding in input data DNN model after described training is 72=1224 joint of (8+1+8) *
Point, the number of nodes of the output layer node of the DNN model after described training equal to number P of all phoneme state, input layer with
There is the hidden layer of predetermined number between output layer, hidden layer number is preferably 3 layers, and each hidden layer all exists 1024 joints
Point, M-1 layer output layer node and M shell output layer internodal matrix weight value and square in the DNN model after described training
Battle array bias can be expressed as WMAnd bM, M=1,2,3 ... P, then i-th frame the first voice data is at M shell output layer node
The characteristic vector of corresponding phoneme stateMeetWherein f (x) is activation primitive, is preferably
Relu function, the F of the DNN model output after the most described trainingiM-th phoneme state on posterior probability featureFor:
S211, creates the phoneme decoding network being associated with described target audio data, and uses the sound of the HMM after training
Posterior probability feature in the target phoneme state of element transition probability and described target audio data obtains in described decoding network
Take the target word sequence data that described target audio data are corresponding;
Concrete, described speech recognition apparatus can create the phoneme decoding net being associated with described target audio data
Network, it is preferred that described phoneme decoding network can be with WFST as framework, phoneme state sequence is input, and word sequence data are
The word figure decoding network of output, it is to be understood that DNN model and HMM can also instructed by described phoneme decoding network
Create in advance when practicing.
Described speech recognition apparatus uses phoneme conversion probability and the target of described target audio data of the HMM after training
Posterior probability feature on phoneme state obtains the target word sequence that described target audio data are corresponding in described decoding network
Data, the phoneme conversion probability of the HMM after described training include each phoneme state jump to self phoneme conversion probability with
And described each phoneme state jumps to the phoneme conversion probability of next phoneme state of self, it is to be understood that described
Speech recognition apparatus can be according to the phoneme conversion probability of the HMM after training and all of described first object Filter
Posterior probability feature in the target phoneme state of bank feature, arranges every network path in described phoneme decoding network
Probit, and filter out optimal path, and the knowledge indicated by described optimal path according to the probit of described every network path
Other result is as target word sequence data corresponding to described target audio data.
Further, the phoneme conversion probability of the HMM after described speech recognition apparatus can use training, described first mesh
Posterior probability feature in the target phoneme state of mark Filter bank feature and described N-Gram language model, described
Decoding network obtains the target word sequence data that described target audio data are corresponding, owing to N-Gram language model can be voluntarily
Infer the probability that next word occurs, therefore in conjunction with probability of occurrence, the probit of every network path can be weighted, increase
Add the probability of network path, obtain, by combining N-Gram language model, the target word sequence number that target audio data are corresponding
According to, the accuracy of speech recognition can be promoted further.
In embodiments of the present invention, when getting target audio data based on interactive application input, by obtaining mesh
Target Filter bank feature in mark voice data, and based on the HMM after the DNN model after training and training, to target
Voice data carries out speech recognition and obtains target word sequence data.The acoustic model set up by DNN model and HMM realizes language
The function of sound identification, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove between characteristic dimension
Dependency, can meet various actual application environment and the speech recognition of pronunciation custom, improve the accuracy of speech recognition;
By having merged method and the training method of DNN-HMM acoustic model of Filter bank feature extraction, it is achieved that complete
Training is to the process identified;The target word sequence data that target audio data are corresponding is obtained by combining N-Gram language model,
Owing to N-Gram language model can infer the probability that next word occurs voluntarily, therefore can be in conjunction with probability of occurrence to every net
The probit in network path is weighted, and increases the probability of network path, improves the accuracy of speech recognition further.
Below in conjunction with accompanying drawing 3-accompanying drawing 6, the speech recognition apparatus providing the embodiment of the present invention describes in detail.Need
It is noted that the speech recognition apparatus shown in accompanying drawing 3-accompanying drawing 6, for performing the side of Fig. 1 of the present invention and embodiment illustrated in fig. 2
Method, for convenience of description, illustrate only the part relevant to the embodiment of the present invention, and concrete ins and outs do not disclose, refer to
Embodiment shown in Fig. 1 and Fig. 2 of the present invention.
Refer to Fig. 3, for embodiments providing the structural representation of a kind of speech recognition apparatus.As it is shown on figure 3,
The described speech recognition apparatus 1 of the embodiment of the present invention may include that voice data acquiring unit 11, feature extraction unit 12, spy
Levy acquiring unit 13 and word sequence data capture unit 14.
Voice data acquiring unit 11, for obtaining the target audio data inputted based on interactive application;
In implementing, described voice data acquiring unit 11 obtains the target audio that user is inputted based on interactive application
Data, described target audio data are specifically as follows user based on the described interactive application being currently needed for carrying out phonetic entry
The voice that application interface is inputted, and for be currently needed for the voice data carrying out speech recognition.
Feature extraction unit 12, for extracting the target Filter bank feature in described target audio data;
In implementing, described feature extraction unit 12 can be in described target audio extracting data target Filter
Bank feature, it should be noted that described feature extraction unit 12 needs described target audio data are split into multiframe audio frequency
Data, and the DNN model after Filter bank feature to every frame voice data is extracted with input to following training respectively
In, i.e. framing inputs the calculating of the posterior probability feature carrying out phoneme state.The most described feature extraction unit 12 can be to institute
State target audio data and carry out data framing, obtain at least one frame voice data in described target audio data, described feature
Extraction unit 12 obtain described in every frame the first voice data is corresponding at least one frame voice data first object Filter bank
Feature, described target Filter bank character representation is the Filter bank feature belonging to described target audio data, described
First voice data is the currently practical speech data needing to carry out posterior probability feature calculation, institute in described target audio data
Stating first object Filter bank character representation is the Filter bank feature belonging to described first object voice data.
Feature acquiring unit 13, is used for the target Filter bank feature in described target audio data as training
After the input data of DNN model, obtain the target phoneme of the described target audio data of the output of the DNN model after described training
Posterior probability feature in state;
In implementing, described feature acquiring unit 13 can be by target Filter in described target audio data
Bank feature, as the input data of the DNN model after training, obtains the described target sound of the output of the DNN model after described training
Posterior probability feature in the target phoneme state of frequency evidence, it is preferred that phoneme state is phonetic symbol, described target phoneme state
For phoneme state present in described target audio data, described DNN model can obtain output layer node in the training process
Between matrix weight value and matrix bias, described output layer node can be at least one node, the quantity of output layer node
Relevant to the quantity of phoneme state (such as: equal), an output layer node i.e. represents the characteristic vector of a phoneme state.
Word sequence data capture unit 14, for creating the phoneme decoding network being associated with described target audio data,
And use the posterior probability in the phoneme conversion probability of the HMM after training and the target phoneme state of described target audio data special
Levy in described decoding network, obtain the target word sequence data that described target audio data are corresponding;
In implementing, described word sequence data capture unit 14 can create and be associated with described target audio data
Phoneme decoding network, it is preferred that described phoneme decoding network can be with WFST as framework, phoneme state sequence is input, word
Sequence data is the word figure decoding network of output, it is to be understood that described phoneme decoding network can also be to DNN model
Create in advance when being trained with HMM.
Described word sequence data capture unit 14 uses the phoneme conversion probability of the HMM after training and described target sound frequency
According to target phoneme state on posterior probability feature in described decoding network, obtain the mesh that described target audio data are corresponding
Mark word sequence data, the phoneme conversion probability of the HMM after described training includes that each phoneme state jumps to self phoneme and turns
Change probability and described each phoneme state jumps to self the phoneme conversion probability of next phoneme state, it is possible to understand that
It is that described word sequence data capture unit 14 can be according to the phoneme conversion probability of the HMM after training and all of described the
Posterior probability feature in the target phoneme state of one target Filter bank feature, is arranged in described phoneme decoding network
The probit of every network path, and filter out optimal path according to the probit of described every network path, and by described
The recognition result of shortest path instruction is as target word sequence data corresponding to described target audio data.
In embodiments of the present invention, when getting target audio data based on interactive application input, by obtaining mesh
Target Filter bank feature in mark voice data, and based on the HMM after the DNN model after training and training, to target
Voice data carries out speech recognition and obtains target word sequence data.The acoustic model set up by DNN model and HMM realizes language
The function of sound identification, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove between characteristic dimension
Dependency, can meet various actual application environment and the speech recognition of pronunciation custom, improve the accuracy of speech recognition.
Refer to Fig. 4, for embodiments providing the structural representation of another kind of speech recognition apparatus.Such as Fig. 4 institute
Show, the described speech recognition apparatus 1 of the embodiment of the present invention may include that voice data acquiring unit 11, feature extraction unit 12,
Feature acquiring unit 13, word sequence data capture unit 14, acoustic training model unit 15, Feature Conversion unit 16, parameter meter
Calculate unit 17, acoustic model signal generating unit 18 and language model signal generating unit 19.
Acoustic training model unit 15, is used for using training audio frequency language material to be trained GMM and HMM, after obtaining training
GMM output at least one phoneme state in the likelihood probability feature of each phoneme state, and obtain the sound of the HMM after training
Element transition probability;
In implementing, before DNN model is trained, need first to train the acoustic mode of GMM and HMM
Type, described acoustic training model unit 15 can use training audio frequency language material to be trained GMM and HMM, after obtaining training
The likelihood probability feature of each phoneme state at least one phoneme state of GMM output, and obtain the phoneme of the HMM after training
Transition probability, described training audio frequency language material can comprise between different noise circumstance, different word speed, different words the fields such as pause as far as possible
Voice data under scape.
It should be noted that described acoustic training model unit 15 can carry out data prediction to training audio frequency language material,
Described data prediction may include that to training audio frequency language material carry out data framing, data preemphasis, data windowing operation etc. with
Obtain at least one frame voice data in time domain;Carry out fast Fourier transform, described at least one frame voice data is transformed into
Frequency domain, obtain described at least one power spectrum data corresponding on frequency domain of at least one frame voice data;By on frequency domain at least
One power spectrum data, by having the mel-frequency wave filter of triangle filtering characteristic, obtains at least one Mel power spectrum number
According to;At least one Mel power spectrum data is taken the logarithm energy, obtain at least one Mel logarithmic energy modal data, now institute
At least one the Mel logarithmic energy modal data (i.e. Filter bank feature) obtained, uses DCT to remove at least one Mel pair
The data dependence of number energy spectra data is to obtain MFCC feature, and described MFCC feature is made by described acoustic training model unit 15
For the input data of GMM, so that GMM and HMM to be trained, and obtain at least one phoneme state of the output of the GMM after training
In the phoneme conversion probability of HMM after the likelihood probability feature of each phoneme state, and training.It is understood that for
Relation one to one is there is in the Filter bank feature of the same frame voice data in training audio frequency language material with MFCC feature.
Feature Conversion unit 16, for using pressure alignment operation the likelihood probability feature of described each phoneme state to be turned
It is changed to the posterior probability feature of described each phoneme state;
In implementing, described Feature Conversion unit 16 can use pressure alignment operation by described each phoneme state
Likelihood probability Feature Conversion is the posterior probability feature of described each phoneme state, it is to be understood that owing to likelihood probability is special
Levying the probability characteristics being belonging to diversity, therefore for the frame voice data in described training audio frequency language material, it is at each sound
The eigenvalue summation of the likelihood probability feature in element state is not 1, and for the frame audio frequency number in described training audio frequency language material
According to, the eigenvalue summation of its posterior probability feature on each phoneme state is 1, it is therefore desirable to choose likelihood probability feature
The phoneme state that eigenvalue is maximum, is set to 1 by the eigenvalue of the posterior probability feature on this phoneme state, and for this frame sound
The eigenvalue of the posterior probability feature on other phoneme state of frequency evidence is then set to 0, by that analogy, changes described training sound
Frequently every frame voice data likelihood probability feature on phoneme state in language material, it is thus achieved that every frame audio frequency in described training audio frequency language material
Data posterior probability feature on phoneme state.
Parameter calculation unit 17, for special according to the training Filter bank extracted in described training audio frequency language material
Levy and the posterior probability feature of described each phoneme state, calculate in DNN model output layer internodal matrix weight value and
Matrix bias;
Acoustic model signal generating unit 18, for adding described matrix weight value and described matrix bias to described DNN
In model, generate the DNN model after training;
In implementing, described parameter calculation unit 17 can be according to the training extracted in described training audio frequency language material
Filter bank feature and the posterior probability feature of described each phoneme state, calculate output layer in DNN model internodal
Matrix weight value and matrix bias, it is preferred that described parameter calculation unit 17 can extract described training based on said method
The training Filter bank feature that in audio frequency language material, every frame voice data is corresponding, and by described training Filter bank feature with
Corresponding posterior probability feature is as training sample pair, and the most described training audio frequency language material can exist multiple training sample pair, base
In the plurality of training sample pair, and the backward pass-algorithm of maximum-likelihood criterion is used to calculate output layer node in DNN model
Between matrix weight value and matrix bias.Described acoustic model signal generating unit 18 is by inclined to described matrix weight value and described matrix
Put value to add to described DNN model, generate the DNN model after training.
Language model signal generating unit 19, general for obtaining the appearance of training word sequence data in training word sequence language material
Rate, and generate N-Gram language model according to the probability of occurrence of described training word sequence data;
In implementing, while the acoustic model of training DNN model and HMM, described language model signal generating unit 19
Can be trained language model, described language model signal generating unit 19 can obtain training word in training word sequence language material
The probability of occurrence of sequence data, and generate N-Gram language model, N-according to the probability of occurrence of described training word sequence data
Gram language model be based on a kind of it is assumed that prefixion K-1 the word of the appearance of k-th word be correlated with, and with other any word
The most uncorrelated, the probability of a words string is the product of the probability of occurrence of each word.
Voice data acquiring unit 11, for obtaining the target audio data inputted based on interactive application;
In implementing, described voice data acquiring unit 11 obtains the target audio that user is inputted based on interactive application
Data, described target audio data be specifically as follows user based on be currently needed for carrying out phonetic entry described interactive application should
The voice inputted with interface, and for be currently needed for the voice data carrying out speech recognition.
Feature extraction unit 12, for extracting the target Filter bank feature in described target audio data;
In implementing, described feature extraction unit 12 can be in described target audio extracting data target Filter
Bank feature, it should be noted that described feature extraction unit 12 needs described target audio data are split into multiframe audio frequency
Data, and the DNN model after Filter bank feature to every frame voice data is extracted with input to following training respectively
In, i.e. framing inputs the calculating of the posterior probability feature carrying out phoneme state.The most described feature extraction unit 12 can be to institute
State target audio data and carry out data framing, obtain at least one frame voice data in described target audio data, described feature
Extraction unit 12 obtain described in every frame the first voice data is corresponding at least one frame voice data first object Filter
Bank feature, described target Filterbank character representation is the Filter bank feature belonging to described target audio data, institute
Stating the first voice data is the currently practical speech data needing to carry out posterior probability feature calculation in described target audio data,
Described first object Filter bank character representation is the Filter bank feature belonging to described first object voice data.
Concrete, please also refer to Fig. 5, for embodiments providing the structural representation of feature extraction unit.As
Shown in Fig. 5, described feature extraction unit 12 may include that
First data acquisition subelement 121, for described target audio data are carried out data framing, obtains described target
At least one frame voice data in voice data;
Fisrt feature obtains subelement 122, be used for obtaining described in every frame the first voice data at least one frame voice data
Corresponding first object Filter bank feature;
In implementing, described first data acquisition subelement 121 needs described target audio data are split into multiframe
Voice data, and the DNN after Filter bank feature to every frame voice data is extracted with input to following training respectively
In model, i.e. framing inputs the calculating of the posterior probability feature carrying out phoneme state.The most described first data acquisition subelement
121 can carry out data framing to described target audio data, obtain at least one frame audio frequency number in described target audio data
According to, described fisrt feature obtain subelement 122 obtain described at least one frame voice data every frame the first voice data corresponding
First object Filter bank feature, described target Filter bank character representation is to belong to described target audio data
Filter bank feature, described first voice data is that in described target audio data, currently practical needs carries out posterior probability
The speech data of feature calculation, described first object Filter bank character representation is for belonging to described first object voice data
Filter bank feature.
Further, described first data acquisition subelement 121 can carry out data to described target audio data and locates in advance
Reason, described data prediction may include that data framing, data preemphasis, data windowing operation etc. are to obtain in time domain extremely
A few frame voice data;Carry out fast Fourier transform, described at least one frame voice data be transformed into frequency domain, obtain described in extremely
At least one power spectrum data that a few frame voice data is corresponding on frequency domain;At least one power spectrum data on frequency domain is led to
Cross the mel-frequency wave filter with triangle filtering characteristic, obtain at least one Mel power spectrum data;To at least one prunus mume (sieb.) sieb.et zucc.
Your power spectrum data is taken the logarithm energy, obtains at least one Mel logarithmic energy modal data, now obtained by least one
The set of Mel logarithmic energy modal data is described target Filter bank feature, it is to be understood that Filter bank
There is data dependence in feature between different characteristic dimension, MFCC feature is then to use DCT to remove Filter bank feature
Data dependence obtained by feature.
Preferably, described target Filter bank feature also can be entered by described fisrt feature acquisition subelement 122 further
Row feature post processing, described feature post processing can include feature extension and feature normalization, feature extension can be ask for described in
The first-order difference of target Filter bank feature and second differnce feature, obtain corresponding pre-of described every frame the first voice data
If the target Filter bank feature of Dimension Characteristics, feature normalization can be to use CMS technology to described every frame the first audio frequency number
Carry out regular according to the target Filter bank feature of corresponding default Dimension Characteristics, obtain described every frame the first voice data pair
The first object Filter bank feature answered, it is preferred that described default dimension can be 72 dimensions.
Feature acquiring unit 13, is used for the target Filter bank feature in described target audio data as training
After the input data of DNN model, obtain the target phoneme of the described target audio data of the output of the DNN model after described training
Posterior probability feature in state;
In implementing, described feature acquiring unit 13 can be by target Filter in described target audio data
Bank feature, as the input data of the DNN model after training, obtains the described target sound of the output of the DNN model after described training
Posterior probability feature in the target phoneme state of frequency evidence, it is preferred that phoneme state is phonetic symbol, described target phoneme state
For phoneme state present in described target audio data, described DNN model can obtain output layer node in the training process
Between matrix weight value and matrix bias, described output layer node can be at least one node, the quantity of output layer node
Relevant to the quantity of phoneme state (such as: equal), an output layer node i.e. represents the characteristic vector of a phoneme state.
Concrete, please also refer to Fig. 6, for embodiments providing the structural representation of feature acquiring unit.As
Shown in Fig. 6, described feature acquiring unit 13 may include that
Second data acquisition subelement 131, for the time-sequencing according to described at least one frame voice data, obtains described
The second audio data of frame number is preset before and after every frame the first voice data;
Second feature obtains subelement 132, for by described first object Filter bank feature and described second sound
Frequency as the input data of the DNN model after training, obtains described training according to the second corresponding target Filter bank feature
After DNN model output described first object Filter bank feature target phoneme state on posterior probability feature;
In implementing, described second data acquisition subelement 131 can according to described at least one frame voice data time
Between sort, obtain the second audio data presetting frame number before and after described every frame the first voice data, described second feature obtains
Subelement 132 is by the second corresponding to described first object Filter bank feature and described second audio data target
Filter bank feature, as the input data of the DNN model after training, obtains the institute of the output of the DNN model after described training
State the posterior probability feature in the target phoneme state of first object Filter bank feature, it is to be understood that described
Two voice datas are to possess the data of dimension relatedness with described first voice data.
Assume described target audio data exist N frame voice data, the first object that i-th frame the first voice data is corresponding
Filter bank is characterized as Fi, i=1,2,3 ... N, front and back preset frame number for front and back 8 frames, then input data can include FiAnd
Second target Filter bank feature of 8 frames before and after i-th frame the first voice data, preferably presets dimension, then institute based on above-mentioned
The quantity stating input layer corresponding in input data DNN model after described training is 72=1224 joint of (8+1+8) *
Point, the number of nodes of the output layer node of the DNN model after described training equal to number P of all phoneme state, input layer with
There is the hidden layer of predetermined number between output layer, hidden layer number is preferably 3 layers, and each hidden layer all exists 1024 joints
Point, M-1 layer output layer node and M shell output layer internodal matrix weight value and square in the DNN model after described training
Battle array bias can be expressed as WMAnd bM, M=1,2,3 ... P, then i-th frame the first voice data is at M shell output layer node
The characteristic vector of corresponding phoneme stateMeetWherein f (x) is activation primitive, is preferably
Relu function, the F of the DNN model output after the most described trainingiM-th phoneme state on posterior probability featureFor:
Word sequence data capture unit 14, for creating the phoneme decoding network being associated with described target audio data,
And use the posterior probability in the phoneme conversion probability of the HMM after training and the target phoneme state of described target audio data special
Levy in described decoding network, obtain the target word sequence data that described target audio data are corresponding;
In implementing, described word sequence data capture unit 14 can create and be associated with described target audio data
Phoneme decoding network, it is preferred that described phoneme decoding network can be with WFST as framework, phoneme state sequence is input, word
Sequence data is the word figure decoding network of output, it is to be understood that described phoneme decoding network can also to DNN model and
Create in advance when HMM is trained.
Described word sequence data capture unit 14 uses the phoneme conversion probability of the HMM after training and described target sound frequency
According to target phoneme state on posterior probability feature in described decoding network, obtain the mesh that described target audio data are corresponding
Mark word sequence data, the phoneme conversion probability of the HMM after described training includes that each phoneme state jumps to self phoneme and turns
Change probability and described each phoneme state jumps to self the phoneme conversion probability of next phoneme state, it is possible to understand that
It is that described word sequence data capture unit 14 can be according to the phoneme conversion probability of the HMM after training and all of described the
Posterior probability feature in the target phoneme state of one target Filter bank feature, is arranged in described phoneme decoding network
The probit of every network path, and filter out optimal path according to the probit of described every network path, and by described
The recognition result of shortest path instruction is as target word sequence data corresponding to described target audio data.
Further, the phoneme conversion probability of the HMM after described word sequence data capture unit 14 can use training, institute
State the posterior probability feature in the target phoneme state of first object Filter bank feature and described N-Gram language mould
Type, obtains the target word sequence data that described target audio data are corresponding in described decoding network, due to N-Gram language mould
Type can infer the probability that next word occurs voluntarily, therefore can enter the probit of every network path in conjunction with probability of occurrence
Row weighting, increases the probability of network path, obtains, by combining N-Gram language model, the target that target audio data are corresponding
Word sequence data, can promote the accuracy of speech recognition further.
In embodiments of the present invention, when getting target audio data based on interactive application input, by obtaining mesh
Target Filter bank feature in mark voice data, and based on the HMM after the DNN model after training and training, to target
Voice data carries out speech recognition and obtains target word sequence data.The acoustic model set up by DNN model and HMM realizes language
The function of sound identification, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove between characteristic dimension
Dependency, can meet various actual application environment and the speech recognition of pronunciation custom, improve the accuracy of speech recognition;
By having merged method and the training method of DNN-HMM acoustic model of Filter bank feature extraction, it is achieved that complete
Training is to the process identified;The target word sequence data that target audio data are corresponding is obtained by combining N-Gram language model,
Owing to N-Gram language model can infer the probability that next word occurs voluntarily, therefore can be in conjunction with probability of occurrence to every net
The probit in network path is weighted, and increases the probability of network path, improves the accuracy of speech recognition further.
Refer to Fig. 7, for embodiments providing the structural representation of another speech recognition apparatus.Such as Fig. 7 institute
Showing, described speech recognition apparatus 1000 may include that at least one processor 1001, such as CPU, at least one network interface
1004, user interface 1003, memorizer 1005, at least one communication bus 1002.Wherein, communication bus 1002 is used for realizing this
Connection communication between a little assemblies.Wherein, user interface 1003 can include display screen (Display), keyboard (Keyboard),
Optional user interface 1003 can also include the wireline interface of standard, wave point.Network interface 1004 optionally can include
The wireline interface of standard, wave point (such as WI-FI interface).Memorizer 1005 can be high-speed RAM memorizer, it is also possible to right and wrong
Unstable memorizer (non-volatile memory), for example, at least one disk memory.Memorizer 1005 is optional
Can also is that at least one is located remotely from the storage device of aforementioned processor 1001.As it is shown in fig. 7, as a kind of Computer Storage
The memorizer 1005 of medium can include operating system, network communication module, Subscriber Interface Module SIM and speech recognition application
Program.
In the speech recognition apparatus 1000 shown in Fig. 7, user interface 1003 is mainly used in providing the user connecing of input
Mouthful, obtain the data of user's input;And processor 1001 may be used for calling the speech recognition application of storage in memorizer 1005
Program, and specifically perform following operation:
Obtain the target audio data inputted based on interactive application;
Extract the target Filter bank feature in described target audio data;
Using the target Filter bank feature in described target audio data as the input number of DNN model after training
According to, the posterior probability in the target phoneme state of the described target audio data obtaining the output of the DNN model after described training is special
Levy;
Create the phoneme decoding network being associated with described target audio data, and use the phoneme of the HMM after training to turn
Change the posterior probability feature in the target phoneme state of probability and described target audio data in described decoding network, obtain institute
State the target word sequence data that target audio data are corresponding.
In one embodiment, described processor 1001 is performing the target sound frequency that acquisition is inputted based on interactive application
According to before, the also following operation of execution:
Use training audio frequency language material that GMM and HMM is trained, obtain at least one phoneme of the GMM output after training
The likelihood probability feature of each phoneme state in state, and obtain the phoneme conversion probability of the HMM after training;
Using and forcing alignment operation is described each phoneme shape by the likelihood probability Feature Conversion of described each phoneme state
The posterior probability feature of state;
According to the training Filter bank feature extracted in described training audio frequency language material and described each phoneme shape
The posterior probability feature of state, calculates output layer internodal matrix weight value and matrix bias in DNN model;
Described matrix weight value and described matrix bias are added to described DNN model, generates the DNN mould after training
Type.
In one embodiment, described processor 1001 is performing the target sound frequency that acquisition is inputted based on interactive application
According to before, the also following operation of execution:
The probability of occurrence of training word sequence data is obtained in training word sequence language material, and according to described training word order columns
According to probability of occurrence generate N-Gram language model.
In one embodiment, the described processor 1001 target Filter in performing the described target audio data of extraction
During bank feature, the following operation of concrete execution:
Described target audio data are carried out data framing, obtains at least one frame audio frequency number in described target audio data
According to;
The first object Filter bank that at least one frame voice data described in acquisition, every frame the first voice data is corresponding is special
Levy.
In one embodiment, described processor 1001 is performing target Filter in described target audio data
Bank feature, as the input data of the DNN model after training, obtains the described target sound of the output of the DNN model after described training
During posterior probability feature in the target phoneme state of frequency evidence, concrete perform following operation:
According to the time-sequencing of described at least one frame voice data, preset before and after obtaining described every frame the first voice data
The second audio data of frame number;
By the second corresponding to described first object Filter bank feature and described second audio data target Filter
Bank feature, as the input data of the DNN model after training, obtains described first mesh of the output of the DNN model after described training
Posterior probability feature in the target phoneme state of mark Filter bank feature;
Wherein, described first voice data is the data being currently needed for carrying out posterior probability feature calculation, described second sound
Frequency is according to the data for possessing dimension relatedness with described first voice data.
In one embodiment, described processor 1001 is performing the phoneme that establishment is associated with described target audio data
Decoding network, and use in the phoneme conversion probability of the HMM after training and the target phoneme state of described target audio data
When posterior probability feature obtains target word sequence data corresponding to described target audio data in described decoding network, specifically hold
The following operation of row:
Create the phoneme decoding network being associated with described target audio data, and use the phoneme of the HMM after training to turn
Change the posterior probability feature in the target phoneme state of probability, described first object Filter bank feature and described N-
Gram language model, obtains the target word sequence data that described target audio data are corresponding in described decoding network.
In embodiments of the present invention, when getting target audio data based on interactive application input, by obtaining mesh
Target Filter bank feature in mark voice data, and based on the HMM after the DNN model after training and training, to target
Voice data carries out speech recognition and obtains target word sequence data.The acoustic model set up by DNN model and HMM realizes language
The function of sound identification, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove between characteristic dimension
Dependency, can meet various actual application environment and the speech recognition of pronunciation custom, improve the accurate of speech recognition
Property;By having merged method and the training method of DNN-HMM acoustic model of Filterbank feature extraction, it is achieved that complete
Training to identify process;The target word sequence number that target audio data are corresponding is obtained by combining N-Gram language model
According to, owing to N-Gram language model can infer the probability that next word occurs voluntarily, therefore can be in conjunction with probability of occurrence to often
The probit of bar network path is weighted, and increases the probability of network path, improves the accuracy of speech recognition further.
One of ordinary skill in the art will appreciate that all or part of flow process realizing in above-described embodiment method, be permissible
Instructing relevant hardware by computer program to complete, described program can be stored in a computer read/write memory medium
In, this program is upon execution, it may include such as the flow process of the embodiment of above-mentioned each method.Wherein, described storage medium can be magnetic
Dish, CD, read-only store-memory body (Read-Only Memory, ROM) or random store-memory body (Random Access
Memory, RAM) etc..
The above disclosed present pre-ferred embodiments that is only, can not limit the right model of the present invention with this certainly
Enclose, the equivalent variations therefore made according to the claims in the present invention, still belong to the scope that the present invention is contained.
Claims (12)
1. an audio recognition method, it is characterised in that including:
Obtain the target audio data inputted based on interactive application;
Extract the target filter group Filter bank feature in described target audio data;
Using the target Filter bank feature in described target audio data as the deep-neural-network DNN model after training
Input data, obtain the DNN model after described training output described target audio data target phoneme state on after
Test probability characteristics;
Create the phoneme decoding network being associated with described target audio data, and use the HMM after training
Posterior probability feature on the phoneme conversion probability of HMM and the target phoneme state of described target audio data is at described decoding net
Network obtains the target word sequence data that described target audio data are corresponding.
Method the most according to claim 1, it is characterised in that the target audio that described acquisition is inputted based on interactive application
Before data, also include:
Use training audio frequency language material that gauss hybrid models GMM and HMM is trained, obtain the GMM after training and export at least
The likelihood probability feature of each phoneme state in one phoneme state, and obtain the phoneme conversion probability of the HMM after training;
Using and forcing alignment operation is described each phoneme state by the likelihood probability Feature Conversion of described each phoneme state
Posterior probability feature;
According to the training Filter bank feature extracted in described training audio frequency language material and described each phoneme state
Posterior probability feature, calculates output layer internodal matrix weight value and matrix bias in DNN model;
Described matrix weight value and described matrix bias are added to described DNN model, generates the DNN model after training.
Method the most according to claim 2, it is characterised in that the target audio that described acquisition is inputted based on interactive application
Before data, also include:
The probability of occurrence of training word sequence data is obtained in training word sequence language material, and according to described training word sequence data
Probability of occurrence generates N-Gram language model.
Method the most according to claim 3, it is characterised in that the target in described extraction described target audio data
Filter bank feature, including:
Described target audio data are carried out data framing, obtains at least one frame voice data in described target audio data;
The first object Filter bank feature that at least one frame voice data described in acquisition, every frame the first voice data is corresponding.
Method the most according to claim 4, it is characterised in that described by target Filter in described target audio data
Bank feature, as the input data of the DNN model after training, obtains the described target sound of the output of the DNN model after described training
Posterior probability feature in the target phoneme state of frequency evidence, including:
According to the time-sequencing of described at least one frame voice data, before and after obtaining described every frame the first voice data, preset frame number
Second audio data;
By the second corresponding to described first object Filter bank feature and described second audio data target Filter
Bank feature, as the input data of the DNN model after training, obtains described first mesh of the output of the DNN model after described training
Posterior probability feature in the target phoneme state of mark Filter bank feature;
Wherein, described first voice data is the data being currently needed for carrying out posterior probability feature calculation, described second audio frequency number
According to the data for possessing dimension relatedness with described first voice data.
Method the most according to claim 5, it is characterised in that the sound that described establishment is associated with described target audio data
Element decoding network, and use in the phoneme conversion probability of the HMM after training and the target phoneme state of described target audio data
Posterior probability feature in described decoding network, obtain the target word sequence data that described target audio data are corresponding, including:
Create the phoneme decoding network that is associated with described target audio data, and use the phoneme conversion of the HMM after training general
Rate, described first object Filter bank feature target phoneme state on posterior probability feature and described N-Gram language
Speech model, obtains the target word sequence data that described target audio data are corresponding in described decoding network.
7. a speech recognition apparatus, it is characterised in that including:
Voice data acquiring unit, for obtaining the target audio data inputted based on interactive application;
Feature extraction unit, for extracting the target Filter bank feature in described target audio data;
Feature acquiring unit, for using the target Filter bank feature in described target audio data as training after DNN
The input data of model, in the target phoneme state of the described target audio data obtaining the output of the DNN model after described training
Posterior probability feature;
Word sequence data capture unit, for creating the phoneme decoding network being associated with described target audio data, and uses
Posterior probability feature on the phoneme conversion probability of the HMM after training and the target phoneme state of described target audio data is in institute
State and decoding network obtains the target word sequence data that described target audio data are corresponding.
Equipment the most according to claim 7, it is characterised in that also include:
Acoustic training model unit, is used for using training audio frequency language material to be trained GMM and HMM, obtains the GMM after training defeated
The likelihood probability feature of each phoneme state at least one phoneme state gone out, and obtain the phoneme conversion of the HMM after training
Probability;
Feature Conversion unit, is institute for using pressure alignment operation by the likelihood probability Feature Conversion of described each phoneme state
State the posterior probability feature of each phoneme state;
Parameter calculation unit, for according to the training Filter bank feature extracted in described training audio frequency language material and
The posterior probability feature of described each phoneme state, calculates output layer internodal matrix weight value and matrix in DNN model inclined
Put value;
Acoustic model signal generating unit, for described matrix weight value and described matrix bias are added to described DNN model,
Generate the DNN model after training.
Equipment the most according to claim 8, it is characterised in that also include:
Language model signal generating unit, for obtaining the probability of occurrence of training word sequence data in training word sequence language material, and root
N-Gram language model is generated according to the probability of occurrence of described training word sequence data.
Equipment the most according to claim 9, it is characterised in that described feature extraction unit includes:
First data acquisition subelement, for described target audio data are carried out data framing, obtains described target sound frequency
At least one frame voice data according to;
Fisrt feature obtains subelement, that at least one frame voice data described in obtain, every frame the first voice data is corresponding
One target Filter bank feature.
11. equipment according to claim 10, it is characterised in that described feature acquiring unit includes:
Second data acquisition subelement, for according to the time-sequencing of described at least one frame voice data, obtains described every frame the
The second audio data of frame number is preset before and after one voice data;
Second feature obtains subelement, for by described first object Filter bank feature and described second audio data
The second corresponding target Filter bank feature is as the input data of the DNN model after training, after obtaining described training
Posterior probability feature in the target phoneme state of the described first object Filter bank feature of DNN model output;
Wherein, described first voice data is the data being currently needed for carrying out posterior probability feature calculation, described second audio frequency number
According to the data for possessing dimension relatedness with described first voice data.
12. equipment according to claim 11, it is characterised in that described word sequence data capture unit is specifically for creating
The phoneme decoding network being associated with described target audio data, and use the phoneme conversion probability of the HMM after training, described
Posterior probability feature in the target phoneme state of one target Filter bank feature and described N-Gram language model,
Described decoding network obtains the target word sequence data that described target audio data are corresponding.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610272292.3A CN105976812B (en) | 2016-04-28 | 2016-04-28 | A kind of audio recognition method and its equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610272292.3A CN105976812B (en) | 2016-04-28 | 2016-04-28 | A kind of audio recognition method and its equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105976812A true CN105976812A (en) | 2016-09-28 |
CN105976812B CN105976812B (en) | 2019-04-26 |
Family
ID=56994150
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610272292.3A Active CN105976812B (en) | 2016-04-28 | 2016-04-28 | A kind of audio recognition method and its equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105976812B (en) |
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106601240A (en) * | 2015-10-16 | 2017-04-26 | 三星电子株式会社 | Apparatus and method for normalizing input data of acoustic model and speech recognition apparatus |
CN106710599A (en) * | 2016-12-02 | 2017-05-24 | 深圳撒哈拉数据科技有限公司 | Particular sound source detection method and particular sound source detection system based on deep neural network |
CN106919662A (en) * | 2017-02-14 | 2017-07-04 | 复旦大学 | A kind of music recognition methods and system |
CN106952645A (en) * | 2017-03-24 | 2017-07-14 | 广东美的制冷设备有限公司 | The recognition methods of phonetic order, the identifying device of phonetic order and air-conditioner |
CN107170444A (en) * | 2017-06-15 | 2017-09-15 | 上海航空电器有限公司 | Aviation cockpit environment self-adaption phonetic feature model training method |
CN107331384A (en) * | 2017-06-12 | 2017-11-07 | 平安科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
CN107633842A (en) * | 2017-06-12 | 2018-01-26 | 平安科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
CN107748898A (en) * | 2017-11-03 | 2018-03-02 | 北京奇虎科技有限公司 | File classifying method, device, computing device and computer-readable storage medium |
CN107871506A (en) * | 2017-11-15 | 2018-04-03 | 北京云知声信息技术有限公司 | The awakening method and device of speech identifying function |
CN108245177A (en) * | 2018-01-05 | 2018-07-06 | 安徽大学 | Intelligent infant monitoring wearable device and GMM-HMM-DNN-based infant crying identification method |
CN108281137A (en) * | 2017-01-03 | 2018-07-13 | 中国科学院声学研究所 | A kind of universal phonetic under whole tone element frame wakes up recognition methods and system |
CN108288467A (en) * | 2017-06-07 | 2018-07-17 | 腾讯科技(深圳)有限公司 | A kind of audio recognition method, device and speech recognition engine |
CN108648769A (en) * | 2018-04-20 | 2018-10-12 | 百度在线网络技术(北京)有限公司 | Voice activity detection method, apparatus and equipment |
CN108922521A (en) * | 2018-08-15 | 2018-11-30 | 合肥讯飞数码科技有限公司 | A kind of voice keyword retrieval method, apparatus, equipment and storage medium |
WO2018232591A1 (en) * | 2017-06-20 | 2018-12-27 | Microsoft Technology Licensing, Llc. | Sequence recognition processing |
CN109274845A (en) * | 2018-08-31 | 2019-01-25 | 平安科技(深圳)有限公司 | Intelligent sound pays a return visit method, apparatus, computer equipment and storage medium automatically |
WO2019019252A1 (en) * | 2017-07-28 | 2019-01-31 | 平安科技(深圳)有限公司 | Acoustic model training method, speech recognition method and apparatus, device and medium |
CN109637523A (en) * | 2018-12-28 | 2019-04-16 | 睿驰达新能源汽车科技(北京)有限公司 | A kind of voice-based door lock for vehicle control method and device |
CN109863554A (en) * | 2016-10-27 | 2019-06-07 | 香港中文大学 | Acoustics font model and acoustics font phonemic model for area of computer aided pronunciation training and speech processes |
CN109887484A (en) * | 2019-02-22 | 2019-06-14 | 平安科技(深圳)有限公司 | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device |
CN110390948A (en) * | 2019-07-24 | 2019-10-29 | 厦门快商通科技股份有限公司 | A kind of method and system of Rapid Speech identification |
CN110491382A (en) * | 2019-03-11 | 2019-11-22 | 腾讯科技(深圳)有限公司 | Audio recognition method, device and interactive voice equipment based on artificial intelligence |
CN110491388A (en) * | 2018-05-15 | 2019-11-22 | 视联动力信息技术股份有限公司 | A kind of processing method and terminal of audio data |
CN110556125A (en) * | 2019-10-15 | 2019-12-10 | 出门问问信息科技有限公司 | Feature extraction method and device based on voice signal and computer storage medium |
CN111243574A (en) * | 2020-01-13 | 2020-06-05 | 苏州奇梦者网络科技有限公司 | Voice model adaptive training method, system, device and storage medium |
CN111613209A (en) * | 2020-04-14 | 2020-09-01 | 北京三快在线科技有限公司 | Acoustic model training method and device, electronic equipment and storage medium |
CN111785256A (en) * | 2020-06-28 | 2020-10-16 | 北京三快在线科技有限公司 | Acoustic model training method and device, electronic equipment and storage medium |
CN112863496A (en) * | 2019-11-27 | 2021-05-28 | 阿里巴巴集团控股有限公司 | Voice endpoint detection method and device |
CN113284514A (en) * | 2021-05-19 | 2021-08-20 | 北京大米科技有限公司 | Audio processing method and device |
CN113640699A (en) * | 2021-10-14 | 2021-11-12 | 南京国铁电气有限责任公司 | Fault judgment method, system and equipment for microcomputer control type alternating current and direct current power supply system |
CN113780408A (en) * | 2021-09-09 | 2021-12-10 | 安徽农业大学 | Live pig state identification method based on audio features |
CN116978368A (en) * | 2023-09-25 | 2023-10-31 | 腾讯科技(深圳)有限公司 | Wake-up word detection method and related device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103559879A (en) * | 2013-11-08 | 2014-02-05 | 安徽科大讯飞信息科技股份有限公司 | Method and device for extracting acoustic features in language identification system |
CN105118501A (en) * | 2015-09-07 | 2015-12-02 | 徐洋 | Speech recognition method and system |
US9240184B1 (en) * | 2012-11-15 | 2016-01-19 | Google Inc. | Frame-level combination of deep neural network and gaussian mixture models |
-
2016
- 2016-04-28 CN CN201610272292.3A patent/CN105976812B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9240184B1 (en) * | 2012-11-15 | 2016-01-19 | Google Inc. | Frame-level combination of deep neural network and gaussian mixture models |
CN103559879A (en) * | 2013-11-08 | 2014-02-05 | 安徽科大讯飞信息科技股份有限公司 | Method and device for extracting acoustic features in language identification system |
CN105118501A (en) * | 2015-09-07 | 2015-12-02 | 徐洋 | Speech recognition method and system |
Non-Patent Citations (4)
Title |
---|
张德良: "《深度神经网络在中文语音识别系统中的实现》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
王一等: "《一种基于层次结构深度信念网络的音素识别方法》", 《应用科学学报》 * |
肖业鸣等: "《深度神经网络在汉语语音识别声学建模中的优化策略》", 《重庆邮电大学学报(自然科学版)》 * |
麦麦提艾力.吐尔逊等: "《深度神经网络在维吾尔语大词汇量连续语音识别中的应用》", 《数据采集与处理》 * |
Cited By (47)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106601240A (en) * | 2015-10-16 | 2017-04-26 | 三星电子株式会社 | Apparatus and method for normalizing input data of acoustic model and speech recognition apparatus |
CN106601240B (en) * | 2015-10-16 | 2021-10-01 | 三星电子株式会社 | Apparatus and method for normalizing input data of acoustic model, and speech recognition apparatus |
CN109863554B (en) * | 2016-10-27 | 2022-12-02 | 香港中文大学 | Acoustic font model and acoustic font phoneme model for computer-aided pronunciation training and speech processing |
CN109863554A (en) * | 2016-10-27 | 2019-06-07 | 香港中文大学 | Acoustics font model and acoustics font phonemic model for area of computer aided pronunciation training and speech processes |
CN106710599A (en) * | 2016-12-02 | 2017-05-24 | 深圳撒哈拉数据科技有限公司 | Particular sound source detection method and particular sound source detection system based on deep neural network |
CN108281137A (en) * | 2017-01-03 | 2018-07-13 | 中国科学院声学研究所 | A kind of universal phonetic under whole tone element frame wakes up recognition methods and system |
CN106919662A (en) * | 2017-02-14 | 2017-07-04 | 复旦大学 | A kind of music recognition methods and system |
CN106919662B (en) * | 2017-02-14 | 2021-08-31 | 复旦大学 | Music identification method and system |
CN106952645A (en) * | 2017-03-24 | 2017-07-14 | 广东美的制冷设备有限公司 | The recognition methods of phonetic order, the identifying device of phonetic order and air-conditioner |
CN106952645B (en) * | 2017-03-24 | 2020-11-17 | 广东美的制冷设备有限公司 | Voice instruction recognition method, voice instruction recognition device and air conditioner |
CN108288467A (en) * | 2017-06-07 | 2018-07-17 | 腾讯科技(深圳)有限公司 | A kind of audio recognition method, device and speech recognition engine |
CN108288467B (en) * | 2017-06-07 | 2020-07-14 | 腾讯科技(深圳)有限公司 | Voice recognition method and device and voice recognition engine |
US11062699B2 (en) | 2017-06-12 | 2021-07-13 | Ping An Technology (Shenzhen) Co., Ltd. | Speech recognition with trained GMM-HMM and LSTM models |
CN107331384B (en) * | 2017-06-12 | 2018-05-04 | 平安科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
CN107633842A (en) * | 2017-06-12 | 2018-01-26 | 平安科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
WO2018227780A1 (en) * | 2017-06-12 | 2018-12-20 | 平安科技(深圳)有限公司 | Speech recognition method and device, computer device and storage medium |
WO2018227781A1 (en) * | 2017-06-12 | 2018-12-20 | 平安科技(深圳)有限公司 | Voice recognition method, apparatus, computer device, and storage medium |
CN107331384A (en) * | 2017-06-12 | 2017-11-07 | 平安科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
CN107170444A (en) * | 2017-06-15 | 2017-09-15 | 上海航空电器有限公司 | Aviation cockpit environment self-adaption phonetic feature model training method |
WO2018232591A1 (en) * | 2017-06-20 | 2018-12-27 | Microsoft Technology Licensing, Llc. | Sequence recognition processing |
WO2019019252A1 (en) * | 2017-07-28 | 2019-01-31 | 平安科技(深圳)有限公司 | Acoustic model training method, speech recognition method and apparatus, device and medium |
CN107748898A (en) * | 2017-11-03 | 2018-03-02 | 北京奇虎科技有限公司 | File classifying method, device, computing device and computer-readable storage medium |
CN107871506A (en) * | 2017-11-15 | 2018-04-03 | 北京云知声信息技术有限公司 | The awakening method and device of speech identifying function |
CN108245177A (en) * | 2018-01-05 | 2018-07-06 | 安徽大学 | Intelligent infant monitoring wearable device and GMM-HMM-DNN-based infant crying identification method |
CN108245177B (en) * | 2018-01-05 | 2021-01-01 | 安徽大学 | Intelligent infant monitoring wearable device and GMM-HMM-DNN-based infant crying identification method |
CN108648769A (en) * | 2018-04-20 | 2018-10-12 | 百度在线网络技术(北京)有限公司 | Voice activity detection method, apparatus and equipment |
CN110491388A (en) * | 2018-05-15 | 2019-11-22 | 视联动力信息技术股份有限公司 | A kind of processing method and terminal of audio data |
CN108922521A (en) * | 2018-08-15 | 2018-11-30 | 合肥讯飞数码科技有限公司 | A kind of voice keyword retrieval method, apparatus, equipment and storage medium |
CN109274845A (en) * | 2018-08-31 | 2019-01-25 | 平安科技(深圳)有限公司 | Intelligent sound pays a return visit method, apparatus, computer equipment and storage medium automatically |
CN109637523A (en) * | 2018-12-28 | 2019-04-16 | 睿驰达新能源汽车科技(北京)有限公司 | A kind of voice-based door lock for vehicle control method and device |
CN109887484A (en) * | 2019-02-22 | 2019-06-14 | 平安科技(深圳)有限公司 | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device |
CN109887484B (en) * | 2019-02-22 | 2023-08-04 | 平安科技(深圳)有限公司 | Dual learning-based voice recognition and voice synthesis method and device |
CN110491382A (en) * | 2019-03-11 | 2019-11-22 | 腾讯科技(深圳)有限公司 | Audio recognition method, device and interactive voice equipment based on artificial intelligence |
CN110390948B (en) * | 2019-07-24 | 2022-04-19 | 厦门快商通科技股份有限公司 | Method and system for rapid speech recognition |
CN110390948A (en) * | 2019-07-24 | 2019-10-29 | 厦门快商通科技股份有限公司 | A kind of method and system of Rapid Speech identification |
CN110556125A (en) * | 2019-10-15 | 2019-12-10 | 出门问问信息科技有限公司 | Feature extraction method and device based on voice signal and computer storage medium |
CN112863496A (en) * | 2019-11-27 | 2021-05-28 | 阿里巴巴集团控股有限公司 | Voice endpoint detection method and device |
CN112863496B (en) * | 2019-11-27 | 2024-04-02 | 阿里巴巴集团控股有限公司 | Voice endpoint detection method and device |
CN111243574A (en) * | 2020-01-13 | 2020-06-05 | 苏州奇梦者网络科技有限公司 | Voice model adaptive training method, system, device and storage medium |
CN111613209A (en) * | 2020-04-14 | 2020-09-01 | 北京三快在线科技有限公司 | Acoustic model training method and device, electronic equipment and storage medium |
CN111785256A (en) * | 2020-06-28 | 2020-10-16 | 北京三快在线科技有限公司 | Acoustic model training method and device, electronic equipment and storage medium |
CN113284514A (en) * | 2021-05-19 | 2021-08-20 | 北京大米科技有限公司 | Audio processing method and device |
CN113780408A (en) * | 2021-09-09 | 2021-12-10 | 安徽农业大学 | Live pig state identification method based on audio features |
CN113640699B (en) * | 2021-10-14 | 2021-12-24 | 南京国铁电气有限责任公司 | Fault judgment method, system and equipment for microcomputer control type alternating current and direct current power supply system |
CN113640699A (en) * | 2021-10-14 | 2021-11-12 | 南京国铁电气有限责任公司 | Fault judgment method, system and equipment for microcomputer control type alternating current and direct current power supply system |
CN116978368A (en) * | 2023-09-25 | 2023-10-31 | 腾讯科技(深圳)有限公司 | Wake-up word detection method and related device |
CN116978368B (en) * | 2023-09-25 | 2023-12-15 | 腾讯科技(深圳)有限公司 | Wake-up word detection method and related device |
Also Published As
Publication number | Publication date |
---|---|
CN105976812B (en) | 2019-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105976812A (en) | Voice identification method and equipment thereof | |
CN107195296B (en) | Voice recognition method, device, terminal and system | |
CN110600017B (en) | Training method of voice processing model, voice recognition method, system and device | |
CN107610709B (en) | Method and system for training voiceprint recognition model | |
CN110310623B (en) | Sample generation method, model training method, device, medium, and electronic apparatus | |
CN107481717B (en) | Acoustic model training method and system | |
CN108615525B (en) | Voice recognition method and device | |
WO2017218465A1 (en) | Neural network-based voiceprint information extraction method and apparatus | |
CN109509470A (en) | Voice interactive method, device, computer readable storage medium and terminal device | |
CN112786004B (en) | Speech synthesis method, electronic equipment and storage device | |
CN107093422B (en) | Voice recognition method and voice recognition system | |
KR20200044388A (en) | Device and method to recognize voice and device and method to train voice recognition model | |
CN112837669B (en) | Speech synthesis method, device and server | |
CN112349289B (en) | Voice recognition method, device, equipment and storage medium | |
CN113096647B (en) | Voice model training method and device and electronic equipment | |
CN112927674B (en) | Voice style migration method and device, readable medium and electronic equipment | |
CN111508466A (en) | Text processing method, device and equipment and computer readable storage medium | |
CN114678032B (en) | Training method, voice conversion method and device and electronic equipment | |
CN114283783A (en) | Speech synthesis method, model training method, device and storage medium | |
CN112542173A (en) | Voice interaction method, device, equipment and medium | |
CN112216270A (en) | Method and system for recognizing speech phonemes, electronic equipment and storage medium | |
CN111640423A (en) | Word boundary estimation method and device and electronic equipment | |
CN114913859B (en) | Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium | |
CN116665642A (en) | Speech synthesis method, speech synthesis system, electronic device, and storage medium | |
CN113724689B (en) | Speech recognition method and related device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |