CN105976812A

CN105976812A - Voice identification method and equipment thereof

Info

Publication number: CN105976812A
Application number: CN201610272292.3A
Authority: CN
Inventors: 钱柄桦; 吴富章; 李为; 李科; 吴永坚; 黄飞跃
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2016-04-28
Filing date: 2016-04-28
Publication date: 2016-09-28
Anticipated expiration: 2036-04-28
Also published as: CN105976812B

Abstract

An embodiment of the invention discloses a voice identification method and equipment thereof. The method comprises the following steps of acquiring input target audio data based on interactive application; extracting a target Filter bank characteristic in the target audio data; taking the target Filter bank characteristic in the target audio data as input data of a trained DNN model and acquiring a posterior probability characteristic on a target phoneme state of the target audio data output by the trained DNN model; and creating a phoneme decoding network associated with the target audio data, and using a phoneme conversion probability of a trained HMM and the posterior probability characteristic on the target phoneme state of the target audio data to acquire target word sequence data corresponding to the target audio data in the decoding network. By using the method and the equipment of the invention, voice identification of various kinds of practical application environments and pronunciation habits can be satisfied and accuracy of the voice identification is increased.

Description

A kind of audio recognition method and equipment thereof

Technical field

The present invention relates to field of computer technology, particularly relate to a kind of audio recognition method and equipment thereof.

Background technology

Constantly developing and perfect along with computer technology, the application scenarios for voice recognition the most gradually increases, such as: Generated in corresponding chat by the associated person information in the audio extraction terminal that user inputs, the audio frequency that inputted by user The audio frequency hold, inputted by user carries out user's checking etc., and voice recognition technology facilitates user at operating handset, computer etc. eventually Operation during end, improves Consumer's Experience.

Existing voice recognition technology be based on gauss hybrid models (Gaussian Mixture Model, GMM) and HMM (Hidden Markov Model, HMM) carries out the foundation of acoustic model, in actual application, needs Mel-frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC) in target audio to be extracted is special Levy, by MFCC feature input to acoustic model, finally export the voice identification result to target audio.Due to GMM-HMM's Acoustic Modeling is the modeling pattern of a kind of distinction, and for solving the distinction problem of pronunciation phonemes state, therefore it needs tool The MFCC feature of the independence between standby characteristic dimension is as the input data of acoustic model, it is impossible to meet various actual application ring Border and the speech recognition of pronunciation custom, reduce the accuracy of speech recognition.

Summary of the invention

The embodiment of the present invention provides a kind of audio recognition method and equipment thereof, can meet various actual application environment and The speech recognition of pronunciation custom, promotes the accuracy of speech recognition.

Embodiment of the present invention first aspect provides a kind of audio recognition method, it may include:

Obtain the target audio data inputted based on interactive application；

Extract target Filter bank (bank of filters) feature in described target audio data；

Using the target Filter bank feature in described target audio data as the deep-neural-network after training The input data of (Deep Neural Networks, DNN) model, obtain the described mesh of the output of the DNN model after described training Posterior probability feature in the target phoneme state of mark voice data；

Create the phoneme decoding network being associated with described target audio data, and use the phoneme of the HMM after training to turn Change the posterior probability feature in the target phoneme state of probability and described target audio data in described decoding network, obtain institute State the target word sequence data that target audio data are corresponding.

Embodiment of the present invention second aspect provides a kind of speech recognition apparatus, it may include:

Voice data acquiring unit, for obtaining the target audio data inputted based on interactive application；

Feature extraction unit, for extracting the target Filter bank feature in described target audio data；

Feature acquiring unit, for using the target Filter bank feature in described target audio data as training after The input data of DNN model, obtain the target phoneme shape of the described target audio data of the output of the DNN model after described training Posterior probability feature in state；

Word sequence data capture unit, for creating the phoneme decoding network being associated with described target audio data, and Use the phoneme conversion probability of HMM after training and the posterior probability feature in the target phoneme state of described target audio data The target word sequence data that described target audio data are corresponding is obtained in described decoding network.

In embodiments of the present invention, when getting target audio data based on interactive application input, by obtaining mesh Target Filter bank feature in mark voice data, and based on the HMM after the DNN model after training and training, to target Voice data carries out speech recognition and obtains target word sequence data.The acoustic model set up by DNN model and HMM realizes language The function of sound identification, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove between characteristic dimension Dependency, can meet various actual application environment and the speech recognition of pronunciation custom, improve the accuracy of speech recognition.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to Other accompanying drawing is obtained according to these accompanying drawings.

Fig. 1 is the schematic flow sheet of a kind of audio recognition method that the embodiment of the present invention provides；

Fig. 2 is the schematic flow sheet of the another kind of audio recognition method that the embodiment of the present invention provides；

Fig. 3 is the structural representation of a kind of speech recognition apparatus that the embodiment of the present invention provides；

Fig. 4 is the structural representation of the another kind of speech recognition apparatus that the embodiment of the present invention provides；

Fig. 5 is the structural representation of the feature extraction unit that the embodiment of the present invention provides；

Fig. 6 is the structural representation of the feature acquiring unit that the embodiment of the present invention provides；

Fig. 7 is the structural representation of another speech recognition apparatus that the embodiment of the present invention provides.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under not making creative work premise Embodiment, broadly falls into the scope of protection of the invention.

The audio recognition method that the embodiment of the present invention provides can apply to the target audio data to terminal use's input (such as: comprise the audio frequency of numeral, the audio frequency etc. that comprises word) is identified and generates corresponding words sequence (such as: numeric string, word Sentence etc.) scene, such as: speech recognition apparatus obtains the target audio data that inputted based on interactive application, described voice is known Other equipment extracts the target Filter bank feature in described target audio data, and described speech recognition apparatus is by described target Target Filter bank feature in voice data is as the input data of the DNN model after training, after obtaining described training Posterior probability feature in the target phoneme state of the described target audio data of DNN model output, described speech recognition apparatus Create the phoneme decoding network that is associated with described target audio data, and use the phoneme conversion probability of the HMM after training with Posterior probability feature in the target phoneme state of described target audio data obtains described target sound in described decoding network Frequency is according to the scene etc. of corresponding target word sequence data.The acoustic model set up by DNN model and HMM realizes voice to be known Other function, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove being correlated with between characteristic dimension Property, various actual application environment and the speech recognition of pronunciation custom can be met, improve the accuracy of speech recognition.

The application resource loading equipemtn that the present embodiments relate to can be to include panel computer, smart mobile phone, palm electricity Brain, car-mounted terminal, PC (personal computer) and mobile internet device (MID) etc. possess the terminal of speech identifying function and set Standby, it is also possible to for the server apparatus possessing speech identifying function that interactive application is corresponding；Described interactive application can be needs Audio frequency in conjunction with user's input carries out the terminal applies of corresponding interactive function realization, such as: transaction application, instant messaging are applied Deng, identifying code input, Password Input, Content of Communication input can be carried out by the audio recognition method that the embodiment of the present invention provides Deng.

Below in conjunction with accompanying drawing 1 and accompanying drawing 2, a kind of audio recognition method providing the embodiment of the present invention is situated between in detail Continue.

Refer to Fig. 1, for embodiments providing the schematic flow sheet of a kind of audio recognition method.As it is shown in figure 1, The described method of the embodiment of the present invention may comprise steps of S101-step S104.

S101, obtains the target audio data inputted based on interactive application；

Concrete, speech recognition apparatus obtains the target audio data that user is inputted, described target based on interactive application Voice data is specifically as follows user's application interface based on the described interactive application being currently needed for carrying out phonetic entry and is inputted Voice, and for be currently needed for the voice data carrying out speech recognition.

S102, extracts the target Filter bank feature in described target audio data；

Concrete, described speech recognition apparatus can be special at described target audio extracting data target Filter bank Levy, it should be noted that described speech recognition apparatus needs to split into described target audio data multiframe voice data, and divide The other Filter bank feature to every frame voice data is extracted to input in the DNN model to following training, i.e. framing Input carries out the calculating of the posterior probability feature of phoneme state.The most described speech recognition apparatus can be to described target sound frequency According to carrying out data framing, obtaining at least one frame voice data in described target audio data, described speech recognition apparatus obtains The first object Filter bank feature that in described at least one frame voice data, every frame the first voice data is corresponding, described target Filter bank character representation is the Filter bank feature belonging to described target audio data, and described first voice data is The currently practical speech data needing to carry out posterior probability feature calculation, described first object in described target audio data Filter bank character representation is the Filter bank feature belonging to described first object voice data.

S103, using the target Filter bank feature in described target audio data as DNN model defeated after training Entering data, the posteriority in the target phoneme state of the described target audio data obtaining the output of the DNN model after described training is general Rate feature；

Concrete, described speech recognition apparatus can be by the target Filter bank feature in described target audio data As the input data of the DNN model after training, obtain the described target audio data of the output of the DNN model after described training Posterior probability feature in target phoneme state, it is preferred that phoneme state is phonetic symbol, described target phoneme state is described mesh Phoneme state present in mark voice data, described DNN model can obtain the internodal matrix of output layer in the training process Weighted value and matrix bias, described output layer node can be at least one node, the quantity of output layer node and phoneme shape The quantity of state is correlated with (such as: equal), and an output layer node i.e. represents the characteristic vector of a phoneme state.

S104, creates the phoneme decoding network being associated with described target audio data, and uses the sound of the HMM after training Posterior probability feature in the target phoneme state of element transition probability and described target audio data obtains in described decoding network Take the target word sequence data that described target audio data are corresponding；

Concrete, described speech recognition apparatus can create the phoneme decoding net being associated with described target audio data Network, it is preferred that described phoneme decoding network can be with cum rights FST (Weighted Finite-State Transducer, WFST) it is framework, phoneme state sequence is input, and word sequence data are the word figure decoding network of output, permissible Being understood by, described phoneme decoding network can also create in advance when being trained DNN model and HMM.

Described speech recognition apparatus uses phoneme conversion probability and the target of described target audio data of the HMM after training Posterior probability feature on phoneme state obtains the target word sequence that described target audio data are corresponding in described decoding network Data, the phoneme conversion probability of the HMM after described training include each phoneme state jump to self phoneme conversion probability with And described each phoneme state jumps to the phoneme conversion probability of next phoneme state of self, it is to be understood that described Speech recognition apparatus can be according to the phoneme conversion probability of the HMM after training and all of described first object Filter Posterior probability feature in the target phoneme state of bank feature, arranges every network path in described phoneme decoding network Probit, and filter out optimal path, and the knowledge indicated by described optimal path according to the probit of described every network path Other result is as target word sequence data corresponding to described target audio data.

Refer to Fig. 2, for embodiments providing the schematic flow sheet of another kind of audio recognition method.Such as Fig. 2 institute Showing, the described method of the embodiment of the present invention may comprise steps of S201-step S211.

S201, uses training audio frequency language material to be trained GMM and HMM, obtains the GMM output after training at least one The likelihood probability feature of each phoneme state in phoneme state, and obtain the phoneme conversion probability of the HMM after training；

Concrete, before DNN model is trained, need first to train the acoustic model of GMM and HMM, institute Stating speech recognition apparatus can use training audio frequency language material to be trained GMM and HMM, obtains the GMM after training and exports extremely The likelihood probability feature of each phoneme state in a few phoneme state, and obtain the phoneme conversion probability of the HMM after training, institute State training audio frequency language material and can comprise the audio frequency under the scenes such as pause between different noise circumstance, different word speed, different words as far as possible Data.

It should be noted that described speech recognition apparatus can carry out data prediction, described number to training audio frequency language material Data preprocess may include that when training audio frequency language material is carried out data framing, data preemphasis, data windowing operation etc. to obtain At least one frame voice data on territory；Carry out fast Fourier transform, described at least one frame voice data is transformed into frequency domain, To at least one power spectrum data that described at least one frame voice data is corresponding on frequency domain；By at least one power on frequency domain Modal data, by having the mel-frequency wave filter of triangle filtering characteristic, obtains at least one Mel power spectrum data；To extremely A few Mel power spectrum data is taken the logarithm energy, obtains at least one Mel logarithmic energy modal data, now obtained by At least one Mel logarithmic energy modal data (i.e. Filter bank feature), uses DCT to remove at least one Mel logarithmic energy The data dependence of modal data to obtain MFCC feature, described speech recognition apparatus using described MFCC feature as the input of GMM Data, to be trained GMM and HMM, and obtain each phoneme shape at least one phoneme state of the output of the GMM after training The phoneme conversion probability of the HMM after the likelihood probability feature of state, and training.It is understood that for training audio frequency language material In Filterbank feature and the MFCC feature of same frame voice data there is relation one to one.

S202, using and forcing alignment operation is described each sound by the likelihood probability Feature Conversion of described each phoneme state The posterior probability feature of element state；

Concrete, described speech recognition apparatus can use pressure alignment operation by general for the likelihood of described each phoneme state Rate Feature Conversion is the posterior probability feature of described each phoneme state, it is to be understood that owing to likelihood probability feature is to belong to In the probability characteristics of diversity, therefore for the frame voice data in described training audio frequency language material, it is at each phoneme state On the eigenvalue summation of likelihood probability feature be not 1, and for the frame voice data in described training audio frequency language material, its The eigenvalue summation of the posterior probability feature on each phoneme state is 1, it is therefore desirable to choose the eigenvalue of likelihood probability feature Maximum phoneme state, is set to 1 by the eigenvalue of the posterior probability feature on this phoneme state, and for this frame voice data Other phoneme state on the eigenvalue of posterior probability feature be then set to 0, by that analogy, change described training audio frequency language material In every frame voice data likelihood probability feature on phoneme state, it is thus achieved that in described training audio frequency language material, every frame voice data exists Posterior probability feature on phoneme state.

S203, according to the training Filter bank feature extracted in described training audio frequency language material and described each The posterior probability feature of phoneme state, calculates output layer internodal matrix weight value and matrix bias in DNN model；

S204, adds described matrix weight value and described matrix bias to described DNN model, after generating training DNN model；

Concrete, described speech recognition apparatus can be according to the training extracted in described training audio frequency language material Filter bank feature and the posterior probability feature of described each phoneme state, calculate output layer in DNN model internodal Matrix weight value and matrix bias, it is preferred that described speech recognition apparatus can extract described training sound based on said method Frequently the training Filter bank feature that in language material, every frame voice data is corresponding, and by described training Filter bank feature with right The posterior probability feature answered is as training sample pair, and the most described training audio frequency language material can exist multiple training sample pair, based on The plurality of training sample pair, and use the backward pass-algorithm of maximum-likelihood criterion to calculate in DNN model between output layer node Matrix weight value and matrix bias.Described matrix weight value and described matrix bias are added by described speech recognition apparatus To described DNN model, generate the DNN model after training.

S205, obtains the probability of occurrence of training word sequence data in training word sequence language material, and according to described training word The probability of occurrence of sequence data generates N-Gram language model；

Concrete, described speech recognition apparatus is while the acoustic model of training DNN model and HMM, it is also possible to language Speech model is trained, and described speech recognition apparatus can obtain the appearance of training word sequence data in training word sequence language material Probability, and generate N-Gram language model according to the probability of occurrence of described training word sequence data, N-Gram language model is base In a kind of it is assumed that prefixion K-1 the word of the appearance of k-th word is correlated with, and the most uncorrelated with other any word, a words The product of the probability of occurrence that probability is each word of string.

S206, obtains the target audio data inputted based on interactive application；

Concrete, described speech recognition apparatus obtains the target audio data that user is inputted based on interactive application, described Target audio data are specifically as follows user application interface institute based on the described interactive application being currently needed for carrying out phonetic entry The voice of input, and for be currently needed for the voice data carrying out speech recognition.

Described target audio data are carried out data framing by S207, obtain at least one frame in described target audio data Voice data；

S208, the first object Filter that at least one frame voice data described in acquisition, every frame the first voice data is corresponding Bank feature；

Concrete, described speech recognition apparatus needs to split into described target audio data multiframe voice data, and divides The other Filter bank feature to every frame voice data is extracted to input in the DNN model to following training, i.e. framing Input carries out the calculating of the posterior probability feature of phoneme state.The most described speech recognition apparatus can be to described target sound frequency According to carrying out data framing, obtaining at least one frame voice data in described target audio data, described speech recognition apparatus obtains The first object Filter bank feature that in described at least one frame voice data, every frame the first voice data is corresponding, described target Filter bank character representation is the Filter bank feature belonging to described target audio data, described first voice data For the speech data needing to carry out posterior probability feature calculation currently practical in described target audio data, described first object Filter bank character representation is the Filter bank feature belonging to described first object voice data.

Further, described speech recognition apparatus can carry out data prediction, described number to described target audio data Data preprocess may include that data framing, data preemphasis, data windowing operation etc. are with at least one frame audio frequency obtaining in time domain Data；Carry out fast Fourier transform, described at least one frame voice data be transformed into frequency domain, obtain described at least one frame audio frequency At least one power spectrum data that data are corresponding on frequency domain；By at least one power spectrum data on frequency domain by having triangle The mel-frequency wave filter of filtering characteristic, obtains at least one Mel power spectrum data；To at least one Mel power spectrum Data are taken the logarithm energy, obtain at least one Mel logarithmic energy modal data, now obtained by least one Mel logarithm energy The set of amount modal data is described target Filter bank feature, it is to be understood that Filter bank feature is in difference There is data dependence between characteristic dimension, MFCC feature is then to use discrete cosine transform (DiscreteCosine Transform, DCT) remove Filter bank feature data dependence obtained by feature.

Preferably, after described speech recognition apparatus also can carry out feature to described target Filter bank feature further Processing, described feature post processing can include feature extension and feature normalization, and feature extension can be to ask for described target The first-order difference of Filter bank feature and second differnce feature, obtain the default dimension that described every frame the first voice data is corresponding The target Filter bank feature of number feature, feature normalization can be to use cepstral mean to subtract (Cepstrum Mean Subtraction, CMS) the target Filter bank of the technology default Dimension Characteristics corresponding to described every frame the first voice data Feature carries out regular, obtains the first object Filter bank feature that described every frame the first voice data is corresponding, it is preferred that institute Stating default dimension can be 72 dimensions.

S209, according to the time-sequencing of described at least one frame voice data, before obtaining described every frame the first voice data The rear second audio data presetting frame number；

S210, by the second corresponding to described first object Filter bank feature and described second audio data target Filter bank feature, as the input data of the DNN model after training, obtains the institute of the output of the DNN model after described training State the posterior probability feature in the target phoneme state of first object Filter bank feature；

Concrete, described speech recognition apparatus can obtain institute according to the time-sequencing of described at least one frame voice data Presetting the second audio data of frame number before and after stating every frame the first voice data, described speech recognition apparatus is by described first object After Filter bank feature and the second target Filter bank feature corresponding to described second audio data are as training The input data of DNN model, obtain the described first object Filter bank feature of the output of the DNN model after described training Posterior probability feature in target phoneme state, it is to be understood that described second audio data is and described first audio frequency number According to the data possessing dimension relatedness.

Assume described target audio data exist N frame voice data, the first object that i-th frame the first voice data is corresponding Filter bank is characterized as F_i, i=1,2,3 ... N, front and back preset frame number for front and back 8 frames, then input data can include F_iAnd Second target Filter bank feature of 8 frames before and after i-th frame the first voice data, preferably presets dimension, then institute based on above-mentioned The quantity stating input layer corresponding in input data DNN model after described training is 72=1224 joint of (8+1+8) * Point, the number of nodes of the output layer node of the DNN model after described training equal to number P of all phoneme state, input layer with There is the hidden layer of predetermined number between output layer, hidden layer number is preferably 3 layers, and each hidden layer all exists 1024 joints Point, M-1 layer output layer node and M shell output layer internodal matrix weight value and square in the DNN model after described training Battle array bias can be expressed as W_MAnd b_M, M=1,2,3 ... P, then i-th frame the first voice data is at M shell output layer node The characteristic vector of corresponding phoneme stateMeetWherein f (x) is activation primitive, is preferably Relu function, the F of the DNN model output after the most described training_iM-th phoneme state on posterior probability featureFor:

O_{M}^{i} = \frac{\exp (h_{M}^{i})}{Σ_{i = 0}^{P} \exp (h_{M}^{i})}

S211, creates the phoneme decoding network being associated with described target audio data, and uses the sound of the HMM after training Posterior probability feature in the target phoneme state of element transition probability and described target audio data obtains in described decoding network Take the target word sequence data that described target audio data are corresponding；

Concrete, described speech recognition apparatus can create the phoneme decoding net being associated with described target audio data Network, it is preferred that described phoneme decoding network can be with WFST as framework, phoneme state sequence is input, and word sequence data are The word figure decoding network of output, it is to be understood that DNN model and HMM can also instructed by described phoneme decoding network Create in advance when practicing.

Further, the phoneme conversion probability of the HMM after described speech recognition apparatus can use training, described first mesh Posterior probability feature in the target phoneme state of mark Filter bank feature and described N-Gram language model, described Decoding network obtains the target word sequence data that described target audio data are corresponding, owing to N-Gram language model can be voluntarily Infer the probability that next word occurs, therefore in conjunction with probability of occurrence, the probit of every network path can be weighted, increase Add the probability of network path, obtain, by combining N-Gram language model, the target word sequence number that target audio data are corresponding According to, the accuracy of speech recognition can be promoted further.

In embodiments of the present invention, when getting target audio data based on interactive application input, by obtaining mesh Target Filter bank feature in mark voice data, and based on the HMM after the DNN model after training and training, to target Voice data carries out speech recognition and obtains target word sequence data.The acoustic model set up by DNN model and HMM realizes language The function of sound identification, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove between characteristic dimension Dependency, can meet various actual application environment and the speech recognition of pronunciation custom, improve the accuracy of speech recognition； By having merged method and the training method of DNN-HMM acoustic model of Filter bank feature extraction, it is achieved that complete Training is to the process identified；The target word sequence data that target audio data are corresponding is obtained by combining N-Gram language model, Owing to N-Gram language model can infer the probability that next word occurs voluntarily, therefore can be in conjunction with probability of occurrence to every net The probit in network path is weighted, and increases the probability of network path, improves the accuracy of speech recognition further.

Below in conjunction with accompanying drawing 3-accompanying drawing 6, the speech recognition apparatus providing the embodiment of the present invention describes in detail.Need It is noted that the speech recognition apparatus shown in accompanying drawing 3-accompanying drawing 6, for performing the side of Fig. 1 of the present invention and embodiment illustrated in fig. 2 Method, for convenience of description, illustrate only the part relevant to the embodiment of the present invention, and concrete ins and outs do not disclose, refer to Embodiment shown in Fig. 1 and Fig. 2 of the present invention.

Refer to Fig. 3, for embodiments providing the structural representation of a kind of speech recognition apparatus.As it is shown on figure 3, The described speech recognition apparatus 1 of the embodiment of the present invention may include that voice data acquiring unit 11, feature extraction unit 12, spy Levy acquiring unit 13 and word sequence data capture unit 14.

Voice data acquiring unit 11, for obtaining the target audio data inputted based on interactive application；

In implementing, described voice data acquiring unit 11 obtains the target audio that user is inputted based on interactive application Data, described target audio data are specifically as follows user based on the described interactive application being currently needed for carrying out phonetic entry The voice that application interface is inputted, and for be currently needed for the voice data carrying out speech recognition.

Feature extraction unit 12, for extracting the target Filter bank feature in described target audio data；

In implementing, described feature extraction unit 12 can be in described target audio extracting data target Filter Bank feature, it should be noted that described feature extraction unit 12 needs described target audio data are split into multiframe audio frequency Data, and the DNN model after Filter bank feature to every frame voice data is extracted with input to following training respectively In, i.e. framing inputs the calculating of the posterior probability feature carrying out phoneme state.The most described feature extraction unit 12 can be to institute State target audio data and carry out data framing, obtain at least one frame voice data in described target audio data, described feature Extraction unit 12 obtain described in every frame the first voice data is corresponding at least one frame voice data first object Filter bank Feature, described target Filter bank character representation is the Filter bank feature belonging to described target audio data, described First voice data is the currently practical speech data needing to carry out posterior probability feature calculation, institute in described target audio data Stating first object Filter bank character representation is the Filter bank feature belonging to described first object voice data.

Feature acquiring unit 13, is used for the target Filter bank feature in described target audio data as training After the input data of DNN model, obtain the target phoneme of the described target audio data of the output of the DNN model after described training Posterior probability feature in state；

In implementing, described feature acquiring unit 13 can be by target Filter in described target audio data Bank feature, as the input data of the DNN model after training, obtains the described target sound of the output of the DNN model after described training Posterior probability feature in the target phoneme state of frequency evidence, it is preferred that phoneme state is phonetic symbol, described target phoneme state For phoneme state present in described target audio data, described DNN model can obtain output layer node in the training process Between matrix weight value and matrix bias, described output layer node can be at least one node, the quantity of output layer node Relevant to the quantity of phoneme state (such as: equal), an output layer node i.e. represents the characteristic vector of a phoneme state.

Word sequence data capture unit 14, for creating the phoneme decoding network being associated with described target audio data, And use the posterior probability in the phoneme conversion probability of the HMM after training and the target phoneme state of described target audio data special Levy in described decoding network, obtain the target word sequence data that described target audio data are corresponding；

In implementing, described word sequence data capture unit 14 can create and be associated with described target audio data Phoneme decoding network, it is preferred that described phoneme decoding network can be with WFST as framework, phoneme state sequence is input, word Sequence data is the word figure decoding network of output, it is to be understood that described phoneme decoding network can also be to DNN model Create in advance when being trained with HMM.

Described word sequence data capture unit 14 uses the phoneme conversion probability of the HMM after training and described target sound frequency According to target phoneme state on posterior probability feature in described decoding network, obtain the mesh that described target audio data are corresponding Mark word sequence data, the phoneme conversion probability of the HMM after described training includes that each phoneme state jumps to self phoneme and turns Change probability and described each phoneme state jumps to self the phoneme conversion probability of next phoneme state, it is possible to understand that It is that described word sequence data capture unit 14 can be according to the phoneme conversion probability of the HMM after training and all of described the Posterior probability feature in the target phoneme state of one target Filter bank feature, is arranged in described phoneme decoding network The probit of every network path, and filter out optimal path according to the probit of described every network path, and by described The recognition result of shortest path instruction is as target word sequence data corresponding to described target audio data.

Refer to Fig. 4, for embodiments providing the structural representation of another kind of speech recognition apparatus.Such as Fig. 4 institute Show, the described speech recognition apparatus 1 of the embodiment of the present invention may include that voice data acquiring unit 11, feature extraction unit 12, Feature acquiring unit 13, word sequence data capture unit 14, acoustic training model unit 15, Feature Conversion unit 16, parameter meter Calculate unit 17, acoustic model signal generating unit 18 and language model signal generating unit 19.

Acoustic training model unit 15, is used for using training audio frequency language material to be trained GMM and HMM, after obtaining training GMM output at least one phoneme state in the likelihood probability feature of each phoneme state, and obtain the sound of the HMM after training Element transition probability；

In implementing, before DNN model is trained, need first to train the acoustic mode of GMM and HMM Type, described acoustic training model unit 15 can use training audio frequency language material to be trained GMM and HMM, after obtaining training The likelihood probability feature of each phoneme state at least one phoneme state of GMM output, and obtain the phoneme of the HMM after training Transition probability, described training audio frequency language material can comprise between different noise circumstance, different word speed, different words the fields such as pause as far as possible Voice data under scape.

It should be noted that described acoustic training model unit 15 can carry out data prediction to training audio frequency language material, Described data prediction may include that to training audio frequency language material carry out data framing, data preemphasis, data windowing operation etc. with Obtain at least one frame voice data in time domain；Carry out fast Fourier transform, described at least one frame voice data is transformed into Frequency domain, obtain described at least one power spectrum data corresponding on frequency domain of at least one frame voice data；By on frequency domain at least One power spectrum data, by having the mel-frequency wave filter of triangle filtering characteristic, obtains at least one Mel power spectrum number According to；At least one Mel power spectrum data is taken the logarithm energy, obtain at least one Mel logarithmic energy modal data, now institute At least one the Mel logarithmic energy modal data (i.e. Filter bank feature) obtained, uses DCT to remove at least one Mel pair The data dependence of number energy spectra data is to obtain MFCC feature, and described MFCC feature is made by described acoustic training model unit 15 For the input data of GMM, so that GMM and HMM to be trained, and obtain at least one phoneme state of the output of the GMM after training In the phoneme conversion probability of HMM after the likelihood probability feature of each phoneme state, and training.It is understood that for Relation one to one is there is in the Filter bank feature of the same frame voice data in training audio frequency language material with MFCC feature.

Feature Conversion unit 16, for using pressure alignment operation the likelihood probability feature of described each phoneme state to be turned It is changed to the posterior probability feature of described each phoneme state；

In implementing, described Feature Conversion unit 16 can use pressure alignment operation by described each phoneme state Likelihood probability Feature Conversion is the posterior probability feature of described each phoneme state, it is to be understood that owing to likelihood probability is special Levying the probability characteristics being belonging to diversity, therefore for the frame voice data in described training audio frequency language material, it is at each sound The eigenvalue summation of the likelihood probability feature in element state is not 1, and for the frame audio frequency number in described training audio frequency language material According to, the eigenvalue summation of its posterior probability feature on each phoneme state is 1, it is therefore desirable to choose likelihood probability feature The phoneme state that eigenvalue is maximum, is set to 1 by the eigenvalue of the posterior probability feature on this phoneme state, and for this frame sound The eigenvalue of the posterior probability feature on other phoneme state of frequency evidence is then set to 0, by that analogy, changes described training sound Frequently every frame voice data likelihood probability feature on phoneme state in language material, it is thus achieved that every frame audio frequency in described training audio frequency language material Data posterior probability feature on phoneme state.

Parameter calculation unit 17, for special according to the training Filter bank extracted in described training audio frequency language material Levy and the posterior probability feature of described each phoneme state, calculate in DNN model output layer internodal matrix weight value and Matrix bias；

Acoustic model signal generating unit 18, for adding described matrix weight value and described matrix bias to described DNN In model, generate the DNN model after training；

In implementing, described parameter calculation unit 17 can be according to the training extracted in described training audio frequency language material Filter bank feature and the posterior probability feature of described each phoneme state, calculate output layer in DNN model internodal Matrix weight value and matrix bias, it is preferred that described parameter calculation unit 17 can extract described training based on said method The training Filter bank feature that in audio frequency language material, every frame voice data is corresponding, and by described training Filter bank feature with Corresponding posterior probability feature is as training sample pair, and the most described training audio frequency language material can exist multiple training sample pair, base In the plurality of training sample pair, and the backward pass-algorithm of maximum-likelihood criterion is used to calculate output layer node in DNN model Between matrix weight value and matrix bias.Described acoustic model signal generating unit 18 is by inclined to described matrix weight value and described matrix Put value to add to described DNN model, generate the DNN model after training.

Language model signal generating unit 19, general for obtaining the appearance of training word sequence data in training word sequence language material Rate, and generate N-Gram language model according to the probability of occurrence of described training word sequence data；

In implementing, while the acoustic model of training DNN model and HMM, described language model signal generating unit 19 Can be trained language model, described language model signal generating unit 19 can obtain training word in training word sequence language material The probability of occurrence of sequence data, and generate N-Gram language model, N-according to the probability of occurrence of described training word sequence data Gram language model be based on a kind of it is assumed that prefixion K-1 the word of the appearance of k-th word be correlated with, and with other any word The most uncorrelated, the probability of a words string is the product of the probability of occurrence of each word.

In implementing, described voice data acquiring unit 11 obtains the target audio that user is inputted based on interactive application Data, described target audio data be specifically as follows user based on be currently needed for carrying out phonetic entry described interactive application should The voice inputted with interface, and for be currently needed for the voice data carrying out speech recognition.

In implementing, described feature extraction unit 12 can be in described target audio extracting data target Filter Bank feature, it should be noted that described feature extraction unit 12 needs described target audio data are split into multiframe audio frequency Data, and the DNN model after Filter bank feature to every frame voice data is extracted with input to following training respectively In, i.e. framing inputs the calculating of the posterior probability feature carrying out phoneme state.The most described feature extraction unit 12 can be to institute State target audio data and carry out data framing, obtain at least one frame voice data in described target audio data, described feature Extraction unit 12 obtain described in every frame the first voice data is corresponding at least one frame voice data first object Filter Bank feature, described target Filterbank character representation is the Filter bank feature belonging to described target audio data, institute Stating the first voice data is the currently practical speech data needing to carry out posterior probability feature calculation in described target audio data, Described first object Filter bank character representation is the Filter bank feature belonging to described first object voice data.

Concrete, please also refer to Fig. 5, for embodiments providing the structural representation of feature extraction unit.As Shown in Fig. 5, described feature extraction unit 12 may include that

First data acquisition subelement 121, for described target audio data are carried out data framing, obtains described target At least one frame voice data in voice data；

Fisrt feature obtains subelement 122, be used for obtaining described in every frame the first voice data at least one frame voice data Corresponding first object Filter bank feature；

In implementing, described first data acquisition subelement 121 needs described target audio data are split into multiframe Voice data, and the DNN after Filter bank feature to every frame voice data is extracted with input to following training respectively In model, i.e. framing inputs the calculating of the posterior probability feature carrying out phoneme state.The most described first data acquisition subelement 121 can carry out data framing to described target audio data, obtain at least one frame audio frequency number in described target audio data According to, described fisrt feature obtain subelement 122 obtain described at least one frame voice data every frame the first voice data corresponding First object Filter bank feature, described target Filter bank character representation is to belong to described target audio data Filter bank feature, described first voice data is that in described target audio data, currently practical needs carries out posterior probability The speech data of feature calculation, described first object Filter bank character representation is for belonging to described first object voice data Filter bank feature.

Further, described first data acquisition subelement 121 can carry out data to described target audio data and locates in advance Reason, described data prediction may include that data framing, data preemphasis, data windowing operation etc. are to obtain in time domain extremely A few frame voice data；Carry out fast Fourier transform, described at least one frame voice data be transformed into frequency domain, obtain described in extremely At least one power spectrum data that a few frame voice data is corresponding on frequency domain；At least one power spectrum data on frequency domain is led to Cross the mel-frequency wave filter with triangle filtering characteristic, obtain at least one Mel power spectrum data；To at least one prunus mume (sieb.) sieb.et zucc. Your power spectrum data is taken the logarithm energy, obtains at least one Mel logarithmic energy modal data, now obtained by least one The set of Mel logarithmic energy modal data is described target Filter bank feature, it is to be understood that Filter bank There is data dependence in feature between different characteristic dimension, MFCC feature is then to use DCT to remove Filter bank feature Data dependence obtained by feature.

Preferably, described target Filter bank feature also can be entered by described fisrt feature acquisition subelement 122 further Row feature post processing, described feature post processing can include feature extension and feature normalization, feature extension can be ask for described in The first-order difference of target Filter bank feature and second differnce feature, obtain corresponding pre-of described every frame the first voice data If the target Filter bank feature of Dimension Characteristics, feature normalization can be to use CMS technology to described every frame the first audio frequency number Carry out regular according to the target Filter bank feature of corresponding default Dimension Characteristics, obtain described every frame the first voice data pair The first object Filter bank feature answered, it is preferred that described default dimension can be 72 dimensions.

Concrete, please also refer to Fig. 6, for embodiments providing the structural representation of feature acquiring unit.As Shown in Fig. 6, described feature acquiring unit 13 may include that

Second data acquisition subelement 131, for the time-sequencing according to described at least one frame voice data, obtains described The second audio data of frame number is preset before and after every frame the first voice data；

Second feature obtains subelement 132, for by described first object Filter bank feature and described second sound Frequency as the input data of the DNN model after training, obtains described training according to the second corresponding target Filter bank feature After DNN model output described first object Filter bank feature target phoneme state on posterior probability feature；

In implementing, described second data acquisition subelement 131 can according to described at least one frame voice data time Between sort, obtain the second audio data presetting frame number before and after described every frame the first voice data, described second feature obtains Subelement 132 is by the second corresponding to described first object Filter bank feature and described second audio data target Filter bank feature, as the input data of the DNN model after training, obtains the institute of the output of the DNN model after described training State the posterior probability feature in the target phoneme state of first object Filter bank feature, it is to be understood that described Two voice datas are to possess the data of dimension relatedness with described first voice data.

O_{M}^{i} = \frac{\exp (h_{M}^{i})}{Σ_{i = 0}^{P} \exp (h_{M}^{i})}

In implementing, described word sequence data capture unit 14 can create and be associated with described target audio data Phoneme decoding network, it is preferred that described phoneme decoding network can be with WFST as framework, phoneme state sequence is input, word Sequence data is the word figure decoding network of output, it is to be understood that described phoneme decoding network can also to DNN model and Create in advance when HMM is trained.

Further, the phoneme conversion probability of the HMM after described word sequence data capture unit 14 can use training, institute State the posterior probability feature in the target phoneme state of first object Filter bank feature and described N-Gram language mould Type, obtains the target word sequence data that described target audio data are corresponding in described decoding network, due to N-Gram language mould Type can infer the probability that next word occurs voluntarily, therefore can enter the probit of every network path in conjunction with probability of occurrence Row weighting, increases the probability of network path, obtains, by combining N-Gram language model, the target that target audio data are corresponding Word sequence data, can promote the accuracy of speech recognition further.

Refer to Fig. 7, for embodiments providing the structural representation of another speech recognition apparatus.Such as Fig. 7 institute Showing, described speech recognition apparatus 1000 may include that at least one processor 1001, such as CPU, at least one network interface 1004, user interface 1003, memorizer 1005, at least one communication bus 1002.Wherein, communication bus 1002 is used for realizing this Connection communication between a little assemblies.Wherein, user interface 1003 can include display screen (Display), keyboard (Keyboard), Optional user interface 1003 can also include the wireline interface of standard, wave point.Network interface 1004 optionally can include The wireline interface of standard, wave point (such as WI-FI interface).Memorizer 1005 can be high-speed RAM memorizer, it is also possible to right and wrong Unstable memorizer (non-volatile memory), for example, at least one disk memory.Memorizer 1005 is optional Can also is that at least one is located remotely from the storage device of aforementioned processor 1001.As it is shown in fig. 7, as a kind of Computer Storage The memorizer 1005 of medium can include operating system, network communication module, Subscriber Interface Module SIM and speech recognition application Program.

In the speech recognition apparatus 1000 shown in Fig. 7, user interface 1003 is mainly used in providing the user connecing of input Mouthful, obtain the data of user's input；And processor 1001 may be used for calling the speech recognition application of storage in memorizer 1005 Program, and specifically perform following operation:

Obtain the target audio data inputted based on interactive application；

Extract the target Filter bank feature in described target audio data；

Using the target Filter bank feature in described target audio data as the input number of DNN model after training According to, the posterior probability in the target phoneme state of the described target audio data obtaining the output of the DNN model after described training is special Levy；

In one embodiment, described processor 1001 is performing the target sound frequency that acquisition is inputted based on interactive application According to before, the also following operation of execution:

Use training audio frequency language material that GMM and HMM is trained, obtain at least one phoneme of the GMM output after training The likelihood probability feature of each phoneme state in state, and obtain the phoneme conversion probability of the HMM after training；

Using and forcing alignment operation is described each phoneme shape by the likelihood probability Feature Conversion of described each phoneme state The posterior probability feature of state；

According to the training Filter bank feature extracted in described training audio frequency language material and described each phoneme shape The posterior probability feature of state, calculates output layer internodal matrix weight value and matrix bias in DNN model；

Described matrix weight value and described matrix bias are added to described DNN model, generates the DNN mould after training Type.

The probability of occurrence of training word sequence data is obtained in training word sequence language material, and according to described training word order columns According to probability of occurrence generate N-Gram language model.

In one embodiment, the described processor 1001 target Filter in performing the described target audio data of extraction During bank feature, the following operation of concrete execution:

Described target audio data are carried out data framing, obtains at least one frame audio frequency number in described target audio data According to；

The first object Filter bank that at least one frame voice data described in acquisition, every frame the first voice data is corresponding is special Levy.

In one embodiment, described processor 1001 is performing target Filter in described target audio data Bank feature, as the input data of the DNN model after training, obtains the described target sound of the output of the DNN model after described training During posterior probability feature in the target phoneme state of frequency evidence, concrete perform following operation:

According to the time-sequencing of described at least one frame voice data, preset before and after obtaining described every frame the first voice data The second audio data of frame number；

By the second corresponding to described first object Filter bank feature and described second audio data target Filter Bank feature, as the input data of the DNN model after training, obtains described first mesh of the output of the DNN model after described training Posterior probability feature in the target phoneme state of mark Filter bank feature；

Wherein, described first voice data is the data being currently needed for carrying out posterior probability feature calculation, described second sound Frequency is according to the data for possessing dimension relatedness with described first voice data.

In one embodiment, described processor 1001 is performing the phoneme that establishment is associated with described target audio data Decoding network, and use in the phoneme conversion probability of the HMM after training and the target phoneme state of described target audio data When posterior probability feature obtains target word sequence data corresponding to described target audio data in described decoding network, specifically hold The following operation of row:

Create the phoneme decoding network being associated with described target audio data, and use the phoneme of the HMM after training to turn Change the posterior probability feature in the target phoneme state of probability, described first object Filter bank feature and described N- Gram language model, obtains the target word sequence data that described target audio data are corresponding in described decoding network.

In embodiments of the present invention, when getting target audio data based on interactive application input, by obtaining mesh Target Filter bank feature in mark voice data, and based on the HMM after the DNN model after training and training, to target Voice data carries out speech recognition and obtains target word sequence data.The acoustic model set up by DNN model and HMM realizes language The function of sound identification, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove between characteristic dimension Dependency, can meet various actual application environment and the speech recognition of pronunciation custom, improve the accurate of speech recognition Property；By having merged method and the training method of DNN-HMM acoustic model of Filterbank feature extraction, it is achieved that complete Training to identify process；The target word sequence number that target audio data are corresponding is obtained by combining N-Gram language model According to, owing to N-Gram language model can infer the probability that next word occurs voluntarily, therefore can be in conjunction with probability of occurrence to often The probit of bar network path is weighted, and increases the probability of network path, improves the accuracy of speech recognition further.

One of ordinary skill in the art will appreciate that all or part of flow process realizing in above-described embodiment method, be permissible Instructing relevant hardware by computer program to complete, described program can be stored in a computer read/write memory medium In, this program is upon execution, it may include such as the flow process of the embodiment of above-mentioned each method.Wherein, described storage medium can be magnetic Dish, CD, read-only store-memory body (Read-Only Memory, ROM) or random store-memory body (Random Access Memory, RAM) etc..

The above disclosed present pre-ferred embodiments that is only, can not limit the right model of the present invention with this certainly Enclose, the equivalent variations therefore made according to the claims in the present invention, still belong to the scope that the present invention is contained.

Claims

1. an audio recognition method, it is characterised in that including:

Obtain the target audio data inputted based on interactive application；

Extract the target filter group Filter bank feature in described target audio data；

Using the target Filter bank feature in described target audio data as the deep-neural-network DNN model after training Input data, obtain the DNN model after described training output described target audio data target phoneme state on after Test probability characteristics；

Create the phoneme decoding network being associated with described target audio data, and use the HMM after training Posterior probability feature on the phoneme conversion probability of HMM and the target phoneme state of described target audio data is at described decoding net Network obtains the target word sequence data that described target audio data are corresponding.

Method the most according to claim 1, it is characterised in that the target audio that described acquisition is inputted based on interactive application Before data, also include:

Use training audio frequency language material that gauss hybrid models GMM and HMM is trained, obtain the GMM after training and export at least The likelihood probability feature of each phoneme state in one phoneme state, and obtain the phoneme conversion probability of the HMM after training；

Using and forcing alignment operation is described each phoneme state by the likelihood probability Feature Conversion of described each phoneme state Posterior probability feature；

According to the training Filter bank feature extracted in described training audio frequency language material and described each phoneme state Posterior probability feature, calculates output layer internodal matrix weight value and matrix bias in DNN model；

Described matrix weight value and described matrix bias are added to described DNN model, generates the DNN model after training.

Method the most according to claim 2, it is characterised in that the target audio that described acquisition is inputted based on interactive application Before data, also include:

The probability of occurrence of training word sequence data is obtained in training word sequence language material, and according to described training word sequence data Probability of occurrence generates N-Gram language model.

Method the most according to claim 3, it is characterised in that the target in described extraction described target audio data Filter bank feature, including:

Described target audio data are carried out data framing, obtains at least one frame voice data in described target audio data；

The first object Filter bank feature that at least one frame voice data described in acquisition, every frame the first voice data is corresponding.

Method the most according to claim 4, it is characterised in that described by target Filter in described target audio data Bank feature, as the input data of the DNN model after training, obtains the described target sound of the output of the DNN model after described training Posterior probability feature in the target phoneme state of frequency evidence, including:

According to the time-sequencing of described at least one frame voice data, before and after obtaining described every frame the first voice data, preset frame number Second audio data；

Wherein, described first voice data is the data being currently needed for carrying out posterior probability feature calculation, described second audio frequency number According to the data for possessing dimension relatedness with described first voice data.

Method the most according to claim 5, it is characterised in that the sound that described establishment is associated with described target audio data Element decoding network, and use in the phoneme conversion probability of the HMM after training and the target phoneme state of described target audio data Posterior probability feature in described decoding network, obtain the target word sequence data that described target audio data are corresponding, including:

Create the phoneme decoding network that is associated with described target audio data, and use the phoneme conversion of the HMM after training general Rate, described first object Filter bank feature target phoneme state on posterior probability feature and described N-Gram language Speech model, obtains the target word sequence data that described target audio data are corresponding in described decoding network.

7. a speech recognition apparatus, it is characterised in that including:

Feature acquiring unit, for using the target Filter bank feature in described target audio data as training after DNN The input data of model, in the target phoneme state of the described target audio data obtaining the output of the DNN model after described training Posterior probability feature；

Word sequence data capture unit, for creating the phoneme decoding network being associated with described target audio data, and uses Posterior probability feature on the phoneme conversion probability of the HMM after training and the target phoneme state of described target audio data is in institute State and decoding network obtains the target word sequence data that described target audio data are corresponding.

Equipment the most according to claim 7, it is characterised in that also include:

Acoustic training model unit, is used for using training audio frequency language material to be trained GMM and HMM, obtains the GMM after training defeated The likelihood probability feature of each phoneme state at least one phoneme state gone out, and obtain the phoneme conversion of the HMM after training Probability；

Feature Conversion unit, is institute for using pressure alignment operation by the likelihood probability Feature Conversion of described each phoneme state State the posterior probability feature of each phoneme state；

Parameter calculation unit, for according to the training Filter bank feature extracted in described training audio frequency language material and The posterior probability feature of described each phoneme state, calculates output layer internodal matrix weight value and matrix in DNN model inclined Put value；

Acoustic model signal generating unit, for described matrix weight value and described matrix bias are added to described DNN model, Generate the DNN model after training.

Equipment the most according to claim 8, it is characterised in that also include:

Language model signal generating unit, for obtaining the probability of occurrence of training word sequence data in training word sequence language material, and root N-Gram language model is generated according to the probability of occurrence of described training word sequence data.

Equipment the most according to claim 9, it is characterised in that described feature extraction unit includes:

First data acquisition subelement, for described target audio data are carried out data framing, obtains described target sound frequency At least one frame voice data according to；

Fisrt feature obtains subelement, that at least one frame voice data described in obtain, every frame the first voice data is corresponding One target Filter bank feature.

11. equipment according to claim 10, it is characterised in that described feature acquiring unit includes:

Second data acquisition subelement, for according to the time-sequencing of described at least one frame voice data, obtains described every frame the The second audio data of frame number is preset before and after one voice data；

Second feature obtains subelement, for by described first object Filter bank feature and described second audio data The second corresponding target Filter bank feature is as the input data of the DNN model after training, after obtaining described training Posterior probability feature in the target phoneme state of the described first object Filter bank feature of DNN model output；

12. equipment according to claim 11, it is characterised in that described word sequence data capture unit is specifically for creating The phoneme decoding network being associated with described target audio data, and use the phoneme conversion probability of the HMM after training, described Posterior probability feature in the target phoneme state of one target Filter bank feature and described N-Gram language model, Described decoding network obtains the target word sequence data that described target audio data are corresponding.