CN105976812A - Voice identification method and equipment thereof - Google Patents

Voice identification method and equipment thereof Download PDF

Info

Publication number
CN105976812A
CN105976812A CN201610272292.3A CN201610272292A CN105976812A CN 105976812 A CN105976812 A CN 105976812A CN 201610272292 A CN201610272292 A CN 201610272292A CN 105976812 A CN105976812 A CN 105976812A
Authority
CN
China
Prior art keywords
data
feature
target
training
audio data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610272292.3A
Other languages
Chinese (zh)
Other versions
CN105976812B (en
Inventor
钱柄桦
吴富章
李为
李科
吴永坚
黄飞跃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201610272292.3A priority Critical patent/CN105976812B/en
Publication of CN105976812A publication Critical patent/CN105976812A/en
Application granted granted Critical
Publication of CN105976812B publication Critical patent/CN105976812B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

An embodiment of the invention discloses a voice identification method and equipment thereof. The method comprises the following steps of acquiring input target audio data based on interactive application; extracting a target Filter bank characteristic in the target audio data; taking the target Filter bank characteristic in the target audio data as input data of a trained DNN model and acquiring a posterior probability characteristic on a target phoneme state of the target audio data output by the trained DNN model; and creating a phoneme decoding network associated with the target audio data, and using a phoneme conversion probability of a trained HMM and the posterior probability characteristic on the target phoneme state of the target audio data to acquire target word sequence data corresponding to the target audio data in the decoding network. By using the method and the equipment of the invention, voice identification of various kinds of practical application environments and pronunciation habits can be satisfied and accuracy of the voice identification is increased.

Description

A kind of audio recognition method and equipment thereof
Technical field
The present invention relates to field of computer technology, particularly relate to a kind of audio recognition method and equipment thereof.
Background technology
Constantly developing and perfect along with computer technology, the application scenarios for voice recognition the most gradually increases, such as: Generated in corresponding chat by the associated person information in the audio extraction terminal that user inputs, the audio frequency that inputted by user The audio frequency hold, inputted by user carries out user's checking etc., and voice recognition technology facilitates user at operating handset, computer etc. eventually Operation during end, improves Consumer's Experience.
Existing voice recognition technology be based on gauss hybrid models (Gaussian Mixture Model, GMM) and HMM (Hidden Markov Model, HMM) carries out the foundation of acoustic model, in actual application, needs Mel-frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC) in target audio to be extracted is special Levy, by MFCC feature input to acoustic model, finally export the voice identification result to target audio.Due to GMM-HMM's Acoustic Modeling is the modeling pattern of a kind of distinction, and for solving the distinction problem of pronunciation phonemes state, therefore it needs tool The MFCC feature of the independence between standby characteristic dimension is as the input data of acoustic model, it is impossible to meet various actual application ring Border and the speech recognition of pronunciation custom, reduce the accuracy of speech recognition.
Summary of the invention
The embodiment of the present invention provides a kind of audio recognition method and equipment thereof, can meet various actual application environment and The speech recognition of pronunciation custom, promotes the accuracy of speech recognition.
Embodiment of the present invention first aspect provides a kind of audio recognition method, it may include:
Obtain the target audio data inputted based on interactive application;
Extract target Filter bank (bank of filters) feature in described target audio data;
Using the target Filter bank feature in described target audio data as the deep-neural-network after training The input data of (Deep Neural Networks, DNN) model, obtain the described mesh of the output of the DNN model after described training Posterior probability feature in the target phoneme state of mark voice data;
Create the phoneme decoding network being associated with described target audio data, and use the phoneme of the HMM after training to turn Change the posterior probability feature in the target phoneme state of probability and described target audio data in described decoding network, obtain institute State the target word sequence data that target audio data are corresponding.
Embodiment of the present invention second aspect provides a kind of speech recognition apparatus, it may include:
Voice data acquiring unit, for obtaining the target audio data inputted based on interactive application;
Feature extraction unit, for extracting the target Filter bank feature in described target audio data;
Feature acquiring unit, for using the target Filter bank feature in described target audio data as training after The input data of DNN model, obtain the target phoneme shape of the described target audio data of the output of the DNN model after described training Posterior probability feature in state;
Word sequence data capture unit, for creating the phoneme decoding network being associated with described target audio data, and Use the phoneme conversion probability of HMM after training and the posterior probability feature in the target phoneme state of described target audio data The target word sequence data that described target audio data are corresponding is obtained in described decoding network.
In embodiments of the present invention, when getting target audio data based on interactive application input, by obtaining mesh Target Filter bank feature in mark voice data, and based on the HMM after the DNN model after training and training, to target Voice data carries out speech recognition and obtains target word sequence data.The acoustic model set up by DNN model and HMM realizes language The function of sound identification, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove between characteristic dimension Dependency, can meet various actual application environment and the speech recognition of pronunciation custom, improve the accuracy of speech recognition.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to Other accompanying drawing is obtained according to these accompanying drawings.
Fig. 1 is the schematic flow sheet of a kind of audio recognition method that the embodiment of the present invention provides;
Fig. 2 is the schematic flow sheet of the another kind of audio recognition method that the embodiment of the present invention provides;
Fig. 3 is the structural representation of a kind of speech recognition apparatus that the embodiment of the present invention provides;
Fig. 4 is the structural representation of the another kind of speech recognition apparatus that the embodiment of the present invention provides;
Fig. 5 is the structural representation of the feature extraction unit that the embodiment of the present invention provides;
Fig. 6 is the structural representation of the feature acquiring unit that the embodiment of the present invention provides;
Fig. 7 is the structural representation of another speech recognition apparatus that the embodiment of the present invention provides.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under not making creative work premise Embodiment, broadly falls into the scope of protection of the invention.
The audio recognition method that the embodiment of the present invention provides can apply to the target audio data to terminal use's input (such as: comprise the audio frequency of numeral, the audio frequency etc. that comprises word) is identified and generates corresponding words sequence (such as: numeric string, word Sentence etc.) scene, such as: speech recognition apparatus obtains the target audio data that inputted based on interactive application, described voice is known Other equipment extracts the target Filter bank feature in described target audio data, and described speech recognition apparatus is by described target Target Filter bank feature in voice data is as the input data of the DNN model after training, after obtaining described training Posterior probability feature in the target phoneme state of the described target audio data of DNN model output, described speech recognition apparatus Create the phoneme decoding network that is associated with described target audio data, and use the phoneme conversion probability of the HMM after training with Posterior probability feature in the target phoneme state of described target audio data obtains described target sound in described decoding network Frequency is according to the scene etc. of corresponding target word sequence data.The acoustic model set up by DNN model and HMM realizes voice to be known Other function, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove being correlated with between characteristic dimension Property, various actual application environment and the speech recognition of pronunciation custom can be met, improve the accuracy of speech recognition.
The application resource loading equipemtn that the present embodiments relate to can be to include panel computer, smart mobile phone, palm electricity Brain, car-mounted terminal, PC (personal computer) and mobile internet device (MID) etc. possess the terminal of speech identifying function and set Standby, it is also possible to for the server apparatus possessing speech identifying function that interactive application is corresponding;Described interactive application can be needs Audio frequency in conjunction with user's input carries out the terminal applies of corresponding interactive function realization, such as: transaction application, instant messaging are applied Deng, identifying code input, Password Input, Content of Communication input can be carried out by the audio recognition method that the embodiment of the present invention provides Deng.
Below in conjunction with accompanying drawing 1 and accompanying drawing 2, a kind of audio recognition method providing the embodiment of the present invention is situated between in detail Continue.
Refer to Fig. 1, for embodiments providing the schematic flow sheet of a kind of audio recognition method.As it is shown in figure 1, The described method of the embodiment of the present invention may comprise steps of S101-step S104.
S101, obtains the target audio data inputted based on interactive application;
Concrete, speech recognition apparatus obtains the target audio data that user is inputted, described target based on interactive application Voice data is specifically as follows user's application interface based on the described interactive application being currently needed for carrying out phonetic entry and is inputted Voice, and for be currently needed for the voice data carrying out speech recognition.
S102, extracts the target Filter bank feature in described target audio data;
Concrete, described speech recognition apparatus can be special at described target audio extracting data target Filter bank Levy, it should be noted that described speech recognition apparatus needs to split into described target audio data multiframe voice data, and divide The other Filter bank feature to every frame voice data is extracted to input in the DNN model to following training, i.e. framing Input carries out the calculating of the posterior probability feature of phoneme state.The most described speech recognition apparatus can be to described target sound frequency According to carrying out data framing, obtaining at least one frame voice data in described target audio data, described speech recognition apparatus obtains The first object Filter bank feature that in described at least one frame voice data, every frame the first voice data is corresponding, described target Filter bank character representation is the Filter bank feature belonging to described target audio data, and described first voice data is The currently practical speech data needing to carry out posterior probability feature calculation, described first object in described target audio data Filter bank character representation is the Filter bank feature belonging to described first object voice data.
S103, using the target Filter bank feature in described target audio data as DNN model defeated after training Entering data, the posteriority in the target phoneme state of the described target audio data obtaining the output of the DNN model after described training is general Rate feature;
Concrete, described speech recognition apparatus can be by the target Filter bank feature in described target audio data As the input data of the DNN model after training, obtain the described target audio data of the output of the DNN model after described training Posterior probability feature in target phoneme state, it is preferred that phoneme state is phonetic symbol, described target phoneme state is described mesh Phoneme state present in mark voice data, described DNN model can obtain the internodal matrix of output layer in the training process Weighted value and matrix bias, described output layer node can be at least one node, the quantity of output layer node and phoneme shape The quantity of state is correlated with (such as: equal), and an output layer node i.e. represents the characteristic vector of a phoneme state.
S104, creates the phoneme decoding network being associated with described target audio data, and uses the sound of the HMM after training Posterior probability feature in the target phoneme state of element transition probability and described target audio data obtains in described decoding network Take the target word sequence data that described target audio data are corresponding;
Concrete, described speech recognition apparatus can create the phoneme decoding net being associated with described target audio data Network, it is preferred that described phoneme decoding network can be with cum rights FST (Weighted Finite-State Transducer, WFST) it is framework, phoneme state sequence is input, and word sequence data are the word figure decoding network of output, permissible Being understood by, described phoneme decoding network can also create in advance when being trained DNN model and HMM.
Described speech recognition apparatus uses phoneme conversion probability and the target of described target audio data of the HMM after training Posterior probability feature on phoneme state obtains the target word sequence that described target audio data are corresponding in described decoding network Data, the phoneme conversion probability of the HMM after described training include each phoneme state jump to self phoneme conversion probability with And described each phoneme state jumps to the phoneme conversion probability of next phoneme state of self, it is to be understood that described Speech recognition apparatus can be according to the phoneme conversion probability of the HMM after training and all of described first object Filter Posterior probability feature in the target phoneme state of bank feature, arranges every network path in described phoneme decoding network Probit, and filter out optimal path, and the knowledge indicated by described optimal path according to the probit of described every network path Other result is as target word sequence data corresponding to described target audio data.
In embodiments of the present invention, when getting target audio data based on interactive application input, by obtaining mesh Target Filter bank feature in mark voice data, and based on the HMM after the DNN model after training and training, to target Voice data carries out speech recognition and obtains target word sequence data.The acoustic model set up by DNN model and HMM realizes language The function of sound identification, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove between characteristic dimension Dependency, can meet various actual application environment and the speech recognition of pronunciation custom, improve the accuracy of speech recognition.
Refer to Fig. 2, for embodiments providing the schematic flow sheet of another kind of audio recognition method.Such as Fig. 2 institute Showing, the described method of the embodiment of the present invention may comprise steps of S201-step S211.
S201, uses training audio frequency language material to be trained GMM and HMM, obtains the GMM output after training at least one The likelihood probability feature of each phoneme state in phoneme state, and obtain the phoneme conversion probability of the HMM after training;
Concrete, before DNN model is trained, need first to train the acoustic model of GMM and HMM, institute Stating speech recognition apparatus can use training audio frequency language material to be trained GMM and HMM, obtains the GMM after training and exports extremely The likelihood probability feature of each phoneme state in a few phoneme state, and obtain the phoneme conversion probability of the HMM after training, institute State training audio frequency language material and can comprise the audio frequency under the scenes such as pause between different noise circumstance, different word speed, different words as far as possible Data.
It should be noted that described speech recognition apparatus can carry out data prediction, described number to training audio frequency language material Data preprocess may include that when training audio frequency language material is carried out data framing, data preemphasis, data windowing operation etc. to obtain At least one frame voice data on territory;Carry out fast Fourier transform, described at least one frame voice data is transformed into frequency domain, To at least one power spectrum data that described at least one frame voice data is corresponding on frequency domain;By at least one power on frequency domain Modal data, by having the mel-frequency wave filter of triangle filtering characteristic, obtains at least one Mel power spectrum data;To extremely A few Mel power spectrum data is taken the logarithm energy, obtains at least one Mel logarithmic energy modal data, now obtained by At least one Mel logarithmic energy modal data (i.e. Filter bank feature), uses DCT to remove at least one Mel logarithmic energy The data dependence of modal data to obtain MFCC feature, described speech recognition apparatus using described MFCC feature as the input of GMM Data, to be trained GMM and HMM, and obtain each phoneme shape at least one phoneme state of the output of the GMM after training The phoneme conversion probability of the HMM after the likelihood probability feature of state, and training.It is understood that for training audio frequency language material In Filterbank feature and the MFCC feature of same frame voice data there is relation one to one.
S202, using and forcing alignment operation is described each sound by the likelihood probability Feature Conversion of described each phoneme state The posterior probability feature of element state;
Concrete, described speech recognition apparatus can use pressure alignment operation by general for the likelihood of described each phoneme state Rate Feature Conversion is the posterior probability feature of described each phoneme state, it is to be understood that owing to likelihood probability feature is to belong to In the probability characteristics of diversity, therefore for the frame voice data in described training audio frequency language material, it is at each phoneme state On the eigenvalue summation of likelihood probability feature be not 1, and for the frame voice data in described training audio frequency language material, its The eigenvalue summation of the posterior probability feature on each phoneme state is 1, it is therefore desirable to choose the eigenvalue of likelihood probability feature Maximum phoneme state, is set to 1 by the eigenvalue of the posterior probability feature on this phoneme state, and for this frame voice data Other phoneme state on the eigenvalue of posterior probability feature be then set to 0, by that analogy, change described training audio frequency language material In every frame voice data likelihood probability feature on phoneme state, it is thus achieved that in described training audio frequency language material, every frame voice data exists Posterior probability feature on phoneme state.
S203, according to the training Filter bank feature extracted in described training audio frequency language material and described each The posterior probability feature of phoneme state, calculates output layer internodal matrix weight value and matrix bias in DNN model;
S204, adds described matrix weight value and described matrix bias to described DNN model, after generating training DNN model;
Concrete, described speech recognition apparatus can be according to the training extracted in described training audio frequency language material Filter bank feature and the posterior probability feature of described each phoneme state, calculate output layer in DNN model internodal Matrix weight value and matrix bias, it is preferred that described speech recognition apparatus can extract described training sound based on said method Frequently the training Filter bank feature that in language material, every frame voice data is corresponding, and by described training Filter bank feature with right The posterior probability feature answered is as training sample pair, and the most described training audio frequency language material can exist multiple training sample pair, based on The plurality of training sample pair, and use the backward pass-algorithm of maximum-likelihood criterion to calculate in DNN model between output layer node Matrix weight value and matrix bias.Described matrix weight value and described matrix bias are added by described speech recognition apparatus To described DNN model, generate the DNN model after training.
S205, obtains the probability of occurrence of training word sequence data in training word sequence language material, and according to described training word The probability of occurrence of sequence data generates N-Gram language model;
Concrete, described speech recognition apparatus is while the acoustic model of training DNN model and HMM, it is also possible to language Speech model is trained, and described speech recognition apparatus can obtain the appearance of training word sequence data in training word sequence language material Probability, and generate N-Gram language model according to the probability of occurrence of described training word sequence data, N-Gram language model is base In a kind of it is assumed that prefixion K-1 the word of the appearance of k-th word is correlated with, and the most uncorrelated with other any word, a words The product of the probability of occurrence that probability is each word of string.
S206, obtains the target audio data inputted based on interactive application;
Concrete, described speech recognition apparatus obtains the target audio data that user is inputted based on interactive application, described Target audio data are specifically as follows user application interface institute based on the described interactive application being currently needed for carrying out phonetic entry The voice of input, and for be currently needed for the voice data carrying out speech recognition.
Described target audio data are carried out data framing by S207, obtain at least one frame in described target audio data Voice data;
S208, the first object Filter that at least one frame voice data described in acquisition, every frame the first voice data is corresponding Bank feature;
Concrete, described speech recognition apparatus needs to split into described target audio data multiframe voice data, and divides The other Filter bank feature to every frame voice data is extracted to input in the DNN model to following training, i.e. framing Input carries out the calculating of the posterior probability feature of phoneme state.The most described speech recognition apparatus can be to described target sound frequency According to carrying out data framing, obtaining at least one frame voice data in described target audio data, described speech recognition apparatus obtains The first object Filter bank feature that in described at least one frame voice data, every frame the first voice data is corresponding, described target Filter bank character representation is the Filter bank feature belonging to described target audio data, described first voice data For the speech data needing to carry out posterior probability feature calculation currently practical in described target audio data, described first object Filter bank character representation is the Filter bank feature belonging to described first object voice data.
Further, described speech recognition apparatus can carry out data prediction, described number to described target audio data Data preprocess may include that data framing, data preemphasis, data windowing operation etc. are with at least one frame audio frequency obtaining in time domain Data;Carry out fast Fourier transform, described at least one frame voice data be transformed into frequency domain, obtain described at least one frame audio frequency At least one power spectrum data that data are corresponding on frequency domain;By at least one power spectrum data on frequency domain by having triangle The mel-frequency wave filter of filtering characteristic, obtains at least one Mel power spectrum data;To at least one Mel power spectrum Data are taken the logarithm energy, obtain at least one Mel logarithmic energy modal data, now obtained by least one Mel logarithm energy The set of amount modal data is described target Filter bank feature, it is to be understood that Filter bank feature is in difference There is data dependence between characteristic dimension, MFCC feature is then to use discrete cosine transform (DiscreteCosine Transform, DCT) remove Filter bank feature data dependence obtained by feature.
Preferably, after described speech recognition apparatus also can carry out feature to described target Filter bank feature further Processing, described feature post processing can include feature extension and feature normalization, and feature extension can be to ask for described target The first-order difference of Filter bank feature and second differnce feature, obtain the default dimension that described every frame the first voice data is corresponding The target Filter bank feature of number feature, feature normalization can be to use cepstral mean to subtract (Cepstrum Mean Subtraction, CMS) the target Filter bank of the technology default Dimension Characteristics corresponding to described every frame the first voice data Feature carries out regular, obtains the first object Filter bank feature that described every frame the first voice data is corresponding, it is preferred that institute Stating default dimension can be 72 dimensions.
S209, according to the time-sequencing of described at least one frame voice data, before obtaining described every frame the first voice data The rear second audio data presetting frame number;
S210, by the second corresponding to described first object Filter bank feature and described second audio data target Filter bank feature, as the input data of the DNN model after training, obtains the institute of the output of the DNN model after described training State the posterior probability feature in the target phoneme state of first object Filter bank feature;
Concrete, described speech recognition apparatus can obtain institute according to the time-sequencing of described at least one frame voice data Presetting the second audio data of frame number before and after stating every frame the first voice data, described speech recognition apparatus is by described first object After Filter bank feature and the second target Filter bank feature corresponding to described second audio data are as training The input data of DNN model, obtain the described first object Filter bank feature of the output of the DNN model after described training Posterior probability feature in target phoneme state, it is to be understood that described second audio data is and described first audio frequency number According to the data possessing dimension relatedness.
Assume described target audio data exist N frame voice data, the first object that i-th frame the first voice data is corresponding Filter bank is characterized as Fi, i=1,2,3 ... N, front and back preset frame number for front and back 8 frames, then input data can include FiAnd Second target Filter bank feature of 8 frames before and after i-th frame the first voice data, preferably presets dimension, then institute based on above-mentioned The quantity stating input layer corresponding in input data DNN model after described training is 72=1224 joint of (8+1+8) * Point, the number of nodes of the output layer node of the DNN model after described training equal to number P of all phoneme state, input layer with There is the hidden layer of predetermined number between output layer, hidden layer number is preferably 3 layers, and each hidden layer all exists 1024 joints Point, M-1 layer output layer node and M shell output layer internodal matrix weight value and square in the DNN model after described training Battle array bias can be expressed as WMAnd bM, M=1,2,3 ... P, then i-th frame the first voice data is at M shell output layer node The characteristic vector of corresponding phoneme stateMeetWherein f (x) is activation primitive, is preferably Relu function, the F of the DNN model output after the most described trainingiM-th phoneme state on posterior probability featureFor:
O M i = exp ( h M i ) Σ i = 0 P exp ( h M i )
S211, creates the phoneme decoding network being associated with described target audio data, and uses the sound of the HMM after training Posterior probability feature in the target phoneme state of element transition probability and described target audio data obtains in described decoding network Take the target word sequence data that described target audio data are corresponding;
Concrete, described speech recognition apparatus can create the phoneme decoding net being associated with described target audio data Network, it is preferred that described phoneme decoding network can be with WFST as framework, phoneme state sequence is input, and word sequence data are The word figure decoding network of output, it is to be understood that DNN model and HMM can also instructed by described phoneme decoding network Create in advance when practicing.
Described speech recognition apparatus uses phoneme conversion probability and the target of described target audio data of the HMM after training Posterior probability feature on phoneme state obtains the target word sequence that described target audio data are corresponding in described decoding network Data, the phoneme conversion probability of the HMM after described training include each phoneme state jump to self phoneme conversion probability with And described each phoneme state jumps to the phoneme conversion probability of next phoneme state of self, it is to be understood that described Speech recognition apparatus can be according to the phoneme conversion probability of the HMM after training and all of described first object Filter Posterior probability feature in the target phoneme state of bank feature, arranges every network path in described phoneme decoding network Probit, and filter out optimal path, and the knowledge indicated by described optimal path according to the probit of described every network path Other result is as target word sequence data corresponding to described target audio data.
Further, the phoneme conversion probability of the HMM after described speech recognition apparatus can use training, described first mesh Posterior probability feature in the target phoneme state of mark Filter bank feature and described N-Gram language model, described Decoding network obtains the target word sequence data that described target audio data are corresponding, owing to N-Gram language model can be voluntarily Infer the probability that next word occurs, therefore in conjunction with probability of occurrence, the probit of every network path can be weighted, increase Add the probability of network path, obtain, by combining N-Gram language model, the target word sequence number that target audio data are corresponding According to, the accuracy of speech recognition can be promoted further.
In embodiments of the present invention, when getting target audio data based on interactive application input, by obtaining mesh Target Filter bank feature in mark voice data, and based on the HMM after the DNN model after training and training, to target Voice data carries out speech recognition and obtains target word sequence data.The acoustic model set up by DNN model and HMM realizes language The function of sound identification, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove between characteristic dimension Dependency, can meet various actual application environment and the speech recognition of pronunciation custom, improve the accuracy of speech recognition; By having merged method and the training method of DNN-HMM acoustic model of Filter bank feature extraction, it is achieved that complete Training is to the process identified;The target word sequence data that target audio data are corresponding is obtained by combining N-Gram language model, Owing to N-Gram language model can infer the probability that next word occurs voluntarily, therefore can be in conjunction with probability of occurrence to every net The probit in network path is weighted, and increases the probability of network path, improves the accuracy of speech recognition further.
Below in conjunction with accompanying drawing 3-accompanying drawing 6, the speech recognition apparatus providing the embodiment of the present invention describes in detail.Need It is noted that the speech recognition apparatus shown in accompanying drawing 3-accompanying drawing 6, for performing the side of Fig. 1 of the present invention and embodiment illustrated in fig. 2 Method, for convenience of description, illustrate only the part relevant to the embodiment of the present invention, and concrete ins and outs do not disclose, refer to Embodiment shown in Fig. 1 and Fig. 2 of the present invention.
Refer to Fig. 3, for embodiments providing the structural representation of a kind of speech recognition apparatus.As it is shown on figure 3, The described speech recognition apparatus 1 of the embodiment of the present invention may include that voice data acquiring unit 11, feature extraction unit 12, spy Levy acquiring unit 13 and word sequence data capture unit 14.
Voice data acquiring unit 11, for obtaining the target audio data inputted based on interactive application;
In implementing, described voice data acquiring unit 11 obtains the target audio that user is inputted based on interactive application Data, described target audio data are specifically as follows user based on the described interactive application being currently needed for carrying out phonetic entry The voice that application interface is inputted, and for be currently needed for the voice data carrying out speech recognition.
Feature extraction unit 12, for extracting the target Filter bank feature in described target audio data;
In implementing, described feature extraction unit 12 can be in described target audio extracting data target Filter Bank feature, it should be noted that described feature extraction unit 12 needs described target audio data are split into multiframe audio frequency Data, and the DNN model after Filter bank feature to every frame voice data is extracted with input to following training respectively In, i.e. framing inputs the calculating of the posterior probability feature carrying out phoneme state.The most described feature extraction unit 12 can be to institute State target audio data and carry out data framing, obtain at least one frame voice data in described target audio data, described feature Extraction unit 12 obtain described in every frame the first voice data is corresponding at least one frame voice data first object Filter bank Feature, described target Filter bank character representation is the Filter bank feature belonging to described target audio data, described First voice data is the currently practical speech data needing to carry out posterior probability feature calculation, institute in described target audio data Stating first object Filter bank character representation is the Filter bank feature belonging to described first object voice data.
Feature acquiring unit 13, is used for the target Filter bank feature in described target audio data as training After the input data of DNN model, obtain the target phoneme of the described target audio data of the output of the DNN model after described training Posterior probability feature in state;
In implementing, described feature acquiring unit 13 can be by target Filter in described target audio data Bank feature, as the input data of the DNN model after training, obtains the described target sound of the output of the DNN model after described training Posterior probability feature in the target phoneme state of frequency evidence, it is preferred that phoneme state is phonetic symbol, described target phoneme state For phoneme state present in described target audio data, described DNN model can obtain output layer node in the training process Between matrix weight value and matrix bias, described output layer node can be at least one node, the quantity of output layer node Relevant to the quantity of phoneme state (such as: equal), an output layer node i.e. represents the characteristic vector of a phoneme state.
Word sequence data capture unit 14, for creating the phoneme decoding network being associated with described target audio data, And use the posterior probability in the phoneme conversion probability of the HMM after training and the target phoneme state of described target audio data special Levy in described decoding network, obtain the target word sequence data that described target audio data are corresponding;
In implementing, described word sequence data capture unit 14 can create and be associated with described target audio data Phoneme decoding network, it is preferred that described phoneme decoding network can be with WFST as framework, phoneme state sequence is input, word Sequence data is the word figure decoding network of output, it is to be understood that described phoneme decoding network can also be to DNN model Create in advance when being trained with HMM.
Described word sequence data capture unit 14 uses the phoneme conversion probability of the HMM after training and described target sound frequency According to target phoneme state on posterior probability feature in described decoding network, obtain the mesh that described target audio data are corresponding Mark word sequence data, the phoneme conversion probability of the HMM after described training includes that each phoneme state jumps to self phoneme and turns Change probability and described each phoneme state jumps to self the phoneme conversion probability of next phoneme state, it is possible to understand that It is that described word sequence data capture unit 14 can be according to the phoneme conversion probability of the HMM after training and all of described the Posterior probability feature in the target phoneme state of one target Filter bank feature, is arranged in described phoneme decoding network The probit of every network path, and filter out optimal path according to the probit of described every network path, and by described The recognition result of shortest path instruction is as target word sequence data corresponding to described target audio data.
In embodiments of the present invention, when getting target audio data based on interactive application input, by obtaining mesh Target Filter bank feature in mark voice data, and based on the HMM after the DNN model after training and training, to target Voice data carries out speech recognition and obtains target word sequence data.The acoustic model set up by DNN model and HMM realizes language The function of sound identification, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove between characteristic dimension Dependency, can meet various actual application environment and the speech recognition of pronunciation custom, improve the accuracy of speech recognition.
Refer to Fig. 4, for embodiments providing the structural representation of another kind of speech recognition apparatus.Such as Fig. 4 institute Show, the described speech recognition apparatus 1 of the embodiment of the present invention may include that voice data acquiring unit 11, feature extraction unit 12, Feature acquiring unit 13, word sequence data capture unit 14, acoustic training model unit 15, Feature Conversion unit 16, parameter meter Calculate unit 17, acoustic model signal generating unit 18 and language model signal generating unit 19.
Acoustic training model unit 15, is used for using training audio frequency language material to be trained GMM and HMM, after obtaining training GMM output at least one phoneme state in the likelihood probability feature of each phoneme state, and obtain the sound of the HMM after training Element transition probability;
In implementing, before DNN model is trained, need first to train the acoustic mode of GMM and HMM Type, described acoustic training model unit 15 can use training audio frequency language material to be trained GMM and HMM, after obtaining training The likelihood probability feature of each phoneme state at least one phoneme state of GMM output, and obtain the phoneme of the HMM after training Transition probability, described training audio frequency language material can comprise between different noise circumstance, different word speed, different words the fields such as pause as far as possible Voice data under scape.
It should be noted that described acoustic training model unit 15 can carry out data prediction to training audio frequency language material, Described data prediction may include that to training audio frequency language material carry out data framing, data preemphasis, data windowing operation etc. with Obtain at least one frame voice data in time domain;Carry out fast Fourier transform, described at least one frame voice data is transformed into Frequency domain, obtain described at least one power spectrum data corresponding on frequency domain of at least one frame voice data;By on frequency domain at least One power spectrum data, by having the mel-frequency wave filter of triangle filtering characteristic, obtains at least one Mel power spectrum number According to;At least one Mel power spectrum data is taken the logarithm energy, obtain at least one Mel logarithmic energy modal data, now institute At least one the Mel logarithmic energy modal data (i.e. Filter bank feature) obtained, uses DCT to remove at least one Mel pair The data dependence of number energy spectra data is to obtain MFCC feature, and described MFCC feature is made by described acoustic training model unit 15 For the input data of GMM, so that GMM and HMM to be trained, and obtain at least one phoneme state of the output of the GMM after training In the phoneme conversion probability of HMM after the likelihood probability feature of each phoneme state, and training.It is understood that for Relation one to one is there is in the Filter bank feature of the same frame voice data in training audio frequency language material with MFCC feature.
Feature Conversion unit 16, for using pressure alignment operation the likelihood probability feature of described each phoneme state to be turned It is changed to the posterior probability feature of described each phoneme state;
In implementing, described Feature Conversion unit 16 can use pressure alignment operation by described each phoneme state Likelihood probability Feature Conversion is the posterior probability feature of described each phoneme state, it is to be understood that owing to likelihood probability is special Levying the probability characteristics being belonging to diversity, therefore for the frame voice data in described training audio frequency language material, it is at each sound The eigenvalue summation of the likelihood probability feature in element state is not 1, and for the frame audio frequency number in described training audio frequency language material According to, the eigenvalue summation of its posterior probability feature on each phoneme state is 1, it is therefore desirable to choose likelihood probability feature The phoneme state that eigenvalue is maximum, is set to 1 by the eigenvalue of the posterior probability feature on this phoneme state, and for this frame sound The eigenvalue of the posterior probability feature on other phoneme state of frequency evidence is then set to 0, by that analogy, changes described training sound Frequently every frame voice data likelihood probability feature on phoneme state in language material, it is thus achieved that every frame audio frequency in described training audio frequency language material Data posterior probability feature on phoneme state.
Parameter calculation unit 17, for special according to the training Filter bank extracted in described training audio frequency language material Levy and the posterior probability feature of described each phoneme state, calculate in DNN model output layer internodal matrix weight value and Matrix bias;
Acoustic model signal generating unit 18, for adding described matrix weight value and described matrix bias to described DNN In model, generate the DNN model after training;
In implementing, described parameter calculation unit 17 can be according to the training extracted in described training audio frequency language material Filter bank feature and the posterior probability feature of described each phoneme state, calculate output layer in DNN model internodal Matrix weight value and matrix bias, it is preferred that described parameter calculation unit 17 can extract described training based on said method The training Filter bank feature that in audio frequency language material, every frame voice data is corresponding, and by described training Filter bank feature with Corresponding posterior probability feature is as training sample pair, and the most described training audio frequency language material can exist multiple training sample pair, base In the plurality of training sample pair, and the backward pass-algorithm of maximum-likelihood criterion is used to calculate output layer node in DNN model Between matrix weight value and matrix bias.Described acoustic model signal generating unit 18 is by inclined to described matrix weight value and described matrix Put value to add to described DNN model, generate the DNN model after training.
Language model signal generating unit 19, general for obtaining the appearance of training word sequence data in training word sequence language material Rate, and generate N-Gram language model according to the probability of occurrence of described training word sequence data;
In implementing, while the acoustic model of training DNN model and HMM, described language model signal generating unit 19 Can be trained language model, described language model signal generating unit 19 can obtain training word in training word sequence language material The probability of occurrence of sequence data, and generate N-Gram language model, N-according to the probability of occurrence of described training word sequence data Gram language model be based on a kind of it is assumed that prefixion K-1 the word of the appearance of k-th word be correlated with, and with other any word The most uncorrelated, the probability of a words string is the product of the probability of occurrence of each word.
Voice data acquiring unit 11, for obtaining the target audio data inputted based on interactive application;
In implementing, described voice data acquiring unit 11 obtains the target audio that user is inputted based on interactive application Data, described target audio data be specifically as follows user based on be currently needed for carrying out phonetic entry described interactive application should The voice inputted with interface, and for be currently needed for the voice data carrying out speech recognition.
Feature extraction unit 12, for extracting the target Filter bank feature in described target audio data;
In implementing, described feature extraction unit 12 can be in described target audio extracting data target Filter Bank feature, it should be noted that described feature extraction unit 12 needs described target audio data are split into multiframe audio frequency Data, and the DNN model after Filter bank feature to every frame voice data is extracted with input to following training respectively In, i.e. framing inputs the calculating of the posterior probability feature carrying out phoneme state.The most described feature extraction unit 12 can be to institute State target audio data and carry out data framing, obtain at least one frame voice data in described target audio data, described feature Extraction unit 12 obtain described in every frame the first voice data is corresponding at least one frame voice data first object Filter Bank feature, described target Filterbank character representation is the Filter bank feature belonging to described target audio data, institute Stating the first voice data is the currently practical speech data needing to carry out posterior probability feature calculation in described target audio data, Described first object Filter bank character representation is the Filter bank feature belonging to described first object voice data.
Concrete, please also refer to Fig. 5, for embodiments providing the structural representation of feature extraction unit.As Shown in Fig. 5, described feature extraction unit 12 may include that
First data acquisition subelement 121, for described target audio data are carried out data framing, obtains described target At least one frame voice data in voice data;
Fisrt feature obtains subelement 122, be used for obtaining described in every frame the first voice data at least one frame voice data Corresponding first object Filter bank feature;
In implementing, described first data acquisition subelement 121 needs described target audio data are split into multiframe Voice data, and the DNN after Filter bank feature to every frame voice data is extracted with input to following training respectively In model, i.e. framing inputs the calculating of the posterior probability feature carrying out phoneme state.The most described first data acquisition subelement 121 can carry out data framing to described target audio data, obtain at least one frame audio frequency number in described target audio data According to, described fisrt feature obtain subelement 122 obtain described at least one frame voice data every frame the first voice data corresponding First object Filter bank feature, described target Filter bank character representation is to belong to described target audio data Filter bank feature, described first voice data is that in described target audio data, currently practical needs carries out posterior probability The speech data of feature calculation, described first object Filter bank character representation is for belonging to described first object voice data Filter bank feature.
Further, described first data acquisition subelement 121 can carry out data to described target audio data and locates in advance Reason, described data prediction may include that data framing, data preemphasis, data windowing operation etc. are to obtain in time domain extremely A few frame voice data;Carry out fast Fourier transform, described at least one frame voice data be transformed into frequency domain, obtain described in extremely At least one power spectrum data that a few frame voice data is corresponding on frequency domain;At least one power spectrum data on frequency domain is led to Cross the mel-frequency wave filter with triangle filtering characteristic, obtain at least one Mel power spectrum data;To at least one prunus mume (sieb.) sieb.et zucc. Your power spectrum data is taken the logarithm energy, obtains at least one Mel logarithmic energy modal data, now obtained by least one The set of Mel logarithmic energy modal data is described target Filter bank feature, it is to be understood that Filter bank There is data dependence in feature between different characteristic dimension, MFCC feature is then to use DCT to remove Filter bank feature Data dependence obtained by feature.
Preferably, described target Filter bank feature also can be entered by described fisrt feature acquisition subelement 122 further Row feature post processing, described feature post processing can include feature extension and feature normalization, feature extension can be ask for described in The first-order difference of target Filter bank feature and second differnce feature, obtain corresponding pre-of described every frame the first voice data If the target Filter bank feature of Dimension Characteristics, feature normalization can be to use CMS technology to described every frame the first audio frequency number Carry out regular according to the target Filter bank feature of corresponding default Dimension Characteristics, obtain described every frame the first voice data pair The first object Filter bank feature answered, it is preferred that described default dimension can be 72 dimensions.
Feature acquiring unit 13, is used for the target Filter bank feature in described target audio data as training After the input data of DNN model, obtain the target phoneme of the described target audio data of the output of the DNN model after described training Posterior probability feature in state;
In implementing, described feature acquiring unit 13 can be by target Filter in described target audio data Bank feature, as the input data of the DNN model after training, obtains the described target sound of the output of the DNN model after described training Posterior probability feature in the target phoneme state of frequency evidence, it is preferred that phoneme state is phonetic symbol, described target phoneme state For phoneme state present in described target audio data, described DNN model can obtain output layer node in the training process Between matrix weight value and matrix bias, described output layer node can be at least one node, the quantity of output layer node Relevant to the quantity of phoneme state (such as: equal), an output layer node i.e. represents the characteristic vector of a phoneme state.
Concrete, please also refer to Fig. 6, for embodiments providing the structural representation of feature acquiring unit.As Shown in Fig. 6, described feature acquiring unit 13 may include that
Second data acquisition subelement 131, for the time-sequencing according to described at least one frame voice data, obtains described The second audio data of frame number is preset before and after every frame the first voice data;
Second feature obtains subelement 132, for by described first object Filter bank feature and described second sound Frequency as the input data of the DNN model after training, obtains described training according to the second corresponding target Filter bank feature After DNN model output described first object Filter bank feature target phoneme state on posterior probability feature;
In implementing, described second data acquisition subelement 131 can according to described at least one frame voice data time Between sort, obtain the second audio data presetting frame number before and after described every frame the first voice data, described second feature obtains Subelement 132 is by the second corresponding to described first object Filter bank feature and described second audio data target Filter bank feature, as the input data of the DNN model after training, obtains the institute of the output of the DNN model after described training State the posterior probability feature in the target phoneme state of first object Filter bank feature, it is to be understood that described Two voice datas are to possess the data of dimension relatedness with described first voice data.
Assume described target audio data exist N frame voice data, the first object that i-th frame the first voice data is corresponding Filter bank is characterized as Fi, i=1,2,3 ... N, front and back preset frame number for front and back 8 frames, then input data can include FiAnd Second target Filter bank feature of 8 frames before and after i-th frame the first voice data, preferably presets dimension, then institute based on above-mentioned The quantity stating input layer corresponding in input data DNN model after described training is 72=1224 joint of (8+1+8) * Point, the number of nodes of the output layer node of the DNN model after described training equal to number P of all phoneme state, input layer with There is the hidden layer of predetermined number between output layer, hidden layer number is preferably 3 layers, and each hidden layer all exists 1024 joints Point, M-1 layer output layer node and M shell output layer internodal matrix weight value and square in the DNN model after described training Battle array bias can be expressed as WMAnd bM, M=1,2,3 ... P, then i-th frame the first voice data is at M shell output layer node The characteristic vector of corresponding phoneme stateMeetWherein f (x) is activation primitive, is preferably Relu function, the F of the DNN model output after the most described trainingiM-th phoneme state on posterior probability featureFor:
O M i = exp ( h M i ) Σ i = 0 P exp ( h M i )
Word sequence data capture unit 14, for creating the phoneme decoding network being associated with described target audio data, And use the posterior probability in the phoneme conversion probability of the HMM after training and the target phoneme state of described target audio data special Levy in described decoding network, obtain the target word sequence data that described target audio data are corresponding;
In implementing, described word sequence data capture unit 14 can create and be associated with described target audio data Phoneme decoding network, it is preferred that described phoneme decoding network can be with WFST as framework, phoneme state sequence is input, word Sequence data is the word figure decoding network of output, it is to be understood that described phoneme decoding network can also to DNN model and Create in advance when HMM is trained.
Described word sequence data capture unit 14 uses the phoneme conversion probability of the HMM after training and described target sound frequency According to target phoneme state on posterior probability feature in described decoding network, obtain the mesh that described target audio data are corresponding Mark word sequence data, the phoneme conversion probability of the HMM after described training includes that each phoneme state jumps to self phoneme and turns Change probability and described each phoneme state jumps to self the phoneme conversion probability of next phoneme state, it is possible to understand that It is that described word sequence data capture unit 14 can be according to the phoneme conversion probability of the HMM after training and all of described the Posterior probability feature in the target phoneme state of one target Filter bank feature, is arranged in described phoneme decoding network The probit of every network path, and filter out optimal path according to the probit of described every network path, and by described The recognition result of shortest path instruction is as target word sequence data corresponding to described target audio data.
Further, the phoneme conversion probability of the HMM after described word sequence data capture unit 14 can use training, institute State the posterior probability feature in the target phoneme state of first object Filter bank feature and described N-Gram language mould Type, obtains the target word sequence data that described target audio data are corresponding in described decoding network, due to N-Gram language mould Type can infer the probability that next word occurs voluntarily, therefore can enter the probit of every network path in conjunction with probability of occurrence Row weighting, increases the probability of network path, obtains, by combining N-Gram language model, the target that target audio data are corresponding Word sequence data, can promote the accuracy of speech recognition further.
In embodiments of the present invention, when getting target audio data based on interactive application input, by obtaining mesh Target Filter bank feature in mark voice data, and based on the HMM after the DNN model after training and training, to target Voice data carries out speech recognition and obtains target word sequence data.The acoustic model set up by DNN model and HMM realizes language The function of sound identification, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove between characteristic dimension Dependency, can meet various actual application environment and the speech recognition of pronunciation custom, improve the accuracy of speech recognition; By having merged method and the training method of DNN-HMM acoustic model of Filter bank feature extraction, it is achieved that complete Training is to the process identified;The target word sequence data that target audio data are corresponding is obtained by combining N-Gram language model, Owing to N-Gram language model can infer the probability that next word occurs voluntarily, therefore can be in conjunction with probability of occurrence to every net The probit in network path is weighted, and increases the probability of network path, improves the accuracy of speech recognition further.
Refer to Fig. 7, for embodiments providing the structural representation of another speech recognition apparatus.Such as Fig. 7 institute Showing, described speech recognition apparatus 1000 may include that at least one processor 1001, such as CPU, at least one network interface 1004, user interface 1003, memorizer 1005, at least one communication bus 1002.Wherein, communication bus 1002 is used for realizing this Connection communication between a little assemblies.Wherein, user interface 1003 can include display screen (Display), keyboard (Keyboard), Optional user interface 1003 can also include the wireline interface of standard, wave point.Network interface 1004 optionally can include The wireline interface of standard, wave point (such as WI-FI interface).Memorizer 1005 can be high-speed RAM memorizer, it is also possible to right and wrong Unstable memorizer (non-volatile memory), for example, at least one disk memory.Memorizer 1005 is optional Can also is that at least one is located remotely from the storage device of aforementioned processor 1001.As it is shown in fig. 7, as a kind of Computer Storage The memorizer 1005 of medium can include operating system, network communication module, Subscriber Interface Module SIM and speech recognition application Program.
In the speech recognition apparatus 1000 shown in Fig. 7, user interface 1003 is mainly used in providing the user connecing of input Mouthful, obtain the data of user's input;And processor 1001 may be used for calling the speech recognition application of storage in memorizer 1005 Program, and specifically perform following operation:
Obtain the target audio data inputted based on interactive application;
Extract the target Filter bank feature in described target audio data;
Using the target Filter bank feature in described target audio data as the input number of DNN model after training According to, the posterior probability in the target phoneme state of the described target audio data obtaining the output of the DNN model after described training is special Levy;
Create the phoneme decoding network being associated with described target audio data, and use the phoneme of the HMM after training to turn Change the posterior probability feature in the target phoneme state of probability and described target audio data in described decoding network, obtain institute State the target word sequence data that target audio data are corresponding.
In one embodiment, described processor 1001 is performing the target sound frequency that acquisition is inputted based on interactive application According to before, the also following operation of execution:
Use training audio frequency language material that GMM and HMM is trained, obtain at least one phoneme of the GMM output after training The likelihood probability feature of each phoneme state in state, and obtain the phoneme conversion probability of the HMM after training;
Using and forcing alignment operation is described each phoneme shape by the likelihood probability Feature Conversion of described each phoneme state The posterior probability feature of state;
According to the training Filter bank feature extracted in described training audio frequency language material and described each phoneme shape The posterior probability feature of state, calculates output layer internodal matrix weight value and matrix bias in DNN model;
Described matrix weight value and described matrix bias are added to described DNN model, generates the DNN mould after training Type.
In one embodiment, described processor 1001 is performing the target sound frequency that acquisition is inputted based on interactive application According to before, the also following operation of execution:
The probability of occurrence of training word sequence data is obtained in training word sequence language material, and according to described training word order columns According to probability of occurrence generate N-Gram language model.
In one embodiment, the described processor 1001 target Filter in performing the described target audio data of extraction During bank feature, the following operation of concrete execution:
Described target audio data are carried out data framing, obtains at least one frame audio frequency number in described target audio data According to;
The first object Filter bank that at least one frame voice data described in acquisition, every frame the first voice data is corresponding is special Levy.
In one embodiment, described processor 1001 is performing target Filter in described target audio data Bank feature, as the input data of the DNN model after training, obtains the described target sound of the output of the DNN model after described training During posterior probability feature in the target phoneme state of frequency evidence, concrete perform following operation:
According to the time-sequencing of described at least one frame voice data, preset before and after obtaining described every frame the first voice data The second audio data of frame number;
By the second corresponding to described first object Filter bank feature and described second audio data target Filter Bank feature, as the input data of the DNN model after training, obtains described first mesh of the output of the DNN model after described training Posterior probability feature in the target phoneme state of mark Filter bank feature;
Wherein, described first voice data is the data being currently needed for carrying out posterior probability feature calculation, described second sound Frequency is according to the data for possessing dimension relatedness with described first voice data.
In one embodiment, described processor 1001 is performing the phoneme that establishment is associated with described target audio data Decoding network, and use in the phoneme conversion probability of the HMM after training and the target phoneme state of described target audio data When posterior probability feature obtains target word sequence data corresponding to described target audio data in described decoding network, specifically hold The following operation of row:
Create the phoneme decoding network being associated with described target audio data, and use the phoneme of the HMM after training to turn Change the posterior probability feature in the target phoneme state of probability, described first object Filter bank feature and described N- Gram language model, obtains the target word sequence data that described target audio data are corresponding in described decoding network.
In embodiments of the present invention, when getting target audio data based on interactive application input, by obtaining mesh Target Filter bank feature in mark voice data, and based on the HMM after the DNN model after training and training, to target Voice data carries out speech recognition and obtains target word sequence data.The acoustic model set up by DNN model and HMM realizes language The function of sound identification, and combine the Filter bank feature input data as acoustic model, it is not necessary to remove between characteristic dimension Dependency, can meet various actual application environment and the speech recognition of pronunciation custom, improve the accurate of speech recognition Property;By having merged method and the training method of DNN-HMM acoustic model of Filterbank feature extraction, it is achieved that complete Training to identify process;The target word sequence number that target audio data are corresponding is obtained by combining N-Gram language model According to, owing to N-Gram language model can infer the probability that next word occurs voluntarily, therefore can be in conjunction with probability of occurrence to often The probit of bar network path is weighted, and increases the probability of network path, improves the accuracy of speech recognition further.
One of ordinary skill in the art will appreciate that all or part of flow process realizing in above-described embodiment method, be permissible Instructing relevant hardware by computer program to complete, described program can be stored in a computer read/write memory medium In, this program is upon execution, it may include such as the flow process of the embodiment of above-mentioned each method.Wherein, described storage medium can be magnetic Dish, CD, read-only store-memory body (Read-Only Memory, ROM) or random store-memory body (Random Access Memory, RAM) etc..
The above disclosed present pre-ferred embodiments that is only, can not limit the right model of the present invention with this certainly Enclose, the equivalent variations therefore made according to the claims in the present invention, still belong to the scope that the present invention is contained.

Claims (12)

1. an audio recognition method, it is characterised in that including:
Obtain the target audio data inputted based on interactive application;
Extract the target filter group Filter bank feature in described target audio data;
Using the target Filter bank feature in described target audio data as the deep-neural-network DNN model after training Input data, obtain the DNN model after described training output described target audio data target phoneme state on after Test probability characteristics;
Create the phoneme decoding network being associated with described target audio data, and use the HMM after training Posterior probability feature on the phoneme conversion probability of HMM and the target phoneme state of described target audio data is at described decoding net Network obtains the target word sequence data that described target audio data are corresponding.
Method the most according to claim 1, it is characterised in that the target audio that described acquisition is inputted based on interactive application Before data, also include:
Use training audio frequency language material that gauss hybrid models GMM and HMM is trained, obtain the GMM after training and export at least The likelihood probability feature of each phoneme state in one phoneme state, and obtain the phoneme conversion probability of the HMM after training;
Using and forcing alignment operation is described each phoneme state by the likelihood probability Feature Conversion of described each phoneme state Posterior probability feature;
According to the training Filter bank feature extracted in described training audio frequency language material and described each phoneme state Posterior probability feature, calculates output layer internodal matrix weight value and matrix bias in DNN model;
Described matrix weight value and described matrix bias are added to described DNN model, generates the DNN model after training.
Method the most according to claim 2, it is characterised in that the target audio that described acquisition is inputted based on interactive application Before data, also include:
The probability of occurrence of training word sequence data is obtained in training word sequence language material, and according to described training word sequence data Probability of occurrence generates N-Gram language model.
Method the most according to claim 3, it is characterised in that the target in described extraction described target audio data Filter bank feature, including:
Described target audio data are carried out data framing, obtains at least one frame voice data in described target audio data;
The first object Filter bank feature that at least one frame voice data described in acquisition, every frame the first voice data is corresponding.
Method the most according to claim 4, it is characterised in that described by target Filter in described target audio data Bank feature, as the input data of the DNN model after training, obtains the described target sound of the output of the DNN model after described training Posterior probability feature in the target phoneme state of frequency evidence, including:
According to the time-sequencing of described at least one frame voice data, before and after obtaining described every frame the first voice data, preset frame number Second audio data;
By the second corresponding to described first object Filter bank feature and described second audio data target Filter Bank feature, as the input data of the DNN model after training, obtains described first mesh of the output of the DNN model after described training Posterior probability feature in the target phoneme state of mark Filter bank feature;
Wherein, described first voice data is the data being currently needed for carrying out posterior probability feature calculation, described second audio frequency number According to the data for possessing dimension relatedness with described first voice data.
Method the most according to claim 5, it is characterised in that the sound that described establishment is associated with described target audio data Element decoding network, and use in the phoneme conversion probability of the HMM after training and the target phoneme state of described target audio data Posterior probability feature in described decoding network, obtain the target word sequence data that described target audio data are corresponding, including:
Create the phoneme decoding network that is associated with described target audio data, and use the phoneme conversion of the HMM after training general Rate, described first object Filter bank feature target phoneme state on posterior probability feature and described N-Gram language Speech model, obtains the target word sequence data that described target audio data are corresponding in described decoding network.
7. a speech recognition apparatus, it is characterised in that including:
Voice data acquiring unit, for obtaining the target audio data inputted based on interactive application;
Feature extraction unit, for extracting the target Filter bank feature in described target audio data;
Feature acquiring unit, for using the target Filter bank feature in described target audio data as training after DNN The input data of model, in the target phoneme state of the described target audio data obtaining the output of the DNN model after described training Posterior probability feature;
Word sequence data capture unit, for creating the phoneme decoding network being associated with described target audio data, and uses Posterior probability feature on the phoneme conversion probability of the HMM after training and the target phoneme state of described target audio data is in institute State and decoding network obtains the target word sequence data that described target audio data are corresponding.
Equipment the most according to claim 7, it is characterised in that also include:
Acoustic training model unit, is used for using training audio frequency language material to be trained GMM and HMM, obtains the GMM after training defeated The likelihood probability feature of each phoneme state at least one phoneme state gone out, and obtain the phoneme conversion of the HMM after training Probability;
Feature Conversion unit, is institute for using pressure alignment operation by the likelihood probability Feature Conversion of described each phoneme state State the posterior probability feature of each phoneme state;
Parameter calculation unit, for according to the training Filter bank feature extracted in described training audio frequency language material and The posterior probability feature of described each phoneme state, calculates output layer internodal matrix weight value and matrix in DNN model inclined Put value;
Acoustic model signal generating unit, for described matrix weight value and described matrix bias are added to described DNN model, Generate the DNN model after training.
Equipment the most according to claim 8, it is characterised in that also include:
Language model signal generating unit, for obtaining the probability of occurrence of training word sequence data in training word sequence language material, and root N-Gram language model is generated according to the probability of occurrence of described training word sequence data.
Equipment the most according to claim 9, it is characterised in that described feature extraction unit includes:
First data acquisition subelement, for described target audio data are carried out data framing, obtains described target sound frequency At least one frame voice data according to;
Fisrt feature obtains subelement, that at least one frame voice data described in obtain, every frame the first voice data is corresponding One target Filter bank feature.
11. equipment according to claim 10, it is characterised in that described feature acquiring unit includes:
Second data acquisition subelement, for according to the time-sequencing of described at least one frame voice data, obtains described every frame the The second audio data of frame number is preset before and after one voice data;
Second feature obtains subelement, for by described first object Filter bank feature and described second audio data The second corresponding target Filter bank feature is as the input data of the DNN model after training, after obtaining described training Posterior probability feature in the target phoneme state of the described first object Filter bank feature of DNN model output;
Wherein, described first voice data is the data being currently needed for carrying out posterior probability feature calculation, described second audio frequency number According to the data for possessing dimension relatedness with described first voice data.
12. equipment according to claim 11, it is characterised in that described word sequence data capture unit is specifically for creating The phoneme decoding network being associated with described target audio data, and use the phoneme conversion probability of the HMM after training, described Posterior probability feature in the target phoneme state of one target Filter bank feature and described N-Gram language model, Described decoding network obtains the target word sequence data that described target audio data are corresponding.
CN201610272292.3A 2016-04-28 2016-04-28 A kind of audio recognition method and its equipment Active CN105976812B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610272292.3A CN105976812B (en) 2016-04-28 2016-04-28 A kind of audio recognition method and its equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610272292.3A CN105976812B (en) 2016-04-28 2016-04-28 A kind of audio recognition method and its equipment

Publications (2)

Publication Number Publication Date
CN105976812A true CN105976812A (en) 2016-09-28
CN105976812B CN105976812B (en) 2019-04-26

Family

ID=56994150

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610272292.3A Active CN105976812B (en) 2016-04-28 2016-04-28 A kind of audio recognition method and its equipment

Country Status (1)

Country Link
CN (1) CN105976812B (en)

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106601240A (en) * 2015-10-16 2017-04-26 三星电子株式会社 Apparatus and method for normalizing input data of acoustic model and speech recognition apparatus
CN106710599A (en) * 2016-12-02 2017-05-24 深圳撒哈拉数据科技有限公司 Particular sound source detection method and particular sound source detection system based on deep neural network
CN106919662A (en) * 2017-02-14 2017-07-04 复旦大学 A kind of music recognition methods and system
CN106952645A (en) * 2017-03-24 2017-07-14 广东美的制冷设备有限公司 The recognition methods of phonetic order, the identifying device of phonetic order and air-conditioner
CN107170444A (en) * 2017-06-15 2017-09-15 上海航空电器有限公司 Aviation cockpit environment self-adaption phonetic feature model training method
CN107331384A (en) * 2017-06-12 2017-11-07 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN107633842A (en) * 2017-06-12 2018-01-26 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN107748898A (en) * 2017-11-03 2018-03-02 北京奇虎科技有限公司 File classifying method, device, computing device and computer-readable storage medium
CN107871506A (en) * 2017-11-15 2018-04-03 北京云知声信息技术有限公司 The awakening method and device of speech identifying function
CN108245177A (en) * 2018-01-05 2018-07-06 安徽大学 Intelligent infant monitoring wearable device and GMM-HMM-DNN-based infant crying identification method
CN108281137A (en) * 2017-01-03 2018-07-13 中国科学院声学研究所 A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN108288467A (en) * 2017-06-07 2018-07-17 腾讯科技(深圳)有限公司 A kind of audio recognition method, device and speech recognition engine
CN108648769A (en) * 2018-04-20 2018-10-12 百度在线网络技术(北京)有限公司 Voice activity detection method, apparatus and equipment
CN108922521A (en) * 2018-08-15 2018-11-30 合肥讯飞数码科技有限公司 A kind of voice keyword retrieval method, apparatus, equipment and storage medium
WO2018232591A1 (en) * 2017-06-20 2018-12-27 Microsoft Technology Licensing, Llc. Sequence recognition processing
CN109274845A (en) * 2018-08-31 2019-01-25 平安科技(深圳)有限公司 Intelligent sound pays a return visit method, apparatus, computer equipment and storage medium automatically
WO2019019252A1 (en) * 2017-07-28 2019-01-31 平安科技(深圳)有限公司 Acoustic model training method, speech recognition method and apparatus, device and medium
CN109637523A (en) * 2018-12-28 2019-04-16 睿驰达新能源汽车科技(北京)有限公司 A kind of voice-based door lock for vehicle control method and device
CN109863554A (en) * 2016-10-27 2019-06-07 香港中文大学 Acoustics font model and acoustics font phonemic model for area of computer aided pronunciation training and speech processes
CN109887484A (en) * 2019-02-22 2019-06-14 平安科技(深圳)有限公司 A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN110390948A (en) * 2019-07-24 2019-10-29 厦门快商通科技股份有限公司 A kind of method and system of Rapid Speech identification
CN110491382A (en) * 2019-03-11 2019-11-22 腾讯科技(深圳)有限公司 Audio recognition method, device and interactive voice equipment based on artificial intelligence
CN110491388A (en) * 2018-05-15 2019-11-22 视联动力信息技术股份有限公司 A kind of processing method and terminal of audio data
CN110556125A (en) * 2019-10-15 2019-12-10 出门问问信息科技有限公司 Feature extraction method and device based on voice signal and computer storage medium
CN111243574A (en) * 2020-01-13 2020-06-05 苏州奇梦者网络科技有限公司 Voice model adaptive training method, system, device and storage medium
CN111613209A (en) * 2020-04-14 2020-09-01 北京三快在线科技有限公司 Acoustic model training method and device, electronic equipment and storage medium
CN111785256A (en) * 2020-06-28 2020-10-16 北京三快在线科技有限公司 Acoustic model training method and device, electronic equipment and storage medium
CN112863496A (en) * 2019-11-27 2021-05-28 阿里巴巴集团控股有限公司 Voice endpoint detection method and device
CN113284514A (en) * 2021-05-19 2021-08-20 北京大米科技有限公司 Audio processing method and device
CN113640699A (en) * 2021-10-14 2021-11-12 南京国铁电气有限责任公司 Fault judgment method, system and equipment for microcomputer control type alternating current and direct current power supply system
CN113780408A (en) * 2021-09-09 2021-12-10 安徽农业大学 Live pig state identification method based on audio features
CN116978368A (en) * 2023-09-25 2023-10-31 腾讯科技(深圳)有限公司 Wake-up word detection method and related device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559879A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Method and device for extracting acoustic features in language identification system
CN105118501A (en) * 2015-09-07 2015-12-02 徐洋 Speech recognition method and system
US9240184B1 (en) * 2012-11-15 2016-01-19 Google Inc. Frame-level combination of deep neural network and gaussian mixture models

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9240184B1 (en) * 2012-11-15 2016-01-19 Google Inc. Frame-level combination of deep neural network and gaussian mixture models
CN103559879A (en) * 2013-11-08 2014-02-05 安徽科大讯飞信息科技股份有限公司 Method and device for extracting acoustic features in language identification system
CN105118501A (en) * 2015-09-07 2015-12-02 徐洋 Speech recognition method and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
张德良: "《深度神经网络在中文语音识别系统中的实现》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王一等: "《一种基于层次结构深度信念网络的音素识别方法》", 《应用科学学报》 *
肖业鸣等: "《深度神经网络在汉语语音识别声学建模中的优化策略》", 《重庆邮电大学学报(自然科学版)》 *
麦麦提艾力.吐尔逊等: "《深度神经网络在维吾尔语大词汇量连续语音识别中的应用》", 《数据采集与处理》 *

Cited By (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106601240A (en) * 2015-10-16 2017-04-26 三星电子株式会社 Apparatus and method for normalizing input data of acoustic model and speech recognition apparatus
CN106601240B (en) * 2015-10-16 2021-10-01 三星电子株式会社 Apparatus and method for normalizing input data of acoustic model, and speech recognition apparatus
CN109863554B (en) * 2016-10-27 2022-12-02 香港中文大学 Acoustic font model and acoustic font phoneme model for computer-aided pronunciation training and speech processing
CN109863554A (en) * 2016-10-27 2019-06-07 香港中文大学 Acoustics font model and acoustics font phonemic model for area of computer aided pronunciation training and speech processes
CN106710599A (en) * 2016-12-02 2017-05-24 深圳撒哈拉数据科技有限公司 Particular sound source detection method and particular sound source detection system based on deep neural network
CN108281137A (en) * 2017-01-03 2018-07-13 中国科学院声学研究所 A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN106919662A (en) * 2017-02-14 2017-07-04 复旦大学 A kind of music recognition methods and system
CN106919662B (en) * 2017-02-14 2021-08-31 复旦大学 Music identification method and system
CN106952645A (en) * 2017-03-24 2017-07-14 广东美的制冷设备有限公司 The recognition methods of phonetic order, the identifying device of phonetic order and air-conditioner
CN106952645B (en) * 2017-03-24 2020-11-17 广东美的制冷设备有限公司 Voice instruction recognition method, voice instruction recognition device and air conditioner
CN108288467A (en) * 2017-06-07 2018-07-17 腾讯科技(深圳)有限公司 A kind of audio recognition method, device and speech recognition engine
CN108288467B (en) * 2017-06-07 2020-07-14 腾讯科技(深圳)有限公司 Voice recognition method and device and voice recognition engine
US11062699B2 (en) 2017-06-12 2021-07-13 Ping An Technology (Shenzhen) Co., Ltd. Speech recognition with trained GMM-HMM and LSTM models
CN107331384B (en) * 2017-06-12 2018-05-04 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN107633842A (en) * 2017-06-12 2018-01-26 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
WO2018227780A1 (en) * 2017-06-12 2018-12-20 平安科技(深圳)有限公司 Speech recognition method and device, computer device and storage medium
WO2018227781A1 (en) * 2017-06-12 2018-12-20 平安科技(深圳)有限公司 Voice recognition method, apparatus, computer device, and storage medium
CN107331384A (en) * 2017-06-12 2017-11-07 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN107170444A (en) * 2017-06-15 2017-09-15 上海航空电器有限公司 Aviation cockpit environment self-adaption phonetic feature model training method
WO2018232591A1 (en) * 2017-06-20 2018-12-27 Microsoft Technology Licensing, Llc. Sequence recognition processing
WO2019019252A1 (en) * 2017-07-28 2019-01-31 平安科技(深圳)有限公司 Acoustic model training method, speech recognition method and apparatus, device and medium
CN107748898A (en) * 2017-11-03 2018-03-02 北京奇虎科技有限公司 File classifying method, device, computing device and computer-readable storage medium
CN107871506A (en) * 2017-11-15 2018-04-03 北京云知声信息技术有限公司 The awakening method and device of speech identifying function
CN108245177A (en) * 2018-01-05 2018-07-06 安徽大学 Intelligent infant monitoring wearable device and GMM-HMM-DNN-based infant crying identification method
CN108245177B (en) * 2018-01-05 2021-01-01 安徽大学 Intelligent infant monitoring wearable device and GMM-HMM-DNN-based infant crying identification method
CN108648769A (en) * 2018-04-20 2018-10-12 百度在线网络技术(北京)有限公司 Voice activity detection method, apparatus and equipment
CN110491388A (en) * 2018-05-15 2019-11-22 视联动力信息技术股份有限公司 A kind of processing method and terminal of audio data
CN108922521A (en) * 2018-08-15 2018-11-30 合肥讯飞数码科技有限公司 A kind of voice keyword retrieval method, apparatus, equipment and storage medium
CN109274845A (en) * 2018-08-31 2019-01-25 平安科技(深圳)有限公司 Intelligent sound pays a return visit method, apparatus, computer equipment and storage medium automatically
CN109637523A (en) * 2018-12-28 2019-04-16 睿驰达新能源汽车科技(北京)有限公司 A kind of voice-based door lock for vehicle control method and device
CN109887484A (en) * 2019-02-22 2019-06-14 平安科技(深圳)有限公司 A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN109887484B (en) * 2019-02-22 2023-08-04 平安科技(深圳)有限公司 Dual learning-based voice recognition and voice synthesis method and device
CN110491382A (en) * 2019-03-11 2019-11-22 腾讯科技(深圳)有限公司 Audio recognition method, device and interactive voice equipment based on artificial intelligence
CN110390948B (en) * 2019-07-24 2022-04-19 厦门快商通科技股份有限公司 Method and system for rapid speech recognition
CN110390948A (en) * 2019-07-24 2019-10-29 厦门快商通科技股份有限公司 A kind of method and system of Rapid Speech identification
CN110556125A (en) * 2019-10-15 2019-12-10 出门问问信息科技有限公司 Feature extraction method and device based on voice signal and computer storage medium
CN112863496A (en) * 2019-11-27 2021-05-28 阿里巴巴集团控股有限公司 Voice endpoint detection method and device
CN112863496B (en) * 2019-11-27 2024-04-02 阿里巴巴集团控股有限公司 Voice endpoint detection method and device
CN111243574A (en) * 2020-01-13 2020-06-05 苏州奇梦者网络科技有限公司 Voice model adaptive training method, system, device and storage medium
CN111613209A (en) * 2020-04-14 2020-09-01 北京三快在线科技有限公司 Acoustic model training method and device, electronic equipment and storage medium
CN111785256A (en) * 2020-06-28 2020-10-16 北京三快在线科技有限公司 Acoustic model training method and device, electronic equipment and storage medium
CN113284514A (en) * 2021-05-19 2021-08-20 北京大米科技有限公司 Audio processing method and device
CN113780408A (en) * 2021-09-09 2021-12-10 安徽农业大学 Live pig state identification method based on audio features
CN113640699B (en) * 2021-10-14 2021-12-24 南京国铁电气有限责任公司 Fault judgment method, system and equipment for microcomputer control type alternating current and direct current power supply system
CN113640699A (en) * 2021-10-14 2021-11-12 南京国铁电气有限责任公司 Fault judgment method, system and equipment for microcomputer control type alternating current and direct current power supply system
CN116978368A (en) * 2023-09-25 2023-10-31 腾讯科技(深圳)有限公司 Wake-up word detection method and related device
CN116978368B (en) * 2023-09-25 2023-12-15 腾讯科技(深圳)有限公司 Wake-up word detection method and related device

Also Published As

Publication number Publication date
CN105976812B (en) 2019-04-26

Similar Documents

Publication Publication Date Title
CN105976812A (en) Voice identification method and equipment thereof
CN107195296B (en) Voice recognition method, device, terminal and system
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN107610709B (en) Method and system for training voiceprint recognition model
CN110310623B (en) Sample generation method, model training method, device, medium, and electronic apparatus
CN107481717B (en) Acoustic model training method and system
CN108615525B (en) Voice recognition method and device
WO2017218465A1 (en) Neural network-based voiceprint information extraction method and apparatus
CN109509470A (en) Voice interactive method, device, computer readable storage medium and terminal device
CN112786004B (en) Speech synthesis method, electronic equipment and storage device
CN107093422B (en) Voice recognition method and voice recognition system
KR20200044388A (en) Device and method to recognize voice and device and method to train voice recognition model
CN112837669B (en) Speech synthesis method, device and server
CN112349289B (en) Voice recognition method, device, equipment and storage medium
CN113096647B (en) Voice model training method and device and electronic equipment
CN112927674B (en) Voice style migration method and device, readable medium and electronic equipment
CN111508466A (en) Text processing method, device and equipment and computer readable storage medium
CN114678032B (en) Training method, voice conversion method and device and electronic equipment
CN114283783A (en) Speech synthesis method, model training method, device and storage medium
CN112542173A (en) Voice interaction method, device, equipment and medium
CN112216270A (en) Method and system for recognizing speech phonemes, electronic equipment and storage medium
CN111640423A (en) Word boundary estimation method and device and electronic equipment
CN114913859B (en) Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium
CN116665642A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
CN113724689B (en) Speech recognition method and related device, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant