CN103117060A - Modeling approach and modeling system of acoustic model used in speech recognition - Google Patents
Modeling approach and modeling system of acoustic model used in speech recognition Download PDFInfo
- Publication number
- CN103117060A CN103117060A CN2013100200107A CN201310020010A CN103117060A CN 103117060 A CN103117060 A CN 103117060A CN 2013100200107 A CN2013100200107 A CN 2013100200107A CN 201310020010 A CN201310020010 A CN 201310020010A CN 103117060 A CN103117060 A CN 103117060A
- Authority
- CN
- China
- Prior art keywords
- phonetic feature
- modeling
- training
- training data
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 claims abstract description 80
- 238000013528 artificial neural network Methods 0.000 claims abstract description 32
- 238000003066 decision tree Methods 0.000 claims abstract description 9
- 230000007704 transition Effects 0.000 claims abstract description 8
- 238000000034 method Methods 0.000 claims description 15
- 230000006870 function Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 3
- 230000001537 neural effect Effects 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000007630 basic procedure Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- NGVDGCNFYWLIFO-UHFFFAOYSA-N pyridoxal 5'-phosphate Chemical compound CC1=NC=C(COP(O)(O)=O)C(C=O)=C1O NGVDGCNFYWLIFO-UHFFFAOYSA-N 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
Images
Abstract
The invention relates to a modeling approach and a modeling system of an acoustic model used in speech recognition. The modeling approach includes the steps of: S1, training an initial model, wherein a modeling unit is a tri-phone state which is clustered by a phoneme decision tree and a state transition probability is provided by the model, S2, obtaining state information of a frame level based on the fact that the initial model aligns the tri-phone state of phonetic features of training data compulsively, S3, pre-training a deep neural network to obtain initial weights of each hidden layer, S4, training the initialized network through error back propagation algorithm based on the obtained frame level state information and updating the weights. According to the modeling approach, a context relevant tri-phone state is used as the modeling unit, the model is established based on the deep neural network, weight of each hidden layer of the network is initialized through restricted Boltzmann algorithm, and the weights can be updated subsequently by means of error back propagation algorithm. Therefore, risk that the network is easy to get into local extremum in pre-training is relieved effectively, and modeling accuracy of the acoustic model is improved greatly.
Description
Technical field
The present invention relates to field of speech recognition, relate in particular to a kind of modeling method and modeling of the acoustic model for speech recognition.
Background technology
The main flow framework of speech recognition is at present identified based on statistical model.Typically the speech recognition system framework as shown in Figure 1: comprise voice collecting and front-end processing module, characteristic extracting module, acoustic model module, language model module and decoder module.The basic procedure of speech recognition is as follows: carry out feature extraction afterwards through front-end processing after voice acquisition device collector's voice, the characteristic sequence that extracts such as MFCC or PLP obtain it by acoustic model and observe probability, send into demoder in conjunction with probabilistic language model and obtain most possible text sequence.Described acoustic model modeling adopts mixed Gauss model to carry out modeling to the probability distribution of phonetic feature based on the Hidden Markov framework.Described mixed Gauss model can be done some inappropriate hypothesis to phonetic feature and distribution thereof, and as the linear independence hypothesis of adjacent phonetic feature, it is observed probability and obeys mixed Gaussian distribution etc.In addition, when mixed Gauss model carries out parameter training, objective function is to make the likelihood probability of observing feature maximum, and what use during decoding is the maximum a posteriori criterion, and is inconsistent on probability model.As seen traditional acoustic model, modeling accuracy is not high, causes the speech recognition effect not good enough.
Summary of the invention
For the problems referred to above, the embodiment of the present invention proposes a kind of modeling method, modeling of the acoustic model for speech recognition.
In first aspect, the embodiment of the present invention proposes a kind of modeling method of the acoustic model for speech recognition, described method comprises: with hidden Markov-mixed Gaussian HMM-GMM model of training data training, the modeling unit of this HMM-GMM model is the three-tone state after the phonetic feature of described training data passes through phoneme decision tree cluster, and described HMM-GMM model obtains the state transition probability of described three-tone state by the maximum EM algorithm of expectation; Based on described HMM-GMM model, the three-tone state of described training data phonetic feature is forced alignment, obtain described phonetic feature frame level status information; To carrying out pre-training as the deep layer neural network of described acoustic model to obtain the parameter for the weight of each hidden layer of the described deep layer network of initialization; Three-tone state based on described training data phonetic feature adopts error backpropagation algorithm that described deep layer neural network is trained, and upgrades the weight of its each hidden layer.
Preferably, described based on described HMM-GMM model, the three-tone state of described training data phonetic feature is forced alignment, obtain described phonetic feature frame level status information, be specially: based on described HMM-GMM model, the most probable three-tone state of described training data phonetic feature and its is carried out corresponding, obtain described phonetic feature frame level status information.
Preferably, describedly be specially for the parameter of the weight of each hidden layer of the described deep layer network of initialization obtaining carry out pre-training as the deep layer neural network of described acoustic model: utilize limited Boltzmann machine successively to train to convergence based on described training data, with the weight of each hidden layer of the described deep layer network of parameter initialization that obtains.
In second aspect, the embodiment of the present invention proposes a kind of modeling for the speech recognition acoustic model, it comprises: the first module, be used for hidden Markov-mixed Gaussian HMM-GMM model of training data training, the modeling unit of this HMM-GMM model is the three-tone state after the phonetic feature of described training data passes through phoneme decision tree cluster, and described HMM-GMM model obtains the state transition probability of described three-tone state by the maximum EM algorithm of expectation; The second module is used for based on described HMM-GMM model, and the three-tone state of described training data phonetic feature is forced alignment, obtains described phonetic feature frame level status information; The 3rd module is used for carrying out pre-training as the deep layer neural network of described acoustic model to obtain the parameter for the weight of each hidden layer of the described deep layer network of initialization; Four module is used for adopting error backpropagation algorithm that described deep layer neural network is trained based on the three-tone state of described training data phonetic feature, upgrades the weight of its each hidden layer.
Preferably, described the second module is based on described HMM-GMM model, the three-tone state of described training data phonetic feature is forced alignment, obtain described phonetic feature frame level status information, be specially: described the second module is based on described HMM-GMM model, the most probable three-tone state of described training data phonetic feature and its is carried out corresponding, obtain described phonetic feature frame level status information.
Preferably, described the 3rd module is specially for the parameter of the weight of each hidden layer of the described deep layer network of initialization obtaining carry out pre-training as the deep layer neural network of described acoustic model: described the 3rd module utilizes limited Boltzmann machine successively to train to convergence based on described training data, with the weight of each hidden layer of the described deep layer network of parameter initialization that obtains.
The embodiment of the present invention adopts the three-tone state, based on the deep layer neural net model establishing, use the weight of described each hidden layer of network of limited Boltzmann's algorithm initialization, described weight can also be updated by the back-propagation algorithm follow-up, can effectively alleviate when described network is trained in advance and easily be absorbed in the risk of local extremum, and further improve the modeling accuracy of acoustic model.
Description of drawings
The present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
Fig. 1 is existing speech recognition system schematic diagram;
Fig. 2 is the relevant deep layer neural network speech recognition system block diagram of the based on the context of the embodiment of the present invention;
Fig. 3 is the modeling method schematic diagram of the acoustic model that is used for speech recognition of the embodiment of the present invention;
Fig. 4 is the modeling schematic diagram of the acoustic model that is used for speech recognition of the embodiment of the present invention.
Embodiment
Below by drawings and Examples, the technical scheme of the embodiment of the present invention is described in further detail.
Consider that mixed Gauss model need to make incorrect hypothesis to phonetic feature and probability distribution thereof, the embodiment of the present invention uses context-sensitive deep layer neural network to replace mixed Gauss model to carry out the acoustic model modeling.Described deep layer neural network comprises a plurality of hidden layers, and its modeling unit is the context dependent three-tone state after phoneme decision tree cluster.The fundamental block diagram of whole system as shown in Figure 2.
Adopt minimum cross entropy criterion as objective function during the deep layer neural metwork training, because it has a plurality of hidden layers, its error function has a lot of local extremums, causes the deep layer neural network to be easy to be absorbed in local extremum and too early convergence at training process.For this problem, the pre-training of neural network of passing through of neural calculating field proposition comes the initializes weights parameter, then adopts traditional error backpropagation algorithm that network parameter is trained.Pre-training algorithm adopts limited Boltzmann machine, and limited Boltzmann machine is the two-dimensional plot model, comprises a visible layer and a hidden layer, wherein between each unit of same layer without the interconnected and dense link in unit different layers.This model is by the joint distribution of an energy function definition visible layer and hidden layer variable, and concrete formula is as follows:
Wherein v is the visible layer variable, and h is the hidden layer variable, and E (v, h) is energy function, and p (v, h) is its joint distribution probability, observes feature likelihood probability p (v) by maximum during training, and its weight parameter more new formula is as follows:
Δw
ij=<v
ih
j>
data-<v
ih
j>
model
w
ij(t+1)=w
ij(t)+Δw
ij
W wherein
ijBe connection weight, t is iterations,<>represent the variable in bracket is got average.
By successively training limited Boltzmann machine, its parameter is used for initialization deep layer neural network, thereby makes its initial weight fall into a reasonable starting point of weight space, be absorbed in the risk of local extremum when having alleviated to a certain extent network training.Adopt simultaneously three-tone state after phoneme decision tree cluster as the teacher signal of neural network, comprised the context relation of phoneme, make the modeling of acoustic model meticulousr and accurate.
Fig. 3 is the modeling method schematic diagram of the acoustic model that is used for speech recognition of the embodiment of the present invention.Described method comprises: step 1, set up initial model.Particularly, with hidden Markov-mixed Gaussian HMM-GMM model of training data training, the modeling unit of this HMM-GMM model is the three-tone state after the phonetic feature of described training data passes through phoneme decision tree cluster, and described HMM-GMM model obtains the state transition probability of described three-tone state by the maximum EM algorithm of expectation;
Preferably, described based on described HMM-GMM model, the three-tone state of described training data phonetic feature is forced alignment, obtain described phonetic feature frame level status information, be specially: based on described HMM-GMM model, the most probable three-tone state of described training data phonetic feature and its is carried out corresponding, obtain described phonetic feature frame level status information.
Step 3, each hidden layer weight of initialization deep layer neural network.Particularly, to carrying out pre-training as the deep layer neural network of described acoustic model to obtain the parameter for the weight of each hidden layer of the described deep layer network of initialization;
Step 4 is upgraded each hidden layer weight of deep layer neural network.Particularly, based on the three-tone state employing error backpropagation algorithm of described training data phonetic feature, described deep layer neural network is trained, upgrade the weight of its each hidden layer.
Preferably, describedly be specially for the parameter of the weight of each hidden layer of the described deep layer network of initialization obtaining carry out pre-training as the deep layer neural network of described acoustic model: utilize limited Boltzmann machine successively to train to convergence based on described training data, with the weight of each hidden layer of the described deep layer network of parameter initialization that obtains.
Be noted that described hidden Markov-mixed Gaussian HMM-GMM model also can be write as hidden Markov/mixed Gaussian HMM/GMM model.
Pre-training in described step 3 can be considered as a kind of unsupervised training.Training in step 3 can be considered as a kind of training that supervision is arranged.
In addition, the pre-training in step 3 and step 2 can be carried out simultaneously.
When described HMM-GMM model is used for speech recognition as acoustic model, be converted to likelihood probability based on the posterior probability that phonetic feature is generated through the deep layer neural network by Bayesian formula and send into demoder and decode, the text sequence that obtains after decoding is namely as the content of speaking that recognizes.Can assess the effect of speech recognition based on the difference of described speak content and the real raw tone that recognizes.Can assess in speech recognition system performance as the deep layer neural network of acoustic model according to this effect, can consider where necessary it is carried out retraining, even can consider state transition probability in described HMM-GMM model is designed again.
Fig. 4 is the modeling schematic diagram of the acoustic model that is used for speech recognition of the embodiment of the present invention.Described modeling comprises: the first module, be used for hidden Markov-mixed Gaussian HMM-GMM model of training data training, the modeling unit of this HMM-GMM model is the three-tone state after the phonetic feature of described training data passes through phoneme decision tree cluster, and described HMM-GMM model obtains the state transition probability of described three-tone state by the maximum EM algorithm of expectation; The second module is used for based on described HMM-GMM model, and the three-tone state of described training data phonetic feature is forced alignment, obtains described phonetic feature frame level status information; The 3rd module is used for carrying out pre-training as the deep layer neural network of described acoustic model to obtain the parameter for the weight of each hidden layer of the described deep layer network of initialization; Four module is used for adopting error backpropagation algorithm that described deep layer neural network is trained based on the three-tone state of described training data phonetic feature, upgrades the weight of its each hidden layer.
Preferably, described the second module is based on described HMM-GMM model, the three-tone state of described training data phonetic feature is forced alignment, obtain described phonetic feature frame level status information, be specially: described the second module is based on described HMM-GMM model, the most probable three-tone state of described training data phonetic feature and its is carried out corresponding, obtain described phonetic feature frame level status information.
Preferably, described the 3rd module is specially for the parameter of the weight of each hidden layer of the described deep layer network of initialization obtaining carry out pre-training as the deep layer neural network of described acoustic model: described the 3rd module utilizes limited Boltzmann machine successively to train to convergence based on described training data, with the weight of each hidden layer of the described deep layer network of parameter initialization that obtains.
The embodiment of the present invention adopts the deep layer neural network to replace mixed Gauss model to carry out the acoustic model modeling, utilized the three-tone state with context dependent characteristic during modeling, and be different from described mixed Gauss model and need to do some ad hoc hypothesis to phonetic feature and distribution thereof, directly provide the posterior probability of phonetic feature.Described three-tone state has taken into full account the context dependence of language, makes modeling unit more careful, and described a plurality of hidden layers are more similar to human speech sensory perceptual system principle, are beneficial to the extraction of carrying out the high-order characteristic information.The embodiment of the present invention is used the weight of described each hidden layer of network of limited Boltzmann's algorithm initialization, described weight can also be updated by the back-propagation algorithm follow-up, can effectively alleviate when described network is trained in advance and easily be absorbed in the risk of local extremum, and further improve the modeling accuracy of acoustic model.
Those skilled in the art should further recognize, each exemplary module and algorithm steps in conjunction with embodiment description disclosed herein, can realize with electronic hardware, computer software or combination both, for the interchangeability of hardware and software clearly is described, composition and the step of each example described in general manner according to function in the above description.These functions are carried out with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Those skilled in the art can specifically should be used for realizing described function with distinct methods to each, but this realization should not thought the scope that exceeds the application.
The method of describing in conjunction with embodiment disclosed herein or the step of algorithm can use the software module of hardware, processor execution, and perhaps both combination is implemented.Software module can be placed in the storage medium of any other form known in random access memory (RAM), internal memory, ROM (read-only memory) (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field.
It is to be noted, these are only preferred embodiment of the present invention, be not to limit practical range of the present invention, technician with professional knowledge base can realize the present invention by above embodiment, therefore every any variation, modification and improvement according to making within the spirit and principles in the present invention, all covered by the scope of the claims of the present invention.Namely, above embodiment is only unrestricted in order to technical scheme of the present invention to be described, although with reference to preferred embodiment, the present invention is had been described in detail, those of ordinary skill in the art is to be understood that, can modify or be equal to replacement technical scheme of the present invention, and not break away from the spirit and scope of technical solution of the present invention.
Claims (6)
1. a modeling method that is used for the acoustic model of speech recognition, is characterized in that, described method comprises:
With hidden Markov-mixed Gaussian HMM-GMM model of training data training, the modeling unit of this HMM-GMM model is the three-tone state after the phonetic feature of described training data passes through phoneme decision tree cluster, described HMM-GMM model obtains by the maximum EM Algorithm for Training of expectation, obtains simultaneously the state transition probability of described three-tone state;
Based on described HMM-GMM model, described training data phonetic feature is forced alignment, obtain described other three-tone status information of phonetic feature frame level;
To carrying out pre-training as the deep layer neural network of described acoustic model to obtain the parameter for the weight of each hidden layer of the described deep layer network of initialization;
Phonetic feature frame level status information based on described training data phonetic feature adopts error backpropagation algorithm that described deep layer neural network is trained, and upgrades the weight of its each hidden layer.
2. modeling method as claimed in claim 1, it is characterized in that, described based on described HMM-GMM model, the three-tone state of described training data phonetic feature is forced alignment, obtain described phonetic feature frame level status information, be specially: based on described HMM-GMM model, the most probable three-tone state of described training data phonetic feature and its is carried out corresponding, obtain described phonetic feature frame level status information.
3. modeling method as claimed in claim 1, it is characterized in that, describedly be specially for the parameter of the weight of each hidden layer of the described deep layer network of initialization obtaining carry out pre-training as the deep layer neural network of described acoustic model: utilize limited Boltzmann machine successively to train to convergence based on described training data, with the weight of each hidden layer of the described deep layer network of parameter initialization that obtains.
4. a modeling that is used for the speech recognition acoustic model, is characterized in that, described modeling comprises:
The first module, be used for hidden Markov-mixed Gaussian HMM-GMM model of training data training, the modeling unit of this HMM-GMM model is the three-tone state after the phonetic feature of described training data passes through phoneme decision tree cluster, and described HMM-GMM model obtains the state transition probability of described three-tone state by the maximum EM algorithm of expectation;
The second module is used for based on described HMM-GMM model, and described training data phonetic feature is forced alignment, obtains the three-tone status information of described phonetic feature frame level;
The 3rd module is used for carrying out pre-training as the deep layer neural network of described acoustic model to obtain the parameter for the weight of each hidden layer of the described deep layer network of initialization;
Four module is used for adopting error backpropagation algorithm that described deep layer neural network is trained based on the phonetic feature frame level status information of described training data phonetic feature, upgrades the weight of its each hidden layer.
5. modeling as claimed in claim 4, it is characterized in that, described the second module is based on described HMM-GMM model, the three-tone state of described training data phonetic feature is forced alignment, obtain described phonetic feature frame level status information, be specially: described the second module is based on described HMM-GMM model, the most probable three-tone state of described training data phonetic feature and its carried out corresponding, obtains described phonetic feature frame level status information.
6. modeling as claimed in claim 4, it is characterized in that, described the 3rd module is specially for the parameter of the weight of each hidden layer of the described deep layer network of initialization obtaining carry out pre-training as the deep layer neural network of described acoustic model: described the 3rd module utilizes limited Boltzmann machine successively to train to convergence based on described training data, with the weight of each hidden layer of the described deep layer network of parameter initialization that obtains.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310020010.7A CN103117060B (en) | 2013-01-18 | 2013-01-18 | For modeling method, the modeling of the acoustic model of speech recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310020010.7A CN103117060B (en) | 2013-01-18 | 2013-01-18 | For modeling method, the modeling of the acoustic model of speech recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103117060A true CN103117060A (en) | 2013-05-22 |
CN103117060B CN103117060B (en) | 2015-10-28 |
Family
ID=48415418
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310020010.7A Expired - Fee Related CN103117060B (en) | 2013-01-18 | 2013-01-18 | For modeling method, the modeling of the acoustic model of speech recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103117060B (en) |
Cited By (43)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103345656A (en) * | 2013-07-17 | 2013-10-09 | 中国科学院自动化研究所 | Method and device for data identification based on multitask deep neural network |
CN103514879A (en) * | 2013-09-18 | 2014-01-15 | 广东欧珀移动通信有限公司 | Local voice recognition method based on BP neural network |
CN103680496A (en) * | 2013-12-19 | 2014-03-26 | 百度在线网络技术(北京)有限公司 | Deep-neural-network-based acoustic model training method, hosts and system |
CN103839211A (en) * | 2014-03-23 | 2014-06-04 | 合肥新涛信息科技有限公司 | Medical history transferring system based on voice recognition |
CN103839546A (en) * | 2014-03-26 | 2014-06-04 | 合肥新涛信息科技有限公司 | Voice recognition system based on Yangze river and Huai river language family |
CN104036774A (en) * | 2014-06-20 | 2014-09-10 | 国家计算机网络与信息安全管理中心 | Method and system for recognizing Tibetan dialects |
CN104347066A (en) * | 2013-08-09 | 2015-02-11 | 盛乐信息技术(上海)有限公司 | Deep neural network-based baby cry identification method and system |
CN104376842A (en) * | 2013-08-12 | 2015-02-25 | 清华大学 | Neural network language model training method and device and voice recognition method |
CN104575497A (en) * | 2013-10-28 | 2015-04-29 | 中国科学院声学研究所 | Method for building acoustic model and speech decoding method based on acoustic model |
CN105229676A (en) * | 2013-05-23 | 2016-01-06 | 国立研究开发法人情报通信研究机构 | The learning device of the learning method of deep-neural-network and learning device and category independently sub-network |
CN105654955A (en) * | 2016-03-18 | 2016-06-08 | 华为技术有限公司 | Voice recognition method and device |
CN105745700A (en) * | 2013-11-27 | 2016-07-06 | 国立研究开发法人情报通信研究机构 | Statistical-acoustic-model adaptation method, acoustic-model learning method suitable for statistical-acoustic-model adaptation, storage medium in which parameters for building deep neural network are stored, and computer program for adapting statistical acoustic model |
CN105761720A (en) * | 2016-04-19 | 2016-07-13 | 北京地平线机器人技术研发有限公司 | Interaction system based on voice attribute classification, and method thereof |
CN105874530A (en) * | 2013-10-30 | 2016-08-17 | 格林伊登美国控股有限责任公司 | Predicting recognition quality of a phrase in automatic speech recognition systems |
CN105960672A (en) * | 2014-09-09 | 2016-09-21 | 微软技术许可有限责任公司 | Variable-component deep neural network for robust speech recognition |
CN106023995A (en) * | 2015-08-20 | 2016-10-12 | 漳州凯邦电子有限公司 | Voice recognition method and wearable voice control device using the method |
CN106157953A (en) * | 2015-04-16 | 2016-11-23 | 科大讯飞股份有限公司 | continuous speech recognition method and system |
CN106297773A (en) * | 2015-05-29 | 2017-01-04 | 中国科学院声学研究所 | A kind of neutral net acoustic training model method |
CN106611599A (en) * | 2015-10-21 | 2017-05-03 | 展讯通信(上海)有限公司 | Voice recognition method and device based on artificial neural network and electronic equipment |
CN106652999A (en) * | 2015-10-29 | 2017-05-10 | 三星Sds株式会社 | System and method for voice recognition |
CN106782504A (en) * | 2016-12-29 | 2017-05-31 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device |
CN106782511A (en) * | 2016-12-22 | 2017-05-31 | 太原理工大学 | Amendment linear depth autoencoder network audio recognition method |
CN106816147A (en) * | 2017-01-25 | 2017-06-09 | 上海交通大学 | Speech recognition system based on binary neural network acoustic model |
WO2017114201A1 (en) * | 2015-12-31 | 2017-07-06 | 阿里巴巴集团控股有限公司 | Method and device for executing setting operation |
CN107112006A (en) * | 2014-10-02 | 2017-08-29 | 微软技术许可有限责任公司 | Speech processes based on neutral net |
CN107680582A (en) * | 2017-07-28 | 2018-02-09 | 平安科技(深圳)有限公司 | Acoustic training model method, audio recognition method, device, equipment and medium |
CN108111335A (en) * | 2017-12-04 | 2018-06-01 | 华中科技大学 | A kind of method and system dispatched and link virtual network function |
CN108109615A (en) * | 2017-12-21 | 2018-06-01 | 内蒙古工业大学 | A kind of construction and application method of the Mongol acoustic model based on DNN |
CN108346423A (en) * | 2017-01-23 | 2018-07-31 | 北京搜狗科技发展有限公司 | The treating method and apparatus of phonetic synthesis model |
CN108428448A (en) * | 2017-02-13 | 2018-08-21 | 芋头科技(杭州)有限公司 | A kind of sound end detecting method and audio recognition method |
CN108648747A (en) * | 2018-03-21 | 2018-10-12 | 清华大学 | Language recognition system |
CN109215637A (en) * | 2017-06-30 | 2019-01-15 | 三星Sds株式会社 | Audio recognition method |
CN109326277A (en) * | 2018-12-05 | 2019-02-12 | 四川长虹电器股份有限公司 | Semi-supervised phoneme forces alignment model method for building up and system |
CN109545201A (en) * | 2018-12-15 | 2019-03-29 | 中国人民解放军战略支援部队信息工程大学 | The construction method of acoustic model based on the analysis of deep layer hybrid cytokine |
CN109741735A (en) * | 2017-10-30 | 2019-05-10 | 阿里巴巴集团控股有限公司 | The acquisition methods and device of a kind of modeling method, acoustic model |
CN109975762A (en) * | 2017-12-28 | 2019-07-05 | 中国科学院声学研究所 | A kind of underwater sound source localization method |
CN110070855A (en) * | 2018-01-23 | 2019-07-30 | 中国科学院声学研究所 | A kind of speech recognition system and method based on migration neural network acoustic model |
US10452995B2 (en) | 2015-06-29 | 2019-10-22 | Microsoft Technology Licensing, Llc | Machine learning classification on hardware accelerators with stacked memory |
CN110459216A (en) * | 2019-08-14 | 2019-11-15 | 桂林电子科技大学 | A kind of dining room brushing card device and application method with speech recognition |
US10540588B2 (en) | 2015-06-29 | 2020-01-21 | Microsoft Technology Licensing, Llc | Deep neural network processing on hardware accelerators with stacked memory |
US10606651B2 (en) | 2015-04-17 | 2020-03-31 | Microsoft Technology Licensing, Llc | Free form expression accelerator with thread length-based thread assignment to clustered soft processor cores that share a functional circuit |
CN112259089A (en) * | 2019-07-04 | 2021-01-22 | 阿里巴巴集团控股有限公司 | Voice recognition method and device |
CN113450786A (en) * | 2020-03-25 | 2021-09-28 | 阿里巴巴集团控股有限公司 | Network model obtaining method, information processing method, device and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5317673A (en) * | 1992-06-22 | 1994-05-31 | Sri International | Method and apparatus for context-dependent estimation of multiple probability distributions of phonetic classes with multilayer perceptrons in a speech recognition system |
CN1427368A (en) * | 2001-12-19 | 2003-07-02 | 中国科学院自动化研究所 | Palm computer non specific human speech sound distinguishing method |
CN1588536A (en) * | 2004-09-29 | 2005-03-02 | 上海交通大学 | State structure regulating method in sound identification |
CN101740024A (en) * | 2008-11-19 | 2010-06-16 | 中国科学院自动化研究所 | Method for automatic evaluation based on generalized fluent spoken language fluency |
CN102411931A (en) * | 2010-09-15 | 2012-04-11 | 微软公司 | Deep belief network for large vocabulary continuous speech recognition |
CN102693723A (en) * | 2012-04-01 | 2012-09-26 | 北京安慧音通科技有限责任公司 | Method and device for recognizing speaker-independent isolated word based on subspace |
-
2013
- 2013-01-18 CN CN201310020010.7A patent/CN103117060B/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5317673A (en) * | 1992-06-22 | 1994-05-31 | Sri International | Method and apparatus for context-dependent estimation of multiple probability distributions of phonetic classes with multilayer perceptrons in a speech recognition system |
CN1427368A (en) * | 2001-12-19 | 2003-07-02 | 中国科学院自动化研究所 | Palm computer non specific human speech sound distinguishing method |
CN1588536A (en) * | 2004-09-29 | 2005-03-02 | 上海交通大学 | State structure regulating method in sound identification |
CN101740024A (en) * | 2008-11-19 | 2010-06-16 | 中国科学院自动化研究所 | Method for automatic evaluation based on generalized fluent spoken language fluency |
CN102411931A (en) * | 2010-09-15 | 2012-04-11 | 微软公司 | Deep belief network for large vocabulary continuous speech recognition |
CN102693723A (en) * | 2012-04-01 | 2012-09-26 | 北京安慧音通科技有限责任公司 | Method and device for recognizing speaker-independent isolated word based on subspace |
Non-Patent Citations (4)
Title |
---|
IBRAHIM M. M. EL-EMARY1, MOHAMED FEZARI AND HAMZA ATTOUI: "Hidden Markov model/Gaussian mixture models (HMM/GMM) based voice command system: A way to improve the control of remotely operated robot arm TR45", 《SCIENTIFIC RESEARCH AND ESSAYS》 * |
POONAM BANSAL, ANUJ KANT, SUMIT KUMAR, AKASH SHARDA, SHITIJ GUPT: "IMPROVED HYBRID MODEL OF HMM/GMM FOR SPEECH RECOGNITION", 《INTERNATIONAL CONFERENCE "INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS" INFOS 2008, VARNA, BULGARIA, JUNE-JULY 2008》 * |
倪崇嘉,刘文举,徐波: "韵律相关的汉语语音识别系统研究", 《计算机应用研究》 * |
黄浩,李兵虎,吾守尔.斯拉木: "区分性模型组合中基于决策树的声学上下文建模方法", 《自动化学报》 * |
Cited By (69)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105229676A (en) * | 2013-05-23 | 2016-01-06 | 国立研究开发法人情报通信研究机构 | The learning device of the learning method of deep-neural-network and learning device and category independently sub-network |
US9691020B2 (en) | 2013-05-23 | 2017-06-27 | National Institute Of Information And Communications Technology | Deep neural network learning method and apparatus, and category-independent sub-network learning apparatus |
CN105229676B (en) * | 2013-05-23 | 2018-11-23 | 国立研究开发法人情报通信研究机构 | The learning method and learning device of deep-neural-network |
CN103345656B (en) * | 2013-07-17 | 2016-01-20 | 中国科学院自动化研究所 | A kind of data identification method based on multitask deep neural network and device |
CN103345656A (en) * | 2013-07-17 | 2013-10-09 | 中国科学院自动化研究所 | Method and device for data identification based on multitask deep neural network |
CN104347066A (en) * | 2013-08-09 | 2015-02-11 | 盛乐信息技术(上海)有限公司 | Deep neural network-based baby cry identification method and system |
CN104347066B (en) * | 2013-08-09 | 2019-11-12 | 上海掌门科技有限公司 | Recognition method for baby cry and system based on deep-neural-network |
CN104376842A (en) * | 2013-08-12 | 2015-02-25 | 清华大学 | Neural network language model training method and device and voice recognition method |
CN103514879A (en) * | 2013-09-18 | 2014-01-15 | 广东欧珀移动通信有限公司 | Local voice recognition method based on BP neural network |
CN104575497A (en) * | 2013-10-28 | 2015-04-29 | 中国科学院声学研究所 | Method for building acoustic model and speech decoding method based on acoustic model |
CN104575497B (en) * | 2013-10-28 | 2017-10-03 | 中国科学院声学研究所 | A kind of acoustic model method for building up and the tone decoding method based on the model |
US10319366B2 (en) | 2013-10-30 | 2019-06-11 | Genesys Telecommunications Laboratories, Inc. | Predicting recognition quality of a phrase in automatic speech recognition systems |
CN105874530A (en) * | 2013-10-30 | 2016-08-17 | 格林伊登美国控股有限责任公司 | Predicting recognition quality of a phrase in automatic speech recognition systems |
CN105874530B (en) * | 2013-10-30 | 2020-03-03 | 格林伊登美国控股有限责任公司 | Predicting phrase recognition quality in an automatic speech recognition system |
CN105745700A (en) * | 2013-11-27 | 2016-07-06 | 国立研究开发法人情报通信研究机构 | Statistical-acoustic-model adaptation method, acoustic-model learning method suitable for statistical-acoustic-model adaptation, storage medium in which parameters for building deep neural network are stored, and computer program for adapting statistical acoustic model |
CN105745700B (en) * | 2013-11-27 | 2019-11-01 | 国立研究开发法人情报通信研究机构 | The adaptive approach and learning method of statistical acoustics model, recording medium |
CN103680496B (en) * | 2013-12-19 | 2016-08-10 | 百度在线网络技术(北京)有限公司 | Acoustic training model method based on deep-neural-network, main frame and system |
CN103680496A (en) * | 2013-12-19 | 2014-03-26 | 百度在线网络技术(北京)有限公司 | Deep-neural-network-based acoustic model training method, hosts and system |
CN103839211A (en) * | 2014-03-23 | 2014-06-04 | 合肥新涛信息科技有限公司 | Medical history transferring system based on voice recognition |
CN103839546A (en) * | 2014-03-26 | 2014-06-04 | 合肥新涛信息科技有限公司 | Voice recognition system based on Yangze river and Huai river language family |
CN104036774A (en) * | 2014-06-20 | 2014-09-10 | 国家计算机网络与信息安全管理中心 | Method and system for recognizing Tibetan dialects |
CN105960672A (en) * | 2014-09-09 | 2016-09-21 | 微软技术许可有限责任公司 | Variable-component deep neural network for robust speech recognition |
CN105960672B (en) * | 2014-09-09 | 2019-11-26 | 微软技术许可有限责任公司 | Variable component deep neural network for Robust speech recognition |
CN107112006A (en) * | 2014-10-02 | 2017-08-29 | 微软技术许可有限责任公司 | Speech processes based on neutral net |
CN106157953B (en) * | 2015-04-16 | 2020-02-07 | 科大讯飞股份有限公司 | Continuous speech recognition method and system |
CN106157953A (en) * | 2015-04-16 | 2016-11-23 | 科大讯飞股份有限公司 | continuous speech recognition method and system |
US10606651B2 (en) | 2015-04-17 | 2020-03-31 | Microsoft Technology Licensing, Llc | Free form expression accelerator with thread length-based thread assignment to clustered soft processor cores that share a functional circuit |
CN106297773B (en) * | 2015-05-29 | 2019-11-19 | 中国科学院声学研究所 | A kind of neural network acoustic training model method |
CN106297773A (en) * | 2015-05-29 | 2017-01-04 | 中国科学院声学研究所 | A kind of neutral net acoustic training model method |
US10452995B2 (en) | 2015-06-29 | 2019-10-22 | Microsoft Technology Licensing, Llc | Machine learning classification on hardware accelerators with stacked memory |
US10540588B2 (en) | 2015-06-29 | 2020-01-21 | Microsoft Technology Licensing, Llc | Deep neural network processing on hardware accelerators with stacked memory |
CN106023995A (en) * | 2015-08-20 | 2016-10-12 | 漳州凯邦电子有限公司 | Voice recognition method and wearable voice control device using the method |
CN106611599A (en) * | 2015-10-21 | 2017-05-03 | 展讯通信(上海)有限公司 | Voice recognition method and device based on artificial neural network and electronic equipment |
CN106652999A (en) * | 2015-10-29 | 2017-05-10 | 三星Sds株式会社 | System and method for voice recognition |
CN106940998A (en) * | 2015-12-31 | 2017-07-11 | 阿里巴巴集团控股有限公司 | A kind of execution method and device of setting operation |
WO2017114201A1 (en) * | 2015-12-31 | 2017-07-06 | 阿里巴巴集团控股有限公司 | Method and device for executing setting operation |
CN105654955A (en) * | 2016-03-18 | 2016-06-08 | 华为技术有限公司 | Voice recognition method and device |
CN105654955B (en) * | 2016-03-18 | 2019-11-12 | 华为技术有限公司 | Audio recognition method and device |
CN105761720B (en) * | 2016-04-19 | 2020-01-07 | 北京地平线机器人技术研发有限公司 | Interactive system and method based on voice attribute classification |
CN105761720A (en) * | 2016-04-19 | 2016-07-13 | 北京地平线机器人技术研发有限公司 | Interaction system based on voice attribute classification, and method thereof |
CN106782511A (en) * | 2016-12-22 | 2017-05-31 | 太原理工大学 | Amendment linear depth autoencoder network audio recognition method |
CN106782504A (en) * | 2016-12-29 | 2017-05-31 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device |
CN108346423A (en) * | 2017-01-23 | 2018-07-31 | 北京搜狗科技发展有限公司 | The treating method and apparatus of phonetic synthesis model |
CN106816147A (en) * | 2017-01-25 | 2017-06-09 | 上海交通大学 | Speech recognition system based on binary neural network acoustic model |
CN108428448A (en) * | 2017-02-13 | 2018-08-21 | 芋头科技(杭州)有限公司 | A kind of sound end detecting method and audio recognition method |
CN109215637A (en) * | 2017-06-30 | 2019-01-15 | 三星Sds株式会社 | Audio recognition method |
CN109215637B (en) * | 2017-06-30 | 2023-09-01 | 三星Sds株式会社 | speech recognition method |
WO2019019252A1 (en) * | 2017-07-28 | 2019-01-31 | 平安科技(深圳)有限公司 | Acoustic model training method, speech recognition method and apparatus, device and medium |
CN107680582B (en) * | 2017-07-28 | 2021-03-26 | 平安科技(深圳)有限公司 | Acoustic model training method, voice recognition method, device, equipment and medium |
CN107680582A (en) * | 2017-07-28 | 2018-02-09 | 平安科技(深圳)有限公司 | Acoustic training model method, audio recognition method, device, equipment and medium |
US11030998B2 (en) | 2017-07-28 | 2021-06-08 | Ping An Technology (Shenzhen) Co., Ltd. | Acoustic model training method, speech recognition method, apparatus, device and medium |
CN109741735A (en) * | 2017-10-30 | 2019-05-10 | 阿里巴巴集团控股有限公司 | The acquisition methods and device of a kind of modeling method, acoustic model |
CN109741735B (en) * | 2017-10-30 | 2023-09-01 | 阿里巴巴集团控股有限公司 | Modeling method, acoustic model acquisition method and acoustic model acquisition device |
CN108111335B (en) * | 2017-12-04 | 2019-07-23 | 华中科技大学 | A kind of method and system of scheduling and link virtual network function |
CN108111335A (en) * | 2017-12-04 | 2018-06-01 | 华中科技大学 | A kind of method and system dispatched and link virtual network function |
CN108109615A (en) * | 2017-12-21 | 2018-06-01 | 内蒙古工业大学 | A kind of construction and application method of the Mongol acoustic model based on DNN |
CN109975762A (en) * | 2017-12-28 | 2019-07-05 | 中国科学院声学研究所 | A kind of underwater sound source localization method |
CN109975762B (en) * | 2017-12-28 | 2021-05-18 | 中国科学院声学研究所 | Underwater sound source positioning method |
CN110070855B (en) * | 2018-01-23 | 2021-07-23 | 中国科学院声学研究所 | Voice recognition system and method based on migrating neural network acoustic model |
CN110070855A (en) * | 2018-01-23 | 2019-07-30 | 中国科学院声学研究所 | A kind of speech recognition system and method based on migration neural network acoustic model |
CN108648747B (en) * | 2018-03-21 | 2020-06-02 | 清华大学 | Language identification system |
CN108648747A (en) * | 2018-03-21 | 2018-10-12 | 清华大学 | Language recognition system |
CN109326277A (en) * | 2018-12-05 | 2019-02-12 | 四川长虹电器股份有限公司 | Semi-supervised phoneme forces alignment model method for building up and system |
CN109326277B (en) * | 2018-12-05 | 2022-02-08 | 四川长虹电器股份有限公司 | Semi-supervised phoneme forced alignment model establishing method and system |
CN109545201B (en) * | 2018-12-15 | 2023-06-06 | 中国人民解放军战略支援部队信息工程大学 | Construction method of acoustic model based on deep mixing factor analysis |
CN109545201A (en) * | 2018-12-15 | 2019-03-29 | 中国人民解放军战略支援部队信息工程大学 | The construction method of acoustic model based on the analysis of deep layer hybrid cytokine |
CN112259089A (en) * | 2019-07-04 | 2021-01-22 | 阿里巴巴集团控股有限公司 | Voice recognition method and device |
CN110459216A (en) * | 2019-08-14 | 2019-11-15 | 桂林电子科技大学 | A kind of dining room brushing card device and application method with speech recognition |
CN113450786A (en) * | 2020-03-25 | 2021-09-28 | 阿里巴巴集团控股有限公司 | Network model obtaining method, information processing method, device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN103117060B (en) | 2015-10-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103117060A (en) | Modeling approach and modeling system of acoustic model used in speech recognition | |
CN112509564B (en) | End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism | |
CN108172218B (en) | Voice modeling method and device | |
KR101415534B1 (en) | Multi-stage speech recognition apparatus and method | |
CN104681036B (en) | A kind of detecting system and method for language audio | |
CN103400577B (en) | The acoustic model method for building up of multilingual speech recognition and device | |
CN103065620B (en) | Method with which text input by user is received on mobile phone or webpage and synthetized to personalized voice in real time | |
US20160284347A1 (en) | Processing audio waveforms | |
US10714076B2 (en) | Initialization of CTC speech recognition with standard HMM | |
CN108281137A (en) | A kind of universal phonetic under whole tone element frame wakes up recognition methods and system | |
CN104036774A (en) | Method and system for recognizing Tibetan dialects | |
WO2019019252A1 (en) | Acoustic model training method, speech recognition method and apparatus, device and medium | |
CN104575497B (en) | A kind of acoustic model method for building up and the tone decoding method based on the model | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
CN107767861A (en) | voice awakening method, system and intelligent terminal | |
CN107093422B (en) | Voice recognition method and voice recognition system | |
CN111445898B (en) | Language identification method and device, electronic equipment and storage medium | |
CN102810311B (en) | Speaker estimation method and speaker estimation equipment | |
CN111833845A (en) | Multi-language speech recognition model training method, device, equipment and storage medium | |
CN111091809B (en) | Regional accent recognition method and device based on depth feature fusion | |
WO2017177484A1 (en) | Voice recognition-based decoding method and device | |
CN102945673A (en) | Continuous speech recognition method with speech command range changed dynamically | |
Kapralova et al. | A big data approach to acoustic model training corpus selection | |
Ferrer et al. | Spoken language recognition based on senone posteriors. | |
CN102521402B (en) | Text filtering system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20151028 |