WO2016167779A1 - Dispositif de reconnaissance de la parole et dispositif de réévaluation de scores - Google Patents

Dispositif de reconnaissance de la parole et dispositif de réévaluation de scores Download PDF

Info

Publication number
WO2016167779A1
WO2016167779A1 PCT/US2015/026217 US2015026217W WO2016167779A1 WO 2016167779 A1 WO2016167779 A1 WO 2016167779A1 US 2015026217 W US2015026217 W US 2015026217W WO 2016167779 A1 WO2016167779 A1 WO 2016167779A1
Authority
WO
WIPO (PCT)
Prior art keywords
language model
speech recognition
rescoring
word
training
Prior art date
Application number
PCT/US2015/026217
Other languages
English (en)
Inventor
Yuki TACHIOKA
Shinji Watanabe
Original Assignee
Mitsubishi Electric Corporation
Mitsubishi Electric Research Laboratories, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mitsubishi Electric Corporation, Mitsubishi Electric Research Laboratories, Inc. filed Critical Mitsubishi Electric Corporation
Priority to PCT/US2015/026217 priority Critical patent/WO2016167779A1/fr
Priority to JP2017507782A priority patent/JP6461308B2/ja
Priority to TW104129304A priority patent/TW201638931A/zh
Publication of WO2016167779A1 publication Critical patent/WO2016167779A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules

Definitions

  • This invention relates to a speech recognition device and a rescoring device and in particular to those utilizing a language model based on a recurrent neural network.
  • RNN recurrent neural network
  • LM language model
  • T. Mikolov M. Karafiat, L. Burget, J. Cernocky
  • S. Khudanpur in the document "Recurrent neural network based language model," in the Proceedings of INTERSPEECH, 2010, pp. 1045-1048.
  • FIG. 1 shows this method.
  • Input ⁇ : is a 1-of-N representation of a dictionary consisting of N words.
  • Output y represents posterior probabilities corresponding to N respective words.
  • a hidden layer includes a vector s of a small dimension.
  • a projection matrix U associates an input layer to the hidden layer.
  • Another projection matrix V associates the hidden layer to an output layer. The hidden layer of the previous instant is copied to the input layer so that the context is maintained.
  • a RNN-LM requires a longer processing time than a conventional n-gram method using table lookup, so RNN-LM is mainly used for rescoring.
  • Fig. 2 shows a construction wherein this is used for rescoring.
  • Recognition means 4 receives speech 1 as an input, scores N hypotheses by using an acoustic model 2 and a language model for recognition 3, and provides N-best recognition results 5 in the descending order of scores.
  • N-best recognition results means a recognition result wherein N hypotheses having highest scores among all hypotheses are sorted in a descending order of score.
  • rescoring means 6 receives the recognition results 5 as an input and returns re-sorted recognition results 8 wherein the hypotheses are sorted in a descending order of scores.
  • original language model scores can be replaced with those obtained by a language model for rescoring 7 or language model scores can be interpolated between original language model scores and these newly obtained scores.
  • Acoustic model scores are the same to those obtained by model 2.
  • the language model for rescoring 7 can be a RNN-LM or discriminative language model. By using RNN-LM for the language model 7, which can consider long contexts, the revised recognition results 8 can be better than the recognition results 5.
  • the vocabulary of the words that the rescoring means 6 should recognize covers the vocabulary of the recognition means 4 because any of the words recognizable by the recognition means 4 can appear in the recognition results 5.
  • the number of words in the vocabulary of the rescoring means 6 can be smaller than that of the recognition means 4 if unknown words (UNK) are modeled as a class.
  • a word sequence wj, W2,---,w t up to the current instant is used to calculate a posterior probability for the next word w t+ i.
  • ⁇ V ⁇ words are included in the vocabulary to be recognized and the words are given respectively different word indexes.
  • the word indexes are represented by n wherein 1 ⁇ « ⁇ l VI.
  • the word indexes may be given based on a result wherein the words are sorted on some standard. If a word index of the word appearing ⁇ -th in the speech is given as c t , a function for evaluating training on a cross-entropy (CE) basis is given as the following Equation (1):
  • C is a word sequence a speaker uttered (i.e., a correct sequence) converted into the word indexes and c t is the word index of the ⁇ -th word among the sequence. That is, C is an ordered sequence such as ci,C2,C 3 ,.... ⁇ is a Kronecker delta.
  • a training rule is obtained as the following Equation (3) by differentiatin F CE by a:
  • the correct answer is given by S(n, c t ), so a difference between the correct answer S(n,c t ) and a probability y t (n) estimated at the current instant is propagated backward as an error s t (n) so that parameters of the neural network are updated.
  • the parameters of the RNN to be trained include at least one element of the projection matrices U and/or V in Fig. 1. Also, the parameters of the RNN to be learned may include elements of vectors representing offsets added upon projection by the projection matrices U and V.
  • the backpropagation is performed for example in order to determine a parameter set minimizing the error s t (n). Also, other known methods and training criteria may be used for the backpropagation.
  • the score is, for example, represented as a function of an acoustic model score and a language model score, and for example, as their weighted sum.
  • the discriminative language model is trained by an (averaged) perceptron algorithm taking a difference of the count of n-grams, which appear in a correct sequence or a hypothesis having the fewest recognition errors among the N-best recognition results and in a hypothesis having the most recognition errors among the N-best recognition results. Examples of this method are described in Roark 2004 and Japanese Patent Application Laid-Open NO. 2014-089247 above.
  • a defect of such a conventional method is that it cannot consider any context longer than the n-gram. That is, if the model uses bi-grams, it cannot consider any context longer than a bi-gram (i.e., consecutive two words), and if the model uses tri-grams, it cannot consider any context longer than a tri-gram (i.e., consecutive three words).
  • such a conventional method has a problem that it cannot score at all any n-grams which do not appear in the N-best recognition results, which were used for training. Because of this, although the method is effective if recognition domains of the training data and the evaluating data are close, it may not be effective if the domains are distant (for example, the training data is from a reading task of newspaper articles whereas the evaluating data is from free e-mail text generation).
  • such a conventional method has a problem that it requires the rescoring process twice if the method is used in combination with the RNN-LM. That is, additional rescoring using the RNN-LM would be required before or after the rescoring by the rescoring means which uses the discriminative language model.
  • the present invention is made in order to solve the above problems and is aimed at constructing a speech recognition device and a rescoring device, by introducing a discriminative effect into the RNN-LM, that reduce recognition errors, that allow consideration for contexts longer than the discriminative language model, and that are robust to unknown contexts to some extent.
  • a speech recognition device related to the present invention is a speech recognition device storing a discriminatively trained language model, wherein: the discriminatively trained language model is trained based on word-by-word alignment between a correct sequence and a recognized hypothesis sequence; and the discriminatively trained language model is constructed based on a RNN.
  • the alignment can be determined for example by using dynamic programming to realize a longest match in word sequences.
  • a rescoring device related to the present invention is a rescoring device for rescoring a hypothesis sequence of speech recognition by using a discriminatively trained language model, wherein: the discriminatively trained language model is trained based on word-by-word alignment between a correct sequence and a recognized hypothesis sequence; and the discriminatively trained language model is constructed based on a RNN.
  • the rescoring device may take a weighted average between a parameter of an original language model and a parameter of the discriminatively trained language model.
  • Each word in the hypothesis sequence may be given a respective confidence measure.
  • the discriminatively trained language model may be trained so that a word having a higher confidence measure is more significant.
  • a first result including a hypothesis sequence may be obtained based on an original language model and a second result including a hypothesis sequence may be obtained based on the discriminatively trained language model. These first and the second results may be integrated.
  • a speech recognition device and a rescoring device that reduce recognition errors, that allow consideration for contexts longer than a discriminative language model and that are robust to unknown contexts to some extent are provided.
  • Fig. 1 is a diagram for explaining a language model based on a recurrent neural network.
  • Fig. 2 is a functional block diagram of a conventional speech recognition device.
  • Fig. 3 is diagram for explaining alignment between a correct sequence and a hypothesis sequence.
  • Fig. 4 is an example of a hardware construction of a speech recognition device related to a first embodiment.
  • Fig. 5 is a flowchart of processes performed by the speech recognition device of Fig. 4 for training.
  • Fig. 6 is a flowchart of processes performed by the speech recognition device of Fig. 4 for application.
  • Fig. 7 is a functional block diagram of the speech recognition device of Fig. 4.
  • Fig. 8 is a functional block diagram of the speech recognition device of Fig. 4.
  • Fig. 8 is a functional block diagram of the speech recognition device related to a second embodiment.
  • Fig. 9 is a functional block diagram of the speech recognition device related to a third embodiment.
  • Fig. 10 is a functional block diagram of the speech recognition device related to a fourth embodiment.
  • Fig. 11 is a functional block diagram of the speech recognition device related to a fifth embodiment.
  • Fig. 12 is a functional block diagram of the speech recognition device related to a sixth embodiment.
  • a first embodiment uses a RNN-LM based on a discriminative standard.
  • the present invention is aimed at improving recognition performance by training the RNN-LM discriminatively.
  • One of the important objects of a language model is to convert speech to be recognized into correct text data, so it is desired to construct a language model that can correct a conventional speech recognition result.
  • constructing a RNN-LM discriminatively by using hypotheses h t from speech recognition in addition to the above-described correct labels c t can be considered.
  • An objective function therefor may utilize a likelihood ratio in the word level as in the following Equation (4).
  • other evaluating functions may be used, for example mutual information amount maximization or least phoneme error, which are often used for discriminative training.
  • Fig. 3 A case wherein the correct sequence is [A,B,C,D] and a recognition hypothesis includes an insertion error (I), a deletion error (@) and a substitution error (S) is considered.
  • a corresponding relationship such as Fig. 3(a) is obtained by first aligning the correct sequence C and the speech recognition result ⁇ .
  • a special treatment is required for the insertion error. For example, suppose a hypothesis sequence of [A,B,C,I,D] is obtained wherein the word I is erroneously inserted into the correct sequence of Fig. 3(a). In this case, there are no correct words corresponding to the word I. This case may for example be processed by ignoring the word I so that the hypothesis sequence is regarded to be [A,B,C,D] or by regarding that the word C at the previous instant is repeated as shown in Fig. 3(b).
  • the number of hypotheses is two or more (e.g. N-best recognition results).
  • Each hypothesis may be processed similarly.
  • the parameters of the RNN-LM are updated by performing the aligning process such as shown in Fig. 3 for the hypothesis of the first rank, and the parameters of the RNN-LM are again updated similarly by performing the aligning process such as shown in Fig. 3 for the hypothesis of the second rank.
  • Fig. 4 shows an example of a hardware construction of a speech recognition device 10 related to the first embodiment of the present invention.
  • the speech recognition device 10 may be constructed utilizing a known computer.
  • the speech recognition device 10 comprises operation means 20, storage means 30, speech input means 40 and result output means 50.
  • the operation means 20 includes a processor and the storage means 30 includes a storage medium such as a semiconductor memory or HDD (Hard Disk Drive).
  • the storage means 30 stores a program (not shown), and by executing the program the operation means 20 realizes the functions of the speech recognition device 10 described herein.
  • the program may be stored on a non-transitory information storage medium.
  • the speech input means 40 is for example a microphone and receives an input of speech 60 including a word sequence.
  • the speech input means 40 may be an electronic data input means and may receive an input of the speech 60 as electronic data.
  • the result output means 50 is for example a liquid crystal display, a printer, a network interface, etc., and outputs re-sorted N-best recognition results 70.
  • Figs. 5 and 6 show flowcharts of the processes performed by the speech recognition device 10.
  • Fig. 5 is a flowchart for training. If the speech recognition device 10 operates in accordance with the flowchart of Fig. 5, the speech recognition device 10 can be viewed as a speech recognition training device.
  • the speech recognition device 10 receives an input of the speech 60 for training (Step SI). Then, the speech recognition device 10 performs speech recognition for the speech 60 to obtain N-best recognition results (Step S2). Then, the speech recognition device 10 aligns the hypothesis sequences included in the N-best recognition results with a correct sequence (Step S3). Then, the speech recognition device 10 trains the language model discriminatively based on the alignment result (Step S4). Then, the speech recognition device 10 outputs the discriminatively trained language model (Step S5). Note that, although normally many correct sequences are used for training, the present invention may be carried out with at least one correct sequence and at least one hypothesis sequence.
  • Fig. 6 is a flowchart for application. If the speech recognition device 10 operates in accordance with the flowchart of Fig. 6, the speech recognition device 10 can be viewed as a rescoring device.
  • the speech recognition device 10 receives an input of the speech 60 to be recognized (Step S6). Then, the speech recognition device 10 performs a speech recognition process for the speech 60 to obtain the N-best recognition results (Step S7). Then, the speech recognition device 10 rescores the hypothesis sequences included in the N-best recognition results based on the discriminatively trained language model (Step S8). Then, the speech recognition device 10 outputs the re-sorted N-best recognition results 70 which were re-sorted in accordance with the rescored result (Step S9). Note that, although normally a plurality of hypothesis sequences are outputted, a construction for outputting at least one hypothesis sequence may correspond to the present invention.
  • Fig. 7 shows a functional block diagram of the speech recognition device 10.
  • the operation means 20 of the speech recognition device 10 functions as recognition means 21, alignment means 22, discriminative training means 23 and rescoring means 24.
  • the storage means 30 of the speech recognition device 10 can store an acoustic model 31, a first language model 32, N-best recognition results 33, correct labels 34 and a second language model 35.
  • the first language model 32 is for example a language model constructed for speech recognition and the second language model 35 is for example a language model constructed for rescoring.
  • the recognition means 21, the acoustic model 31 and the first language model 32 may be of conventional constructions. That is, the recognition means 4, the acoustic model 2 and the language model 3 of Fig. 2 may be used.
  • the alignment means 22 aligns the N-best recognition results 33 and the correct labels 34.
  • "To align" means, for example, to define corresponding relationships between the words included in a correct sequence and the words included in a hypothesis sequence.
  • words [A, S, D] in the hypothesis sequence are associated respectively to words [A, B, D] in the correct sequence.
  • the words that cannot be associated it is considered that there is an insertion or an omission.
  • a word C is deleted and a word I is inserted. Alignment can be determined for example by taking the longest match using dynamic programming.
  • the discriminative training means 23 performs training discriminatively based on a result of the alignment process and generates or updates the second language model 35.
  • the second language model 35 is constructed based on a RNN.
  • the discriminative training for the second language model 35 is performed for example by the back-propagation using above Equation (5), thereby updating the parameters of the RNN. This can be performed in a manner similar to the back-propagation in conventional training.
  • the second language model 35 is trained based on alignment between the correct sequence and the hypothesis sequence.
  • the rescoring means 24 rescores the N-best recognition results 33 based on the second language model 35 to obtain the re-sorted N-best recognition result 70.
  • "To rescore” means, for example, to perform scoring again for hypothesis sequences that were already given their scores.
  • the first scoring is for example the scoring performed by the recognition means 21 in the first embodiment.
  • the rescoring means 24 replaces the language model scores of the hypotheses included in the N-best recognition results 33 with the language model scores estimated by using the neural network.
  • weighted averages between the original language model scores and the estimated language model scores can be taken.
  • the present invention is considered to be more robust to differences in the domains than conventional constructions combined with a discriminative language model because the present invention allows analogy for contexts not present in the training data.
  • the words “dog” and “cat” are exchangeable in some context, in which case a cosine similarity therebetween would be high if they are projected onto vectors s in a small dimension.
  • training effect of a case wherein there is “dog” in the training data would be similar to that of a case wherein there is “cat”, so the present invention can obtain an analogizing effect based on a similar context including an exchangeable word.
  • Such an effect cannot be obtained by any conventional discriminative language model.
  • a specific dimension of the vector s can be designed as needed and is normally smaller than I VI.
  • the present invention has an advantage that a single rescoring process is sufficient in contrast with a conventional construction wherein a RNN-LM and a
  • discriminative language model are used together.
  • another discriminative language model can be used additionally in a subsequent stage in order to further improve the performance.
  • an additional rescoring means may be provided at a subsequent stage with respect to the rescoring means 24 and the additional rescoring means may rescore the re-sorted N-best recognition result 70 based on another discriminative language model.
  • a training device may not comprise the rescoring means 24 and an application device may not comprise the alignment means 22 or the discriminative training means 23.
  • the application device may be a conventional speech recognition device (e.g. of the construction shown in Fig. 2), provided that the second language model 35 is used for rescoring.
  • the first embodiment uses the discriminatively trained second language model 35 without any modification.
  • a second embodiment uses parameters of weighted averages between an original language model 36 and the second language model 35. Such a construction can reduce an effect of over- fitting.
  • the original language model 36 means the second language model 35 before its neural network parameters are updated by the discriminative training means 23, i.e. identical to the second language model 35 in an initial state.
  • the second language model 35 is generated by performing discriminative training on the original language model 36.
  • Fig. 8 shows a construction related to the second embodiment.
  • Weighting means 25 is added.
  • the operation means 20 of the speech recognition device 10 may function as the weighting means 25.
  • the weighting means 25 takes weighted averages between the parameters of the original language model 36 and the parameters of the second language model 35. For example, in the construction of Fig. 1, this is represented by the following Equation
  • U CE and V CE are parameters of the model trained using cross entropy
  • U LR and V LR are parameters of the discriminatively trained model
  • r is a smoothing coefficient. Note that, although the language models normally include multiple parameters, the weighted averaging can be performed if the language models respectively include at least one parameter.
  • a third embodiment uses a RNN-LM based on a discriminative standard utilizing word confidence measures.
  • Fig. 9 shows a construction related to the third embodiment.
  • This example comprises recognition means 121 instead of the recognition means 21 in the first and second embodiments and discriminative training means 123 instead of the discriminative training means 23 in the first and second embodiments.
  • the operation means 20 of the speech recognition device 10 may function as the recognition means 121 and the discriminative training means 123.
  • the recognition means 121 outputs the N-best recognition results 33, determines a confidence measure for each word included in the N-best recognition results 33 and outputs the confidence measures as word confidence measures 37.
  • the word confidence measures 37 is, for example, stored in the storage means 30 of the speech recognition device 10.
  • the discriminative training means 123 performs discriminative training based on the word confidence measures 37 in addition to the result of the alignment process and generates or updates the second language model 35.
  • a ratio of the likelihood of a particular hypothesis occurring at a certain time to the sum of likelihoods of all hypotheses at that time can be used as the word confidence measure for the particular hypothesis. For example, if the word hypotheses at a certain time t are noted as w t ' wherein l ⁇ 7, the word confidence measure can be represented by using likelihood of the word hypotheses p(w t ') as:
  • each word in the hypothesis sequence has a respective confidence measure and the second language model 35 is trained so that a word having a higher confidence measure is more significant.
  • weighting means 25 and the original language model 36 are provided in a manner similar to the second embodiment, they may be omitted in a manner similar to the first embodiment.
  • results of training are integrated at a level of language model.
  • results of training are integrated at a level of recognition result.
  • Fig. 10 shows a construction related to the fourth embodiment.
  • First rescoring means 224 and second rescoring means 225 are provided instead of the rescoring means 24 in the first and second embodiments.
  • the operation means 20 of the speech recognition device 10 may function as the first rescoring means 224 and the second rescoring means 225.
  • the first rescoring means 224 obtains re-sorted N-best recognition results 270 (first result) by rescoring and re-sorting the N-best recognition results 33 based on the original language model 36.
  • the second rescoring means 225 obtains N-best recognition results 271 (second result) by rescoring and re-sorting the N-best recognition results 33 based on the discriminatively trained second language model 35.
  • the re-sorted N-best recognition results 270 and 271 may be stored in the storage means 30 of the speech recognition device 10.
  • the fourth embodiment is provided with result integration means 26.
  • the operation means 20 of the speech recognition device 10 may function as the result integration means 26.
  • the result integration means 26 integrates the re-sorted N-best recognition results 270 and 271 to obtain a final set of re-sorted N-best recognition results 70.
  • the integration may for example be performed by comparing the hypotheses based on respective scores and selecting the hypothesis with a higher score.
  • the integration may be performed by a majority decision.
  • Specific applications of the majority decision can be designed as needed. For example, a majority decision of three or more systems can be used, and if all systems output respectively different hypotheses, they may be compared based on the scores.
  • the score of any language model may be discounted appropriately upon integration.
  • the scores of the hypotheses thereof may be multiplied by a weight smaller than one (e.g. 0.8) and then the hypotheses may be compared and integrated based on the scores.
  • the rescoring can be performed more robustly than using a single (or an averaged) language model.
  • a fifth embodiment has a construction that uses only incorrect hypotheses for the discriminative training of the language model.
  • both the correct hypotheses and the incorrect hypotheses are used for training.
  • using a language model which is trained based only on the incorrect hypotheses may be considered.
  • Fig. 11 shows a construction related to the fifth embodiment.
  • Alignment means 322 is provided instead of the alignment means 22 in the second embodiment.
  • the alignment means 322 extracts incorrect hypotheses 38 from the N-best recognition results 33 and aligns them.
  • model training means 323 is provided instead of the discriminative training means 23 in the second embodiment.
  • the model training means 323 performs training based on the alignment process by using the incorrect hypotheses 38 and generates or updates the second language model 335. This training process per se does not have to be performed in accordance with any discriminative method.
  • the model training means 323 performs training by updating the parameters of the neural network in accordance with Equation (3).
  • weighting means 325 is provided instead of the weighting means 25 in the second embodiment.
  • the weighting means 325 takes weighted averages between the parameters of the original language model 36 and the parameters of the second language model 335 so that a penalty is imposed on the parameters that output an incorrect hypothesis.
  • the weighting means 325 weights the parameters of the second language model 325 to be negative, i.e. so that r in Equation (6) becomes greater than 1.
  • the speech recognition device 10 performs discriminative training in total by combining the original language model and the language model trained by the incorrect hypotheses.
  • the operation means 20 of the speech recognition device 10 may function as the alignment means 322, the model training means 323 and the weighting means 325. Also, the incorrect hypotheses 38 and the second language model 325 may be stored in the storage means 30 of the speech recognition device 10.
  • the first language model 32 for speech recognition is not an object for discriminative training.
  • the language model for speech recognition is trained by using the RNN-LM.
  • Fig. 12 shows a construction related to the sixth embodiment.
  • discriminative training means 423 is provided instead of the discriminative training means 23 in the first embodiment.
  • the discriminative training means 423 performs training discriminatively based on the result of the aligning process and updates the language model 432.
  • recognition means 421 is provided instead of the recognition means 21 in the first embodiment.
  • the recognition means 421 performs speech recognition based on the discriminatively trained language model 432 and outputs the N-best recognition results 33.
  • an effect of the discriminative training can also be obtained in a manner similar to the first embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un dispositif de reconnaissance de la parole et un dispositif de réévaluation de scores construits de façon à réduire des erreurs de reconnaissance, qui permettent la prise en compte de contextes plus longs qu'un modèle de langage discriminatif et qui dans une certaine mesure sont robustes à des contextes inconnus. Dans le dispositif de reconnaissance de la parole et le dispositif de réévaluation de scores utilisant un modèle de langage appris de manière discriminante, ledit modèle est formé par apprentissage sur la base d'un alignement entre une séquence correcte et une séquence d'hypothèses, et le modèle de langage appris de manière discriminatoire est construit sur la base d'un réseau neuronal récurrent.
PCT/US2015/026217 2015-04-16 2015-04-16 Dispositif de reconnaissance de la parole et dispositif de réévaluation de scores WO2016167779A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/US2015/026217 WO2016167779A1 (fr) 2015-04-16 2015-04-16 Dispositif de reconnaissance de la parole et dispositif de réévaluation de scores
JP2017507782A JP6461308B2 (ja) 2015-04-16 2015-04-16 音声認識装置およびリスコアリング装置
TW104129304A TW201638931A (zh) 2015-04-16 2015-09-04 語音識別裝置與調整裝置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2015/026217 WO2016167779A1 (fr) 2015-04-16 2015-04-16 Dispositif de reconnaissance de la parole et dispositif de réévaluation de scores

Publications (1)

Publication Number Publication Date
WO2016167779A1 true WO2016167779A1 (fr) 2016-10-20

Family

ID=57125816

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/026217 WO2016167779A1 (fr) 2015-04-16 2015-04-16 Dispositif de reconnaissance de la parole et dispositif de réévaluation de scores

Country Status (3)

Country Link
JP (1) JP6461308B2 (fr)
TW (1) TW201638931A (fr)
WO (1) WO2016167779A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018051841A1 (fr) * 2016-09-16 2018-03-22 日本電信電話株式会社 Dispositif d'apprentissage de modèle, procédé associé et programme
WO2018062265A1 (fr) * 2016-09-30 2018-04-05 日本電信電話株式会社 Dispositif d'apprentissage de modèle acoustique, procédé associé, et programme
CN108335694A (zh) * 2018-02-01 2018-07-27 北京百度网讯科技有限公司 远场环境噪声处理方法、装置、设备和存储介质
JP2019008574A (ja) * 2017-06-26 2019-01-17 合同会社Ypc 物品判定装置、システム、方法及びプログラム
EP3648099A4 (fr) * 2017-06-29 2020-07-08 Tencent Technology (Shenzhen) Company Limited Procédé, dispositif, appareil, et support d'informations de reconnaissance vocale
CN112163636A (zh) * 2020-10-15 2021-01-01 电子科技大学 基于孪生神经网络的电磁信号辐射源的未知模式识别方法
US20220199091A1 (en) * 2020-12-18 2022-06-23 Microsoft Technology Licensing, Llc Hypothesis stitcher for speech recognition of long-form audio

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10170110B2 (en) * 2016-11-17 2019-01-01 Robert Bosch Gmbh System and method for ranking of hybrid speech recognition results with neural networks
BR112020023552A2 (pt) * 2018-05-18 2021-02-09 Greeneden U.S. Holdings Ii, Llc métodos para treinar um modelo de confiança em um sistema de reconhecimento automático de fala e para converter entrada de fala em texto usando modelagem de confiança com uma abordagem multiclasse, e, sistema destinado a converter fala de entrada em texto.
JP6965846B2 (ja) * 2018-08-17 2021-11-10 日本電信電話株式会社 言語モデルスコア算出装置、学習装置、言語モデルスコア算出方法、学習方法及びプログラム
US11011156B2 (en) 2019-04-11 2021-05-18 International Business Machines Corporation Training data modification for training model
KR102577589B1 (ko) 2019-10-22 2023-09-12 삼성전자주식회사 음성 인식 방법 및 음성 인식 장치

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6490555B1 (en) * 1997-03-14 2002-12-03 Scansoft, Inc. Discriminatively trained mixture models in continuous speech recognition
US20040267530A1 (en) * 2002-11-21 2004-12-30 Chuang He Discriminative training of hidden Markov models for continuous speech recognition
US20080243503A1 (en) * 2007-03-30 2008-10-02 Microsoft Corporation Minimum divergence based discriminative training for pattern recognition
US8775177B1 (en) * 2012-03-08 2014-07-08 Google Inc. Speech recognition process
US20150095026A1 (en) * 2013-09-27 2015-04-02 Amazon Technologies, Inc. Speech recognizer with multi-directional decoding

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1450350A1 (fr) * 2003-02-20 2004-08-25 Sony International (Europe) GmbH Méthode de reconnaissance de la parole avec des attributs
JP2008026721A (ja) * 2006-07-24 2008-02-07 Nec Corp 音声認識装置、音声認識方法、および音声認識用プログラム
JP2013125144A (ja) * 2011-12-14 2013-06-24 Nippon Hoso Kyokai <Nhk> 音声認識装置およびそのプログラム

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6490555B1 (en) * 1997-03-14 2002-12-03 Scansoft, Inc. Discriminatively trained mixture models in continuous speech recognition
US20040267530A1 (en) * 2002-11-21 2004-12-30 Chuang He Discriminative training of hidden Markov models for continuous speech recognition
US20080243503A1 (en) * 2007-03-30 2008-10-02 Microsoft Corporation Minimum divergence based discriminative training for pattern recognition
US8775177B1 (en) * 2012-03-08 2014-07-08 Google Inc. Speech recognition process
US20150095026A1 (en) * 2013-09-27 2015-04-02 Amazon Technologies, Inc. Speech recognizer with multi-directional decoding

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018051841A1 (fr) * 2016-09-16 2018-03-22 日本電信電話株式会社 Dispositif d'apprentissage de modèle, procédé associé et programme
JPWO2018051841A1 (ja) * 2016-09-16 2019-07-25 日本電信電話株式会社 モデル学習装置、その方法、及びプログラム
JPWO2018062265A1 (ja) * 2016-09-30 2019-07-25 日本電信電話株式会社 音響モデル学習装置、その方法、及びプログラム
WO2018062265A1 (fr) * 2016-09-30 2018-04-05 日本電信電話株式会社 Dispositif d'apprentissage de modèle acoustique, procédé associé, et programme
JP2019008574A (ja) * 2017-06-26 2019-01-17 合同会社Ypc 物品判定装置、システム、方法及びプログラム
EP3648099A4 (fr) * 2017-06-29 2020-07-08 Tencent Technology (Shenzhen) Company Limited Procédé, dispositif, appareil, et support d'informations de reconnaissance vocale
CN108335694A (zh) * 2018-02-01 2018-07-27 北京百度网讯科技有限公司 远场环境噪声处理方法、装置、设备和存储介质
US11087741B2 (en) 2018-02-01 2021-08-10 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, device and storage medium for processing far-field environmental noise
CN108335694B (zh) * 2018-02-01 2021-10-15 北京百度网讯科技有限公司 远场环境噪声处理方法、装置、设备和存储介质
CN112163636A (zh) * 2020-10-15 2021-01-01 电子科技大学 基于孪生神经网络的电磁信号辐射源的未知模式识别方法
CN112163636B (zh) * 2020-10-15 2023-09-26 电子科技大学 基于孪生神经网络的电磁信号辐射源的未知模式识别方法
US20220199091A1 (en) * 2020-12-18 2022-06-23 Microsoft Technology Licensing, Llc Hypothesis stitcher for speech recognition of long-form audio
US11574639B2 (en) * 2020-12-18 2023-02-07 Microsoft Technology Licensing, Llc Hypothesis stitcher for speech recognition of long-form audio

Also Published As

Publication number Publication date
TW201638931A (zh) 2016-11-01
JP6461308B2 (ja) 2019-01-30
JP2017527846A (ja) 2017-09-21

Similar Documents

Publication Publication Date Title
WO2016167779A1 (fr) Dispositif de reconnaissance de la parole et dispositif de réévaluation de scores
Ogawa et al. Error detection and accuracy estimation in automatic speech recognition using deep bidirectional recurrent neural networks
JP6222821B2 (ja) 誤り修正モデル学習装置、及びプログラム
De Mulder et al. A survey on the application of recurrent neural networks to statistical language modeling
Jelinek Statistical methods for speech recognition
Shannon Optimizing expected word error rate via sampling for speech recognition
Lou et al. Disfluency detection using auto-correlational neural networks
US20180144234A1 (en) Sentence Embedding for Sequence-To-Sequence Matching in a Question-Answer System
US8494847B2 (en) Weighting factor learning system and audio recognition system
JP2019159654A (ja) 時系列情報の学習システム、方法およびニューラルネットワークモデル
Cui et al. Multi-view and multi-objective semi-supervised learning for hmm-based automatic speech recognition
Munkhdalai et al. Fast contextual adaptation with neural associative memory for on-device personalized speech recognition
Wu et al. Encoding linear models as weighted finite-state transducers.
WO2023071581A1 (fr) Procédé et appareil servant à déterminer une phrase de réponse, dispositif et support
Lin et al. Neural finite-state transducers: Beyond rational relations
CN111814489A (zh) 口语语义理解方法及系统
Audhkhasi et al. Theoretical analysis of diversity in an ensemble of automatic speech recognition systems
Wang et al. A new concept of deep reinforcement learning based augmented general tagging system
Ons et al. Fast vocabulary acquisition in an NMF-based self-learning vocal user interface
Saraçlar Pronunciation modeling for conversational speech recognition
JP6127778B2 (ja) モデル学習方法、モデル学習プログラム及びモデル学習装置
Khassanov et al. Enriching rare word representations in neural language models by embedding matrix augmentation
Hori et al. Adversarial training and decoding strategies for end-to-end neural conversation models
Layton Augmented statistical models for classifying sequence data
Andrew et al. Sequential deep belief networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15889373

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2017507782

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15889373

Country of ref document: EP

Kind code of ref document: A1