WO2014189399A1 - Modèle de langage à structure mélangée basé sur n-grammes - Google Patents

Modèle de langage à structure mélangée basé sur n-grammes Download PDF

Info

Publication number
WO2014189399A1
WO2014189399A1 PCT/RS2013/000009 RS2013000009W WO2014189399A1 WO 2014189399 A1 WO2014189399 A1 WO 2014189399A1 RS 2013000009 W RS2013000009 W RS 2013000009W WO 2014189399 A1 WO2014189399 A1 WO 2014189399A1
Authority
WO
WIPO (PCT)
Prior art keywords
grams
gram
morphologic
mixed
lemmas
Prior art date
Application number
PCT/RS2013/000009
Other languages
English (en)
Inventor
Stevan OSTROGONAC
Milan SEĈUJSKI
Vlado Delić
Dragiša MIŠKOVIĆ
Nikša JAKOVLJEVIĆ
Nataša VUJNOVIĆ SEDLAR
Original Assignee
Axon Doo
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Axon Doo filed Critical Axon Doo
Priority to PCT/RS2013/000009 priority Critical patent/WO2014189399A1/fr
Publication of WO2014189399A1 publication Critical patent/WO2014189399A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams

Definitions

  • the invention belongs to the field of natural language processing (NLP), specifically to statistical language modeling. It is related to speech recognition and it could be applied in other fields such as spell checking and language translation.
  • NLP natural language processing
  • LVCSR Large vocabulary continuous speech recognition
  • the patent EP 1290676 B l filed May 23 rd 2001, entitled “Creating a unified task dependent language models with information retrieval techniques” relates to a method for creating a language model from a task-independent corpus for a language processing system.
  • the language model includes a plurality of context-free grammar and a hybrid rc-gram model.
  • the patent EP1046157 Bl filed October 11 th 1999, entitled “Method of determining parameters of a statistical language model” relates to a method of determining parameters of a statistical language model for automatic speech recognition where elements of a vocabulary are combined so as to form context-independent vocabulary element categories.
  • the patent US6154722 A filed December 18 th 1997, entitled “Method and apparatus for a speech recognition system language model that integrates a finite state grammar probability and an w-gram probability” is a method and an apparatus for a speech recognition system that uses a language model based on an integrated finite state grammar probability.
  • the patent US7020606 B l filed December 2 nd 1998, entitled “Voice recognition using a grammar or «-gram procedures” relates to a method for voice recognition, wherein a diagram method is combined with an n-gram voice model with statistical word sequence evaluation.
  • the patent EP0801786 B l filed November 4 th 1995, entitled “Method and apparatus for adapting the language model's size in a speech recognition system” relates to a method and an apparatus for adapting, particularly reducing the size of a language model, which comprises word «-grams, in a speech recognition system.
  • the patent introduces the mechanism which discards rc-grams for which the acoustic part of the system requires less support from the language model to recognize correctly.
  • Scientific paper "Hybrid «-gram Probability Estimation in Morphologically Rich Languages” proposes a hybrid method that joins word-based and morpheme-based language modeling.
  • the subject of this invention is a language model consisting of mixed-structure «-grams. Namely, three kinds of w-gram constituents are used: words, lemmas and morphologic classes.
  • a training corpus has to be created first.
  • POS part-of-speech
  • the morphologic dictionary contains the information about the morphologic categories and the canonical forms for the words appearing in the training corpus.
  • the POS tagging software assigns morphologic classes and the lemmatizer assigns lemmas to the words and creates a training corpus of triples.
  • a sentence consisting of three words as an example: Maja voli cvece. (Maja likes flowers.)
  • C ⁇ represents e.g. all the proper, feminine nouns in nominative singular case
  • C 2 represents all the verbs in the third person singular of the present tense
  • C 3 represents the neuter gender nouns in nominative plural case.
  • n-grams to be included are created by combining the constituents of the triples. All combinations are taken into account.
  • Such a training principle produces very large language models even when small training corpora are used. For example, if the sentence "Maja voli cvece ' " appears in the original training corpus, it would result in the following trigrams which would be added to the mixed-structure LM (lemmas are marked by the curly brackets):
  • the main advantage of the mixed-structure rc-gram concept is the possibility of including only the most important and reliable information in the model. For example, when observing 3- grams, some word W3 can appear with different word histories, but these histories can all be assigned to a single morphologic class (or lemma) history cic 2 (1 ⁇ 2). Therefore, if the word appears frequently, it is useful to include the w-gram consisting of the morphologic classes comprising the history (context) and the final word in its original (inflected) form c ⁇ c 2 W3. Besides that, the 3 -gram consisting of the same history and the final word replaced by its corresponding morphologic class (C1C2C3) may be included in the model as a separate entry.
  • the mixed-structure «-gram model keeps the information about the most frequent morphologic structures appearing in the training corpus, but also keeps the information about particular words (or lemmas) that stand out as the common constituents of some of the structures.
  • the mixed-structure LM thus takes advantage of all three types of information carriers contained in the training corpus in a way that represents a compromise between robustness to the lack of training data and modeling accuracy. Once created, such a LM is easier to use than e.g. the combined models of words, lemmas and morphologic classes because only one document is searched to obtain the resulting probability for a given textual content.
  • the described language model can be used to estimate the probability of some textual content in a way that resembles the Katz back-off algorithm.
  • the probability of a word sequence is commonly calculated by multiplying the probabilities of «-grams consisting of each word from the sequence and its corresponding history.
  • the order of n-grams which is commonly used is 3 (trigrams).
  • the Katz algorithm implies using probabilities of rc-grams included in the model to estimate the probability of the input text, but if some of the relevant n-gram probabilities are not included in the model, lower order n-grams (n-l -grams) are used instead.
  • the switching to the lower order rc-grams is done iteratively when needed, but this is penalized by back-off coefficients.
  • the main difference between the Katz algorithm and the algorithm used to find the probability of the input text based on the mixed-structure language model is that the latter algorithm implies two stages of back-off.
  • the first back-off stage refers to the choice of the optimal n-gram from a set of n-grams corresponding to the input word sequence of length n.
  • the trigram LM is used, if the input word sequence consists of the words M>IW 2 W 3 , there may be more than one corresponding trigram included in the LM, such as W] W2Vt>3, and so on.
  • W] W2Vt>3 W2Vt>3
  • the optimal trigram probability it is best to first search for a trigram consisting of words. If it does not exist, trigrams containing lemmas and/or morphologic classes are taken into account.
  • the hierarchy of the trigram structures can be defined in a variety of ways, but it is best to consider the size of the training corpus when defining it.
  • the probabilities of word H-grams are naturally lower than the probabilities of n-grams containing the corresponding lemmas and/or morphologic classes (besides words) since all the structures are treated as equal in the training phase even though lemmas and morphologic classes appear more frequently than the actual words.
  • the probability of the existing «-gram must be scaled in order to obtain an adequate word rc-gram probability estimate.
  • One way to do this is by dividing the probability by sizes of the morphologic classes and lexemes corresponding to the lemmas contained in the «-gram, while the size of a lexeme (morphologic class) is determined as the number of types (different words) it is assigned to in the training corpus. For example, if a morphologic class is defined as a noun in case genitive and male gender, and if 200 different words found in the training corpus fall into that category, then the size of this morphologic class is 200.
  • the scaling should be done in the training phase so that the resulting model would contain probabilities which are ready for use.
  • the second back-off stage refers to resorting to «-l-gram probabilities when no «-gram probabilities correspond with the input word sequence (which is analog to the back-off procedure described by the Katz algorithm).
  • the switching to a lower order «-gram should be penalized, and then the first back-off stage can be applied to the set of n-l -grams found in the model, assuming that at least one n-l -gram is found. If not, the process is repeated iteratively.
  • the lower order back-off penalization for the mixed-structure model would be complicated and computationally inefficient if the weights analogue to the Katz back-off coefficients for word H-gram models were to be calculated.
  • the penalization can be done simply by multiplying the acquired ra-l-gram probability with the default «-gram probability obtained during the training process.
  • This default value represents the probability mass reserved for unseen events calculated during the Good-Turing (GT) discounting which is the initial step in statistical LM training (although other discounting techniques may be used, e.g. Kneser-Ney or absolute discounting). If no entries corresponding to the input word sequence are found in the model, a default GT value for unigrams is returned. This may be a very small probability, but it is never zero, which is important for further calculation.
  • GT Good-Turing
  • Figure 1 - shows a block scheme of a large-vocabulary continuous speech recognition system and illustrates the role of the language model.
  • Figure 2 - contains details about the mixed-structure language model training.
  • Figure 3 describes how a mixed structure language model is used to estimate a word- sequence probability.
  • Figure 4 - illustrates how the words, lemmas and morphologic classes are combined to create the mixed structure w-grams. The example shows the list of 3-grams acquired by mixing data corresponding to three words from the original training corpus.
  • This invention shows a language model based on the mixed-structure rc-grams and the explanation of the following figures illustrates it in details.
  • Figure 1 represents a block diagram of a speech recognition system.
  • the acoustic feature vector is used in the acoustic recognition level 101 which relies on the information provided by the acoustic models 102 and the pronunciation (lexical) model 103.
  • the result of the acoustic recognition is a set of word sequence hypotheses which are additionally scored on the linguistic recognition level 104 which relies on the information provided by the language model 105.
  • Figure 2 shows the training process for the mixed-structure language model proposed by this invention.
  • the initial training corpus 200 which contains textual information is used in the block for creating the mixed-structure corpus 201 during which process a record of the sizes of lexemes (corresponding to lemmas) and morphologic classes is kept.
  • a POS tagging and lemmatization tool 202 is required for assigning the lemmas and morphologic classes to all the words from the initial training corpus.
  • the POS tagger and the lemmatizer rely on the information provided by the morphologic dictionary 203.
  • the mixed-structure corpus is forwarded to the block for combining words, lemmas and morphologic classes into mixed-structure rc-grams 204 which also records the counts of all seen «-grams in the corpus.
  • the initial counts are then smoothed and some probability mass is reserved for "unseen” events through applying the discounting 205 after which the counts are forwarded to the block for calculating the initial probability estimates 206 for all the «-grams.
  • These probability estimates are then scaled by the sizes of lexemes (corresponding to lemmas) and morphologic classes contained in the «-grams in the probability recalculation block 207 and the resulting mixed-structure language model 105 is exported to the output textual document.
  • Figure 3 shows how the mixed-structure model can be used in the speech recognition phase.
  • the mixed structures are sent to the block for generating the list of «-grams 300.
  • the block for searching for the most appropriate «-gram probability 301 which relies on the language model 105 applies the first back-off stage or the second back-off stage if no «-grams are initially found in the model.
  • the resulting probability is the estimate which is used in the speech recognition system for scoring the word sequence hypotheses.
  • Figure 4 shows how a list of mixed-structure 3-grams is obtained by mixing data corresponding to the three words from the original training corpus. All the combinations are considered and for the given example a list of 27 mixed-structure trigrams is given.
  • This invention describes an w-gram language model containing 77-grams of mixed structures.
  • Each «-gram may contain from 0 to n words in their original (inflected) forms, but also canonical word forms (lemmas) and morphologic classes.
  • This invention relies on the existence of a morphologic dictionary, a part-of-speech tagging tool and a lemmatizer, which are language-dependent.
  • the mixed-structure language model can, however, be used for different languages and it is especially useful for highly-inflective languages and domain- specific applications in which cases the lack of training data can degrade the performance of word-based ra-gram models.
  • the mixed-structure modeling technique ensures the inclusion of the most reliable information obtained from the training corpus and enables the creation of high-quality models even when small amounts of data are available or when models need to be small (e.g. for applications in mobile phones).
  • This type of language model can improve the accuracy of speech recognition systems and it can also introduce improvements into software for spell checking, automatic translation or other tools that use the information about word collocation probabilities.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un modèle de langage à structure mélangée basé sur n-grammes et un procédé de détermination de probabilité de séquence de mots basé sur ce type de modèle. La structure mélangée comprenant un lemme et des informations de classes morphologiques pour tous les mots d'un n-gramme active une technique de modélisation qui assure l'inclusion des informations les plus fiables obtenues à partir d'un corpus d'apprentissage et active la création de modèles de qualité élevée même lorsque une petite quantité de données est disponible. L'invention concerne également différentes techniques d'élagage pouvant être utilisées afin de réduire le nombre de n-grammes inclus dans le modèle si une grande quantité de données est disponible pour l'apprentissage de classes morphologiques.
PCT/RS2013/000009 2013-05-22 2013-05-22 Modèle de langage à structure mélangée basé sur n-grammes WO2014189399A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/RS2013/000009 WO2014189399A1 (fr) 2013-05-22 2013-05-22 Modèle de langage à structure mélangée basé sur n-grammes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RS2013/000009 WO2014189399A1 (fr) 2013-05-22 2013-05-22 Modèle de langage à structure mélangée basé sur n-grammes

Publications (1)

Publication Number Publication Date
WO2014189399A1 true WO2014189399A1 (fr) 2014-11-27

Family

ID=48747698

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RS2013/000009 WO2014189399A1 (fr) 2013-05-22 2013-05-22 Modèle de langage à structure mélangée basé sur n-grammes

Country Status (1)

Country Link
WO (1) WO2014189399A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871534A (zh) * 2019-01-10 2019-06-11 北京海天瑞声科技股份有限公司 中英混合语料的生成方法、装置、设备及存储介质

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5640487A (en) 1993-02-26 1997-06-17 International Business Machines Corporation Building scalable n-gram language models using maximum likelihood maximum entropy n-gram models
US6073091A (en) 1997-08-06 2000-06-06 International Business Machines Corporation Apparatus and method for forming a filtered inflected language model for automatic speech recognition
EP0801786B1 (fr) 1995-11-04 2000-06-28 International Business Machines Corporation Procede et appareil d'adaptation de la dimension du modele de langage dans un systeme de reconnaissance vocale
US6154722A (en) 1997-12-18 2000-11-28 Apple Computer, Inc. Method and apparatus for a speech recognition system language model that integrates a finite state grammar probability and an N-gram probability
EP1320086A1 (fr) 2001-12-13 2003-06-18 Sony International (Europe) GmbH Procédé de génération et d'adaptation de modèles de langage
EP1046157B1 (fr) 1998-10-21 2004-03-10 Koninklijke Philips Electronics N.V. Procede permettant de determiner des parametres d'un modele de language statistique
EP1528539A1 (fr) 2003-10-30 2005-05-04 AT&T Corp. Système et méthode pour l'utilisation de meta-données en modélisation du language
US7020606B1 (en) 1997-12-11 2006-03-28 Harman Becker Automotive Systems Gmbh Voice recognition using a grammar or N-gram procedures
EP1290676B1 (fr) 2000-06-01 2006-10-18 Microsoft Corporation Creation d'un modele de langage destine a un systeme de traitement du langage
US20110161072A1 (en) 2008-08-20 2011-06-30 Nec Corporation Language model creation apparatus, language model creation method, speech recognition apparatus, speech recognition method, and recording medium
US20120278060A1 (en) * 2011-04-27 2012-11-01 Xerox Corporation Method and system for confidence-weighted learning of factored discriminative language models

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5640487A (en) 1993-02-26 1997-06-17 International Business Machines Corporation Building scalable n-gram language models using maximum likelihood maximum entropy n-gram models
EP0801786B1 (fr) 1995-11-04 2000-06-28 International Business Machines Corporation Procede et appareil d'adaptation de la dimension du modele de langage dans un systeme de reconnaissance vocale
US6073091A (en) 1997-08-06 2000-06-06 International Business Machines Corporation Apparatus and method for forming a filtered inflected language model for automatic speech recognition
US7020606B1 (en) 1997-12-11 2006-03-28 Harman Becker Automotive Systems Gmbh Voice recognition using a grammar or N-gram procedures
US6154722A (en) 1997-12-18 2000-11-28 Apple Computer, Inc. Method and apparatus for a speech recognition system language model that integrates a finite state grammar probability and an N-gram probability
EP1046157B1 (fr) 1998-10-21 2004-03-10 Koninklijke Philips Electronics N.V. Procede permettant de determiner des parametres d'un modele de language statistique
EP1290676B1 (fr) 2000-06-01 2006-10-18 Microsoft Corporation Creation d'un modele de langage destine a un systeme de traitement du langage
EP1320086A1 (fr) 2001-12-13 2003-06-18 Sony International (Europe) GmbH Procédé de génération et d'adaptation de modèles de langage
EP1528539A1 (fr) 2003-10-30 2005-05-04 AT&T Corp. Système et méthode pour l'utilisation de meta-données en modélisation du language
US20110161072A1 (en) 2008-08-20 2011-06-30 Nec Corporation Language model creation apparatus, language model creation method, speech recognition apparatus, speech recognition method, and recording medium
US20120278060A1 (en) * 2011-04-27 2012-11-01 Xerox Corporation Method and system for confidence-weighted learning of factored discriminative language models

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BROWN P F ET AL: "CLASS-BASED N-GRAM MODELS OF NATURAL LANGUAGE", COMPUTATIONAL LINGUISTICS, CAMBRIDGE, MA, US, vol. 18, no. 4, 1 December 1992 (1992-12-01), pages 467 - 479, XP000892488 *
KIRCHHOFF K ET AL: "Morphology-based language modeling for conversational Arabic speech recognition", COMPUTER SPEECH AND LANGUAGE, ELSEVIER, LONDON, GB, vol. 20, no. 4, 1 October 2006 (2006-10-01), pages 589 - 608, XP024930246, ISSN: 0885-2308, [retrieved on 20061001], DOI: 10.1016/J.CSL.2005.10.001 *
STEVAN OSTROGONAC ET AL: "A language model for highly inflective non-agglutinative languages", INTELLIGENT SYSTEMS AND INFORMATICS (SISY), 2012 IEEE 10TH JUBILEE INTERNATIONAL SYMPOSIUM ON, IEEE, 20 September 2012 (2012-09-20), pages 177 - 181, XP032265283, ISBN: 978-1-4673-4751-8, DOI: 10.1109/SISY.2012.6339510 *
TOMAS BRYCHCIN ET AL: "Morphological based language models for inflectional languages", INTELLIGENT DATA ACQUISITION AND ADVANCED COMPUTING SYSTEMS (IDAACS), 2011 IEEE 6TH INTERNATIONAL CONFERENCE ON, IEEE, 15 September 2011 (2011-09-15), pages 560 - 563, XP031990283, ISBN: 978-1-4577-1426-9, DOI: 10.1109/IDAACS.2011.6072829 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871534A (zh) * 2019-01-10 2019-06-11 北京海天瑞声科技股份有限公司 中英混合语料的生成方法、装置、设备及存储介质
CN109871534B (zh) * 2019-01-10 2020-03-24 北京海天瑞声科技股份有限公司 中英混合语料的生成方法、装置、设备及存储介质

Similar Documents

Publication Publication Date Title
US8719021B2 (en) Speech recognition dictionary compilation assisting system, speech recognition dictionary compilation assisting method and speech recognition dictionary compilation assisting program
US5878390A (en) Speech recognition apparatus equipped with means for removing erroneous candidate of speech recognition
JP2003505778A (ja) 音声制御ユーザインタフェース用の認識文法作成の特定用途を有する句ベースの対話モデル化
Sak et al. Morphology-based and sub-word language modeling for Turkish speech recognition
US8255220B2 (en) Device, method, and medium for establishing language model for expanding finite state grammar using a general grammar database
EP2950306A1 (fr) Procédé et système pour construire un modèle de langage
Kipyatkova et al. Recurrent neural network-based language modeling for an automatic Russian speech recognition system
Tachbelie et al. Syllable-based and hybrid acoustic models for amharic speech recognition
WO2014189399A1 (fr) Modèle de langage à structure mélangée basé sur n-grammes
Tanigaki et al. A hierarchical language model incorporating class-dependent word models for OOV words recognition
EP4295358A1 (fr) Modèle de langage récurrent de table de consultation
Al-Anzi et al. Performance evaluation of sphinx and HTK speech recognizers for spoken Arabic language
Maskey et al. A phrase-level machine translation approach for disfluency detection using weighted finite state transducers
Donaj et al. Context-dependent factored language models
KR20050101694A (ko) 문법적 제약을 갖는 통계적인 음성 인식 시스템 및 그 방법
Sakti et al. Unsupervised determination of efficient Korean LVCSR units using a Bayesian Dirichlet process model
Smaïli et al. An hybrid language model for a continuous dictation prototype
Hasegawa-Johnson et al. Fast transcription of speech in low-resource languages
Duchateau et al. Handling disfluencies in spontaneous language models
Alumae Sentence-adapted factored language model for transcribing Estonian speech
Isotani et al. Speech recognition using a stochastic language model integrating local and global constraints
Sas et al. Pipelined language model construction for Polish speech recognition
Ogawa et al. Word class modeling for speech recognition with out-of-task words using a hierarchical language model.
Brugnara et al. Techniques for approximating a trigram language model
Bahrani et al. Building statistical language models for persian continuous speech recognition systems using the peykare corpus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13734523

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13734523

Country of ref document: EP

Kind code of ref document: A1