WO2002093416A1 - Systeme de traduction a base de memoire statistique - Google Patents

Systeme de traduction a base de memoire statistique Download PDF

Info

Publication number
WO2002093416A1
WO2002093416A1 PCT/US2002/015057 US0215057W WO02093416A1 WO 2002093416 A1 WO2002093416 A1 WO 2002093416A1 US 0215057 W US0215057 W US 0215057W WO 02093416 A1 WO02093416 A1 WO 02093416A1
Authority
WO
WIPO (PCT)
Prior art keywords
translation
target language
text segment
word
current target
Prior art date
Application number
PCT/US2002/015057
Other languages
English (en)
Inventor
Daniel Marcu
Original Assignee
University Of Southern California
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/854,327 external-priority patent/US7533013B2/en
Application filed by University Of Southern California filed Critical University Of Southern California
Priority to JP2002590018A priority Critical patent/JP2005516267A/ja
Priority to CA002446811A priority patent/CA2446811A1/fr
Priority to EP02729189A priority patent/EP1390868A4/fr
Publication of WO2002093416A1 publication Critical patent/WO2002093416A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory

Definitions

  • Machine translation concerns the automatic translation of natural language sentences from a first language (e.g., French) into another language (e.g., English) .
  • Systems that perform MT techniques are said to "decode" the source language into. the target language.
  • a statistical MT system that translates French sentences into English has three components: a language model (LM) that assigns a probability P(e) to any English string; a translation model (TM) that assigns a probability P(f
  • the decoder may take a previously unseen sentence f and try to find the e that maximizes P(e
  • a statistical machine translation (MT) system may include a translation memory (TMEM) and a decoder.
  • the TMEM may be a statistical TMEM generated from a corpus or a TMEM produced by a human.
  • the decoder may translate an input text segment using a statistical MT decoding algorithm, for example, a greedy decoding algorithm.
  • the system may generate a cover of the input text segment from text segments in the TMEM.
  • the decoder may use the cover as an initial translation in the decoding operation.
  • Figure 1 is a block diagram of a statistical machine translation system.
  • Figure 2 illustrates the results of a stochastic word alignment operation.
  • Figure 3 is a flowchart describing a stochastic process that explains how a source string can be mapped into a target string.
  • Figure 4 is a flowchart describing a greedy decoding procedure that uses both a TMEM and a statistical translation model.
  • Figure 1 illustrates a statistical machine translation (MT) system which utilizes a translation memory (TMEM) according to an embodiment.
  • the MT system 100 may be used to translate from a source language (e.g., French) to a target language (e.g., English).
  • the MT system 100 may include a language model 102, a translation model 105, a TMEM 110, and a decoder 115.
  • the MT system 100 may be based on a source-channel model.
  • the language model (the source) provides an a priori distribution P(e) of probabilities indicating which English text strings are more likely, e.g., which are grammatically correct and which are not.
  • the language model 102 may be an n-gram model trained by a large, naturally generated monolithic corpus (e.g., English) to determine the probability of a word sequence.
  • the translation model 105 may be used to determine the probability of correctness for a translation.
  • the translation model may be, for example, an IBM translation model 4, described in U.S. Patent No. 5,477,451.
  • the IBM translation model 4 revolves around the notion of a word alignment over a pair of sentences, such as that shown in Figure 2.
  • a word alignment assigns a single home (English string position) to each French word. If two French words align to the same English word, then that English word is said to have a fertility of two. Likewise, if an English word remains unaligned-to, then it has fertility zero. If a word has fertility greater than one, it is called very fertile.
  • FIG. 2 The word alignment in Figure 2 is shorthand for a hypothetical stochastic process by which an English string 200 gets converted into a French string 205.
  • Figure 3 is a flowchart describing, at a high level, such a stochastic process 300. Every English word in the string is first assigned a fertility (block 305) . These assignments may be made stochastically according to a table n( ⁇
  • the head of one English word is assigned a French string position based on the position assigned to the previous English word. If an English word E e - ⁇ translates into something at French position j, then the French head word of e_ is stochastically placed in French position k with distortion probability di (k-j
  • NULL-generated words are permuted into the remaining vacant slots randomly. If there are 0 0 NULL-generated words, then any placement scheme is chosen with probability l/0 0 !.
  • the factors separated by "x" symbols denote fertility, translation, head permutation, non-head permutation, null-fertility, and null-translation probabilities, respectively.
  • the symbols in this formula are: 1 (the length of e) , m (the length of f) , e_ (the i t English word in e) , e 0 (the NULL word) , ⁇ _ (the fertility of e ⁇ ), ⁇ o (the fertility of the NULL word) , Xi k (the k th French word produced by e_ in a) , ⁇ i (the position of ⁇ _ k in f) , pi (the position of the first fertile word to the left of . ⁇ in a) , Cpi (the ceiling of the average of all ⁇ P i k for pi, or 0 if pi is undefined) .
  • the TMEM 110 may be a pre-co piled TMEM including human produced translation pairs.
  • a TMEM such as the Hansard Corpus, or a portion thereof, may be used.
  • the Hansard Corpus includes parallel texts in English and Canadian French, drawn from official records of the proceedings of the Canadian Parliament.
  • the Hansard Corpus is presented as sequences of sentences in a version produced by IBM.
  • the IBM collection contains nearly 2.87 million parallel sentence pairs in the set.
  • the TMEM may be a statistical TMEM.
  • a statistical TMEM may be generated by training the translation model with a training corpus, e.g., the Hansard Corpus, or a portion thereof, and then extracting the Viterbi (most probable word level) alignment of each sentence, i.e., the alignment of highest probability, to extract tuples of the form ⁇ e_, ei + i, ..., ei +k ; fj, fj + _,..., fj + _; a j , a- j+ i, ..., a- j+ ⁇ >, where e_, e ⁇ + , ..., ei + represents a contiguous English phrase, f j , f j+ _, ..., f j+ _ represents a contiguous French phrase, and a j , a j+ _, ..., aj + ⁇ > represents
  • the TMEM may contain in addition to the contiguous French/English phrase adjacent information specific to the translation model that is employed.
  • the tuples may be selected based on certain criteria.
  • the tuples may be limited to "contiguous" alignments, i.e., alignments in which the words in the English phrase generated only words in the French phrase and each word in the French phrase was generated either by the NULL word or a word from the English phrase.
  • the tuples may be limited to those in which the English and French phrases contained at least two words.
  • the tuples may be limited to those that occur most often in the data.
  • one possible English translation equivalent may be chosen for each French phrase.
  • a Frequency-based Translation Memory may be created by associating with each French phrase the English equivalent that occurred most often in the collection of phrases that are extracted.
  • a Probability- based Translation Memory may be created by associating with each French phrase the English equivalent that corresponds to the alignment of highest probability.
  • the decoder 115 may utilize a greedy decoding operation 400, such as that described in the flowchart shown in Figure 4, to produce an output sentence. Greedy decoding methods may start out with a random, approximate solution and then try to improve it incrementally until a satisfactory solution is reached.
  • the decoder 115 may receive an input sentence to be translated (block 405) . Although in this example, the text segment being translated is a sentence, virtually any other text segment could be used, for example, clauses, paragraphs, or entire treatises.
  • the decoder 115 may generate a "cover" for the input sentence using phrases from the TMEM (block 410) .
  • the derivation attempts to cover with tranlation pairs from the TMEM 110 as much of the input sentence as possible, using the longest phrases in the TMEM.
  • the words in the input that are not part of any phrase extracted from the TMEM 110 may be "glossed," i.e., replaced with an essentially word- for-word translation.
  • the decoder 115 estimates the probability of correctness of the current translation, P(c), based on probabilities assigned by the language model and the translation model (block 420) .
  • the decoder 115 tries to improve the alignment (block 425) . That is, the decoder tries to find an alignment (and implicitly, a translation) of higher probability by applying one or more sentence modification operators, described below.
  • the use of a word-level alignment and the particular operators described below were chosen for this particular embodiment. However, alternative embodiments using different statistical models may benefit from different or additional operations.
  • translateOneOr woWords (ji, ei, j 2 , e 2 ) : This operation changes the translation of one or two French words, those located at positions ji and j 2 , from ef j _ and ef j2 into e_ and e 2 . If ef j is a word of fertility 1 and e k is NULL, then ef j is deleted from the translation. If ef j is the NULL word, the word e k is inserted into the translation at the position that yields an alignment of highest probability.
  • trans la teAndlnsert (j, e lf e 2 ) : This operation changes the translation of the French word located at position j from e f j into e_ and simultaneously inserts word e 2 at the position that yields the alignment of highest probability.
  • removeWordOfFert ⁇ l ⁇ tyO (i) This operation deletes the word of fertility 0 at position i in the current alignment.
  • swapS ⁇ gments (i X/ i 2 , j ⁇ r j 2 ) : This operation creates a new alignment from the old one by swapping non- overlapping English word segments [i_, i 2 ] and [ _, ⁇ . During the swap operation, all existing links between English and French words are preserved. The segments can be as small as a word or as long as
  • joinWords (i_, i 2 ) This operation eliminates from the alignment the English word at position l ⁇ (or i 2 ) and links the French words generated by en (or e ⁇ 2 ) to e ⁇ 2 (or e_ ⁇ ) .
  • the decoder 115 may estimate the probabilities of correctness, P(M_) ... P (M n ) , for each of the results of the sentence modification operations, i.e., the probability for each new resulting translation is determined (block 430) .
  • the decoder 115 may determine whether any of the new translations are better than the current translation by comparing their respective probabilities of correctness (block 435) .
  • the best new translation (that is, the translation solution having the highest probability of correctness) may be set as the current translation (block 440) and the decoding process may return to block 425 to perform one or more of the sentence modification operations on the new current translation solution.
  • the process may repeat until the sentence modification operations cease (as determined in block 435) to produce translation solutions having higher probabilities of correctness, at which point, the decoding process halts and the current translation is output as the final decoding solution (block 445) .
  • the decoder 115 could cease after a predetermined number of iterations chosen, for example, either by a human end-user or by an application program using the decoder 115 as a translation engine.
  • the decoder 115 may use a process loop (blocks 425-440) to iterate exhaustively over all alignments that are one operation away from the alignment under consideration.
  • the decoder chooses the alignment of highest probability, until the probability of the current alignment can no longer be improved.
  • sentence modification block 425.
  • either all of the five sentence modification operations can be used or any subset thereof may be used to the exclusion of the others, depending on the preferences of the system designer and/or end-user.
  • the most time consuming operations in the decoder may be swapSegments, translateOneOrTwoWords, and translateAndlnsert .
  • SwapSegments iterates over all possible non-overlapping span pairs that can be built on a sequence of length
  • Trans la teOneOr woWords iterates over I f
  • TranslateAndlnsert iterates over
  • the decoder may be designed to omit one or more of these slower operations in order to speed up decoding, but potentially at the cost of accuracy.
  • the decoder may be designed to use different or additional sentence modification operations according to the objectives of the system designer and/or end-user.
  • a cover sentence may produce better results than, say, a word-by-word gloss of the input sentence because the cover sentence may bias the decoder to search in sub-spaces that are likely to yield translations of high probability, subspaces which otherwise may not be explored.
  • One of the strengths of the TMEM is its ability to encode contextual, long-distance dependencies that are incongruous with the parameters learned by a statistical MT system utilizing context poor, reductionist channel model.
  • Alternative ranking techniques may be used by the decoder 115 that would permit the decoder to prefer a TMEM-based translation in some instances even thought that translation may not be the best translation according to the probabilistic channel model .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

La présente invention concerne un système de traduction automatique statistique (100) comportant une mémoire de traduction (110) et un décodeur (115). Le décodeur (115) peut traduire un segment de texte d'entrée (102) au moyen d'un algorithme de décodage de traduction automatique statistique, par exemple un algorithme de décodage glouton. Le système peut générer une couverture du segment de texte d'entrée à partir de segments de texte dans la mémoire de traduction (110). Le décodeur (115) peut utiliser la couverture comme traduction initiale dans l'opération de décodage.
PCT/US2002/015057 2001-05-11 2002-05-13 Systeme de traduction a base de memoire statistique WO2002093416A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2002590018A JP2005516267A (ja) 2001-05-11 2002-05-13 統計的メモリベースの翻訳システム
CA002446811A CA2446811A1 (fr) 2001-05-11 2002-05-13 Systeme de traduction a base de memoire statistique
EP02729189A EP1390868A4 (fr) 2001-05-11 2002-05-13 Systeme de traduction a base de memoire statistique

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US09/854,327 2001-05-11
US09/854,327 US7533013B2 (en) 2000-05-11 2001-05-11 Machine translation techniques
US29185301P 2001-05-17 2001-05-17
US60/291,853 2001-05-17
US10/143,382 2002-05-09

Publications (1)

Publication Number Publication Date
WO2002093416A1 true WO2002093416A1 (fr) 2002-11-21

Family

ID=26967017

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/015057 WO2002093416A1 (fr) 2001-05-11 2002-05-13 Systeme de traduction a base de memoire statistique

Country Status (1)

Country Link
WO (1) WO2002093416A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1889149A2 (fr) * 2005-05-06 2008-02-20 Trados GmbH Services electroniques de traduction au moyen d'une machine de traduction et d'une memoire de traduction
DE202005021909U1 (de) 2004-03-16 2011-04-14 Star Ag Computergestütztes Hilfsmittel für ein Verfahren zur Erstellung von fremdsprachigen Dokumenten
DE202005021923U1 (de) 2004-04-02 2011-06-09 Star Ag Computergestütztes Hilfsmittel für ein Verfahren zur Erstellung von fremdsprachigen Dokumenten
DE202005021922U1 (de) 2004-04-03 2011-08-17 Star Ag Computergestütztes Hilfsmittel für ein Verfahren zur Erstellung von Dokumenten

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5477451A (en) * 1991-07-25 1995-12-19 International Business Machines Corp. Method and system for natural language translation
US5724593A (en) * 1995-06-07 1998-03-03 International Language Engineering Corp. Machine assisted translation tools

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5477451A (en) * 1991-07-25 1995-12-19 International Business Machines Corp. Method and system for natural language translation
US5724593A (en) * 1995-06-07 1998-03-03 International Language Engineering Corp. Machine assisted translation tools
US6131082A (en) * 1995-06-07 2000-10-10 Int'l.Com, Inc. Machine assisted translation tools utilizing an inverted index and list of letter n-grams

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1390868A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE202005021909U1 (de) 2004-03-16 2011-04-14 Star Ag Computergestütztes Hilfsmittel für ein Verfahren zur Erstellung von fremdsprachigen Dokumenten
DE202005021923U1 (de) 2004-04-02 2011-06-09 Star Ag Computergestütztes Hilfsmittel für ein Verfahren zur Erstellung von fremdsprachigen Dokumenten
DE202005021922U1 (de) 2004-04-03 2011-08-17 Star Ag Computergestütztes Hilfsmittel für ein Verfahren zur Erstellung von Dokumenten
EP1889149A2 (fr) * 2005-05-06 2008-02-20 Trados GmbH Services electroniques de traduction au moyen d'une machine de traduction et d'une memoire de traduction
EP1889149A4 (fr) * 2005-05-06 2010-03-10 Trados Gmbh Services electroniques de traduction au moyen d'une machine de traduction et d'une memoire de traduction

Similar Documents

Publication Publication Date Title
US7295962B2 (en) Statistical memory-based translation system
CA2408819C (fr) Techniques de traduction automatique
US7689405B2 (en) Statistical method for building a translation memory
US8239188B2 (en) Example based translation apparatus, translation method, and translation program
US8249856B2 (en) Machine translation
Choudhury et al. Investigation and modeling of the structure of texting language
US7620538B2 (en) Constructing a translation lexicon from comparable, non-parallel corpora
CN103631772A (zh) 机器翻译方法及装置
WO2007068123A1 (fr) Procede et systeme de formation et d'application d'un composant de distorsion a une traduction automatique
Bangalore et al. Statistical machine translation through global lexical selection and sentence reconstruction
CN102662932B (zh) 构建树结构及基于树结构的机器翻译系统的方法
CN111814493B (zh) 机器翻译方法、装置、电子设备和存储介质
CN115587590A (zh) 训练语料集构建方法、翻译模型训练方法、翻译方法
Callison-Burch et al. Co-training for statistical machine translation
WO2002093416A1 (fr) Systeme de traduction a base de memoire statistique
Rambow et al. Parsing arabic dialects
Amengual et al. Using categories in the EuTrans system
JP2003263433A (ja) 統計的機械翻訳機における翻訳モデルの生成方法
Cazzaro et al. Align and Augment: Generative Data Augmentation for Compositional Generalization
KR20140049148A (ko) 형태소 분할에 기반한 품사 태깅 방법 및 그 장치
CN108153743A (zh) 基于相似度的智能离线翻译机
Cavalli-Sforza et al. Using morphology to improve Example-Based Machine Translation
Saers et al. Ternary Segmentation for Improving Search in Top-down Induction of Segmental ITGs
Latiri et al. Phrase-based machine translation based on text mining and statistical language modeling techniques
Ruzsics Multi-level Modelling for Upstream Text Processing

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US US US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2446811

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 2002729189

Country of ref document: EP

Ref document number: 2002590018

Country of ref document: JP

Ref document number: 1886/DELNP/2003

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 028125452

Country of ref document: CN

WWP Wipo information: published in national office

Ref document number: 2002729189

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWW Wipo information: withdrawn in national office

Ref document number: 2002729189

Country of ref document: EP