US20080133218A1 - Example based machine translation system - Google Patents

Example based machine translation system Download PDF

Info

Publication number
US20080133218A1
US20080133218A1 US11/935,938 US93593807A US2008133218A1 US 20080133218 A1 US20080133218 A1 US 20080133218A1 US 93593807 A US93593807 A US 93593807A US 2008133218 A1 US2008133218 A1 US 2008133218A1
Authority
US
United States
Prior art keywords
continuous
alignment
word
alignments
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/935,938
Inventor
Ming Zhou
Jin-Xia Huang
Chang Ning (Tom) Huang
Wei Wang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Priority to US11/935,938 priority Critical patent/US20080133218A1/en
Publication of US20080133218A1 publication Critical patent/US20080133218A1/en
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC reassignment MICROSOFT TECHNOLOGY LICENSING, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MICROSOFT CORPORATION
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/45Example-based machine translation; Alignment

Definitions

  • the present invention relates to machine translation. More specifically, the present invention relates to an example based machine translation system or translation memory system.
  • Machine translation is a process by which an input sentence (or sentence fragment) in a source language is provided to a machine translation system.
  • the machine translation system outputs one or more translations of the source language input as a target language sentence, or sentence fragment.
  • machine translation systems including example based machine translation (EBMT) systems.
  • EBMT systems generally perform two fundamental operations in performing a translation. Those operations include matching and transfer.
  • the matching operation retrieves a “closest match” for a source language input string from an example database.
  • the transfer operation generates a translation in terms of the matched example(s). Specifically, the transfer operation is actually the process of getting the translation of the input string by performing alignment between the matched bilingual example (s).
  • “Alignment” as used herein means deciding which fragment in a target language sentence (or example) corresponds to the fragment in the source language sentence being translated.
  • Some EBMT systems perform similarity matching based on syntactic structures, such as parse trees or logical forms. Of course, these systems require the inputs to be parsed to obtain the syntactic structure. This type of matching method can make suitable use of examples and enhance the coverage of the example base. However, these types of systems run into trouble in certain domains, such as software localization. In software localization, software documentation and code are localized or translated into different languages. The terms used in software manuals render the parsing accuracy of conventional EBMT systems very low, because even the shallow syntax information (such as word segmentation and part-of-speech tags) is often erroneous.
  • EBMT systems and translation memory systems employ string matching.
  • example matching is typically performed by using a similarity metric which is normally the edit distance between the input fragment and the example.
  • the edit distance metric only provides a good indication of matching accuracy when a complete sentence or a complete sentence segment has been matched.
  • correspondences are found not by using a parser, but by utilizing co-occurrence information and geometric information.
  • Co-occurrence information is obtained by examining whether there are co-occurrences of source language fragments and target language fragments in a corpus.
  • Geometric information is used to constrain the alignment space.
  • the correspondences located are grammarless.
  • the present invention performs machine translation by matching fragments of a source language input to portions of examples in an example base. All relevant examples are identified in the example base, in which fragments of the target language sentence are aligned against fragments of the source language sentence within each example. A translation component then substitutes the aligned target language phrases from the examples for the matched fragments in the source language input.
  • example matching is performed based on position marked term frequency/inverted document frequency index scores.
  • TF/IDF weights are calculated for blocks in the source language input that are covered by the examples to find a best block combination. The best examples for each block in the block combination are also found by calculating a TF/IDF weight.
  • the relevant examples once identified are provided to an alignment component.
  • the alignment component first performs word alignment to obtain alignment anchor points between the source language sentence and the target language sentence in the example pair under consideration. Then, all continuous alignments between the source language sentence and the target language sentence are generated, as are all non-continuous alignments. Scores are calculated for each alignment and the best are chosen as the translation.
  • a confidence metric is calculated for the translation output.
  • the confidence metric is used to highlight portions of the translation output which need user's attention. This draws the user's attention to such areas for possible modification.
  • FIG. 1 is a block diagram of an environment in which the present invention can be used.
  • FIG. 2 is a block diagram of a translation engine in accordance with one embodiment of the present invention.
  • FIG. 3 is a flow diagram illustrating the overall operation of the system shown in FIG. 2 .
  • FIG. 4 is a flow diagram illustrating example matching in accordance with one embodiment of the present invention.
  • FIG. 5 illustrates a plurality of different examples corresponding to an input sentence in accordance with one embodiment of the present invention.
  • FIG. 6 is a data flow diagram illustrating word alignment in accordance with one embodiment of the present invention.
  • FIG. 7 is a flow diagram illustrating phrase alignment in accordance with one embodiment of the present invention.
  • FIGS. 8 and 9 illustrate continuous and discontinuous alignments.
  • FIG. 10 is a flow diagram illustrating the generation of continuous alignments in accordance with one embodiment of the present invention.
  • FIG. 11 is a flow diagram illustrating the generation of non-continuous alignments in accordance with one embodiment of the present invention.
  • the present invention involves a machine translation system.
  • a machine translation system For example, one embodiment of an environment in which the present invention can be used will be described.
  • FIG. 1 illustrates an example of a suitable computing system environment 100 on which the invention may be implemented.
  • the computing system environment 100 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 100 .
  • the invention is operational with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
  • the invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote computer storage media including memory storage devices.
  • an exemplary system for implementing the invention includes a general purpose computing device in the form of a computer 110 .
  • Components of computer 110 may include, but are not limited to, a processing unit 120 , a system memory 130 , and a system bus 121 that couples various system components including the system memory to the processing unit 120 .
  • the system bus 121 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
  • such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • EISA Enhanced ISA
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • Computer 110 typically includes a variety of computer readable media.
  • Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media.
  • Computer readable media may comprise computer storage media and communication media.
  • Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media includes, but is not limited to, RAM, ROM, BEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 100 .
  • Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier WAV or other transport mechanism and includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, FR, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
  • the system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132 .
  • ROM read only memory
  • RAM random access memory
  • BIOS basic input/output system
  • RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120 .
  • FIG. 1 illustrates operating system 134 , application programs 135 , other program modules 136 , and program data 137 .
  • the computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media.
  • FIG. 1 illustrates a hard disk drive 141 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnetic disk 152 , and an optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156 such as a CD ROM or other optical media.
  • removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like.
  • the hard disk drive 141 is typically connected to the system bus 121 through a non-removable memory interface such as interface 140
  • magnetic disk drive 151 and optical disk drive 155 are typically connected to the system bus 121 by a removable memory interface, such as interface 150 .
  • hard disk drive 141 is illustrated as storing operating system 144 , application programs 145 , other program modules 146 , and program data 147 . Note that these components can either be the same as or different from operating system 134 , application programs 135 , other program modules 136 , and program data 137 . Operating system 144 , application programs 145 , other program modules 146 , and program data 147 are given different numbers here to illustrate that, at a minimum, they are different copies.
  • a user may enter commands and information into the computer 110 through input devices such as a keyboard 162 , a microphone 163 , and a pointing device 161 , such as a mouse, trackball or touch pad.
  • Other input devices may include a joystick, game pad, satellite dish, scanner, or the like.
  • a monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190 .
  • computers may also include other peripheral output devices such as speakers 197 and printer 196 , which may be connected through an output peripheral interface 190 .
  • the computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180 .
  • the remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110 .
  • the logical connections depicted in FIG. 1 include a local area network (LAN) 171 and a wide area network (WAN) 173 , but may also include other networks.
  • LAN local area network
  • WAN wide area network
  • Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
  • the computer 110 When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170 .
  • the computer 110 When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173 , such as the Internet.
  • the modem 172 which may be internal or external, may be connected to the system bus 121 via the user-input interface 160 , or other appropriate mechanism.
  • program modules depicted relative to the computer 110 may be stored in the remote memory storage device.
  • FIG. 1 illustrates remote application programs 185 as residing on remote computer 180 . It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • the present invention can be carried out on a computer system such as that described with respect to FIG. 1 .
  • the present invention can be carried out on a server, a computer devoted to message handling, or on a distributed system in which different portions of the present invention are carried out on different parts of the distributed computing system.
  • FIG. 2 is a block diagram of a translation engine 200 in accordance with one embodiment of the present invention.
  • Translation engine 200 receives an input sentence (or sentence fragment) in a source language as source language input 202 .
  • Engine 200 then accesses example base 204 and term base 206 and generates a target language output 208 .
  • Target language output 208 is illustratively a translation of source language input 202 into the target language.
  • Example base 204 is a database of word aligned target language and source language examples generated from example base generator 210 based on a sentence aligned bilingual corpus of examples 212 .
  • Aligned bilingual corpus of examples 212 illustratively contains paired sentences (sentences in the source language aligned or paired with translations of those sentences in the target language).
  • Example base generator 210 generates example base 204 indexed in what is referred to as position-marked term frequency/inverse document frequency (P-TF/IDF) indexing.
  • P-TF/IDF position-marked term frequency/inverse document frequency
  • TF/IDF is a mature information retrieval technique and is a type of word indexing that is used to enable efficient document retrieval.
  • a TF/IDF weight (or score) is calculated for each term (such as a lemma, or a term with a part-of-speech (POS) tag) in an index file. The higher the TF/IDF weight, the more important a term is.
  • the TF/IDF weight is determined by the following formulas:
  • TF ij log ⁇ ( n ij + 1 ) ( 1 )
  • IDF i log ⁇ ( N n i ) + 1 ( 2 )
  • TFIDF ij TF ij * IDF i ⁇ n j ⁇ ( TF ij * IDF i ) 2 ( 3 )
  • N the number of examples in the example base (EB);
  • n i the total number of occurrences of term i in the EB
  • n j total term number of the example j
  • n ij the total number of occurrences of term i in the example j;
  • TFIDF ij Term i's TFIDF weight in example j.
  • Such a system is employed in the present invention because the word index can enable efficient example retrieval, and also because it is believed to reflect the factors that should be considered in sentence similarity calculation.
  • factors include the number of matched words in each example (the more matched words, the higher the example weight), the differing importance of different words in the example (the higher the term frequency, the lower the term weight), the length of a given example (the longer the example length, the lower the example weight), and the number of extra or mismatched words in the example (the more extra or mismatched words, the less the example weight).
  • the traditional TF/IDF technique is extended to a position-marked TF/IDF format. This reflects not only the term weight, but also the term position in each example.
  • Table 1 shows an exemplary P-TF/IDF indexing file for the terms “anti-virus tool” and “type of”.
  • one embodiment of the present invention uses bi-term indexing instead of uni-term indexing.
  • the first column shows the bi-term unit indexed.
  • the second column shows the average TF/IDF weight of the bi-term in the example base
  • the third column shows the related example's index number, the weight of the bi-term in that example, and the position of the bi-term in the example sentence.
  • the bi-term “anti-virus tool” has an average TF/IDF weight of 0.33. It can be found in the example identified by index number 102454, etc.
  • the weight of the particular bi-term in the example sentence where it is found is 0.45, and the position of the bi-term in the example sentence is position number 2.
  • example base generator 210 can be any known example base generator generating examples indexed as shown in Table 1. Generator 210 illustratively calculates the TF/IDF weights (or just indexes them if they're already calculated), and it also identifies the position of the bi-term in the example sentence.
  • Term base 206 is generated by term base generator 214 , which also accesses bilingual example corpus 212 .
  • Term base generator 214 simply generates correspondences between individual terms in the source and target language.
  • Engine 200 illustratively includes preprocessing component 216 , example matching component 218 , phrase alignment component 220 , translation component 222 and post processing component 224 .
  • Engine 200 first receives the source language input sentence 202 to be translated. This is indicated by block 226 in FIG. 3 .
  • preprocessing component 216 performs preprocessing on source language input 202 .
  • Preprocessing component 216 illustratively identifies the stemmed forms of words in the source language input 202 .
  • other preprocessing can be performed as well, such as employing part-of-speech tagging or other preprocessing techniques. It should also be noted, however, that the present invention can be employed on surface forms as well and thus preprocessing may not be needed.
  • preprocessing is indicated by block 228 in FIG. 3 .
  • example matching component 218 matches the preprocessed source language input against examples in example base 204 .
  • Component 218 also finds all candidate word sequences (or blocks). The best combinations of blocks are then located, as is the best example for each block. This is indicated by blocks 230 , 232 and 234 in FIG. 3 and is described in greater detail with respect to FIGS. 4 and 5 below.
  • phrase alignment component 220 The relevant examples 236 for each block are obtained and provided to phrase alignment component 220 .
  • the corresponding target language block is then located, and the matched phrases in the source language are replaced with the target language correspondences located. This is indicated by blocks 235 and 238 in FIG. 3 .
  • the location of the target language correspondences in this way is performed by phrase alignment component 220 and is illustrated in greater detail with respect to FIGS. 6-10 below.
  • the source language input may still have a number of terms which failed to be translated through the bi-term matching and the phrase alignment stage.
  • translation component 222 accesses term base 206 to obtain a translation of the terms which have not yet been translated.
  • Component 222 also replaces the aligned source language phrases with associated portions of the target language examples. This is indicated by block 240 in FIG. 3 .
  • the result is then provided to post processing component 224 .
  • Post processing component 224 calculates a confidence measure for the translation results as indicated by block 242 in FIG. 3 , and can optionally highlight related portions of the translation results that require user's attentions as indicated by block 244 . This directs the user's attention to the translation output in related examples which have been calculated, but have a low confidence metric associated with them.
  • Target language output 208 thus illustratively includes the translation result as highlighted to indicate related areas.
  • FIG. 4 is a flow diagram which better illustrates the operation of example matching component 218 .
  • example matching component 218 simply locates examples which contain the bi-term sequences that are also found in the input sentence.
  • the identifier of examples containing the bi-term sequence can easily be found (for example, in the third column of Table 1).
  • all matching blocks between the selected relevant example and the input sentence are identified. This is indicated by block 252 .
  • FIG. 5 better illustrates the meaning of a “matching block”.
  • the input sentence is composed of seven terms (term 1 -term 7 ) each of which is a word in this example.
  • the input sentence contains four indexed bi-terms identified as bi-term 3 - 4 (which covers terms 3 and 4 in the input sentence) bi-term 4 - 5 (which covers terms 4 and 5 of the input sentence), bi-term 5 - 6 (which covers terms 5 and 6 in the input sentence) and bi-term 6 - 7 (which covers terms 6 and 7 in the input sentence).
  • the same continuous sequence of bi-terms occurs in an example (such as example 1 in FIG. 5 ).
  • the bi-term sequence appears continuous in example 1.
  • the bi-terms in the source language input sentence can be combined into a single block (block 3 - 7 ).
  • example 2 contains a continuous bi-term sequence that can be blocked in the input sentence as block 3 - 5 .
  • Example 3 contains a continuous bi-term sequence that can be blocked in the input sentence as block 5 - 7 .
  • Example 4 contains a continuous bi-term sequence that can be blocked in the input sentence as block 4 - 5 and example 5 contains a bi-term sequence that can be blocked in the input sentence as block 6 - 7 .
  • Example matching component 218 thus finds the best block combination of terms in the input sentence by calculating a TF/IDF weight for each block combination. This is indicated by block 254 in FIG. 4 .
  • the best block combination problem can be viewed as a shortest-path location problem.
  • a dynamic programming algorithm can be utilized.
  • the “edge length” (or path length) associated with each block combination is calculated by the following equation:
  • i the “edge” (block) index number in the input sentence
  • m the word indexing number of the “edge” i's starting point
  • n the word indexing number of the “edge” i's ending point
  • TFIDF k the term k's average TF/IDF weight in the EB.
  • EdgeLen i weight of block i.
  • each block combination identified has its weight calculated as indicated by the above equation.
  • each block combination for the input sentence will have a weight or path length associated therewith.
  • K the total number of common terms included both in example j and the input sentence
  • TFIDF kj Term k's TFIDF weight in example j;
  • Similarity j the matching weight between the example j and input sentence.
  • Finding the TFIDF weight associated with the each example is indicated by block 256 in FIG. 4 .
  • example matching component 218 has now calculated a score associated with each different block combination into which the input sentence can be divided.
  • Component 218 has also calculated a score for each example associated with every block identified in the different block combinations.
  • Component 218 can then prune the list of examples to those having a sufficient similarity score, or a sufficient similarity score combined with the block combination score, and provide the relevant examples 236 in FIG. 2 to phrase alignment component 220 .
  • phrase alignment component 220 thus accepts as input an example, which in fact is a sentence (or text fragment) pair including a source sentence (or fragment) and a target sentence (or fragment), and also the boundary information specifying the matched portion of the source sentence in that example against the input sentence which is to be translated.
  • the job of phrase alignment component 220 is to align the possible translations in the target sentence of the given example with the matched phrases or word sequences in the source sentence of the same example, and to select a best target fragment as the translation for that matched part of the source sentence, and therefore as the translation for the matched part (matched between the input sentence to be translated and the source sentence of an example) of the input sentence.
  • phrase alignment component 220 first generates a series of word alignments as anchors in the phrase alignment process. Based on these anchors, component 220 then attempts to find the correspondent phrases in the target sentence within an example for the matched part of the source sentence in the same example.
  • FIG. 6 is a flow diagram which better illustrates the word alignment process in order to obtain anchors in accordance with one embodiment of the present invention.
  • FIG. 6 shows that in the word alignment process, an example under consideration (which includes source language input sentence 301 and target language sentence 300 ) is input to a first alignment component which operates as a bilingual dictionary aligner 302 .
  • Aligner 302 describes how two words in different languages can possibly be translated into one another. There are a wide variety of different ways in which this has been done. Some metrics for evaluating this type of translation confidence include a translation probability such as that found in Brown et al., The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics, 19(2), pp.
  • Bilingual dictionary aligner 302 thus establishes high confidence single word anchor points which are direct word translations from source sentence to target sentence of example 300 . These are used later during phrase alignment.
  • word segmentation will be conducted. This can be done in any of a wide variety of different, known ways and the present invention is not limited to any specific word segmentation technique. Word segmentation of the target sentence of the example 300 is indicated by block 304 in FIG. 6 .
  • the enhanced bilingual dictionary based aligner 306 is then employed, which not only utilizes word similarities computed based on a bilingual dictionary, but also uses a distortion model to describe how likely one position in the source sentence can be aligned to another position in the target sentence.
  • a distortion model to describe how likely one position in the source sentence can be aligned to another position in the target sentence.
  • Some such models include absolute distortion (such as in Brown, cited above), relative offset (such as in Brown), hidden markoov model (HMM)-based systems and structure constraint systems (also found in Brown).
  • a monolingual dictionary is accessed to merge characters into words and words into phrases. This is indicated by block 308 in FIG. 6 .
  • the bilingual dictionary is very large, its coverage is still very limited because of the basic complexity of language.
  • Using a monolingual dictionary some separate words (that should not be separate because they are part of a phrase) can be identified as a phrase. Thus, phrase merging is implemented.
  • any known statistical alignment component can be used in an effort to align unaligned words. This is indicated by block 310 .
  • Such statistical alignment techniques are known and are simply provided with a threshold to constrain the statistical alignment space.
  • the word alignment results 312 are output by the word alignment system.
  • the word alignment mechanism includes translation information from bilingual dictionary aligner 302 , distortion aligner model 306 , phrase merging component 308 and statistical alignment component 310 , other sources of information can be used as well.
  • the t-score mentioned above can be used as can contextual information.
  • the word alignment results 312 provide anchor points which reflect high confidence alignments between the source language sentence 301 and target language sentence 300 . These anchor points are used during phrase alignment.
  • FIG. 7 is a flow diagram indicative of one embodiment of phrase alignment in accordance with the present invention.
  • the phrase alignment component receives as an input the word alignment results 312 of the example and the boundary information generated from example matching component 218 identifying the boundaries of matched blocks in the source sentence of an example.
  • the phrase alignment component finds all possible target language candidate fragments corresponding to the matched blocks in the source language sentence. This is indicated by block 350 in FIG. 7 .
  • the phrase alignment component calculates a score for each candidate fragment identified. This is indicated by block 352 . From the score calculated, the phrase alignment component selects the best candidate or predetermined number of candidates as the translation output. This is indicated by block 354 in FIG. 7 .
  • step 350 the present invention breaks this task into two parts.
  • the present invention finds all possible continuous candidate fragments, and all possible non-continuous candidate fragments.
  • FIGS. 8 and 9 illustrate continuous and non-continuous fragments.
  • FIG. 8 shows a source language sentence which includes words (or word sequences) A, B, C and D.
  • FIG. 8 also shows a corresponding target language example sentence (or a portion thereof) which includes target language words (or word sequences) E, F, G and H.
  • a continuous fragment is defined as follows:
  • SFRAG is a fragment in the source language sentence
  • TFRAG is a fragment in the target language sentence. If all the aligned words in SFRAG are aligned to the words in TFRAG and only to the words in TFRAG, then SFRAG is continuous to TFRAG, and vise versa. Otherwise, it is non-continuous.
  • target language fragment E F G H is not a continuous fragment to fragment A B C. This is because, while A B C is continuous in the source language sentence, F F H, which corresponds to A B C, is not continuous in the target language sentence. Instead, word (or word sequence) G in the target language sentence corresponds to word (or word sequence) D in the source language sentence.
  • FIG. 9 shows two instances of a source language sentence which contains words (or word sequences) A-F and a target language sentence which contains words (or word sequences) G-N.
  • C D English language fragment for which a translation is being sought
  • fragment H I J This is referred to as continuous.
  • a continuous source language fragment A B corresponds to a non-continuous target language fragment (G H L M).
  • the out of range target language words (or word sequences) I J K also correspond to a continuous source language fragment D E. This is referred to as non-continuous.
  • the present invention generates all possible continuous fragments and then all possible non-continuous fragments.
  • FIG. 10 is a flow diagram illustrating how one embodiment of the present invention in which all possible continuous fragments in the target language sentence are identified for a fragment in the source language sentence.
  • the source and target language sentences (or the preprocessed sentences) are received along with word alignment results 312 . This is indicated by block 370 in FIG. 10 .
  • Boundary information for the source language fragment for which alignments are sought is also received.
  • the boundary information in the present example is indicated by (a, b) where a and b are word positions in the source language sentence.
  • the fragment in the source language sentence for which alignment is sought is C D, in FIG. 9 , and each letter is representative of a word, then the boundary information would be ( 3 , 4 ) since word C is in word position 3 and word D is in word position 4 in the source language sentence.
  • Receiving the boundary information is indicated by block 372 in FIG. 10 .
  • the alignment component finds a word set (SET) in the target language sentence which aligns to the fragments having boundaries a, b in the source language sentence based on the word alignment results. This is indicated by block 374 in FIG. 10 .
  • SET word set
  • the phrase alignment component finds the left-most word position (c) and the right-most word position (d) of the words in (SET) in the target sentence so the target language sentence fragment (c, d) is the minimum possible alignment (MinPA) in the target language sentence which could be aligned with the source language fragment.
  • MinPA minimum possible alignment
  • the target language fragment boundaries of MinPA are extended to the left and the right until an inconsistent alignment anchor is met (one which shows alignment to a word in the SL input outside of a, b) in each direction.
  • the left and right boundaries, respectively, are moved by one word within the target language sentence until the left or right boundary (which ever is being moved) meets an inconsistent anchor point. At that point, the extension of the fragment boundary in that direction is terminated.
  • the new target language boundaries will be (e, f) and will define the maximum possible alignment (MaxPA). This is indicated by block 378 .
  • AP is all possible continuous substrings between MinPA and MaxPA, all of which must contain MinPA.
  • continuous is meant that no word gaps exist within the continuous substring.
  • the set of MinPA in union with MaxPA in union with AP is then returned as all possible continuous alignments in the target language sentence for the given fragment in the source language sentence. This is indicated by block 382 .
  • All of the continuous alignments are then scored (as is discussed in greater detail below). Scoring the alignments is indicated by block 384 .
  • the step of obtaining all possible continuous alignments is performed for each fragment in the source language input.
  • FIG. 11 is a flow diagram illustrating how all possible non-continuous alignments are found. Again, by non-continuous alignments it is meant those such as found in FIG. 8 and the second instance of FIG. 9 in which a continuous source fragment corresponds to a non-continuous target fragment.
  • the phrase alignment component first receives the inputs and boundaries as described with respect to blocks 370 and 372 in FIG. 10 .
  • the system finds a word set (SET 1 ) in the example (or target) sentence that is aligned with the selected fragment (a, b) in the source language sentence based on the word alignment results. This is the same as indicated by block 374 in FIG. 10 .
  • the phrase alignment component finds a word set (SET 2 ) in the source sentence that aligns to a portion of SET 1 but is beyond of the range of (a, b) in the source language sentence. This is indicated by block 386 in FIG. 11 .
  • SET 2 is continuous in the source language sentence. If not, no phrase alignments are calculated. This is indicated by blocks 388 and 390 . However, if SET 2 is continuous in the source language sentence (meaning that there are no word gaps in SET 2 ), then processing continues at block 392 .
  • the phrase alignment component obtains the continuous word set (SET 3 ) containing SET 2 in the source language sentence.
  • all possible alignments for SET 3 are obtained. This is illustratively done using the algorithm described with respect to FIG. 10 . Finding all possible alignments for SET 3 is indicated by block 394 in FIG. 11 .
  • the left most position (i) and the right most position (j) in SET are then located. This is indicated by block 398 .
  • SET 4 is then removed from the sequence (i, j). This is indicated by block 400 .
  • MinPA MinPA of (a, b). This is indicated by block 402 .
  • MinPA is then extended to obtain MaxPA as discussed with respect to block 378 in FIG. 10 . This is indicated by block 404 in FIG. 11 .
  • AP is obtained as all possible continuous substrings between MinPA and MaxPA, all of which contain MinPA. This is indicated by block 406 in FIG. 11 .
  • the union of MinPA, MaxPA and AP is then returned as indicated by block 408 .
  • Each of the possible non-continuous alignments returned is then scored as indicated by block 410 .
  • the score associated with each of the possible alignments is indicated by the following equation:
  • the confidence level for each translation output is calculated. This can be done by translation component 222 or post processing component 224 in system 200 . In any case, in one embodiment, the translation confidence level is determined as follows:
  • the translation confidence level is based on the alignment confidence level, the confidence of aligned words, and the number of aligned and unaligned words in the target language correspondence.
  • the system marks portions of the output with the confidence level which allows the user to identify low confidence translation outputs for particular scrutiny and the areas that require user's attention.
  • the present invention employs an example matching method that enhances the example matching and retrieval performance both in quality and speed over prior systems.
  • the present invention employs a word/phrase alignment technique and a score function for selecting the best candidate in phrase alignment which also produces enhancements in accuracy and speed over prior systems.
  • the present invention employs a translation confidence prediction method that indicates the quality of the translation generated by the machine, and also highlights some translation portions for scrutiny by the user.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention performs machine translation by matching fragments of a source language sentence to be translated to source language portions of an example in example base. When all relevant examples have been identified in the example base, the examples are subjected to phrase alignment in which fragments of the target language sentence in each example are aligned against the matched fragments of the source language sentence in the same example. A translation component then substitutes the aligned target language phrases from the matched examples for the matched fragments in the source language sentence.

Description

    BACKGROUND OF THE INVENTION
  • The present invention relates to machine translation. More specifically, the present invention relates to an example based machine translation system or translation memory system.
  • Machine translation is a process by which an input sentence (or sentence fragment) in a source language is provided to a machine translation system. The machine translation system outputs one or more translations of the source language input as a target language sentence, or sentence fragment. There are a number of different types of machine translation systems, including example based machine translation (EBMT) systems.
  • EBMT systems generally perform two fundamental operations in performing a translation. Those operations include matching and transfer. The matching operation retrieves a “closest match” for a source language input string from an example database. The transfer operation generates a translation in terms of the matched example(s). Specifically, the transfer operation is actually the process of getting the translation of the input string by performing alignment between the matched bilingual example (s). “Alignment” as used herein means deciding which fragment in a target language sentence (or example) corresponds to the fragment in the source language sentence being translated.
  • Some EBMT systems perform similarity matching based on syntactic structures, such as parse trees or logical forms. Of course, these systems require the inputs to be parsed to obtain the syntactic structure. This type of matching method can make suitable use of examples and enhance the coverage of the example base. However, these types of systems run into trouble in certain domains, such as software localization. In software localization, software documentation and code are localized or translated into different languages. The terms used in software manuals render the parsing accuracy of conventional EBMT systems very low, because even the shallow syntax information (such as word segmentation and part-of-speech tags) is often erroneous.
  • Also, such systems have high example base maintenance costs. This is because all of the examples saved in the example base should be parsed and corrected by humans whenever the example base needs to be updated.
  • Other EBMT systems and translation memory systems employ string matching. In these types of systems, example matching is typically performed by using a similarity metric which is normally the edit distance between the input fragment and the example. However, the edit distance metric only provides a good indication of matching accuracy when a complete sentence or a complete sentence segment has been matched.
  • A variety of different alignment techniques have been used in the past as well, particularly for phrase alignments. Most of the previous alignment techniques can be classified into one of two different categories. Structural methods find correspondences between source and target language sentences or fragments with the help of parsers. Again, the source and target language fragments are parsed to obtain paired parses. Structural correspondences are then found based on the structural constraints of the paired parse trees. As discussed above, parsers present difficult problems in certain domains such as technical domains.
  • In grammarless alignment systems, correspondences are found not by using a parser, but by utilizing co-occurrence information and geometric information. Co-occurrence information is obtained by examining whether there are co-occurrences of source language fragments and target language fragments in a corpus. Geometric information is used to constrain the alignment space. The correspondences located are grammarless. Once the word correspondences are extracted, they are stored in an example base. This means that there is a source language sentence, and the correspondent target language sentence, and the word correspondence information will be saved in the example base. During translation, an example in the example base will be stimulated only if there is a fragment in the source language side of the example matching the input string.
  • SUMMARY OF THE INVENTION
  • The present invention performs machine translation by matching fragments of a source language input to portions of examples in an example base. All relevant examples are identified in the example base, in which fragments of the target language sentence are aligned against fragments of the source language sentence within each example. A translation component then substitutes the aligned target language phrases from the examples for the matched fragments in the source language input.
  • In one embodiment, example matching is performed based on position marked term frequency/inverted document frequency index scores. TF/IDF weights are calculated for blocks in the source language input that are covered by the examples to find a best block combination. The best examples for each block in the block combination are also found by calculating a TF/IDF weight.
  • In one embodiment, the relevant examples once identified are provided to an alignment component. The alignment component first performs word alignment to obtain alignment anchor points between the source language sentence and the target language sentence in the example pair under consideration. Then, all continuous alignments between the source language sentence and the target language sentence are generated, as are all non-continuous alignments. Scores are calculated for each alignment and the best are chosen as the translation.
  • In accordance with another embodiment of the invention, a confidence metric is calculated for the translation output. The confidence metric is used to highlight portions of the translation output which need user's attention. This draws the user's attention to such areas for possible modification.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram of an environment in which the present invention can be used.
  • FIG. 2 is a block diagram of a translation engine in accordance with one embodiment of the present invention.
  • FIG. 3 is a flow diagram illustrating the overall operation of the system shown in FIG. 2.
  • FIG. 4 is a flow diagram illustrating example matching in accordance with one embodiment of the present invention.
  • FIG. 5 illustrates a plurality of different examples corresponding to an input sentence in accordance with one embodiment of the present invention.
  • FIG. 6 is a data flow diagram illustrating word alignment in accordance with one embodiment of the present invention.
  • FIG. 7 is a flow diagram illustrating phrase alignment in accordance with one embodiment of the present invention.
  • FIGS. 8 and 9 illustrate continuous and discontinuous alignments.
  • FIG. 10 is a flow diagram illustrating the generation of continuous alignments in accordance with one embodiment of the present invention.
  • FIG. 11 is a flow diagram illustrating the generation of non-continuous alignments in accordance with one embodiment of the present invention.
  • DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
  • The present invention involves a machine translation system. However, before describing the present invention in greater detail, one embodiment of an environment in which the present invention can be used will be described.
  • FIG. 1 illustrates an example of a suitable computing system environment 100 on which the invention may be implemented. The computing system environment 100 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 100.
  • The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
  • The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
  • With reference to FIG. 1, an exemplary system for implementing the invention includes a general purpose computing device in the form of a computer 110. Components of computer 110 may include, but are not limited to, a processing unit 120, a system memory 130, and a system bus 121 that couples various system components including the system memory to the processing unit 120. The system bus 121 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus.
  • Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, BEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 100. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier WAV or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, FR, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
  • The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation, FIG. 1 illustrates operating system 134, application programs 135, other program modules 136, and program data 137.
  • The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only, FIG. 1 illustrates a hard disk drive 141 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnetic disk 152, and an optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156 such as a CD ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk drive 141 is typically connected to the system bus 121 through a non-removable memory interface such as interface 140, and magnetic disk drive 151 and optical disk drive 155 are typically connected to the system bus 121 by a removable memory interface, such as interface 150.
  • The drives and their associated computer storage media discussed above and illustrated in FIG. 1, provide storage of computer readable instructions, data structures, program modules and other data for the computer 110. In FIG. 1, for example, hard disk drive 141 is illustrated as storing operating system 144, application programs 145, other program modules 146, and program data 147. Note that these components can either be the same as or different from operating system 134, application programs 135, other program modules 136, and program data 137. Operating system 144, application programs 145, other program modules 146, and program data 147 are given different numbers here to illustrate that, at a minimum, they are different copies.
  • A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 190.
  • The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in FIG. 1 include a local area network (LAN) 171 and a wide area network (WAN) 173, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
  • When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user-input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation, FIG. 1 illustrates remote application programs 185 as residing on remote computer 180. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • It should be noted that the present invention can be carried out on a computer system such as that described with respect to FIG. 1. However, the present invention can be carried out on a server, a computer devoted to message handling, or on a distributed system in which different portions of the present invention are carried out on different parts of the distributed computing system.
  • FIG. 2 is a block diagram of a translation engine 200 in accordance with one embodiment of the present invention. Translation engine 200 receives an input sentence (or sentence fragment) in a source language as source language input 202. Engine 200 then accesses example base 204 and term base 206 and generates a target language output 208. Target language output 208 is illustratively a translation of source language input 202 into the target language.
  • Example base 204 is a database of word aligned target language and source language examples generated from example base generator 210 based on a sentence aligned bilingual corpus of examples 212. Aligned bilingual corpus of examples 212 illustratively contains paired sentences (sentences in the source language aligned or paired with translations of those sentences in the target language). Example base generator 210 generates example base 204 indexed in what is referred to as position-marked term frequency/inverse document frequency (P-TF/IDF) indexing.
  • TF/IDF is a mature information retrieval technique and is a type of word indexing that is used to enable efficient document retrieval. A TF/IDF weight (or score) is calculated for each term (such as a lemma, or a term with a part-of-speech (POS) tag) in an index file. The higher the TF/IDF weight, the more important a term is. The TF/IDF weight is determined by the following formulas:
  • TF ij = log ( n ij + 1 ) ( 1 ) IDF i = log ( N n i ) + 1 ( 2 ) TFIDF ij = TF ij * IDF i n j ( TF ij * IDF i ) 2 ( 3 )
  • where N=the number of examples in the example base (EB);
  • ni=the total number of occurrences of term i in the EB;
  • nj=total term number of the example j;
  • nij=the total number of occurrences of term i in the example j;
  • TFij=Term i's normalized frequency in example j; and
  • TFIDFij=Term i's TFIDF weight in example j.
  • Such a system is employed in the present invention because the word index can enable efficient example retrieval, and also because it is believed to reflect the factors that should be considered in sentence similarity calculation. Such factors include the number of matched words in each example (the more matched words, the higher the example weight), the differing importance of different words in the example (the higher the term frequency, the lower the term weight), the length of a given example (the longer the example length, the lower the example weight), and the number of extra or mismatched words in the example (the more extra or mismatched words, the less the example weight).
  • In order to maintain matching information between each term contained in an input sentence and its matched example, the traditional TF/IDF technique is extended to a position-marked TF/IDF format. This reflects not only the term weight, but also the term position in each example. Table 1 shows an exemplary P-TF/IDF indexing file for the terms “anti-virus tool” and “type of”.
  • TABLE 1
    Example of P-TFIDF Indexing
    anti-virus tool 0.33 102454 0.45 2_ . . .
    type of 0.22 100044 0.30 2{circumflex over ( )}12_100074|0.20|7_ . . .
  • As seen in Table 1, to enhance retrieval speed, one embodiment of the present invention uses bi-term indexing instead of uni-term indexing. In Table 1, the first column shows the bi-term unit indexed. The second column shows the average TF/IDF weight of the bi-term in the example base, and the third column shows the related example's index number, the weight of the bi-term in that example, and the position of the bi-term in the example sentence. For instance, the bi-term “anti-virus tool” has an average TF/IDF weight of 0.33. It can be found in the example identified by index number 102454, etc. The weight of the particular bi-term in the example sentence where it is found is 0.45, and the position of the bi-term in the example sentence is position number 2. The bi-term “type of” can be found twice in example number 100044 at positions 2 and 12. It can also be found in example 100074 at position 7, etc. Thus, example base generator 210 can be any known example base generator generating examples indexed as shown in Table 1. Generator 210 illustratively calculates the TF/IDF weights (or just indexes them if they're already calculated), and it also identifies the position of the bi-term in the example sentence.
  • Term base 206 is generated by term base generator 214, which also accesses bilingual example corpus 212. Term base generator 214 simply generates correspondences between individual terms in the source and target language.
  • The overall operation of engine 200 will now be described with respect to FIG. 2, as well as FIG. 3 which is a flow diagram of the overall operation of engine 200. Engine 200 illustratively includes preprocessing component 216, example matching component 218, phrase alignment component 220, translation component 222 and post processing component 224.
  • Engine 200 first receives the source language input sentence 202 to be translated. This is indicated by block 226 in FIG. 3. Next, preprocessing component 216 performs preprocessing on source language input 202. Preprocessing component 216 illustratively identifies the stemmed forms of words in the source language input 202. Of course, other preprocessing can be performed as well, such as employing part-of-speech tagging or other preprocessing techniques. It should also be noted, however, that the present invention can be employed on surface forms as well and thus preprocessing may not be needed. In any case, preprocessing is indicated by block 228 in FIG. 3.
  • After preprocessing has been performed, example matching component 218 matches the preprocessed source language input against examples in example base 204. Component 218 also finds all candidate word sequences (or blocks). The best combinations of blocks are then located, as is the best example for each block. This is indicated by blocks 230, 232 and 234 in FIG. 3 and is described in greater detail with respect to FIGS. 4 and 5 below.
  • The relevant examples 236 for each block are obtained and provided to phrase alignment component 220. The corresponding target language block is then located, and the matched phrases in the source language are replaced with the target language correspondences located. This is indicated by blocks 235 and 238 in FIG. 3. The location of the target language correspondences in this way is performed by phrase alignment component 220 and is illustrated in greater detail with respect to FIGS. 6-10 below.
  • The source language input may still have a number of terms which failed to be translated through the bi-term matching and the phrase alignment stage. Thus, translation component 222 accesses term base 206 to obtain a translation of the terms which have not yet been translated. Component 222 also replaces the aligned source language phrases with associated portions of the target language examples. This is indicated by block 240 in FIG. 3. The result is then provided to post processing component 224.
  • Post processing component 224 calculates a confidence measure for the translation results as indicated by block 242 in FIG. 3, and can optionally highlight related portions of the translation results that require user's attentions as indicated by block 244. This directs the user's attention to the translation output in related examples which have been calculated, but have a low confidence metric associated with them. Target language output 208 thus illustratively includes the translation result as highlighted to indicate related areas.
  • FIG. 4 is a flow diagram which better illustrates the operation of example matching component 218. First, all relevant examples are obtained from the example base by accessing the P-TF/IDF index described above. This is illustrated by block 250 in FIG. 4. In order to do this, example matching component 218 simply locates examples which contain the bi-term sequences that are also found in the input sentence. Of course, by accessing the P-TF/IDF index, the identifier of examples containing the bi-term sequence can easily be found (for example, in the third column of Table 1). Then, for each relevant example identified in block 250, all matching blocks between the selected relevant example and the input sentence are identified. This is indicated by block 252.
  • FIG. 5 better illustrates the meaning of a “matching block”. Suppose the input sentence is composed of seven terms (term 1-term 7) each of which is a word in this example. Suppose also that the input sentence contains four indexed bi-terms identified as bi-term 3-4 (which covers terms 3 and 4 in the input sentence) bi-term 4-5 (which covers terms 4 and 5 of the input sentence), bi-term 5-6 (which covers terms 5 and 6 in the input sentence) and bi-term 6-7 (which covers terms 6 and 7 in the input sentence). Now assume that the same continuous sequence of bi-terms occurs in an example (such as example 1 in FIG. 5). Suppose also that the bi-term sequence appears continuous in example 1. Then, the bi-terms in the source language input sentence can be combined into a single block (block 3-7).
  • However, the matching blocks in the input sentence can overlap one another. For example, it can be seen that example 2 contains a continuous bi-term sequence that can be blocked in the input sentence as block 3-5. Example 3 contains a continuous bi-term sequence that can be blocked in the input sentence as block 5-7. Example 4 contains a continuous bi-term sequence that can be blocked in the input sentence as block 4-5 and example 5 contains a bi-term sequence that can be blocked in the input sentence as block 6-7.
  • Therefore, a number of different block combinations can be derived. Such block combinations can be block 3-7; block 3-5 +block 6-7, block 4-5+block 6-7 or simply block 5-7, etc. The input sentence could be blocked in any of these different ways and still examples can be found for translation of portions of the input sentence. Example matching component 218 thus finds the best block combination of terms in the input sentence by calculating a TF/IDF weight for each block combination. This is indicated by block 254 in FIG. 4.
  • In accordance with one embodiment of the present invention, the best block combination problem can be viewed as a shortest-path location problem. Thus, a dynamic programming algorithm can be utilized. In accordance with one embodiment of the present invention, the “edge length” (or path length) associated with each block combination is calculated by the following equation:
  • EdgeLen i = { 1 k = m n TFIDF k , if n > m 10 , if n == m ( 4 )
  • where,
  • i=the “edge” (block) index number in the input sentence;
  • m=the word indexing number of the “edge” i's starting point;
  • n=the word indexing number of the “edge” i's ending point;
  • k=the word indexing number of the “edge” i's each term;
  • TFIDFk=the term k's average TF/IDF weight in the EB; and
  • EdgeLeni=weight of block i.
  • Therefore, each block combination identified has its weight calculated as indicated by the above equation. Thus, each block combination for the input sentence will have a weight or path length associated therewith.
  • Next, the example associated with each block is identified, and the similarity between each identified example and the input sentence is calculated as follows:
  • similarity j = k = 1 K TFIDF kj ( 5 )
  • Where,
  • K=the total number of common terms included both in example j and the input sentence;
  • TFIDFkj=Term k's TFIDF weight in example j; and
  • Similarityj=the matching weight between the example j and input sentence.
  • Finding the TFIDF weight associated with the each example is indicated by block 256 in FIG. 4.
  • Thus, example matching component 218 has now calculated a score associated with each different block combination into which the input sentence can be divided. Component 218 has also calculated a score for each example associated with every block identified in the different block combinations. Component 218 can then prune the list of examples to those having a sufficient similarity score, or a sufficient similarity score combined with the block combination score, and provide the relevant examples 236 in FIG. 2 to phrase alignment component 220.
  • It can be seen that phrase alignment component 220 thus accepts as input an example, which in fact is a sentence (or text fragment) pair including a source sentence (or fragment) and a target sentence (or fragment), and also the boundary information specifying the matched portion of the source sentence in that example against the input sentence which is to be translated. Thus, the job of phrase alignment component 220 is to align the possible translations in the target sentence of the given example with the matched phrases or word sequences in the source sentence of the same example, and to select a best target fragment as the translation for that matched part of the source sentence, and therefore as the translation for the matched part (matched between the input sentence to be translated and the source sentence of an example) of the input sentence. In order to do this, phrase alignment component 220 first generates a series of word alignments as anchors in the phrase alignment process. Based on these anchors, component 220 then attempts to find the correspondent phrases in the target sentence within an example for the matched part of the source sentence in the same example.
  • FIG. 6 is a flow diagram which better illustrates the word alignment process in order to obtain anchors in accordance with one embodiment of the present invention. FIG. 6 shows that in the word alignment process, an example under consideration (which includes source language input sentence 301 and target language sentence 300) is input to a first alignment component which operates as a bilingual dictionary aligner 302. Aligner 302 describes how two words in different languages can possibly be translated into one another. There are a wide variety of different ways in which this has been done. Some metrics for evaluating this type of translation confidence include a translation probability such as that found in Brown et al., The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics, 19(2), pp. 263-311 (1993), a dice coefficient such as that found in Ker et al., A Class-based Approach to Word Alignment, Computational Linguistics Vol. 23, Num. 2, pp. 313-343 (1997), mutual information such as that found in Brown, P. F., A Statistical Approach to Language Translation, COLING-88, Vol. 1, pp. 71-76 (1998), and t-score such as that found in Pascale, A Pattern Matching Method for Finding Noun and Proper Noun Translation From Noisy Parallel Corpora, Computational Linguistics, 21(4), pp. 226-233 (1995).
  • Bilingual dictionary aligner 302 thus establishes high confidence single word anchor points which are direct word translations from source sentence to target sentence of example 300. These are used later during phrase alignment.
  • Next, in cases where the target sentence of example 300 is in a non-segmented language (such as Chinese), word segmentation will be conducted. This can be done in any of a wide variety of different, known ways and the present invention is not limited to any specific word segmentation technique. Word segmentation of the target sentence of the example 300 is indicated by block 304 in FIG. 6.
  • The enhanced bilingual dictionary based aligner 306 is then employed, which not only utilizes word similarities computed based on a bilingual dictionary, but also uses a distortion model to describe how likely one position in the source sentence can be aligned to another position in the target sentence. As with the bilingual dictionary aligner 302, there are a wide variety of different distortion models which can be employed. Some such models include absolute distortion (such as in Brown, cited above), relative offset (such as in Brown), hidden markoov model (HMM)-based systems and structure constraint systems (also found in Brown).
  • Even after word alignment and distortion modeling, there will exist some partial alignments. Therefore, a monolingual dictionary is accessed to merge characters into words and words into phrases. This is indicated by block 308 in FIG. 6. In other words, even if the bilingual dictionary is very large, its coverage is still very limited because of the basic complexity of language. Using a monolingual dictionary, some separate words (that should not be separate because they are part of a phrase) can be identified as a phrase. Thus, phrase merging is implemented.
  • Similarly, any known statistical alignment component can be used in an effort to align unaligned words. This is indicated by block 310. Such statistical alignment techniques are known and are simply provided with a threshold to constrain the statistical alignment space.
  • Taking all of these items into account, the word alignment results 312 are output by the word alignment system.
  • While, in the embodiments shown in FIG. 6, the word alignment mechanism includes translation information from bilingual dictionary aligner 302, distortion aligner model 306, phrase merging component 308 and statistical alignment component 310, other sources of information can be used as well. For example, the t-score mentioned above can be used as can contextual information. In any case, the word alignment results 312 provide anchor points which reflect high confidence alignments between the source language sentence 301 and target language sentence 300. These anchor points are used during phrase alignment.
  • FIG. 7 is a flow diagram indicative of one embodiment of phrase alignment in accordance with the present invention. The phrase alignment component receives as an input the word alignment results 312 of the example and the boundary information generated from example matching component 218 identifying the boundaries of matched blocks in the source sentence of an example.
  • Based on these inputs, the phrase alignment component finds all possible target language candidate fragments corresponding to the matched blocks in the source language sentence. This is indicated by block 350 in FIG. 7. Next, the phrase alignment component calculates a score for each candidate fragment identified. This is indicated by block 352. From the score calculated, the phrase alignment component selects the best candidate or predetermined number of candidates as the translation output. This is indicated by block 354 in FIG. 7.
  • These steps are now described in greater detail. In finding all possible target language candidate fragments as in step 350, the present invention breaks this task into two parts. The present invention finds all possible continuous candidate fragments, and all possible non-continuous candidate fragments. FIGS. 8 and 9 illustrate continuous and non-continuous fragments.
  • If a continuous source language sentence fragment always corresponds to a continuous target language fragment, the task of phrase alignment can be easy. However, this is not always true. For example, in languages such as English and Chinese, it is often the case found in FIG. 8. FIG. 8 shows a source language sentence which includes words (or word sequences) A, B, C and D. FIG. 8 also shows a corresponding target language example sentence (or a portion thereof) which includes target language words (or word sequences) E, F, G and H. For purposes of the present discussion, a continuous fragment is defined as follows:
  • Suppose SFRAG is a fragment in the source language sentence, TFRAG is a fragment in the target language sentence. If all the aligned words in SFRAG are aligned to the words in TFRAG and only to the words in TFRAG, then SFRAG is continuous to TFRAG, and vise versa. Otherwise, it is non-continuous.
  • In FIG. 8, for example, target language fragment E F G H is not a continuous fragment to fragment A B C. This is because, while A B C is continuous in the source language sentence, F F H, which corresponds to A B C, is not continuous in the target language sentence. Instead, word (or word sequence) G in the target language sentence corresponds to word (or word sequence) D in the source language sentence.
  • In order to accommodate these difficulties, one embodiment of the present invention breaks the different circumstances into two different categories as shown in FIG. 9. FIG. 9 shows two instances of a source language sentence which contains words (or word sequences) A-F and a target language sentence which contains words (or word sequences) G-N. In the first instance, it can be seen that the English language fragment for which a translation is being sought (C D) corresponds to a continuous target language fragment in the target example illustrated (fragment H I J). This is referred to as continuous.
  • In the second instance, a continuous source language fragment A B corresponds to a non-continuous target language fragment (G H L M). However, the out of range target language words (or word sequences) I J K also correspond to a continuous source language fragment D E. This is referred to as non-continuous. Thus, the present invention generates all possible continuous fragments and then all possible non-continuous fragments.
  • FIG. 10 is a flow diagram illustrating how one embodiment of the present invention in which all possible continuous fragments in the target language sentence are identified for a fragment in the source language sentence. First, the source and target language sentences (or the preprocessed sentences) are received along with word alignment results 312. This is indicated by block 370 in FIG. 10.
  • Boundary information for the source language fragment for which alignments are sought is also received. The boundary information in the present example is indicated by (a, b) where a and b are word positions in the source language sentence. Thus, if the fragment in the source language sentence for which alignment is sought is C D, in FIG. 9, and each letter is representative of a word, then the boundary information would be (3, 4) since word C is in word position 3 and word D is in word position 4 in the source language sentence. Receiving the boundary information is indicated by block 372 in FIG. 10.
  • The alignment component then finds a word set (SET) in the target language sentence which aligns to the fragments having boundaries a, b in the source language sentence based on the word alignment results. This is indicated by block 374 in FIG. 10.
  • The phrase alignment component then finds the left-most word position (c) and the right-most word position (d) of the words in (SET) in the target sentence so the target language sentence fragment (c, d) is the minimum possible alignment (MinPA) in the target language sentence which could be aligned with the source language fragment. This is indicated by block 376. Next, the target language fragment boundaries of MinPA are extended to the left and the right until an inconsistent alignment anchor is met (one which shows alignment to a word in the SL input outside of a, b) in each direction. The left and right boundaries, respectively, are moved by one word within the target language sentence until the left or right boundary (which ever is being moved) meets an inconsistent anchor point. At that point, the extension of the fragment boundary in that direction is terminated. Thus, the new target language boundaries will be (e, f) and will define the maximum possible alignment (MaxPA). This is indicated by block 378.
  • Next, a set of words AP is obtained. AP is all possible continuous substrings between MinPA and MaxPA, all of which must contain MinPA. By continuous is meant that no word gaps exist within the continuous substring. This is indicated by block 380. The set of MinPA in union with MaxPA in union with AP is then returned as all possible continuous alignments in the target language sentence for the given fragment in the source language sentence. This is indicated by block 382.
  • All of the continuous alignments are then scored (as is discussed in greater detail below). Scoring the alignments is indicated by block 384. The step of obtaining all possible continuous alignments is performed for each fragment in the source language input.
  • FIG. 11 is a flow diagram illustrating how all possible non-continuous alignments are found. Again, by non-continuous alignments it is meant those such as found in FIG. 8 and the second instance of FIG. 9 in which a continuous source fragment corresponds to a non-continuous target fragment.
  • In order to obtain all possible continuous fragments, the phrase alignment component first receives the inputs and boundaries as described with respect to blocks 370 and 372 in FIG. 10. Next, the system finds a word set (SET1) in the example (or target) sentence that is aligned with the selected fragment (a, b) in the source language sentence based on the word alignment results. This is the same as indicated by block 374 in FIG. 10.
  • Next, the phrase alignment component finds a word set (SET2) in the source sentence that aligns to a portion of SET1 but is beyond of the range of (a, b) in the source language sentence. This is indicated by block 386 in FIG. 11.
  • It is next determined whether SET2 is continuous in the source language sentence. If not, no phrase alignments are calculated. This is indicated by blocks 388 and 390. However, if SET2 is continuous in the source language sentence (meaning that there are no word gaps in SET2), then processing continues at block 392.
  • In block 392, the phrase alignment component obtains the continuous word set (SET3) containing SET2 in the source language sentence. Next, all possible alignments for SET3 are obtained. This is illustratively done using the algorithm described with respect to FIG. 10. Finding all possible alignments for SET3 is indicated by block 394 in FIG. 11.
  • All of the alignments are then scored and the best alignment SET4 for SET3 is chosen. This is indicated by block 396.
  • The left most position (i) and the right most position (j) in SET are then located. This is indicated by block 398. SET4 is then removed from the sequence (i, j). This is indicated by block 400.
  • Then, the word sequence (i, j) minus SET4 is identified as MinPA of (a, b). This is indicated by block 402.
  • MinPA is then extended to obtain MaxPA as discussed with respect to block 378 in FIG. 10. This is indicated by block 404 in FIG. 11.
  • Again, AP is obtained as all possible continuous substrings between MinPA and MaxPA, all of which contain MinPA. This is indicated by block 406 in FIG. 11. The union of MinPA, MaxPA and AP is then returned as indicated by block 408. Each of the possible non-continuous alignments returned is then scored as indicated by block 410.
  • In accordance with one embodiment of the present invention, the score associated with each of the possible alignments is indicated by the following equation:

  • Weight=P(m|1)Pk|m1)Pj|m1)  (6)
  • Where,
      • m=the length of the SL fragment;
      • 1=the length of the TL fragment;
      • k=the number of content words in SL sentence;
      • j=the number of functional words in SL sentence;
      • Δj=|j of TL−j of SL|; and
      • Δk=|k of TL−k of SL|.
        However, other scoring techniques could be used as well.
  • Finally, after replacing the source language words and phrases with the aligned target language words and phrases, the confidence level for each translation output is calculated. This can be done by translation component 222 or post processing component 224 in system 200. In any case, in one embodiment, the translation confidence level is determined as follows:
  • ConL = c 1 log ( AlignCon 10 ) + c 2 log ( TransPercent 10 ) + c 3 log ( 10 / Example_num ) + c 4 log ( 10 / Valid_block _num ) ( 7 ) AlignCon = w i PhrSL , w j PhrTL i… j : are connected Conf ( C ij ) / PhrTL ( 0 AlignCon 1 , 0 TransPercent 1 , i = 1 4 c i = 1 ) ( 8 )
  • where,
    • ConL: translation confidence level;
    • c1c2, . . . , c4: constants,
    • Align Con: alignment confidence level;
    • TransPercent: weighted translation percentage;
    • Example_num: employed example number;
    • Valid_block_num: fragment number in input string translation;
    • PhrSL: the SL phrase in example that related to the given input string;
    • PhrTL: the TL correspondence in the translation of the example;
    • |PhrTL|: the word number of PhrTL;
    • Ci . . . j: connection between SL word i and TL word j; and
    • Conf(Ci . . . j): confidence level of word alignment.
  • Thus, the translation confidence level is based on the alignment confidence level, the confidence of aligned words, and the number of aligned and unaligned words in the target language correspondence. The system marks portions of the output with the confidence level which allows the user to identify low confidence translation outputs for particular scrutiny and the areas that require user's attention.
  • It can thus be seen that the present invention employs an example matching method that enhances the example matching and retrieval performance both in quality and speed over prior systems. Similarly, the present invention employs a word/phrase alignment technique and a score function for selecting the best candidate in phrase alignment which also produces enhancements in accuracy and speed over prior systems. Finally, the present invention employs a translation confidence prediction method that indicates the quality of the translation generated by the machine, and also highlights some translation portions for scrutiny by the user.
  • Although the present invention has been described with reference to particular embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.

Claims (21)

1-14. (canceled)
15. A method of performing machine translation of a source language (SL) input to a translation output in a target language (TL), comprising:
selecting examples, from an example base, that match fragments of the SL input;
aligning TL portions of the selected examples with SL portions that match the fragments of the SL input by, for each example:
performing word alignment to identify anchor alignment points corresponding to words in the SL portion that are translations of words in the TL portion;
finding continuous alignments between the TL portion and the SL portion based on the anchor alignment points;
wherein finding continuous alignments comprises:
obtaining SL boundary information indicative of positions of words in the SL input that define a boundary for a fragment of the SL portion to be aligned;
obtaining TL boundary information identifying boundary positions of words in the TL portion of the example that are aligned with the SL portion, based on the anchor alignment points, to obtain a minimum possible alignment (MinPA);
identifying a maximum possible alignment (MaxPA) by extending boundaries identified by the TL boundary information until an inconsistent alignment anchor point is reached;
finding non-continuous alignments between the TL portion and the SL portion; and
translating the SL input to the translation output from the continuous and non-continuous alignments.
16. The method of claim 15 comprises:
generating a plurality of translation outputs based on the continuous and non-continuous alignments;
calculating a score for each translation out put; and
selecting at least one translation output.
17. The method of claim 16 and further comprising:
calculating a confidence measure for the selected translation output; and
identifying one or more portions of the translation output that have a confidence measure below a threshold level.
18. (canceled)
19. (canceled)
20. (canceled)
21. The method of claim 15 wherein finding continuous alignments further comprises:
generating all alignments between MinPA and MaxPA, all of which include MinPA.
22. The method of claim 18 wherein finding all non-continuous alignments comprises:
identifying a word set in the TL portion of the example that corresponds to the SL portion to be aligned, based on the anchor alignment points.
23. The method of claim 22 wherein finding all non-continuous alignments further comprises:
identifying a word set in the SL portion of the example that aligns to a portion of the word set in the TL portion but is outside the SL boundary information.
24. The method of claim 23 wherein finding all non-continuous alignments further comprises:
if the word set in the SL portion is continuous, finding all possible continuous alignments for the word set in the SL portion and the TL portion of the example.
25. The method of claim 23 wherein finding all non-continuous alignments further comprises:
removing from the word set in the TL portion the words that align with the words in the SL portion that are outside the SL boundary information to obtain a minimum possible alignment (MinPA).
26. The method of claim 25 wherein finding all non-continuous alignments further comprises:
extending boundaries of MinPA, until an inconsistent alignment anchor point is reached, to obtain a maximum possible alignment (MaxPA).
27. The method of claim 26 wherein finding all non-continuous alignments further comprises:
generating continuous substrings from the TL portion between MinPA and MaxPA, all of which include MinPA.
28. The method of claim 15 wherein performing word alignment comprises:
accessing a bilingual dictionary to obtain dictionary information indicative of word translations between the SL portion and the TL portion of the example.
29. The method of claim 28 wherein word alignment further comprises:
if the TL portion of the example is in a non-segmented language, performing word segmentation on the example.
30. (canceled)
31. (canceled)
32. (canceled)
33. (canceled)
34. (canceled)
US11/935,938 2002-06-28 2007-11-06 Example based machine translation system Abandoned US20080133218A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/935,938 US20080133218A1 (en) 2002-06-28 2007-11-06 Example based machine translation system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/185,376 US7353165B2 (en) 2002-06-28 2002-06-28 Example based machine translation system
US11/935,938 US20080133218A1 (en) 2002-06-28 2007-11-06 Example based machine translation system

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/185,376 Continuation US7353165B2 (en) 2002-06-28 2002-06-28 Example based machine translation system

Publications (1)

Publication Number Publication Date
US20080133218A1 true US20080133218A1 (en) 2008-06-05

Family

ID=29779611

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/185,376 Expired - Fee Related US7353165B2 (en) 2002-06-28 2002-06-28 Example based machine translation system
US11/935,938 Abandoned US20080133218A1 (en) 2002-06-28 2007-11-06 Example based machine translation system

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US10/185,376 Expired - Fee Related US7353165B2 (en) 2002-06-28 2002-06-28 Example based machine translation system

Country Status (3)

Country Link
US (2) US7353165B2 (en)
JP (2) JP4694111B2 (en)
CN (1) CN100440150C (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009149549A1 (en) * 2008-06-09 2009-12-17 National Research Council Of Canada Method and system for using alignment means in matching translation
US20100250232A1 (en) * 2009-03-25 2010-09-30 Fujitsu Limited Retrieval result outputting apparatus and retrieval result outputting method
US20120123766A1 (en) * 2007-03-22 2012-05-17 Konstantin Anisimovich Indicating and Correcting Errors in Machine Translation Systems
US20120143593A1 (en) * 2010-12-07 2012-06-07 Microsoft Corporation Fuzzy matching and scoring based on direct alignment
US20120226489A1 (en) * 2011-03-02 2012-09-06 Bbn Technologies Corp. Automatic word alignment
US20130231916A1 (en) * 2012-03-05 2013-09-05 International Business Machines Corporation Method and apparatus for fast translation memory search
US8825469B1 (en) * 2011-08-04 2014-09-02 Google Inc. Techniques for translating documents including tags
US20150178274A1 (en) * 2013-12-25 2015-06-25 Kabushiki Kaisha Toshiba Speech translation apparatus and speech translation method
US9235573B2 (en) 2006-10-10 2016-01-12 Abbyy Infopoisk Llc Universal difference measure
US9323747B2 (en) 2006-10-10 2016-04-26 Abbyy Infopoisk Llc Deep model statistics method for machine translation
US9495358B2 (en) 2006-10-10 2016-11-15 Abbyy Infopoisk Llc Cross-language text clustering
US9626358B2 (en) 2014-11-26 2017-04-18 Abbyy Infopoisk Llc Creating ontologies by analyzing natural language texts
US9626353B2 (en) 2014-01-15 2017-04-18 Abbyy Infopoisk Llc Arc filtering in a syntactic graph
US9633005B2 (en) 2006-10-10 2017-04-25 Abbyy Infopoisk Llc Exhaustive automatic processing of textual information
US9740682B2 (en) 2013-12-19 2017-08-22 Abbyy Infopoisk Llc Semantic disambiguation using a statistical analysis
US9817818B2 (en) 2006-10-10 2017-11-14 Abbyy Production Llc Method and system for translating sentence between languages based on semantic structure of the sentence
CN107908601A (en) * 2017-11-01 2018-04-13 北京颐圣智能科技有限公司 Participle model construction method, equipment, readable storage medium storing program for executing and the segmenting method of medical text

Families Citing this family (104)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4718687B2 (en) * 1999-03-19 2011-07-06 トラドス ゲゼルシャフト ミット ベシュレンクテル ハフツング Workflow management system
US20060116865A1 (en) 1999-09-17 2006-06-01 Www.Uniscape.Com E-services translation utilizing machine translation and translation memory
US7904595B2 (en) 2001-01-18 2011-03-08 Sdl International America Incorporated Globalization management system and method therefor
US8868543B1 (en) * 2002-11-20 2014-10-21 Google Inc. Finding web pages relevant to multimedia streams
JP2005100335A (en) * 2003-09-01 2005-04-14 Advanced Telecommunication Research Institute International Machine translation apparatus, machine translation computer program, and computer
JP3919771B2 (en) * 2003-09-09 2007-05-30 株式会社国際電気通信基礎技術研究所 Machine translation system, control device thereof, and computer program
CA2549769A1 (en) * 2003-12-15 2005-06-30 Laboratory For Language Technology Incorporated System, method, and program for identifying the corresponding translation
CN1661593B (en) * 2004-02-24 2010-04-28 北京中专翻译有限公司 Method for translating computer language and translation system
US7983896B2 (en) 2004-03-05 2011-07-19 SDL Language Technology In-context exact (ICE) matching
JP4076520B2 (en) * 2004-05-26 2008-04-16 富士通株式会社 Translation support program and word mapping program
GB2415518A (en) * 2004-06-24 2005-12-28 Sharp Kk Method and apparatus for translation based on a repository of existing translations
JP4473702B2 (en) * 2004-11-02 2010-06-02 株式会社東芝 Machine translation system, machine translation method and program
US7680646B2 (en) * 2004-12-21 2010-03-16 Xerox Corporation Retrieval method for translation memories containing highly structured documents
US20060206797A1 (en) * 2005-03-08 2006-09-14 Microsoft Corporation Authorizing implementing application localization rules
US8219907B2 (en) * 2005-03-08 2012-07-10 Microsoft Corporation Resource authoring with re-usability score and suggested re-usable data
WO2007004391A1 (en) * 2005-07-06 2007-01-11 Matsushita Electric Industrial Co., Ltd. Conversation support apparatus
US10319252B2 (en) 2005-11-09 2019-06-11 Sdl Inc. Language capability assessment and training apparatus and techniques
US8041556B2 (en) * 2005-12-01 2011-10-18 International Business Machines Corporation Chinese to english translation tool
WO2007068123A1 (en) * 2005-12-16 2007-06-21 National Research Council Of Canada Method and system for training and applying a distortion component to machine translation
US7536295B2 (en) * 2005-12-22 2009-05-19 Xerox Corporation Machine translation using non-contiguous fragments of text
JP2007233486A (en) * 2006-02-27 2007-09-13 Fujitsu Ltd Translator support program, translator support device and translator support method
US7711546B2 (en) * 2006-04-21 2010-05-04 Microsoft Corporation User interface for machine aided authoring and translation
US7827155B2 (en) * 2006-04-21 2010-11-02 Microsoft Corporation System for processing formatted data
US8171462B2 (en) * 2006-04-21 2012-05-01 Microsoft Corporation User declarative language for formatted data processing
US20070250528A1 (en) * 2006-04-21 2007-10-25 Microsoft Corporation Methods for processing formatted data
US8549492B2 (en) * 2006-04-21 2013-10-01 Microsoft Corporation Machine declarative language for formatted data processing
US9020804B2 (en) * 2006-05-10 2015-04-28 Xerox Corporation Method for aligning sentences at the word level enforcing selective contiguity constraints
US7542893B2 (en) * 2006-05-10 2009-06-02 Xerox Corporation Machine translation using elastic chunks
US7725306B2 (en) * 2006-06-28 2010-05-25 Microsoft Corporation Efficient phrase pair extraction from bilingual word alignments
US20080019281A1 (en) * 2006-07-21 2008-01-24 Microsoft Corporation Reuse of available source data and localizations
US8521506B2 (en) * 2006-09-21 2013-08-27 Sdl Plc Computer-implemented method, computer software and apparatus for use in a translation system
JP4481972B2 (en) 2006-09-28 2010-06-16 株式会社東芝 Speech translation device, speech translation method, and speech translation program
JP5082374B2 (en) * 2006-10-19 2012-11-28 富士通株式会社 Phrase alignment program, translation program, phrase alignment device, and phrase alignment method
GB2444084A (en) * 2006-11-23 2008-05-28 Sharp Kk Selecting examples in an example based machine translation system
US8600736B2 (en) * 2007-01-04 2013-12-03 Thinking Solutions Pty Ltd Linguistic analysis
US8788258B1 (en) 2007-03-15 2014-07-22 At&T Intellectual Property Ii, L.P. Machine translation using global lexical selection and sentence reconstruction
JP4971844B2 (en) * 2007-03-16 2012-07-11 日本放送協会 Example database creation device, example database creation program, translation device, and translation program
US8290967B2 (en) * 2007-04-19 2012-10-16 Barnesandnoble.Com Llc Indexing and search query processing
US20090326916A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Unsupervised chinese word segmentation for statistical machine translation
US9176952B2 (en) * 2008-09-25 2015-11-03 Microsoft Technology Licensing, Llc Computerized statistical machine translation with phrasal decoder
US20100082324A1 (en) * 2008-09-30 2010-04-01 Microsoft Corporation Replacing terms in machine translation
GB2468278A (en) 2009-03-02 2010-09-08 Sdl Plc Computer assisted natural language translation outputs selectable target text associated in bilingual corpus with input target text from partial translation
US9262403B2 (en) 2009-03-02 2016-02-16 Sdl Plc Dynamic generation of auto-suggest dictionary for natural language translation
US8185373B1 (en) * 2009-05-05 2012-05-22 The United States Of America As Represented By The Director, National Security Agency, The Method of assessing language translation and interpretation
US8874426B2 (en) * 2009-06-30 2014-10-28 International Business Machines Corporation Method for translating computer generated log files
CN101996166B (en) * 2009-08-14 2015-08-05 张龙哺 Bilingual sentence is to medelling recording method and interpretation method and translation system
WO2011029011A1 (en) * 2009-09-04 2011-03-10 Speech Cycle, Inc. System and method for the localization of statistical classifiers based on machine translation
TW201113870A (en) * 2009-10-09 2011-04-16 Inst Information Industry Method for analyzing sentence emotion, sentence emotion analyzing system, computer readable and writable recording medium and multimedia device
US10417646B2 (en) 2010-03-09 2019-09-17 Sdl Inc. Predicting the cost associated with translating textual content
CN102214166B (en) * 2010-04-06 2013-02-20 三星电子(中国)研发中心 Machine translation system and machine translation method based on syntactic analysis and hierarchical model
US20110264437A1 (en) * 2010-04-26 2011-10-27 Honeywell International Inc. System and method for translating an english language message into another language
US8375061B2 (en) * 2010-06-08 2013-02-12 International Business Machines Corporation Graphical models for representing text documents for computer analysis
US8554558B2 (en) * 2010-07-12 2013-10-08 Nuance Communications, Inc. Visualizing automatic speech recognition and machine translation output
US20120158398A1 (en) * 2010-12-17 2012-06-21 John Denero Combining Model-Based Aligner Using Dual Decomposition
JP5747508B2 (en) * 2011-01-05 2015-07-15 富士ゼロックス株式会社 Bilingual information search device, translation device, and program
US9128929B2 (en) 2011-01-14 2015-09-08 Sdl Language Technologies Systems and methods for automatically estimating a translation time including preparation time in addition to the translation itself
US9547626B2 (en) 2011-01-29 2017-01-17 Sdl Plc Systems, methods, and media for managing ambient adaptability of web applications and web services
US10657540B2 (en) 2011-01-29 2020-05-19 Sdl Netherlands B.V. Systems, methods, and media for web content management
US10580015B2 (en) 2011-02-25 2020-03-03 Sdl Netherlands B.V. Systems, methods, and media for executing and optimizing online marketing initiatives
US10140320B2 (en) 2011-02-28 2018-11-27 Sdl Inc. Systems, methods, and media for generating analytical data
US9552213B2 (en) * 2011-05-16 2017-01-24 D2L Corporation Systems and methods for facilitating software interface localization between multiple languages
US9984054B2 (en) 2011-08-24 2018-05-29 Sdl Inc. Web interface including the review and manipulation of a web document and utilizing permission based control
US20130275436A1 (en) * 2012-04-11 2013-10-17 Microsoft Corporation Pseudo-documents to facilitate data discovery
US9773270B2 (en) 2012-05-11 2017-09-26 Fredhopper B.V. Method and system for recommending products based on a ranking cocktail
US10261994B2 (en) 2012-05-25 2019-04-16 Sdl Inc. Method and system for automatic management of reputation of translators
US9317500B2 (en) * 2012-05-30 2016-04-19 Audible, Inc. Synchronizing translated digital content
US9116886B2 (en) * 2012-07-23 2015-08-25 Google Inc. Document translation including pre-defined term translator and translation model
US11386186B2 (en) 2012-09-14 2022-07-12 Sdl Netherlands B.V. External content library connector systems and methods
US10452740B2 (en) 2012-09-14 2019-10-22 Sdl Netherlands B.V. External content libraries
US11308528B2 (en) 2012-09-14 2022-04-19 Sdl Netherlands B.V. Blueprinting of multimedia assets
US9916306B2 (en) 2012-10-19 2018-03-13 Sdl Inc. Statistical linguistic analysis of source content
US9424360B2 (en) * 2013-03-12 2016-08-23 Google Inc. Ranking events
JP5850512B2 (en) * 2014-03-07 2016-02-03 国立研究開発法人情報通信研究機構 Word alignment score calculation device, word alignment device, and computer program
US10181098B2 (en) 2014-06-06 2019-01-15 Google Llc Generating representations of input sequences using neural networks
US9740687B2 (en) 2014-06-11 2017-08-22 Facebook, Inc. Classifying languages for objects and entities
US9864744B2 (en) 2014-12-03 2018-01-09 Facebook, Inc. Mining multi-lingual data
US10067936B2 (en) 2014-12-30 2018-09-04 Facebook, Inc. Machine translation output reranking
US9830386B2 (en) 2014-12-30 2017-11-28 Facebook, Inc. Determining trending topics in social media
US9830404B2 (en) 2014-12-30 2017-11-28 Facebook, Inc. Analyzing language dependency structures
US9477652B2 (en) 2015-02-13 2016-10-25 Facebook, Inc. Machine learning dialect identification
CN104866547B (en) * 2015-05-08 2019-04-23 湖北荆楚网络科技股份有限公司 A kind of filter method for combined characters class keywords
CN104850609B (en) * 2015-05-08 2019-04-23 湖北荆楚网络科技股份有限公司 A kind of filter method for rising space class keywords
JP2017058865A (en) * 2015-09-15 2017-03-23 株式会社東芝 Machine translation device, machine translation method, and machine translation program
US9734142B2 (en) * 2015-09-22 2017-08-15 Facebook, Inc. Universal translation
US10614167B2 (en) 2015-10-30 2020-04-07 Sdl Plc Translation review workflow systems and methods
US9690777B1 (en) * 2015-12-10 2017-06-27 Webinterpret Translating website listings and propagating the translated listings to listing websites in other regions
US10133738B2 (en) 2015-12-14 2018-11-20 Facebook, Inc. Translation confidence scores
US9734143B2 (en) 2015-12-17 2017-08-15 Facebook, Inc. Multi-media context language processing
JP2017120616A (en) * 2015-12-25 2017-07-06 パナソニックIpマネジメント株式会社 Machine translation method and machine translation system
US20170185587A1 (en) * 2015-12-25 2017-06-29 Panasonic Intellectual Property Management Co., Ltd. Machine translation method and machine translation system
US9805029B2 (en) 2015-12-28 2017-10-31 Facebook, Inc. Predicting future translations
US10002125B2 (en) 2015-12-28 2018-06-19 Facebook, Inc. Language model personalization
US9747283B2 (en) 2015-12-28 2017-08-29 Facebook, Inc. Predicting future translations
US10902221B1 (en) 2016-06-30 2021-01-26 Facebook, Inc. Social hash for language models
US10902215B1 (en) 2016-06-30 2021-01-26 Facebook, Inc. Social hash for language models
CN107818086B (en) * 2016-09-13 2021-08-10 株式会社东芝 Machine translation method and device
US10180935B2 (en) 2016-12-30 2019-01-15 Facebook, Inc. Identifying multiple languages in a content item
US10380249B2 (en) 2017-10-02 2019-08-13 Facebook, Inc. Predicting future trending topics
US10635863B2 (en) 2017-10-30 2020-04-28 Sdl Inc. Fragment recall and adaptive automated translation
US10817676B2 (en) 2017-12-27 2020-10-27 Sdl Inc. Intelligent routing services and systems
US10747962B1 (en) * 2018-03-12 2020-08-18 Amazon Technologies, Inc. Artificial intelligence system using phrase tables to evaluate and improve neural network based machine translation
CN112074840A (en) * 2018-05-04 2020-12-11 瑞典爱立信有限公司 Method and apparatus for enriching entities with alternative text in multiple languages
US11256867B2 (en) 2018-10-09 2022-02-22 Sdl Inc. Systems and methods of machine learning for digital assets and message creation
CN113778582B (en) * 2021-07-28 2024-06-28 赤子城网络技术(北京)有限公司 Setting method, device, equipment and storage medium for localized multi-language adaptation

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5541836A (en) * 1991-12-30 1996-07-30 At&T Corp. Word disambiguation apparatus and methods
US5659765A (en) * 1994-03-15 1997-08-19 Toppan Printing Co., Ltd. Machine translation system
US5907821A (en) * 1995-11-06 1999-05-25 Hitachi, Ltd. Method of computer-based automatic extraction of translation pairs of words from a bilingual text
US6182026B1 (en) * 1997-06-26 2001-01-30 U.S. Philips Corporation Method and device for translating a source text into a target using modeling and dynamic programming
US6330530B1 (en) * 1999-10-18 2001-12-11 Sony Corporation Method and system for transforming a source language linguistic structure into a target language linguistic structure based on example linguistic feature structures
US20020138250A1 (en) * 2001-03-19 2002-09-26 Fujitsu Limited Translation supporting apparatus and method and translation supporting program
US20020143537A1 (en) * 2001-03-30 2002-10-03 Fujitsu Limited Of Kawasaki, Japan Process of automatically generating translation- example dictionary, program product, computer-readable recording medium and apparatus for performing thereof
US6631346B1 (en) * 1999-04-07 2003-10-07 Matsushita Electric Industrial Co., Ltd. Method and apparatus for natural language parsing using multiple passes and tags
US6772180B1 (en) * 1999-01-22 2004-08-03 International Business Machines Corporation Data representation schema translation through shared examples

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3189186B2 (en) * 1992-03-23 2001-07-16 インターナショナル・ビジネス・マシーンズ・コーポレ−ション Translation device based on patterns
JPH1063669A (en) * 1996-08-21 1998-03-06 Oki Electric Ind Co Ltd Bilingual data base preparing device and translated example retrieving device
JPH10312382A (en) * 1997-05-13 1998-11-24 Keiichi Shinoda Similar example translation system
US6073096A (en) * 1998-02-04 2000-06-06 International Business Machines Corporation Speaker adaptation system and method based on class-specific pre-clustering training speakers
US6107935A (en) * 1998-02-11 2000-08-22 International Business Machines Corporation Systems and methods for access filtering employing relaxed recognition constraints
JPH11259482A (en) * 1998-03-12 1999-09-24 Kdd Corp Machine translation system for composite noun
US6195631B1 (en) * 1998-04-15 2001-02-27 At&T Corporation Method and apparatus for automatic construction of hierarchical transduction models for language translation
CN1302415C (en) * 2000-06-19 2007-02-28 李玉鑑 English-Chinese translation machine
US7295962B2 (en) * 2001-05-11 2007-11-13 University Of Southern California Statistical memory-based translation system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5541836A (en) * 1991-12-30 1996-07-30 At&T Corp. Word disambiguation apparatus and methods
US5659765A (en) * 1994-03-15 1997-08-19 Toppan Printing Co., Ltd. Machine translation system
US5907821A (en) * 1995-11-06 1999-05-25 Hitachi, Ltd. Method of computer-based automatic extraction of translation pairs of words from a bilingual text
US6182026B1 (en) * 1997-06-26 2001-01-30 U.S. Philips Corporation Method and device for translating a source text into a target using modeling and dynamic programming
US6772180B1 (en) * 1999-01-22 2004-08-03 International Business Machines Corporation Data representation schema translation through shared examples
US6631346B1 (en) * 1999-04-07 2003-10-07 Matsushita Electric Industrial Co., Ltd. Method and apparatus for natural language parsing using multiple passes and tags
US6330530B1 (en) * 1999-10-18 2001-12-11 Sony Corporation Method and system for transforming a source language linguistic structure into a target language linguistic structure based on example linguistic feature structures
US20020138250A1 (en) * 2001-03-19 2002-09-26 Fujitsu Limited Translation supporting apparatus and method and translation supporting program
US20020143537A1 (en) * 2001-03-30 2002-10-03 Fujitsu Limited Of Kawasaki, Japan Process of automatically generating translation- example dictionary, program product, computer-readable recording medium and apparatus for performing thereof

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9633005B2 (en) 2006-10-10 2017-04-25 Abbyy Infopoisk Llc Exhaustive automatic processing of textual information
US9495358B2 (en) 2006-10-10 2016-11-15 Abbyy Infopoisk Llc Cross-language text clustering
US9323747B2 (en) 2006-10-10 2016-04-26 Abbyy Infopoisk Llc Deep model statistics method for machine translation
US9235573B2 (en) 2006-10-10 2016-01-12 Abbyy Infopoisk Llc Universal difference measure
US9817818B2 (en) 2006-10-10 2017-11-14 Abbyy Production Llc Method and system for translating sentence between languages based on semantic structure of the sentence
US20120123766A1 (en) * 2007-03-22 2012-05-17 Konstantin Anisimovich Indicating and Correcting Errors in Machine Translation Systems
US9772998B2 (en) 2007-03-22 2017-09-26 Abbyy Production Llc Indicating and correcting errors in machine translation systems
US8959011B2 (en) * 2007-03-22 2015-02-17 Abbyy Infopoisk Llc Indicating and correcting errors in machine translation systems
US8594992B2 (en) 2008-06-09 2013-11-26 National Research Council Of Canada Method and system for using alignment means in matching translation
WO2009149549A1 (en) * 2008-06-09 2009-12-17 National Research Council Of Canada Method and system for using alignment means in matching translation
US20110093254A1 (en) * 2008-06-09 2011-04-21 Roland Kuhn Method and System for Using Alignment Means in Matching Translation
US8566079B2 (en) * 2009-03-25 2013-10-22 Fujitsu Limited Retrieval result outputting apparatus and retrieval result outputting method
US20100250232A1 (en) * 2009-03-25 2010-09-30 Fujitsu Limited Retrieval result outputting apparatus and retrieval result outputting method
US20120143593A1 (en) * 2010-12-07 2012-06-07 Microsoft Corporation Fuzzy matching and scoring based on direct alignment
US8655640B2 (en) * 2011-03-02 2014-02-18 Raytheon Bbn Technologies Corp. Automatic word alignment
US20120226489A1 (en) * 2011-03-02 2012-09-06 Bbn Technologies Corp. Automatic word alignment
US8825469B1 (en) * 2011-08-04 2014-09-02 Google Inc. Techniques for translating documents including tags
US20130231916A1 (en) * 2012-03-05 2013-09-05 International Business Machines Corporation Method and apparatus for fast translation memory search
US8874428B2 (en) * 2012-03-05 2014-10-28 International Business Machines Corporation Method and apparatus for fast translation memory search
US9740682B2 (en) 2013-12-19 2017-08-22 Abbyy Infopoisk Llc Semantic disambiguation using a statistical analysis
US20150178274A1 (en) * 2013-12-25 2015-06-25 Kabushiki Kaisha Toshiba Speech translation apparatus and speech translation method
US9626353B2 (en) 2014-01-15 2017-04-18 Abbyy Infopoisk Llc Arc filtering in a syntactic graph
US9626358B2 (en) 2014-11-26 2017-04-18 Abbyy Infopoisk Llc Creating ontologies by analyzing natural language texts
CN107908601A (en) * 2017-11-01 2018-04-13 北京颐圣智能科技有限公司 Participle model construction method, equipment, readable storage medium storing program for executing and the segmenting method of medical text

Also Published As

Publication number Publication date
CN100440150C (en) 2008-12-03
JP2004038976A (en) 2004-02-05
JP4993762B2 (en) 2012-08-08
US7353165B2 (en) 2008-04-01
US20040002848A1 (en) 2004-01-01
CN1475907A (en) 2004-02-18
JP4694111B2 (en) 2011-06-08
JP2008262587A (en) 2008-10-30

Similar Documents

Publication Publication Date Title
US7353165B2 (en) Example based machine translation system
EP1422634B1 (en) Statistical method and apparatus for statistical learning of translation relationships among phrases
US7496496B2 (en) System and method for machine learning a confidence metric for machine translation
US8548794B2 (en) Statistical noun phrase translation
US8185377B2 (en) Diagnostic evaluation of machine translators
US7050964B2 (en) Scaleable machine translation system
US7707025B2 (en) Method and apparatus for translation based on a repository of existing translations
EP1462948B1 (en) Ordering component for sentence realization for a natural language generation system, based on linguistically informed statistical models of constituent structure
US8275605B2 (en) Machine language translation with transfer mappings having varying context
EP0839357A1 (en) Method and apparatus for automated search and retrieval processing
US9311299B1 (en) Weakly supervised part-of-speech tagging with coupled token and type constraints
Piperidis et al. From sentences to words and clauses
KR100420474B1 (en) Apparatus and method of long sentence translation using partial sentence frame
Salloum et al. Unsupervised Arabic dialect segmentation for machine translation
JP2006127405A (en) Method for carrying out alignment of bilingual parallel text and executable program in computer
JP2000250914A (en) Machine translation method and device and recording medium recording machine translation program
Slayden et al. Large-scale Thai statistical machine translation
JPH09223143A (en) Document information processor
Jawaid Statistical Machine Translation between Languages with

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034766/0001

Effective date: 20141014