EP2132657A1 - Machine learning for transliteration - Google Patents
Machine learning for transliterationInfo
- Publication number
- EP2132657A1 EP2132657A1 EP08731575A EP08731575A EP2132657A1 EP 2132657 A1 EP2132657 A1 EP 2132657A1 EP 08731575 A EP08731575 A EP 08731575A EP 08731575 A EP08731575 A EP 08731575A EP 2132657 A1 EP2132657 A1 EP 2132657A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- word
- transliteration
- script
- source
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
- G06F40/16—Automatic learning of transformation rules, e.g. from examples
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
Definitions
- This invention relates to automatic transliteration of words from one writing system to another writing system.
- Electronic documents are typically written in many different languages.
- Each language is normally expressed in a particular writing system (i.e., a script), which is usually characterized by a particular alphabet.
- a script which is usually characterized by a particular alphabet.
- the English language is expressed using the Latin alphabet while the Hindi language is normally expressed using the Devanagar ⁇ alphabet.
- the scripts used by some languages include a particular alphabet that has been extended to include additional marks or characters.
- the French language is written using a script that includes the basic Latin alphabet (i.e., the 26 unaccented characters from A to Z, upper and lower case) and also includes diacritics (i.e., accented characters) and ligatures (e.g., JE, (E).
- diacritics i.e., accented characters
- ligatures e.g., JE, (E).
- keyboards or mobile devices are configured to generate characters of the basic Latin alphabet.
- These input devices are quite frequently used by users who want to produce characters and words in non-Latin based scripts (e.g., Indie, Russian, Hebrew, Chinese or Japanese).
- a user may not be able to use these input devices to conveniently produce the letters of the script that they prefer. Instead, the user will often use the input device to provide a character or character sequence that is a close substitute. For example, a user may provide AE in lieu of JE.
- substitutions are a form of transliteration, whereby the script of one language (e.g., Latin alphabet) is used to express the script of another language (e.g., the French alphabet).
- the system receiving the substitute characters is often expected to transliterate the given characters into characters of the desired script.
- the rules and conventions of transliteration between scripts can vary even among the same two languages, often by geographic region and even from user to user.
- " is expressed in the Latin alphabet as “Sharda”, whereas in other regions the same Hindi word is expressed as "Sharada”.
- the conventional approach for transliteration is to use rules, which specify that one or two particular characters in one script can be mapped to one or two particular characters in another script. These rules are typically provided by a language expert. This approach depends heavily on the expertise of the language expert or on cultural conventions.
- Embodiments feature methods, systems, apparatus, including computer program product apparatus. Each of these will be described in this summary be reference to the methods, for which there are corresponding systems and apparatus.
- one aspect of the subject matter described in this specification can be embodied in a method that includes receiving from a user an input of a sequence of multiple input characters entered in an input script. The sequence is terminated by entry of a word-break character where the word-break character is not part of the sequence. A transliteration model is used, after entry of the word-break character, to determine an output word in an output script from the sequence of multiple input characters.
- Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
- the transliteration model can include a plurality of segments, each segment mapping one or more characters of the input script to one or more characters of the output script. Each segment in the plurality of segments can correspond to a word pair in a corpus of word pairs, where each segment can have a score based on a frequency of occurrence of the word pair in the corpus of word pairs.
- Using the transliteration model can include generating potential transliterations from the segments, each potential transliteration being derived from a combination of one or more segments; and selecting the transliteration to use to determine the output word based on the scores of the segments in each of the potential transliterations.
- the transliteration model can include a dictionary having entries in the input script and, for each entry, a corresponding word in the output script.
- the word-break character can be a space character or an end-of-sentence character.
- the sequence of multiple input characters in a user interface can be replaced with the output word in the output script.
- another aspect of the subject matter described in this specification can be embodied in a method that includes deriving multiple word pairs from multiple electronic documents that contain parallel text.
- the parallel text including text in a first script corresponding to text in a different, second script.
- a similarity score between the words in each word pair is determined based on a phonetic metric value of each word in the word pair.
- Word pairs are used that have a similarity score satisfying a threshold criterion for automatic transliteration.
- Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
- Each phonetic metric value can be a soundex value.
- Deriving word pairs from multiple electronic documents can include aligning text within each document to identify text that is parallel; and deriving word pairs based on word alignments between parallel text.
- Deriving word pairs from multiple electronic documents can include using phonetic metric scoring and matching to align corresponding word pairs in unstructured text.
- the phonetic metric scoring can be a soundex scoring.
- each word pair in the corpus includes a source word and a target word.
- Each source word is specified in a source script and each target word is a transliteration of the corresponding source word in a different, target script.
- Relevant word pairs from the corpus are selected. Selection includes excluding trivial words in the corpus, where trivial words comprising one letter words and numerical characters, and selecting the word pairs based on how frequently the source words of the word pairs occur in the corpus. The relevant word pairs are ranked for use in automatic transliteration.
- Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
- Trivial words can include acronyms.
- the corpus of word pairs can include user-generated word pairs. Multiple possible transliterations for a source word can be provided to a user. A selection of a first transliteration from among the multiple transliterations can be received from the user. A word pair comprising the source word and the first transliteration are added to the corpus of word pairs. The frequencies of source words can be measured based on a number of documents in which the source words occur. Selecting relevant word pairs can include selecting additional word pairs from the corpus based on a randomized statistically biased selection. Selecting relevant word pairs can include filtering from the selected word pairs based on the respective sources of the word pairs.
- Another aspect of the subject matter described in this specification can be embodied in a method that includes generating a training model from ranked word pairs.
- Each word pair in the ranked word pairs includes a source word and a target word.
- Each source word is specified in a source script and each target word is a transliteration of the corresponding source word in a different, target script.
- Training model includes alignments between the letters of each of a plurality of source words and the letters of the corresponding target word.
- Generating the training model includes generating alignments from each of multiple word pairs including: for each word pair, matching the letters from the source word with the letters of the target word of the word pair. The letters are matched based on a statistical likelihood that one or more letters in the source word co- occur with one or more letters in the target word.
- Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
- the statistical likelihood can be measured by Dice coefficients.
- the letter-to- letter matches can include a k-to-n alignment, where k and n are each integers greater than 2. Some characters in the target script can be ignored or skipped in determining the alignment of letters.
- Pre-determined consonant maps can be used to map specific letters from source words to target words.
- another aspect of the subject matter described in this specification can be embodied in a method that includes clustering users into groups based on usage patterns of the users in selecting or correction transliterations. A transliteration of a word for a first user in a first group is automatically corrected based on corrections made by other users in the first group. Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
- another aspect of the subject matter described in this specification can be embodied in a method that includes clustering users into groups by identifying geographic locations of the users. A transliteration of a word for a first user in a first group is automatically corrected based on corrections made by other users in the first group. Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
- Another aspect of the subject matter described in this specification can be embodied in a method that includes recording word pairs for transliteration.
- Each word pair has a source word in a source script and one or more target words in a different, target script.
- the method includes generating an entry-aligned dictionary of transliterations.
- the dictionary includes, for every source word in the dictionary, a single target word. Whenever a particular source word is mapped to multiple target words, then the entry-aligned dictionary includes an entry for each target word, where each entry includes the same source word repeated in each entry.
- Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
- the entry-aligned dictionary of transliterations can include parts of a global dictionary of transliterations.
- the entry-aligned dictionary of transliterations can include a user's dictionary of transliterations.
- another aspect of the subject matter described in this specification can be embodied in a method that includes generating a transliteration model based on statistical information derived from a corpus of parallel text having first text in an input script and corresponding second text in an output script.
- the transliteration model is used to transliterate a sequence of input characters in the input script to a sequence of output characters in the output script.
- Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
- Multiple input words can be identified from the sequence of input characters.
- a first portion of the multiple input words can be transliterated, using the transliteration model, based on one or more of: 1) a second portion of the multiple input words preceding the first portion, or 2) a third portion of the multiple input words following the first portion.
- Each of the first, second and third portions correspond to a word, a phrase, or a sentence in the multiple input words.
- a transliteration of the first portion can be selected from a plurality of potential transliterations of the first portion based on a statistical likelihood that a potential transliteration in the plurality of potential transliterations co-occurs in the corpus with a transliteration of the second portion preceding the first portion.
- the rules that govern transliteration are automatically learned from a corpus of examples.
- the rules that govern transliteration are also learned and improved through use and user interaction.
- Dynamic rule sets enable transliteration to adapt to the dynamic nature of language and the varying expectations of users.
- Transliteration rules can be automatically customized for each individual user. Groups of users can be identified, based on geographical location or usage patterns, and can be provided with transliterations that are more likely to meet the particular expectations of users in the group.
- Transliteration rules can be provided to a client, such as a web browser, to provide interactive and timely transliterations. Common transliterations can be cached to further expedite transliteration. Common transliterations can be provided at least in part to a client to efficiently enable interactive transliteration.
- FIG. 1 is a diagram of a user interface for receiving text for transliteration.
- FIG. 2 is a diagram of alignment between characters of a target and source word.
- FIG. 3 is a diagram of segmentations of the aligned words shown in FIG. 2.
- FIG. 4A is a diagram of segmentations derived from multiple word pairs.
- FIG. 4B is a diagram of partially generated potential transliterations.
- FIG. 5 is a flow diagram for selecting relevant words pairs from a corpus.
- FIG. 6 is a flow diagram for transliterating words.
- FIG. 7 shows an input string from which two potential transliterations are derived.
- FIG. 8 shows a hierarchy of groups and their associated dictionaries
- FIG. 9 is a block diagram of a transliteration system.
- an exemplary graphical user interface 100 includes a text box 110 for receiving text-based user input.
- the graphical user interface 100 can be that of a web page rendered by a web browser 120 or, in other implementations, can be a part of a stand alone application.
- Textual user input e.g., the text 130
- the textual user input is provided in a particular input script (e.g., using the Latin alphabet).
- text is provided by a user using an input device (e.g., a keyboard, a mouse, stylus, or microphone).
- Exemplary user input 130 is shown displayed in the text box, representing text received from a user in a particular input script (e.g., Latin alphabet).
- the user interface also includes a selection list 140.
- the selection list includes one or more transliterations 145A, 145B.
- Each transliteration is a string that includes characters in a script other than the input script.
- the exemplary transliterations 145 are strings in an Indie script, e.g., Devanagar ⁇ , that ideally correspond to the Latin input 130.
- Indie script e.g., Devanagar ⁇
- Transliteration is, in general, an imprecise process that can be dependent on context of both the transliterated string and the expectations of the user.
- the expectations of a user may be shaped by social norms, personal habits, regional practices or any number of external influences.
- the transliterations 145 presented in the selection list 140 can be presented in an order that reflects the likelihood that the transliteration correctly corresponds to one or more words in the input string 130. Whenever a user selects any but the first transliteration in the selection list, that selection can be recognized as a correction. For example, the transliteration 145 A is presented first because it is considered the most likely transliteration of the input string 130. If a user selects another transliteration 145B, that selection represents a correction, namely that the second transliteration 145B is considered by the user as a more accurate transliteration than the first transliteration 145 A.
- User corrections can be recorded to improve the accuracy of subsequent transliterations.
- a record of user corrections identifies characteristics of the correction including the input word (or source word) as well as the transliterated word (or target word) that was selected by the user.
- correction records generated by multiple users can also include other statistical information. Statistical information can include how many users made the correction and how frequently the correction occurred both absolutely and relatively to the number of times the transliteration was presented (but not necessarily selected).
- a user may manually correct a particular transliteration by adding, removing or replacing characters in a transliterated word.
- a user may use a letter-level transliteration software or a software keyboard to insert individual letters into a transliterated word.
- Such manually corrected transliterations are also recognized as corrections and can be recorded as such.
- Context information can include how a user provided a correction (e.g. selection compared to manual correction) and the time the user provided the corrections.
- the context information can be used to rank corrections and determine their relative relevance and confidence.
- user input received from the user can be transliterated on a word by word basis. For example, all of a user's text immediately preceding a word- break character (e.g., punctuation, a space, carriage return, end-of-line or end-of-file character) can be transliterated at once as a complete word - even while the user continues to provide additional input. In other implementations, the entire user input provided is transliterated at once (e.g., when the user submits the input or explicitly selects to have user input transliterated on demand). For example, a user can position a cursor over a particular word, and in response, the selection list 140 of transliterations can be presented. In other implementations, word fragments can be transliterated before the user has provided input that completes the word.
- word- break character e.g., punctuation, a space, carriage return, end-of-line or end-of-file character
- word fragments can be transliterated before the user has provided input that completes the word.
- Transliteration can be performed between any two scripts where the letters of one script can be expressed using a combination of letters in another script.
- Latin and an Indie alphabet will be used to illustrate concepts of automatic machine-assisted transliteration.
- source-words, specified in Latin characters are being transliterated to target- words, specified in Indie characters. Note, however, that the methods and processes described below can apply, in general, between any two differing scripts where transliteration is applicable.
- a process 500 for selecting relevant word pairs from a training corpus of word pairs for use with automatic transliteration learning algorithms.
- Each training pair has a word-set, a source-word and one or more target- words.
- the following description assumes a single target- word.
- the source-word is specified in the source script and the target- word is a transliteration of the source-word.
- the target- word is in the target script.
- Word pairs in the corpus can be derived from a variety of sources including existing electronic documents and recorded interaction with individual users (e.g., transliteration corrections).
- word pairs are automatically derived from electronic documents, such as documents that include parallel text (e.g., text in one script corresponding to a transliteration of text in another script).
- documents that include parallel text e.g., text in one script corresponding to a transliteration of text in another script.
- publicly accessible web pages which contain parallel text, can include language instruction material and transliteration guidance (e.g., governmental, corporate and academic literature).
- Suitable documents can be identified based on whether the document includes two different scripts.
- Well-known text and word alignment techniques can be used to align text within the document and determine whether the text is parallel (e.g., whether the text in one script is likely the translation of text in the other script).
- Word pairs can be derived based on word alignments between parallel text. Word pairs can be verified by comparing each word's soundex value (or other phonetic metric).
- scoring can be used to align and match corresponding word pairs in unstructured text. For example, the soundex score of words are used to determine a similarity score for each word in a potential word pair. A potential word pair whose similarly score exceeds a particular criterion threshold can be identified and recorded. Using soundex scoring can help prevent erroneous word pairs (e.g., incorrectly transliterated words) from being subsequently used during automatic transliteration.
- the corpus of word pairs can include user generated- word pairs.
- a user generated word pairs is derived when a user provides or selects one or more transliterations for a particular source-word specified in the input text. For example, a user selecting one of several possible transliterations for a particular input word (e.g., as described in reference to FIG. 1) generates a word pair between the input word and the selected transliteration.
- User generated word pairs can also be provided by an expert user.
- the process 500 includes omitting or ignoring trivial words in the corpus (step 510).
- Trivial words are words from which meaningful transliteration information cannot be acquired. Trivial words include one letter words and numerical characters. Acronyms can also be ignored.
- From the remaining word pairs in the corpus several word pairs can be selected based on how frequently the source-word occurs in the corpus (step 520). In some implementations, selection is based on how often the word appears anywhere in the corpus (e.g., all instances in all documents in the corpus). In other implementations, selection is based on the number of unique documents in which the word occurs (e.g., multiple instances of the same word in a particular document count only as one occurrence).
- the top 90% of all non-unique words can be selected.
- the number of selected words may be significantly less than the total number of distinct words that occur in the corpus. For example, some estimate that in English fewer than 5,000 unique words are used in 80% of all written texts.
- the process 500 includes selecting additional word pairs from the corpus based on any sampling method such as a randomized statistically biased selection (e.g. the higher the frequency, higher the probability of selection) (step 530). For example, an additional 5% of words can be selected that are both non-trivial and not selected (e.g., not among the top 90%). Thus, if 10,000 non-trivial words occur in less than 10% of all documents, then an additional 500 words are randomly selected from the 10,000 words.
- any sampling method such as a randomized statistically biased selection (e.g. the higher the frequency, higher the probability of selection)
- the process 500 includes filtering from the selected word pairs based on the source of a word pair (step 540).
- the sources from which each word pair originates can be grouped into entities. Words that originate from users can be grouped according to the particular user. Words that originate from web pages or documents can be grouped according to an associated characteristic of the document (e.g., domain name, article, author, directory, or database). Words that have been used by only a few entities (e.g., three or less) can be filtered (e.g., ignored or omitted). Alternatively, a squashing function can be used to score each word based on how often the word occurs both across different entities and within a particular entity, and words below a pre-defined score can be filtered.
- Each of the word pairs can be weighted based on their source (e.g., particular user or location). For example, the word pairs provided by a language expert or derived from a user correction (e.g., as described in reference to FIG. 1) can be given more weight compared to the same word pair derived from another source.
- the process 500 includes filtering from the selected word pairs based on the frequency of a word pair in the corpus (step 550) (e.g., based on how often the target- word or source-word appears in the corpus).
- a threshold can be used to filter all word pairs that include a word that infrequently occurs in the corpus.
- a word pair can be filtered if it the target- word occurs proportionally very rarely compared to other target-words that all share the same source-word.
- a word pair can be filtered if the target- word occurs proportionally rarely compared to all other words in the target script (e.g., words that occur less than 2% of the time, compared to all other words in the same script).
- all of the above filtering techniques can be used as an aggregate of signals.
- a single filtering function can be used to score a word pair based on its signals, whereby any word pair with sufficiently low score is subsequently omitted.
- the remaining selected word pairs are ranked (step 560).
- the rank of a word pair is a function of the number of times the word pair occurs in the corpus, a confidence signal and the weight of the word pair.
- the confidence signal is based on the number of unique word-pair sources (e.g., distinct users and document sources) which have used the transliteration represented by the word pair.
- word pairs can be ranked according to a squashing function (e.g., using values 1, 10 -> 2, 100 -> 10).
- the number of unique word-pair sources can be squashed to some small, maximal value for frequently occurring word-pairs, while the value of less frequently occurring words are boosted relatively.
- the squashing function is a non-linear function used to normalize linear predictions into probabilities (e.g., that range between 0 and 1).
- a training model is generated using the ranked word pairs.
- the training model includes alignments between the letters of a source word and the letters of the source word's corresponding target word.
- An alignment between source letters and target letters ideally identify letter transliterations (e.g., the source letters are a transliteration of the target letters and vice-versa).
- the letters from the source word are matched with the letters of the target word.
- Letters are matched based on the statistical likelihood that one or more letters in the source word co-occur with one or more letters in the target word.
- co-occurrence probabilities are measured by Dice coefficients.
- normal alignment techniques are relatively unconstrained and purely Dice-based alignment can be error-prone.
- letter-level alignment is a many-to-many mapping of characters, however in practice, alignments are typically one-to-one, two-to-one, one-to-two, one-to-three, or three-to-one mappings.
- viramas In determining the alignment of letters, some characters in the target script can be ignored or skipped. For example, in some Indie scripts, a class of characters known as viramas can be skipped during alignment. Even if viramas are skipped for alignment, they may still be considered during subsequent analysis (e.g., distance scoring and segmentation, as described below).
- Pre-determined consonant maps can be used to map specific characters from the source word to characters of the target word. Generally, consonants produce well-defined sounds. The consonants of one script map to one or a small number of consonants letters in another script. Consonant maps can be pre-determined by an expert user, or can be learned in a separate consonant mapping process. Consonant maps provide additional constraints during alignment requiring a specific consonants in the source word to map to one of a specific consonants in the corresponding word. Using consonant maps reduces the number of potential alignments, reducing the search space, increasing efficiency and reducing the likelihood of alignment error. When the characters of a word in both the source and target scripts are pronounced in the order written (e.g.
- a monotonic constraint can be used to constrain alignment mapping.
- the following description assumes that both source and destination are in the same direction.
- the monotonic constraint requires that the beginning and end of a source and corresponding target word align. Moreover, the character preceding an aligned sub-part of the source word must align with the preceding character of the corresponding sub-part of the target word.
- the monotonic constraint makes alignment mapping a smaller, linear, chained-alignment problem.
- the alignment problem can be treated as a discrete or non-linear, constrained optimization problem, and techniques like BFGS (Broyden-Fletcher-Goldfarb-Shanno method), simulated annealing, SPSA (simultaneous perturbation stochastic approximation) can be applied to finding an optimal or near optimal solution.
- a monotonic constraint is used as a potential field (energy field) when aligning word-pairs using a constraint-based optimization. Under the monotonic constraint a measure of distance between the first (and last) character of one word and the first (and last) character of the corresponding word is zero.
- the distance between corresponding consonants is also zero.
- the distances of all other characters are measured with respect to these zero points.
- the probability of a character in one word mapping to a character in the corresponding word is highest if their respective distances from corresponding zero points are the same. The probability decreases as the difference in distances increase.
- Using the monotonic constraint to set distance values makes the alignment mapping a smaller optimization problem.
- silent characters like viramas can be used to modify the distance functions.
- additional constraint rules can be used to simplify the alignment mapping.
- the inherent language-based characteristics of a script can be used to derive special constraints.
- matras are characters that represent a phonetic modifier to a consonant.
- Special rules that map matras to particular character can be used to improve alignment.
- Matras in an Indic-script word can be represented in a corresponding Latin-script transliteration as a vowel or as no character at all, depending on preceding characters. These conventions can be encoded as constraint rules.
- One such rule restricts which characters occur after a Latin character representing a corresponding Indie character. For example, the matra ' T' (in 3TRT) extends the
- a rule can indicate that the letter 'a' occurring after another letter that aligns with an Indie consonant character (e.g., ⁇ ) will most likely align
- FIG. 2 is an illustration 200 of an alignment between characters in a source word
- the source word 210 is specified in Latin script while the target word 240 is specified in Devanagar ⁇ script.
- the target word 240 includes the ten characters 230A-230K. Note that the rendering of the target word 240 can appear to betray the actual order of the word's constituent characters.
- the ten individual characters 230A-230K are shown in their actual order (e.g., the character 230A is in a memory location successive to the memory location of character 230B).
- Some characters between the source and target words align one-to-one, such as the alignment 220 between the first 'n' in the source word 210 and character 230A.
- One or more letter alignments between a word pair can be grouped together producing a segmentation consisting of one or more contiguous alignments.
- the segmentation of a word pair effectively provides a mapping of a segment (e.g., one or more letters) from a source word to a segment in a target word.
- Each segmentation represents a transliteration that can potentially be applied to another source-word.
- a word pair may be used to generate multiple varying length overlapping segments; however, each segment obeys intra-word alignment boundaries.
- alignments between consonants are used to constrain segmentation.
- Consonant alignments are used as a boundary to limit segmentation, which effectively prevents coalescing letters on both sides of a consonant into a single segment.
- Each segment can be associated with an occurrence or frequency property whose value is based on how often the segment (e.g., a particular sequence of letters) occurs within the corpus. This property can be expressed as a segment prior probability derived from the number of times the segment occurs in the corpus relative to all other segments.
- Each segmentation can also be associated with an occurrence or frequency property whose value is based on the number of times the segmentation can be derived from word pairs in the corpus. This property can be expressed as a segmentation prior probability derived from the number of segmentations relative to all other segmentations.
- Each segment and segmentation can be associated with information about its conditional probability. The conditional probability of a segment indicates the probability that a particular series of target letters is generated given a particular series of source letters.
- Statistical similarity (co-occurrence) metrics such as Dice's coefficient, which measures the correlation between discrete events, can be used to measure the likelihood of a particular segment mapping to one or more corresponding segments.
- Each potential segmentation can be scored based on the frequency of occurrences in the corpus and a confidence signal (e.g., how many times the segmentation is used by users). Segmentations whose scores are not enough to exceed a preset threshold can be removed, omitted or ignored.
- Segmentation rules can be used to aggregate segments. For example, in Indie scripts, a segmentation rule can specify that viramas, which are particular characters that occur before or after consonants, can be collapsed with (e.g., added to) their associated consonant into the same segment. Accents (e.g., a matra) that follow a consonant can be collapsed with the consonant. Accents and viramas can be recursively collapsed to generate larger segments.
- Individual segments can be associated with information identifying whether the segment is a prefix or suffix depending on whether the segment occurs most frequently at the beginning or end of the word.
- Common prefixes and suffixes are can be identified from specific target-script letter sequences that frequently occur at the beginning or end of a word.
- a corresponding suffix or prefix in a source-script can be identified where the occurrence of a particular source-script letter sequence correlates with a corresponding occurrence of the target-script suffix or prefix.
- Prefixes and suffixes are automatically detected based on frequency of occurrence in the corpus and conditional probability correlation.
- a particular segmentation can be checked by computing a soundex value for the source segment and its corresponding target segment. Segmentations whose soundex values are determined to be significantly different can be removed, omitted or ignored. In addition to computing soundex values, other phonetic comparisons (e.g., pre-defined consonant maps, matra-vowel maps and syllable maps) can be used to verify segment mappings.
- character classes include consonants, vowels, consonant clusters (e.g., consecutive consonants), vowel clusters (e.g., consecutive vowels e.g. occurring for matras), accented characters, or viramas. For example, statistics identifying the probability that a particular consonant cluster follows another consonant cluster or that a particular accented character precedes a particular vowel can be collected.
- Statistical information can also be collected which describes the likelihood that a character or character class has particular characteristics with respect to the word in which the character is found (e.g., whether a character is usually accented, appears at the beginning or end of the word, or is followed or preceded by a virama). This statistical information can be generated for all corpora and can be used to determine whether a potential automatic transliteration is likely valid or not. This statistical information can also be verified to check validity and usefulness of particular segments. Automatic transliteration is described in further detail in reference to FIG. 6.
- Information about additional combinations or consonant clusters can be generated using one and two letter generation rules, which can include language specific information (e.g., accents and viramas). These generation rules can be provided by expert users.
- a global dictionary of common transliteration mappings can be recorded. That is, a source word that occurs in the corpus with sufficient frequency can be recorded in the global dictionary with the source word's corresponding target words.
- This global dictionary serves as a transliteration cache from which the transliteration of common words can be quickly and easily retrieved.
- a global dictionary can be generated for each script or corpus.
- FIG. 3 is an illustration 300 of segmentations of the aligned words shown in FIG. 2.
- the segmentation 320 includes the first two alignments and represents a mapping of 'ni' source-word characters to the characters 230A and 230B of the target word 240.
- the segmentation 330 includes the next three alignments.
- the segmentation 340 includes the last two alignments between the source and target word.
- the segmentation 350 includes the last five alignments and overlaps with the segmentation 340 (e.g., the last two alignments are in both segmentations). Notice that each segmentation obeys the character alignments between the words (e.g., no segmentation crosses an alignment boundary). Although, only four alignments are shown, in general, the word pair can be used to generate as many segmentations as possible (e.g., every combination of contiguous alignments).
- a process 600 for transliterating a source word includes receiving a source word from a user (step 610).
- the particular user can be identified, separate from all other users from which source words may be received.
- a user accessing transliteration through the use of a web browser can be identified through login authentication, session keys, cookies, IP addresses or a combination thereof.
- an identified user has a profile which can include a user transliteration dictionary.
- a user transliteration dictionary identifies particular source words and respective target words that the user has identified.
- the user's transliteration dictionary can include mappings that have been explicitly or implicitly identified by the user (e.g., when the user makes a correction).
- a user's transliteration dictionary may differ from the global dictionary.
- a user transliteration dictionary is described in further detail below in reference to FIG. 8. If the source word is found in the user's transliteration dictionary, the corresponding target word can be provided to the user (step 620). Otherwise, the source word is used to search the global dictionary of common transliteration mappings (step 630). If the source word is found in the global dictionary, the corresponding target word can be provided to the user.
- the global dictionary can include region specific or group specific dictionaries that the user may belong to. In one implementation, the more specific the group, the higher the priority of that dictionary for the user. The most specific group being the user's personal dictionary, as described in reference to FIG. 8.
- the source word can be transliterated as a sequence of segments.
- a list of potential transliterations are generated (step 640). The generation of potential transliterations can begin by matching either prefix segments or suffix segments, or by matching both prefix and suffix segments.
- the portion of the word that remains e.g., end, beginning or middle, respectively
- the entire word can be transliterated by the application of segment maps in no particular order using a global optimization approach.
- a source word can be transliterated by first identifying all applicable prefix and suffix segments based on the letters in the source word. All of these segments, in combination constitute a list of potential partial transliterations. Each partial transliteration includes only prefix and suffix segments. A partial transliteration will also include some unmapped letters of the source word, namely those letters between the end of the prefix and the beginning of the suffix. The partial transliteration can be "filled in” by applying additional segment maps. Applying the segment maps can produce additional transliterations if more than one segment mapping applies to a particular combination of characters in the source word. For example, FIG. 4A is an illustration 400A of segmentations 410-450 derived from multiple word pairs.
- the segmentations 410-440 are exemplary segmentations that can be derived from the word pair illustrated in FIG. 3.
- the segmentation 450 is a segmentation that is derived from another word pair.
- Each segmentation represents a mapping of word segments in the source script (e.g., Latin) to word segments in the target script (e.g., an Indie script). As described above, some of these segmentations can be associated with information identifying whether the segmentation is a prefix or a suffix.
- the segmentations 410-430, each derived from the beginning of the word pair in FIG. 3 are prefix segmentations.
- the segmentation 440 is derived from the end of the same word pair and can be designated as a suffix segmentation.
- the segmentation 450 is not derived from the word pair shown in FIG.
- FIG. 4B is an illustration 400B of generating potential transliterations 470 A-D for an input string 460.
- Each potential transliteration 470 A-D is generated based on the segmentations shown in FIG. 4A.
- the potential transliteration 470A is generated from the segmentation 410.
- the potential transliteration 470B is generated from the prefix segmentation 420 and the suffix segmentation 440. Every character of the input string 460 is used to generate characters in the potential transliterations 470 A-B.
- the potential transliterations 470C-D do not map every character from the input. Instead, each of these transliterations are generated based on the suffix segmentation 440 and two distinct prefix segmentations 430 and 450. The prefix segmentation 430 and 450 map that same source- word characters to distinct target- word characters, so each segmentation is used to derive a potential transliteration.
- These transliterations 470A-D are generated from all combinations of prefix and suffix transliterations illustrated in FIG. 4A.
- the blank 490 represents the characters in the target word that are unknown.
- the unmapped characters of the potential transliterations 470C-D can be used to generate missing characters to fill in the blank 490.
- Each potential transliteration 470 A-D is subject to pruning and scoring to identify a likely transliteration for the input string 460.
- Unviable transliterations are potential transliterations that exhibit letter and segment patterns that are not supported by the statistical information collected from the corpus. For example, if, according to corpus statistics, there are no words that begin with an accent, then all potential transliterations with an initial accent can be pruned. All aspects of the statistical information collected from the corpus can be used to prune potential transliterations (e.g., prefix/suffixes, segment combinations, character pair and character-class pair co-occurrences, and other letter characteristics). In some implementations, a threshold can be specified to further increase the pruning rate.
- the threshold can specify that the statistical information from the corpus must exceed a particular value before a transliteration is considered viable. Therefore, characteristics of the potential transliteration (e.g., character-class combinations, suffixes, prefixes and so on) must not only have occurred in the corpus but must constitute a certain proportion thereof. For example, a particular segment may occur as a prefix in only 1% of all words in which the segment occurs. A potential transliteration that has the particular segment occurring as a prefix can be pruned if 1% does not exceed the threshold value.
- special characters can be inserted between segments that are otherwise not viable. For example, a special character can be inserted between a segment that ends with a consonant and the next segment that begins with a consonant. The special character can be later mapped to a vowel and is added to a potential transliteration when doing so would increase the score of the potential transliteration significantly. All potential transliterations are scored based on the conditional and prior probability and the length of each segment used to generate the transliteration (step 660). In general, long segments are scored more favorably than short segments because a longer segment typically represents a more specific and, ideally, a more accurate transliteration. In some implementations, the transliteration can be scored based on the prior and conditional probability of the entire word (e.g., rather than an individual segment).
- Transliterations can also be scored based on co-occurrence probabilities of each segment pair in the potential transliteration.
- the contribution of each segment to the score of the transliteration can be additive, multiplicative or some other monotonically increasing function.
- Other words in the input string can be used to contextually score potential transliterations.
- the score of several transliterations are all below a particular threshold value or alternatively, if the score of the transliterations are all near in value, then the score of each transliteration can be re-evaluated based on other words in the input string.
- the preceding or following words from the input string can be used.
- multi-word (e.g., phrase or sentence) matching can be used with preceding or following characters in the input string.
- the prior probability of word co-occurrences e.g., according to the corpus
- FIG. 7 shows an input string 710 from which two equally viable transliterations are derived 760 and 770.
- the input string 710 includes two words, the first word 712 corresponds to the transliteration 740.
- the second word 714 has an ambiguous transliteration as it can be transliterated into either of the words 720 or 730. Both transliterations 720 and 730 are equally viable transliterations of the second word in the string 710.
- the complete combined transliteration of the whole input string 710 can either be 760 or 770.
- the relative occurrences 780 of each whole transliteration can be considered to determine which of the combined transliterations is likely more accurate.
- the score of the transliteration 720 can be improved relative to the transliteration of 730.
- n-word portions of the transliteration which includes the ambiguous transliteration, are considered. For example, in a four word transliteration that included a word transliterated from the word 714, the potential transliterations for 714 are grouped in a 2- word portion including a transliteration for a single preceding or succeeding word. The relative occurrence in the corpus of the n-word portion is used to score each potential transliteration of the source-word.
- the portion of the input string 710 being transliterated and the preceding and following portions of the input string used to affect the transliteration can each correspond to a word, multiple words (e.g., a phrase) or sentences.
- each potential, viable transliteration is ordered or ranked based on the respective score of the transliteration (step 670).
- the transliterations are presented in order to the user (step 680). If the user corrects the transliteration (e.g. selects any but the first transliteration in the ordered list), then the corrected word is added to the user's dictionary. All transliterations used by the user (e.g., whether corrected or not) can also be added to the training corpus, thus altering corpus and segmentation statistics. Transliterations of the same source word by a particular user can be added to a user's inferred dictionary.
- the word is added to the inferred dictionary and can be used to boost the score of subsequent potential transliterations.
- users can be clustered into groups of users together based on their usage patterns.
- a group of users who make one or more particular transliteration corrections can be recognized by statistical correlation - or by applying any collaborative filtering method.
- a culturally similar group of users can be identified based on their input, transliteration choices, and other context information such as their geographical location (e.g., based on the user's IP address or information in the user's profile), language preference, age, place of birth and so on.
- the users in a group share at least one particular commonality.
- User groups can be used to refine the transliterations provided to users of the group and to use for other services that may require personalization.
- the transliteration of words for these recognized users can automatically be corrected based on corrections made by other users in the group.
- user groups can also be identified based on words that are most frequently transliterated by the user. A particular group of users may be more likely to use and transliterate particular words than another group of users.
- Transliteration conventions often differ from one geographic region to another, so the usage pattern of users from a particular geographical region can be used to adapt transliterations for those users.
- user groups can be associated with particular group specific transliteration information.
- a particular group is associated with unique segment mappings, and group-specific transliteration statistics such as segmentation frequency, word pair frequency and prior probability information.
- This transliteration information can be based on transliteration selection and corrections by users in the group.
- the transliteration information can be included in a group dictionary which can include word pairs that are frequently used by users within the group.
- the global dictionary, one or more group dictionaries and a user's own personalized dictionary represent a prioritized hierarchy of dictionaries that can affect a particular user's transliterations.
- FIG. 8 shows a hierarchy 800 of groups and their associated dictionaries.
- the transliteration information applicable to all users e.g., the global dictionary and corpus- wide transliteration statistics
- a first group 810 is a group derived based on a particular geographical location of users.
- the first group 810 has an associated groups-specific transliteration information 815 identifying particular transliterations often used by users of the group 810.
- the group may be one of other groups that respectively correspond to other particular geographical locations.
- the first group 810 includes at least two other subgroups.
- the group 820 may correspond to users in the first group that correspond to users having a particular language preference.
- the group 820 is also associated with group-specific transliteration information 825.
- a user may belong to many groups at many varying levels in a hierarchy of groups.
- One particular user 840 in the group 820 is also associated with a personal transliteration information 845, such as the user's personalized transliteration dictionary or the user's inferred dictionary.
- a personal transliteration information 845 such as the user's personalized transliteration dictionary or the user's inferred dictionary.
- the transliteration information associated with the user and the user's groups, can be consulted in order of personalization.
- the entries of a user's personalized transliteration information 845 can be used first, the transliteration information 825 of sub-group 820 used second, the transliteration information 815 of group 810 used third and the global transliteration information used last.
- FIG. 9 is a block diagram of a transliteration system 900 for providing transliterations responsive to user requests includes a transliteration module 910.
- user input is received by the transliteration module 910, upon which transliteration is performed.
- the transliteration module 910 provides a transliteration of the user input back to the user.
- the transliteration module 910 is a server that communicates with a client 920 such as a web browser, which is running on a device (e.g., a computer 964 or portable device 962) connected to the server using a wired or wireless network 958.
- the client 920 provides user input to the transliteration module 910 using any convenient data submission techniques.
- the system 900 can provide a user interface 952 to the client 920 in accordance with the hyper-text transfer protocol (HTTP).
- HTTP hyper-text transfer protocol
- the client 920 can include client-side scripting capabilities that allow instructions to be received from the transliteration module 910 that are executed by the client 920. These instructions can be specified in client-side scripting languages such as JavaScript, VBScript, Flash, and others.
- the transliteration module 910 can provide data and client-side instructions to enable the client to generate complete or partial transliterations within the client 920.
- the transliteration module 910 can provide the client with a client-side copy of the user's transliteration dictionary 923 (or common words from the global transliteration dictionary).
- the client will also receive instructions that enable the client to automatically transliterate words that appear in the client-side dictionary without further interaction with the transliteration module 910.
- segment maps 927 can be provided to the client along with instructions such that the client can generate viable transliterations for some words through application of the segment maps.
- the segment maps sent to the user can be identified based on a confidence score of the map and the frequency with which the map is used to produce a successful transliteration.
- the segments that are both likely to be correct and often used can be provided to the client for client-side transliteration. If a transliteration cannot be computed on the client (e.g., the word is not in the user's dictionary, or the provided rules are insufficient) the text can be provided to the transliteration module 910.
- the particular maps and dictionary entries that are provided to the client compared to the maps and dictionaries that reside only on the server can depend on a caching strategy.
- the caching strategy can require that all transliteration occur on the server-side without client-side computation (e.g., unsupported web- browsers, mobile devices, slow devices, memory-constrained devices).
- the caching strategy can require that maps and dictionary entries are provided to the client for client-side computation.
- the selected mapping strategy can depend on the words being transliterated, the capabilities of the client, the capacity of the network connection or a combination thereof.
- transliteration module 910 includes two sub-modules, a back end 930 and a front end 940. Each sub-module can be distinguished by its role in transliteration.
- the front end can include the user dictionary 914 and the global dictionary 918. The front end, on receipt of a particular input string, can attempt to transliterate the string based on word look-ups each dictionary.
- the back end can include a transliteration processor for transliterating a word algorithmically based on segmentation maps 985 and the training corpus of word pairs 974 (e.g., using corpus- related statistics such as prior probabilities).
- the training corpus of word pairs 974 is derived from the search corpus 972.
- the front end can ideally transliterate many common words while the back end transliterates the obscure or rare words that the front end is unable to translate directly.
- the caching behaviors of the front and back end can reflect the unique role of each sub-module during transliteration.
- the front end can cache the top 500 transliterations in the global dictionary, while the back end caches the top 1000 segmentation maps.
- Caching policies affecting how often caches are refreshed or when cache items are replaced e.g., based on least-recently-used (LRU) or least-frequently-used (LFU) cache algorithms).
- the transliteration provided by the client may be undesirable.
- the user can provide user input indicating that the user would prefer to select a transliteration from other potential transliterations.
- the word can be provided to the transliteration server 920, and potential transliterations can be received from the transliteration server 920 and presented to the user.
- the system 900 can include an entry-aligned dictionary of transliterations.
- the entry aligned dictionary of transliterations includes, for every source word in the dictionary, a single target word.
- the dictionary can include parts of the global dictionary of transliterations and or the user's dictionary of transliterations. If a particular source word can be mapped to multiple target words, then the entry-aligned dictionary includes an entry for each target word, where each entry includes the same source word repeated in each entry.
- the entry-aligned dictionary is a space-efficient way to record word pairs.
- a consecutive word stream of the same language and encoding will compress (e.g., using convention compression techniques) more effectively than alternating languages and encodings.
- each word in the entry-aligned dictionary has a simple one-to-one relationship and therefore does not require any special structural overhead for recording potential alternatives.
- the entry-aligned dictionary can be provided by the system 900 to the user's client 920.
- the client 920 can subsequently use the dictionary to transliterate words that appear in the dictionary.
- compression can be achieved by HTTP compression as specified in the HTTP 1.1 protocol standard.
- the system 900 can include an alignment and segmentation module 980.
- the alignment and segmentation module 980 can analyze the training corpus 974 to derive alignment, segmentation maps, transliteration dictionaries and corpus statistics.
- the analysis of the training corpus is conducted asynchronous Iy from receiving user input or generating potential transliterations for such user input.
- the system 900 can include a search engine.
- the search engine receives a source word as a search query.
- the source word can be transliterated producing, potentially, several transliterated words that can be used to replace or amend the search query.
- Embodiments of the invention and all of the functional operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
- Embodiments of the invention can be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a computer-readable medium for execution by, or to control the operation of, data processing apparatus.
- the computer-readable medium can be a machine -readable storage device, a machine -readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them.
- data processing apparatus encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
- the apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
- a propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus.
- a computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
- a computer program does not necessarily correspond to a file in a file system.
- a program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
- a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
- the processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output.
- the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application- specific integrated circuit).
- processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
- a processor will receive instructions and data from a read-only memory or a random access memory or both.
- the essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data.
- a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
- mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
- a computer need not have such devices.
- a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio player, a Global Positioning System (GPS) receiver, to name just a few.
- Computer-readable media suitable for storing computer program instructions and data include all forms of non- volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
- the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
- embodiments of the invention can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- keyboard and a pointing device e.g., a mouse or a trackball
- Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
- Embodiments of the invention can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the invention, or any combination of one or more such back-end, middleware, or front-end components.
- the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
- LAN local area network
- WAN wide area network
- the computing system can include clients and servers.
- a client and server are generally remote from each other and typically interact through a communication network.
- the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. While this specification contains many specifics, these should not be construed as limitations on the scope of the invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of the invention. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US89337207P | 2007-03-06 | 2007-03-06 | |
PCT/US2008/056087 WO2008109769A1 (en) | 2007-03-06 | 2008-03-06 | Machine learning for transliteration |
Publications (2)
Publication Number | Publication Date |
---|---|
EP2132657A1 true EP2132657A1 (en) | 2009-12-16 |
EP2132657A4 EP2132657A4 (en) | 2018-01-03 |
Family
ID=39738804
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP08731575.0A Withdrawn EP2132657A4 (en) | 2007-03-06 | 2008-03-06 | Machine learning for transliteration |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP2132657A4 (en) |
WO (1) | WO2008109769A1 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8463597B2 (en) | 2008-05-11 | 2013-06-11 | Research In Motion Limited | Mobile electronic device and associated method enabling identification of previously entered data for transliteration of an input |
US8438008B2 (en) | 2010-08-03 | 2013-05-07 | King Fahd University Of Petroleum And Minerals | Method of generating a transliteration font |
CN104657343B (en) * | 2013-11-15 | 2017-10-10 | 富士通株式会社 | Recognize the method and device of transliteration name |
US10943143B2 (en) | 2018-12-28 | 2021-03-09 | Paypal, Inc. | Algorithm for scoring partial matches between words |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0877173A (en) * | 1994-09-01 | 1996-03-22 | Fujitsu Ltd | System and method for correcting character string |
US5787452A (en) * | 1996-05-21 | 1998-07-28 | Sybase, Inc. | Client/server database system with methods for multi-threaded data processing in a heterogeneous language environment |
KR100350787B1 (en) | 1999-09-22 | 2002-08-28 | 엘지전자 주식회사 | Multimedia browser based on user profile having ordering preference of searching item of multimedia data |
US7610189B2 (en) * | 2001-10-18 | 2009-10-27 | Nuance Communications, Inc. | Method and apparatus for efficient segmentation of compound words using probabilistic breakpoint traversal |
US7031911B2 (en) * | 2002-06-28 | 2006-04-18 | Microsoft Corporation | System and method for automatic detection of collocation mistakes in documents |
US8200475B2 (en) * | 2004-02-13 | 2012-06-12 | Microsoft Corporation | Phonetic-based text input method |
US20050216253A1 (en) * | 2004-03-25 | 2005-09-29 | Microsoft Corporation | System and method for reverse transliteration using statistical alignment |
-
2008
- 2008-03-06 EP EP08731575.0A patent/EP2132657A4/en not_active Withdrawn
- 2008-03-06 WO PCT/US2008/056087 patent/WO2008109769A1/en active Application Filing
Non-Patent Citations (1)
Title |
---|
See references of WO2008109769A1 * |
Also Published As
Publication number | Publication date |
---|---|
WO2008109769A1 (en) | 2008-09-12 |
EP2132657A4 (en) | 2018-01-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20080221866A1 (en) | Machine Learning For Transliteration | |
US8386237B2 (en) | Automatic correction of user input based on dictionary | |
US8341520B2 (en) | Method and system for spell checking | |
US8626486B2 (en) | Automatic spelling correction for machine translation | |
US8521761B2 (en) | Transliteration for query expansion | |
US8762358B2 (en) | Query language determination using query terms and interface language | |
US7475063B2 (en) | Augmenting queries with synonyms selected using language statistics | |
US8255376B2 (en) | Augmenting queries with synonyms from synonyms map | |
KR101425182B1 (en) | Typing candidate generating method for enhancing typing efficiency | |
CA2614416C (en) | Processing collocation mistakes in documents | |
US20110184723A1 (en) | Phonetic suggestion engine | |
US9514098B1 (en) | Iteratively learning coreference embeddings of noun phrases using feature representations that include distributed word representations of the noun phrases | |
US7835903B2 (en) | Simplifying query terms with transliteration | |
US8515731B1 (en) | Synonym verification | |
WO2009000103A1 (en) | Word probability determination | |
KR102552811B1 (en) | System for providing cloud based grammar checker service | |
Way et al. | wEBMT: developing and validating an example-based machine translation system using the world wide web | |
WO2008109769A1 (en) | Machine learning for transliteration | |
Shibli et al. | Automatic back transliteration of romanized Bengali (Banglish) to Bengali | |
EP2016486A2 (en) | Processing of query terms | |
US9208233B1 (en) | Using synthetic descriptive text to rank search results | |
Dashti et al. | PERCORE: A Deep Learning-Based Framework for Persian Spelling Correction with Phonetic Analysis | |
Hasan et al. | SweetCoat-2D: Two-Dimensional Bangla Spelling Correction and Suggestion Using Levenshtein Edit Distance and String Matching Algorithm | |
Dashti et al. | Automatic real-word error correction in persian text | |
Wang | Research on College English curriculum algorithm based on hierarchical model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20091002 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MT NL NO PL PT RO SE SI SK TR |
|
DAX | Request for extension of the european patent (deleted) | ||
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: GOOGLE LLC |
|
RA4 | Supplementary search report drawn up and despatched (corrected) |
Effective date: 20171130 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06F 17/28 20060101AFI20171124BHEP |
|
17Q | First examination report despatched |
Effective date: 20180726 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20191001 |
|
P01 | Opt-out of the competence of the unified patent court (upc) registered |
Effective date: 20230519 |