WO2010013228A1 - Automatic context sensitive language generation, correction and enhancement using an internet corpus - Google Patents
Automatic context sensitive language generation, correction and enhancement using an internet corpus Download PDFInfo
- Publication number
- WO2010013228A1 WO2010013228A1 PCT/IL2009/000130 IL2009000130W WO2010013228A1 WO 2010013228 A1 WO2010013228 A1 WO 2010013228A1 IL 2009000130 W IL2009000130 W IL 2009000130W WO 2010013228 A1 WO2010013228 A1 WO 2010013228A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- correction
- functionality
- words
- word
- input
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Definitions
- the present invention relates to computer-assisted language generation and correction generally and more particularly as applicable to machine translation.
- the present invention seeks to provide improved systems and functionalities for computer-assisted language generation.
- a computer-assisted language generation system comprising: sentence retrieval functionality, operative on the basis of an input text containing words, to retrieve from an internet corpus a plurality of sentences containing words which correspond to the words in the input text; and sentence generation functionality operative using a plurality of sentences retrieved by the sentence retrieval functionality from the internet corpus to generate at least one correct sentence giving expression to the input text.
- the sentence retrieval functionality comprises: an independent phrase generator splitting the input text into one or more independent phrases; a word stem generator and classifier, operative for each independent phrase to generate word stems for words appearing therein and to assign importance weights thereto; and an alternatives generator for generating alternative word stems corresponding to the word stems.
- the computer-assisted language generation system and also comprises a stem to sentence index which interacts with the internet corpus for retrieving the plurality of sentences containing words which correspond to the words in the input text.
- the sentence generation functionality comprises: sentence simplification functionality operative to simplify the sentences retrieved from the internet corpus; simplified sentence grouping functionality for grouping similar simplified sentences provided by the sentence simplification functionality; and simplified sentence group ranking functionality for ranking groups of the similar simplified sentences.
- the simplified sentence group ranking functionality operates using at least some of the following criteria:
- the simplified sentence group ranking functionality operates using at least part of the following procedure: defining the weight of a word stem, to indicate the importance of the word in the language; calculating a Positive Match Rank corresponding to criterion B; calculating a Negative Match Rank corresponding to criterion C; calculating a Composite Rank based on: the number of simplified sentences contained in a group and corresponding to criterion A; the Positive Match Rank; and the Negative Match Rank.
- the computer-assisted language generation system also comprises machine translation functionality providing the input text.
- a machine translation system comprising: machine translation functionality; sentence retrieval functionality, operative on the basis of an input text provided by the machine translation functionality, to retrieve from an internet corpus a plurality of sentences containing words which correspond to words in the input text; and sentence generation functionality operative using a plurality of sentences retrieved by the sentence retrieval functionality from the internet corpus to generate at least one correct sentence giving expression to the input text generated by the machine translation functionality.
- the machine translation functionality provides a plurality of alternatives corresponding to words in the input text and the sentence retrieval functionality is operative to retrieve from the internet corpus a plurality of sentences containing words which correspond to the alternatives.
- language generation comprises text correction.
- a text correction system comprising: sentence retrieval functionality, operative on the basis of an input text provided by the text correction functionality, to retrieve from an internet corpus a plurality of sentences containing words which correspond to words in the input text; and sentence correction functionality operative using a plurality of sentences retrieved by the sentence retrieval functionality from the internet corpus to generate at least one correct sentence giving expression to the input text.
- the system also comprises sentence search functionality providing the input text based on user-entered query words.
- a sentence search system comprising: sentence search functionality providing an input text based on user- entered query words; sentence retrieval functionality, operative on the basis of the input text provided by the sentence search functionality, to retrieve from an internet corpus a plurality of sentences containing words which correspond to words in the input text; and sentence generation functionality operative using a plurality of sentences retrieved by the sentence retrieval functionality from the internet corpus to generate at least one correct sentence giving expression to the input text generated by the sentence search functionality.
- the computer-assisted language generation system also comprises speech-to-text conversion functionality providing the input text.
- a speech-to-text conversion system comprising: speech-to-text conversion functionality providing an input text; sentence retrieval functionality, operative on the basis of the input text provided by the sentence search functionality, to retrieve from an internet corpus a plurality of sentences containing words which correspond to words in the input text; and sentence generation functionality operative using a plurality of sentences retrieved by the sentence retrieval functionality from the internet corpus to generate at least one correct sentence giving expression to the input text generated by the speech-to- text conversion functionality.
- a computer-assisted language correction system including an alternatives generator, generating on the basis of an input sentence a text-based representation providing multiple alternatives for each of a plurality of words in the sentence, a selector for selecting among at least the multiple alternatives for each of the plurality of words in the sentence, based at least partly on an internet corpus, and a correction generator operative to provide a correction output based on selections made by the selector.
- the selector is operative to make the selections based on at least one of the following correction functions: spelling correction, misused word correction, grammar correction and vocabulary enhancement.
- the selector is operative to make the selections based on at least two of the following correction functions: spelling correction, misused word correction, grammar correction; and vocabulary enhancement. Additionally, the selector is operative to make the selections based on at least one of the following time ordering of corrections: spelling correction prior to at least one of misused word correction, grammar correction and vocabulary enhancement, and misused word correction and grammar correction prior to vocabulary enhancement. Additionally or alternatively, the input sentence is provided by one of the following functionalities: word processor functionality, machine translation functionality, speech-to-text conversion functionality, optical character recognition functionality and instant messaging functionality, and the selector is operative to make the selections based on at least one of the following correction functions: misused word correction, grammar correction and vocabulary enhancement.
- the correction generator includes a corrected language input generator operative to provide a corrected language output based on selections made by the selector without requiring user intervention.
- the grammar correction functionality includes at least one of punctuation, verb inflection, single/plural, article and preposition correction functionalities.
- the grammar correction functionality includes at least one of replacement, insertion and omission correction functionalities.
- the selector includes context based scoring functionality operative to rank the multiple alternatives, based at least partially on contextual feature- sequence (CFS) frequencies of occurrences in an internet corpus.
- CFS contextual feature- sequence
- the context based scoring functionality is also operative to rank the multiple alternatives based at least partially on normalized CFS frequencies of occurrences in the internet corpus.
- a computer-assisted language correction system including at least one of spelling correction functionality, misused word correction functionality, grammar correction functionality and vocabulary enhancement functionality, and contextual feature-sequence functionality cooperating with at least one of the spelling correction functionality; the misused word correction functionality, grammar correction functionality and the vocabulary enhancement functionality and employing an internet corpus.
- the grammar correction functionality includes at least one of punctuation, verb inflection, single/plural, article and preposition correction functionalities. Additionally or alternatively, the grammar correction functionality includes at least one of replacement, insertion and omission correction functionalities.
- the computer-assisted language correction system includes at least two of the spelling correction functionality, the misused word correction functionality, the grammar correction functionality and the vocabulary enhancement functionality, and the contextual feature-sequence functionality cooperates with at least two of the spelling correction functionality, the misused word correction functionality, the grammar correction functionality and the vocabulary enhancement functionality, and employs an internet corpus.
- the computer-assisted language correction system also includes at least three of the spelling correction functionality, the misused word correction functionality; the grammar correction functionality and the vocabulary enhancement functionality and the contextual feature-sequence functionality cooperates with at least three of the spelling correction functionality, the misused word correction functionality, the grammar correction functionality and the vocabulary enhancement functionality, and employs an internet corpus.
- the computer-assisted language correction system also includes the spelling correction functionality, the misused word correction functionality, the grammar correction functionality and the vocabulary enhancement functionality, and the contextual feature- sequence functionality cooperates with the spelling correction functionality, the misused word correction functionality, the grammar correction functionality and the vocabulary enhancement functionality, and employs an internet corpus.
- the correction generator includes a corrected language generator operative to provide a corrected language output based on selections made by the selector without requiring user intervention.
- a computer-assisted language correction system including an alternatives generator, generating on the basis of a language input a text-based representation providing multiple alternatives for each of a plurality of words in the sentence, a selector for selecting among at least the multiple alternatives for each of the plurality of words in the language input, based at least partly on a relationship between selected ones of the multiple alternatives for at least some of the plurality of words in the language input and a correction generator operative to provide a correction output based on selections made by the selector.
- an alternatives generator generating on the basis of a language input a text-based representation providing multiple alternatives for each of a plurality of words in the sentence, a selector for selecting among at least the multiple alternatives for each of the plurality of words in the language input, based at least partly on a relationship between selected ones of the multiple alternatives for at least some of the plurality of words in the language input and a correction generator operative to provide a correction output based on selections made by the selector.
- the language input includes at least one of an input sentence and an input text.
- the language input is speech and the generator converts the language input in speech to a text-based representation providing multiple alternatives for a plurality of words in the language input.
- the language input is at least one of a text input, an output of optical character recognition functionality, an output of machine translation functionality and an output of word processing functionality, and the generator converts the language input in text to a text- based representation providing multiple alternatives for a plurality of words in the language input.
- the selector is operative to make the selections based on at least two of the following correction functions: spelling correction, misused word correction, grammar correction and vocabulary enhancement. Additionally, the selector is operative to make the selections based on at least one of the following time ordering of corrections: spelling correction prior to at least one of misused word correction, grammar correction and vocabulary enhancement, and misused word correction and grammar correction prior to vocabulary enhancement.
- the language input is speech and the selector is operative to make the selections based on at least one of the following correction functions: misused word correction, grammar correction and vocabulary enhancement.
- the selector is operative to make the selections by carrying out at least two of the following functions: selection of a first set of words or combinations of words which include less than all of the plurality of words in the language input for an initial selection, thereafter ordering elements of the first set of words or combinations of words to establish priority of selection and thereafter when selecting among the multiple alternatives for an element of the first set of words, choosing other words, but not all, of the plurality of words as a context to influence the selecting.
- the selector is operative to make the selections by carrying out the following function: when selecting for an element having at least two words, evaluating each of the multiple alternatives for each of the at least two words in combination with each of the multiple alternatives for each other of the at least two words.
- the correction generator includes a corrected language input generator operative to provide a corrected language output based on selections made by the selector without requiring user intervention.
- a computer-assisted language correction system including a misused-word suspector evaluating at least most of the words in an language input on the basis of their fit within a context of the language input and a correction generator operative to provide a correction output based at least partially on an evaluation performed by the suspector.
- the computer-assisted language correction system also includes an alternatives generator, generating on the basis of the language input, a text- based representation providing multiple alternatives for at least one of the at least most words in the language input and a selector for selecting among at least the multiple alternatives for each of the at least one of the at least most words in the language input, and the correction generator is operative to provide the correction output based on selections made by the selector.
- the computer-assisted language correction system also includes a suspect word output indicator indicating an extent to which at least some of the at least most of the words in the language input is suspect as a misused-word.
- the correction generator includes an automatic corrected language generator operative to provide a corrected text output based at least partially on an evaluation performed by the suspector, without requiring user intervention.
- the language input is speech and the selector is operative to make the selections based on at least one of the following correction functions: misused word correction, grammar correction and vocabulary enhancement.
- a computer-assisted language correction system including a misused-word suspector evaluating words in an language input, an alternatives generator, generating multiple alternatives for at least some of the words in the language input evaluated as suspect words by the suspector, at least one of the multiple alternatives for a word in the language input being consistent with a contextual feature of the word in the language input in an internet corpus, a selector for selecting among at least the multiple alternatives and a correction generator operative to provide a correction output based at least partially on a selection made by the selector.
- a computer-assisted language correction system including a misused-word suspector evaluating words in an language input and identifying suspect words, an alternatives generator, generating multiple alternatives for the suspect words, a selector, grading each the suspect word as well as ones of the multiple alternatives therefor generated by the alternatives generator according to multiple selection criteria, and applying a bias in favor of the suspect word vis-a-vis ones of the multiple alternatives therefor generated by the alternatives generator and a correction generator operative to provide a correction output based at least partially on a selection made by the selector.
- a computer-assisted language correction system including an alternatives generator, generating on the basis of an input multiple alternatives for at least one word in the input, a selector, grading each the at least one word as well as ones of the multiple alternatives therefor generated by the alternatives generator according to multiple selection criteria, and applying a bias in favor of the at least one word vis-a-vis ones of the multiple alternatives therefor generated by the alternatives generator, the bias being a function of an input uncertainty metric indicating uncertainty of a person providing the input, and a correction generator operative to provide a correction output based on a selection made by the selector.
- a computer-assisted language correction system including an incorrect word suspector evaluating at least most of the words in a language input, the suspector being at least partially responsive to an input uncertainty metric indicating uncertainty of a person providing the input, the suspector providing a suspected incorrect word output, and an alternatives generator, generating a plurality of alternatives for suspected incorrect words identified by the suspected incorrect word output, a selector for selecting among each suspected incorrect word and the plurality of alternatives generated by the alternatives generator, and a correction generator operative to provide a correction output based on a selection made by the selector.
- a computer-assisted language correction system including at least one of a spelling correction module, a misused-word correction module, a grammar correction module and a vocabulary enhancement module receiving a multi-word input and providing a correction output, each of the at least one of a spelling correction module, a misused-word correction module, a grammar correction module and a vocabulary enhancement module including an alternative word candidate generator including phonetic similarity functionality operative to propose alternative words based on phonetic similarity to a word in the input and to indicate a metric of phonetic similarity and character string similarity functionality operative to propose alternative words based on character string similarity to a word in the input and to indicate a metric of character string similarity for each alternative word, and a selector operative to select either a word in the output or an alternative word candidate proposed by the alternative word candidate generator by employing the phonetic similarity and character string similarity metrics together with context-based selection functionality.
- a computer-assisted language correction system including suspect word identification functionality, receiving a multi-word language input and providing a suspect word output which indicates suspect words, feature identification functionality operative to identify features including the suspect words, an alternative selector identifying alternatives to the suspect words, feature occurrence functionality employing a corpus and providing an occurrence output, ranking various features including the alternatives as to their frequency of use in the corpus, and a selector employing the occurrence output to provide a correction output
- the feature identification functionality including feature filtration functionality including at least one of functionality for eliminating features containing suspected errors, functionality for negatively biasing features which contain words introduced in an earlier correction iteration of the multiword input and which have a confidence level below a confidence level predetermined threshold, and functionality for eliminating features which are contained in another feature having an frequency of occurrence above a predetermined frequency threshold.
- the selector is operative to make the selections based on at least two of the following correction functions: spelling correction, misused word correction, grammar correction and vocabulary enhancement. Additionally, the selector is operative to make the selections based on at least one of the following time ordering of corrections: spelling correction prior to at least one of misused word correction, grammar correction and vocabulary enhancement and misused word correction and grammar correction prior to vocabulary enhancement.
- the language input is speech and the selector is operative to make the selections based on at least one of the following correction functions: grammar correction, and misused word correction and vocabulary enhancement.
- the correction generator includes a corrected language input generator operative to provide a corrected language output based on selections made by the selector without requiring user intervention.
- the selector is also operative to make the selections based at least partly on a user input uncertainty metric.
- the user input uncertainty metric is a function based on a measurement of the uncertainty of a person providing the input. Additionally or alternatively, the selector also employs user input history learning functionality.
- a computer-assisted language correction system including suspect word identification functionality, receiving a multi-word language input and providing a suspect word output which indicates suspect words,feature identification functionality operative to identify features including the suspect words, an alternative selector identifying alternatives to the suspect words, occurrence functionality employing a corpus and providing an occurrence output, ranking features including the alternatives as to their frequency of use in the corpus, and a correction output generator, employing the occurrence output to provide a correction output, the feature identification functionality including at least one of: N-gram identification functionality and cooccurrence identification functionality, and at least one of: skip-gram identification functionality, switch-gram identification functionality and previously used by user feature identification functionality.
- a computer-assisted language correction system including a grammatical error suspector evaluating at least most of the words in an language input on the basis of their fit within a context of the language input and a correction generator operative to provide a correction output based at least partially on an evaluation performed by the suspector.
- the computer-assisted language correction system also includes an alternatives generator, generating on the basis of the language input, a text- based representation providing multiple alternatives for at least one of the at least most words in the language input, and a selector for selecting among at least the multiple alternatives for each of the at least one of the at least most words in the language input, and the correction generator is operative to provide the correction output based on selections made by the selector.
- an alternatives generator generating on the basis of the language input, a text- based representation providing multiple alternatives for at least one of the at least most words in the language input, and a selector for selecting among at least the multiple alternatives for each of the at least one of the at least most words in the language input, and the correction generator is operative to provide the correction output based on selections made by the selector.
- the computer-assisted language correction system also includes a suspect word output indicator indicating an extent to which at least some of the at least most of the words in the language input is suspect as containing grammatical error.
- the correction generator includes an automatic corrected language generator operative to provide a corrected text output based at least partially on an evaluation performed by the suspector, without requiring user intervention.
- a computer-assisted language correction system including a grammatical error suspector evaluating words in an language input, an alternatives generator, generating multiple alternatives for at least some of the words in the language input evaluated as suspect words by the suspector, at least one of the multiple alternatives for a word in the language input being consistent with a contextual feature of the word in the language input, a selector for selecting among at least the multiple alternatives and a correction generator operative to provide a correction output based at least partially on a selection made by the selector.
- a computer-assisted language correction system including a grammatical error suspector evaluating words in an language input and identifying suspect words, an alternatives generator, generating multiple alternatives for the suspect words, a selector, grading each the suspect word as well as ones of the multiple alternatives therefor generated by the alternatives generator according to multiple selection criteria, and applying a bias in favor of the suspect word vis-a-vis ones of the multiple alternatives therefor generated by the alternatives generator, and a correction generator operative to provide a correction output based at least partially on a selection made by the selector.
- the various embodiments summarized above may be combined with or also include a computer-assisted language correction system including context based scoring of various alternative corrections, based at least partially on contextual feature- sequence (CFS) frequencies of occurrences in an internet corpus.
- CFS contextual feature- sequence
- the computer-assisted language correction system also includes at least one of spelling correction functionality, misused word correction functionality, grammar correction functionality and vocabulary enhancement functionality, cooperating with the context based scoring.
- the context based scoring is also based at least partially on normalized CFS frequencies of occurrences in an internet corpus.
- the context based scoring is also based at least partially on a CFS importance score.
- the CFS importance score is a function of at least one of the following: operation of a part-of- speech tagging and sentence parsing functionality; a CFS length; a frequency of occurrence of each of the words in the CFS and a CFS type.
- a computer-assisted language correction system including vocabulary enhancement functionality including vocabulary-challenged words identification functionality, alternative vocabulary enhancements generation functionality and context based scoring functionality, based at least partially on contextual feature-sequence (CFS) frequencies of occurrences in an internet corpus, the alternative vocabulary enhancements generation functionality including thesaurus preprocessing functionality operative to generate candidates for vocabulary enhancement.
- vocabulary enhancement functionality including vocabulary-challenged words identification functionality, alternative vocabulary enhancements generation functionality and context based scoring functionality, based at least partially on contextual feature-sequence (CFS) frequencies of occurrences in an internet corpus
- CFS contextual feature-sequence
- the multiple alternatives are evaluated based on contextual feature sequences (CFSs) and the confidence level is based on at least one of the following parameters: number, type and scoring of selected CFSs, a measure of statistical significance of frequency of occurrence of the multiple alternatives, in the context of the CFSs, degree of consensus on the selection of one of the multiple alternatives, based on preference metrics of each of the CFSs and word similarity scores of the multiple alternatives, a non-contextual similarity score of the one of the multiple alternatives being above a first predetermined minimum threshold and an extent of contextual data available, as indicated by the number of the CFSs having CFS scores above a second predetermined minimum threshold and having preference scores over a third predetermined threshold.
- CFSs contextual feature sequences
- a computer-assisted language correction system including a punctuation error suspector evaluating at least some of the words and punctuation in a language input on the basis of their fit within a context of the language input based on frequency of occurrence of feature-grams of the language input in an internet corpus and a correction generator operative to provide a correction output based at least partially on an evaluation performed by the suspector.
- the correction generator includes at least one of missing punctuation correction functionality, superfluous punctuation correction functionality and punctuation replacement correction functionality.
- a computer-assisted language correction system including a grammatical element error suspector evaluating at least some of the words in a language input on the basis of their fit within a context of the language input based on frequency of occurrence of feature-grams of the language input in an internet corpus and a correction generator operative to provide a correction output based at least partially on an evaluation performed by the suspector.
- the correction generator includes at least one of missing grammatical element correction functionality, superfluous grammatical element correction functionality and grammatical element replacement correction functionality.
- the grammatical element is one of an article, a preposition and a conjunction.
- Fig. 1 is a simplified block diagram illustration of a system and functionality for computer-assisted language correction constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 2 is a simplified flow chart illustrating spelling correction functionality, preferably employed in the system and functionality of Fig. 1;
- Fig. 3 is a simplified flow chart illustrating misused word and grammar correction functionality, preferably employed in the system and functionality of Fig. 1 ;
- Fig. 4 is a simplified flow chart illustrating vocabulary enhancement functionality, preferably employed in the system and functionality of Fig. 1;
- Fig. 5 is a simplified block diagram illustrating contextual-feature- sequence (CFS) functionality, preferably employed in the system and functionality of Fig. 1;
- CFS contextual-feature- sequence
- Fig. 6A is a simplified flow chart illustrating spelling correction functionality forming part of the functionality of Fig. 2 in accordance with a preferred embodiment of the present invention
- Fig. 6B is a simplified flow chart illustrating misused word and grammar correction functionality forming part of the functionality of Fig. 3 in accordance with a preferred embodiment of the present invention
- Fig. 6C is a simplified flow chart illustrating vocabulary enhancement functionality forming part of the functionality of Fig. 4 in accordance with a preferred embodiment of the present invention
- Fig. 7A is a simplified flow chart illustrating functionality for generating alternative corrections which is useful in the functionalities of Figs. 2 and 3;
- Fig. 7B is a simplified flow chart illustrating functionality for generating alternative enhancements which is useful in the functionality of Fig. 4;
- Fig. 8 is a simplified flow chart illustrating functionality for non- contextual word similarity-based scoring and contextual scoring, preferably using an internet corpus, of various alternative corrections useful in the spelling correction functionality of Fig. 2;
- Fig. 9 is a simplified flow chart illustrating functionality for non- contextual word similarity-based scoring and contextual scoring, preferably using an internet corpus, of various alternative corrections useful in the misused word and grammar correction functionalities of Figs. 3, 10 and 11 and in the vocabulary enhancement functionality of Fig. 4;
- Fig. 10 is a simplified flowchart illustrating the operation of missing article, preposition and punctuation correction functionality;
- Fig. 11 is a simplified flowchart illustrating the operation of superfluous article, preposition and punctuation correction functionality
- Fig. 12 is a simplified block diagram illustration of a system and functionality for computer-assisted language translation and generation, constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 13 is a simplified flow chart illustrating sentence retrieval functionality preferably forming part of the system and functionality of Fig. 12;
- Figs. 14A and 14B together are a simplified flow chart illustrating sentence generation functionality preferably forming part of the system and functionality of Fig. 12;
- Fig. 15 is a simplified flow chart illustrating functionality for generating alternatives which is useful in the functionalities of Figs. 13, 14A & 14B.
- Fig. 1 is a simplified block diagram illustration of a system and functionality for computer-assisted language correction constructed and operative in accordance with a preferred embodiment of the present invention.
- text for correction is supplied to a language correction module 100 from one or more sources, including, without limitation, word processor functionality 102, machine translation functionality 104, speech-to-text conversion functionality 106, optical character recognition functionality 108 and any other text source 110, such as instant messaging or the internet.
- Language correction module 100 preferably includes spelling correction functionality 112, misused word and grammar correction functionality 114 and vocabulary enhancement functionality 116. It is a particular feature of the present invention that spelling correction functionality 112, misused word and grammar correction functionality 114 and vocabulary enhancement functionality 116 each interact with contextual-feature- sequence (CFS) functionality 118, which utilizes an internet corpus 120.
- CFS contextual-feature- sequence
- a contextual-feature-sequence or CFS is defined for the purposes of the present description as including, N-grams, skip-grams, switch-grams, co-occurrences, "previously used by user features" and combinations thereof, which are in turn defined hereinbelow with reference to Fig. 5. It is noted that for simplicity and clarity of description, most of the examples which follow employ n-grams only. It is understood that the invention is not so limited.
- the use of an internet corpus is important in that it provides significant statistical data for an extremely large number of contextual-feature-sequences, resulting in highly robust language correction functionality. In practice, combinations of over two words have very poor statistics in conventional non-internet corpuses but have acceptable or good statistics in internet corpuses.
- An internet corpus is a large representative sample of natural language text which is collected from the world wide web, usually by crawling on the internet and collecting text from website pages.
- dynamic text such as chat transcripts, texts from web forums and texts from blogs, is also collected.
- the collected text is used for accumulating statistics on natural language text.
- the size of an internet corpus can be, for example, one trillion (1,000,000,000,000) words or several trillion words, as opposed to more typical corpus sizes of up to 2 billion words.
- a small sample of the web, such as the web corpus includes 10 billion words, which is significantly less than one percent of the web texts indexed by search engines, such as GOOGLE®.
- the present invention can work with a sample of the web, such as the web corpus, but preferably it utilizes a significantly larger sample of the web for the task of text correction.
- An internet corpus is preferably employed in one of the following two ways:
- One or more internet search engines is employed using a CFS as a search query.
- the number of results for each such query provides the frequency of occurrence of that CFS.
- a local index is built up over time by crawling and indexing the internet.
- the number of occurrences of each CFS provides the CFS frequency.
- the local index, as well as the search queries, may be based on selectable parts of the internet and may be identified with those selected parts. Similarly, parts of the internet may be excluded or appropriately weighted in order to correct anomalies between internet usage and general language usage. In such a way, websites that are reliable in terms of language usage, such as news and government websites, may be given greater weight than other websites, such as chat or user forums.
- input text is initially supplied to spelling correction functionality 112 and thereafter to misused word and grammar correction functionality 1 14.
- the input text may be any suitable text and in the context of word processing is preferably a part of a document, such as a sentence.
- Vocabulary enhancement functionality 116 preferably is operated at the option of a user on text that has already been supplied to spelling correction functionality 112 and to misused word and grammar correction functionality 114.
- the language correction module 100 provides an output which includes corrected text accompanied by one or more suggested alternatives for each corrected word or group of words.
- Fig. 2 is a simplified flow chart illustrating spelling correction functionality, preferably employed in the system and functionality of Fig. 1.
- the spelling correction functionality preferably comprises the following steps: identifying spelling errors in an input text, preferably using a conventional dictionary enriched with proper names and words commonly used on the internet; grouping spelling errors into clusters, which may include single or multiple words, consecutive or near consecutive, having spelling mistakes and selecting a cluster for correction. This selection attempts to find the cluster which contains the largest amount of correct contextual data. Preferably, the cluster that has the longest sequence or sequences of correctly spelled words in its vicinity is selected.
- the foregoing steps are described hereinbelow in greater detail with reference to Fig. 6A.
- Cluster 2 it is noted that “their” is correctly spelled, but nevertheless included in a cluster since it is surrounded by misspelled words.
- Cluster 1 "eksersiv” is selected for correction inasmuch as it has the longest sequence or sequences of correctly spelled words in its vicinity.
- the non-contextual score may be derived in various ways.
- One example is by using the Levelnshtein Distance algorithm which is available on http://en.wikipedia.org/wiki/Levenshtein distance. This algorithm can be implied on word strings, word phonetic representation, or a combination of both.
- Each alternative is also given a contextual score, as seen in Table 3, based on its fit in the context of the input sentence.
- the context that is used is "Some students should ⁇ eksersiv> daily"
- the contextual score is preferably derived as described hereinbelow with reference to Fig. 8 and is based on contextual feature sequence (CFS) frequencies in an internet corpus.
- CFS contextual feature sequence
- the word "exercise” is selected as the best alternative based on a combination of the contextual score and non-contextual word similarity score, as described hereinbelow with reference to Fig. 8.
- the spelling- corrected input text following spelling correction in accordance with a preferred embodiment of the present invention is:
- Fig. 3 is a simplified flow chart illustrating misused word and grammar correction functionality, preferably employed in the system and functionality of Fig. 1.
- the misused word and grammar correction functionality provides correction of words which are correctly spelled but misused in the context of the input text and correction of grammar mistakes, including use of a grammatically incorrect word in place of grammatically correct word, the use of a superfluous word and missing words and punctuation.
- the misused word and grammar correction functionality preferably comprises the following steps: identifying suspected misused words and words having grammar mistakes in a spelling-corrected input text output from the spelling correction functionality of Fig.
- identifying, grouping and selecting steps are preferably based on an algorithm described hereinbelow with reference to Fig. 6B. generating one or preferably more alternative corrections for each cluster, preferably based on an alternative correction generation algorithm described hereinbelow with reference to Fig. 7A; generating one or preferably more alternative corrections for each cluster, based on a missing article, preposition and punctuation correction algorithm described hereinbelow with reference to Fig.
- the scoring includes applying a bias in favor of the suspect word vis-a-vis ones of the multiple alternatives therefor, the bias being a function of an input uncertainty metric indicating uncertainty of a person providing the input.
- the following words are identified as suspected misused words: money, book
- the following cluster is generated: money book
- money book The following are examples of alternative corrections which are generated for the cluster (partial list): money books; money back; money box; money bulk; money Buick; money ebook; money bank; mini book; mummy book; Monet book; honey book; mannerly book; mono book; Monday book; many books; mini bike; mummy back; monkey bunk; Monday booked; Monarchy back; Mourned brook
- Fig. 4 is a simplified flow chart illustrating vocabulary enhancement functionality, employed in the system and functionality of Fig. 1.
- the vocabulary enhancement functionality preferably comprises the following steps: identifying vocabulary-challenged words having suspected suboptimal vocabulary usage in a spelling, misused word and grammar-corrected input text output from the misused word and grammar correction functionality of Fig. 3; grouping vocabulary-challenged words into clusters, which are preferably non-overlapping; selecting a cluster for correction.
- the identifying, grouping and selecting steps are preferably based on an algorithm described hereinbelow with reference to Fig. 6C.
- CFSs are the feature-grams:
- FIG. 5 is a simplified block diagram illustrating contextual-feature-sequence (CFS) functionality 118 (Fig. 1) useful in the system and functionality for computer-assisted language correction of a preferred embodiment of the present invention.
- CFS contextual-feature-sequence
- the CFS functionality 118 preferably includes feature extraction functionality including N-gram extraction functionality and optionally at least one of skip-gram extraction functionality; switch-gram extraction functionality; co-occurrence extraction functionality; and previously used by user feature extraction functionality.
- N-gram which is a known term of the art, refers to a sequence of N consecutive words in an input text.
- the N-gram extraction functionality may employ conventional part-of-speech tagging and sentence parsing functionality in order to avoid generating certain N-grams which, based on grammatical considerations, are not expected to appear with high frequency in a corpus, preferably an internet corpus.
- skip-gram extraction functionality means functionality operative to extract "skip-grams" which are modified n-grams which leave out certain non-essential words or phrases, such as adjectives, adverbs, adjectival phrases and adverbial phrases, or which contain only words having predetermined grammatical relationships, such as subject-verb, verb- object, adverb- verb or verb-time phrase.
- the skip-gram extraction functionality may employ conventional part-of-speech tagging and sentence parsing functionality to assist in deciding which words may be skipped in a given context.
- switch-gram extraction functionality means functionality which identifies “switch grams", which are modified n-grams in which the order of appearance of certain words is switched.
- the switch-gram extraction functionality may employ conventional part-of-speech tagging and sentence parsing functionality to assist in deciding which words may have their order of appearance switched in a given context.
- co-occurrence extraction functionality means functionality which identifies word combinations in an input sentence or an input document containing many input sentences, having input text word co-occurrence for all words in the input text other than those included in the N- grams, switch-grams or skip-grams, together with indications of distance from an input word and direction, following filtering out of commonly occurring words, such as prepositions, articles, conjunctions and other words whose function is primarily grammatical.
- previously used by user feature extraction functionality means functionality which identifies words used by a user in other documents, following filtering out of commonly occurring words, such as prepositions, articles, conjunctions and other words whose function is primarily grammatical.
- N-grams For the purposes of the present description, N-grams, skip-grams, switch- grams and combinations thereof are termed feature-grams.
- N-grams, skip-grams, switch- grams, co-occurrences, "previously used by user features” and combinations thereof are termed contextual-feature-sequences or CFSs.
- Fig. 5 preferably operates on individual words or clusters of words in an input text.
- N-grams 2-grams: Cherlock Homes; Homes the
- CFSs are each given an "importance score" based on at least one of, preferably more than one of and most preferably all of the following: a. operation of conventional part-of-speech tagging and sentence parsing functionality. A CFS which includes parts of multiple parsing tree nodes is given a relatively low score.
- Fig. 6A is a simplified flow chart illustrating functionality for identifying misspelled words in the input text; grouping misspelled words into clusters, which are preferably non-overlapping; and selecting a cluster for correction.
- identifying misspelled words is preferably carried out by using a conventional dictionary enriched with proper names and words commonly used on the internet.
- Grouping misspelled words into clusters is preferably carried out by grouping consecutive or nearly consecutive misspelled words into a single cluster along with misspelled words which have a grammatical relationship.
- Selecting a cluster for correction is preferably carried out by attempting to find the cluster which contains the largest amount of non-suspected contextual data.
- the cluster that has the longest sequence or sequences of correctly spelled words in its vicinity is selected.
- Fig. 6B is a simplified flow chart illustrating functionality for identifying suspected misused words and words having grammar mistakes in a spelling-corrected input text; grouping suspected misused words and words having grammar mistakes into clusters, which are preferably non- overlapping; and selecting a cluster for correction.
- Identifying suspected misused words is preferably carried out as follows: feature-grams are generated for each word in the spelling-corrected input text; the frequency of occurrence of each of the feature-grams in a corpus, preferably an internet corpus, is noted; the number of suspected feature-grams for each word is noted. Suspected feature-grams have a frequency which is significantly lower than their expected frequency or which lies below a minimum frequency threshold. The expected frequency of a feature-gram is estimated on the basis of the frequencies of its constituent elements and combinations thereof. a word is suspected if the number of suspected feature-grams containing the word exceeds a predetermined threshold.
- the frequency of occurrence of each feature-gram in the spelling-corrected input text in a corpus is ascertained.
- the frequency of occurrence of each word in the spelling-corrected input text in that corpus (FREQ W) is also ascertained and the frequency of occurrence of each feature-gram without that word (FREQ FG-W) is additionally ascertained.
- EFREQ F-G FREQ F-G-W * FREQ W/(TOTAL OF FREQUENCIES OF ALL WORDS IN THE CORPUS) If the ratio of the frequency of occurrence of each feature-gram in the spelling-corrected input text in a corpus, preferably an internet corpus, to the expected frequency of occurrence of each feature-gram, FREQ F-G/EFREQ F-G, is less than a predetermined threshold, or if FREQ F-G is less than another predetermined threshold, the feature-gram is considered to be a suspected feature-gram. Every word that is included in a suspected feature-gram is considered to be a suspected misused word or a word having a suspected grammar mistake.
- the operation of the functionality of Fig. 6B for identifying suspected misused words and words having grammar mistakes in a spelling-corrected input text may be better understood from a consideration of the following example:
- the following spelling-corrected input text is provided: I have money book
- the feature-grams include the following:
- Table 8 indicates the frequencies of occurrence in an internet corpus of the above feature-grams:
- EFREQ F-G (FREQ F-G-W * FREQ W)/(TOTAL OF FREQUENCIES OF ALL WORDS IN THE CORPUS)
- the expected 2-gram frequency for a 2-gram (x,y) (1-gram frequency of x * 1-gram frequency of y)/Number of words in the internet corpus, e.g., Trillion (1,000,000,000,000) words.
- the ratio of the frequency of occurrence of each feature-gram in the spelling-corrected input text in a corpus, preferably an internet corpus, to the expected frequency of occurrence of each feature-gram is calculated as follows:
- FREQ F-G/EFREQ F-G The ratio of the frequency of occurrence of each of the above 2-grams in the spelling-corrected input text in a corpus, preferably an internet corpus, to the expected frequency of occurrence of each of the above 2-grams are seen in Table 9:
- FREQ F-G of "money book” is substantially lower than its expected frequency and thus FREQ F-G/EFREQ F-G may be considered to be lower than a predetermined threshold, such as 1, and therefore the cluster "money book” is suspected.
- Grouping suspected misused words and words having grammar mistakes into clusters is preferably carried out as follows: consecutive or nearly consecutive suspected misused words are grouped into a single cluster; and suspected misused words which have a grammatical relationship between themselves are grouped into the same cluster.
- Selecting a cluster for correction is preferably carried out by attempting to find the cluster which contains the largest amount of non-suspected contextual data.
- the cluster that has the longest sequence or sequences of non-suspected words in its vicinity is selected.
- Fig. 6C is a simplified flow chart illustrating functionality for identifying vocabulary-challenged words having suspected suboptimal vocabulary usage in a spelling, misused word and grammar-corrected input text; grouping vocabulary-challenged words into clusters, which are preferably non- overlapping; and selecting a cluster for correction.
- Identifying vocabulary-challenged words is preferably carried out as follows: pre-processing a thesaurus in order to assign language richness scores to each word which indicate the level of the word in a hierarchy wherein written language is preferred over spoken language; and wherein among internet sources, articles and books are preferred over chat and forums, for example, and wherein less frequently used words are preferred over more frequently used words; further pre-processing of the thesaurus to eliminate words which are not likely candidates for vocabulary enhancement based on the results of the preceding pre- processing step and on grammatical rules; additional pre-processing to indicate for each remaining word, candidates for vocabulary enhancement which have a language richness score higher than that of the input word; and checking whether each word in the spelling, misused word and grammar- corrected input text appears as a remaining word in the multiple pre-processed thesaurus and identifying each such word which appears as a remaining word as a candidate for vocabulary enhancement.
- Grouping vocabulary-challenged words into clusters is optional and is preferably carried out as follows: consecutive vocabulary-challenged words are grouped into a single cluster; and vocabulary-challenged words which have a grammatical relationship are grouped into the same cluster.
- Selecting a cluster for correction is preferably carried out by attempting to find the cluster which contains the largest amount of non vocabulary-challenged words.
- the cluster that has the longest sequence or sequences of non vocabulary-challenged words in its vicinity is selected.
- Fig. 7A is a simplified flow chart illustrating functionality for generating alternative corrections for a cluster, which is useful in the functionalities of Figs. 2 and 3. If the original input word is correctly spelled, it is considered as an alternative.
- a plurality of alternative corrections is initially generated in the following manner: A plurality of words, taken from a dictionary, similar to each word in the cluster, both on the basis of their written appearance, expressed in character string similarity, and on the basis of sound or phonetic similarity, is retrieved.
- This functionality is known and available on the internet as freeware, such as GNU Aspell and Google ® GSpell.
- the retrieved and prioritized words provide a first plurality of alternative corrections.
- the word "physics” will be retrieved from the dictionary, based on a similar sound, even though it has only one character, namely "i", in common.
- the word "felix” will be retrieved, based on its string character similarity, even though it doesn't have a similar sound.
- Additional alternatives may be generated by employing rules based on known alternative usages as well as accumulated user inputs. E.g., u -> you, r -> are, Im - ⁇ I am.
- contextual information such as CFSs and more particularly feature- grams
- CFSs and more particularly feature- grams is employed to generate alternative corrections and not only for scoring such "contextually retrieved" alternative corrections.
- Frequently occurring word combinations such as CFSs and more particularly feature-grams, may be retrieved from an existing corpus, such as an internet corpus.
- Contextually retrieved alternatives are then filtered, such that only contextually retrieved alternatives having some phonetic or writing similarity to the original word, in the present example "kts", remain.
- the alternative having the highest phonetic and writing similarity, "kittens” is retrieved.
- the input text is generated automatically by an external system, such as an optical character recognition, speech-to-text or machine translation system
- additional alternatives may be received directly from such system.
- Such additional alternatives typically are generated in the course of operation of such system.
- the alternative translations of a word in a foreign language may be supplied to the present system for use as alternatives.
- cluster alternatives for the entire cluster are generated by ascertaining all possible combinations of the various alternatives and subsequent filtering of the combinations based on the frequency of their occurrence in a corpus, preferably an internet corpus.
- Fig. 7B is a simplified flow chart illustrating functionality for generating alternative enhancements for a cluster, which is useful in the functionality of Fig. 4.
- the retrieved and prioritized words provide a first plurality of alternative enhancements.
- Additional alternatives may be generated by employing rules based on known alternative usages as well as accumulated user inputs.
- contextual information such as CFSs and more particularly feature-grams is employed to generate alternative enhancements and not only for scoring such "contextually retrieved" alternative enhancements.
- Frequently occurring word combinations such as CFSs and more particularly feature-grams, may be retrieved from an existing corpus, such as an internet corpus.
- Fig. 8 is a simplified flow chart illustrating functionality for context-based and word similarity-based scoring of various alternative enhancements useful in the spelling correction functionality of Fig. 2.
- HA Frequency of occurrence analysis is carried out, preferably using an internet corpus, on the various alternative cluster corrections produced by the functionality of Fig. 7 A, in the context of the CFSs extracted as described hereinabove with reference to Fig. 5.
- CFS selection and weighting of the various CFSs is carried out based on, inter alia, the results of the frequency of occurrence analysis of sub-stage IIA.
- Weighting is also based on relative inherent importance of various CFSs. It is appreciated that some of the CFSs may be given a weighting of zero and are thus not selected. The selected CFSs preferably are given relative weightings.
- a frequency of occurrence metric is assigned to each alternative correction for each of the selected CFSs in sub-stage IIB.
- a reduced set of alternative cluster corrections is generated, based, inter alia, on the results of the frequency of occurrence analysis of sub-stage IIA, the frequency of occurrence metric of sub-stage HC and the CFS selection and weighting of sub-stage IIB.
- the cluster having the highest non-contextual similarity score in stage I is selected from the reduced set in sub-stage HD for use as a reference cluster correction.
- a frequency of occurrence metric is assigned to the reference cluster correction of sub-stage HE for each of the selected CFSs in stage IIB.
- a ratio metric is assigned to each of the selected CFSs in sub-stage IIB which represents the ratio of the frequency of occurrence metric for each alternative correction for that feature to the frequency of occurrence metric assigned to the reference cluster of sub-stage HE.
- a most preferred alternative cluster correction is selected based on the results of stage I and the results of stage II.
- a matrix is generated indicating the frequency of occurrence in a corpus, preferably an internet corpus, of each of the alternative corrections for the cluster in each of the CFSs. All CFSs for which all alternative corrections have a zero frequency of occurrence are eliminated. Thereafter, all CFSs which are entirely included in other CFSs having at least a minimum threshold frequency of occurrence are eliminated.
- each of the remaining CFSs is given a score as described hereinabove with reference to Fig. 5.
- CFSs which contain words introduced in an earlier correction iteration of the multi-word input and have a confidence level below a predetermined confidence level threshold are negatively biased.
- a normalized frequency matrix is generated indicating the normalized frequency of occurrence of each CFS in the internet corpus.
- the normalized frequency matrix is normally generated from the frequency matrix by dividing each CFS frequency by a function of the frequencies of occurrence of the relevant cluster alternatives.
- the normalization is operative to neutralize the effect of substantial differences in overall popularity of various alternative corrections.
- a suitable normalization factor is based on the overall frequencies of occurrence of various alternative corrections in a corpus as a whole, without regard to particular CFSs.
- normalized frequencies of occurrence which neutralize substantial differences in overall popularity of various alternative corrections, are preferably used in selecting among the alternative corrections. It is appreciated that other metrics of frequency of occurrence, other than normalized frequencies of occurrence, may alternatively or additionally be employed as metrics. Where the frequencies of occurrence are relatively low or particularly high, additional or alternative metrics are beneficial. It will be appreciated from the discussion that follows that additional functionalities are often useful in selecting among various alternative corrections. These functionalities are described hereinbelow.
- each alternative cluster correction which is less preferred than another alternative cluster correction according to both of the following metrics is eliminated: i. having a word similarity score lower than the other alternative cluster correction; and ii. having lower frequencies of occurrences and preferably also lower normalized frequencies of occurrence for all of the CFSs than the other alternative cluster correction.
- non-contextual similarity scores of the alternative cluster corrections are as indicated in Table 14:
- frequency function is used below to refer to the frequency, the normalized frequency or a function of both the frequency and the normalized frequency.
- One possible preference metric is the highest occurrence frequency function for each alternative cluster correction in the reduced matrix or matrices for any of the CFSs in the reduced matrix or matrices. For example, the various alternative cluster corrections would be scored as follows:
- non-contextual similarity scores of the alternative cluster corrections are as indicated in Table 16:
- the alternative 'eagle' is selected because it has a CFS with a maximum frequency of occurrence.
- Another possible preference metric is the average occurrence frequency function of all CFSs for each alternative correction.
- the various alternative corrections would be scored as follows: The following input text is provided: A while ago sthe lived 3 dwarfs
- a further possible preference metric is the weighted sum, over all CFSs for each alternative correction, of the occurrence frequency function for each CFS multiplied by the score of that CFS as computed by the functionality described hereinabove with reference to Fig. 5.
- a Specific Alternative Correction/CFS preference metric is generated, as described hereinabove with reference to sub-stages HE - IIG, by any one or more, and more preferably most and most preferably all of the following operations on the alternative corrections in the reduced matrix or matrices: i.
- the alternative cluster correction having the highest non-contextual similarity score is selected to be the reference cluster.
- ii. A modified matrix is produced wherein in each preference matrix, the occurrence frequency function of each alternative correction in each feature gram is replaced by the ratio of the occurrence frequency function of each alternative correction to the occurrence frequency function of the reference cluster.
- iii A modified matrix of the type described hereinabove in ii.
- a modified matrix of the type described hereinabove in ii or iii is additionally modified by multiplying the applicable ratio or function of ratio in each preference metric by the appropriate CFS score. This provides emphasis based on correct grammatical usage and other factors which are reflected in the CFS score.
- a modified matrix of the type described hereinabove in ii, iii or iv is additionally modified by generating a function of the applicable ratio, function of ratio, frequency of occurrence and normalized frequency of occurrence.
- a preferred function is generated by multiplying the applicable ratio or function of ratio in each preference metric by the frequency of occurrence of that CFS.
- a final preference metric is computed for each alternative correction based on the Specific Alternative Correction/CFS preference metric as described hereinabove in D by multiplying the similarity score of the alternative correction by the sum of the Specific Alternative Correction/CFS preference metrics for all CFS for that Alternative Correction.
- An example illustrating the use of such a modified matrix is as follows:
- both the frequency of occurrence and the normalized frequency of occurrence of "teach” are greater than those of "touch”, but for another feature, both the frequency of occurrence and the normalized frequency of occurrence of "touch” are greater than those of "teach”.
- ratio metrics described hereinabove with reference to sub-stage HG, are preferably employed as described hereinbelow.
- non-contextual similarity scores of the alternative cluster corrections are as indicated in Table 21 :
- the reference cluster is "teach", since it has the highest similarity score. Nevertheless "touch” is selected based on the final preference score described hereinabove. This is not intuitive, as may be appreciated from a consideration of the above matrices which indicate that "teach” has the highest frequency of occurrence and the highest normalized frequency of occurrence.
- the final preference score indicates a selection of "touch” over “teach” since the ratio of frequencies of occurrence for a feature in which "touch" is favored is much greater than the ratio of frequencies of occurrence for the other feature in which "teach” is favored.
- an alternative correction may be filtered out on the basis of a comparison of frequency function values and preference metrics for that alternative correction and for the reference cluster using one or more of the following decision rules:
- non-contextual similarity scores of the alternative cluster corrections are as indicated in Table 23:
- a ranking is established based on the final preference metric developed as described hereinabove at A - E on the alternative corrections which survive the filtering in F.
- the alternative correction having the highest final preference score is selected.
- H As discussed hereinabove with reference to Stage IV, a confidence level is assigned to the selected alternative correction. This confidence level is calculated based on one or more of the following parameters: a. number, type and scoring of selected CFSs as provided in sub-stage IIB above; b. statistical significance of frequency of occurrence of the various alternative cluster corrections, in the context of the CFSs; c.
- non-contextual similarity score (stage I) of the selected alternative cluster correction being above a predetermined minimum threshold.
- stage I extent of contextual data available, as indicated by the number of CFSs in the reduced matrix having CFS scores above a predetermined minimum threshold and having preference scores over another predetermined threshold.
- the selected alternative correction is implemented without user interaction. If the confidence level is below the predetermined threshold but above a lower predetermined threshold, the selected alternative correction is implemented but user interaction is invited. If the confidence level is below the lower predetermined threshold, user selection based on a prioritized list of alternative corrections is invited.
- the confidence level is somewhat less, due to the fact that the alternative correction 'back' has a higher frequency of occurrence than 'beach' in the CFS 'bech in the summer' but 'beach' has a higher frequency of occurrence than 'back' in the CFSs 'on the beech in' and 'the bech in the'.
- the alternative correction 'beach' is selected with an intermediate confidence level based on criterion H(c).
- the alternative correction 'beach' is selected with an intermediate confidence level based on criterion H(c).
- the confidence level is even less, based on criterion H(a):
- Fig. 9 is a simplified flow chart illustrating functionality for context-based and word similarity-based scoring of various alternative corrections useful in the misused word and grammar correction functionality of Figs. 3, 10 and 11, and also in the vocabulary enhancement functionality of Fig. 4.
- the context-based and word similarity-based scoring of various alternative corrections proceeds in the following general stages: I. NON-CONTEXTUAL SCORING - Various cluster alternatives are scored on the basis of similarity to a cluster in the input text in terms of their written appearance and sound similarity. This scoring does not take into account any contextual similarity outside of the given cluster. II. CONTEXTUAL SCORING USING INTERNET CORPUS - Each of the various cluster alternatives is also scored on the basis of extracted contextual- feature-sequences (CFSs), which are provided as described hereinabove with reference to Fig. 5. This scoring includes the following sub-stages: IIA. Frequency of occurrence analysis is carried out, preferably using an internet corpus, on the various alternative cluster corrections produced by the functionality of Figs. 7 A or 7B, in the context of the CFSs extracted as described hereinabove in Fig. 5.
- CFSs contextual- feature-sequences
- CFS selection and weighting of the various CFSs based on, inter alia, the results of the frequency of occurrence analysis of sub-stage IIA. Weighting is also based on relative inherent importance of various CFSs. It is appreciated that some of the CFSs may be given a weighting of zero and are thus not selected. The selected
- CFSs preferably are given relative weightings.
- a reduced set of alternative cluster corrections is generated, based, inter alia, on the results of the frequency of occurrence analysis of sub-stage IIA, the frequency of occurrence metric of sub-stage IIC and the CFS selection and weighting of sub-stage IIB.
- the input cluster is selected for use as a reference cluster correction.
- a frequency of occurrence metric is assigned to the reference cluster correction of sub-stage HE for each of the selected CFSs in stage IIB.
- a most preferred alternative cluster correction is selected based on the results of stage I and the results of stage II.
- a matrix is generated indicating the frequency of occurrence in a corpus, preferably an internet corpus, of each of the alternative corrections for the cluster in each of the CFSs. All CFSs for which all alternative corrections have a zero frequency of occurrence are eliminated. Thereafter, all CFSs which are entirely included in other CFSs having at least a minimum threshold frequency of occurrence are eliminated.
- each of the remaining CFSs is given a score as described hereinabove with reference to Fig. 5. Additionally CFSs which contain words introduced in an earlier correction iteration of the multi-word input and have a confidence level below a predetermined confidence level threshold are negatively biased.
- a normalized frequency matrix is generated indicating the normalized frequency of occurrence of each CFS in the internet corpus.
- the normalized frequency matrix is normally generated from the frequency matrix by dividing each CFS frequency by a function of the frequencies of occurrence of the relevant cluster alternatives.
- the normalization is operative to neutralize the effect of substantial differences in overall popularity of various alternative corrections.
- a suitable normalization factor is based on the overall frequencies of occurrence of various alternative corrections in a corpus as a whole, without regard to CFSs. The following example illustrates the generation of a normalized frequency of occurrence matrix:
- normalized frequencies which neutralize substantial differences in overall popularity of various alternative corrections, are used in selecting among the alternative corrections. It is appreciated that other metrics of frequency of occurrence, other than normalized frequencies of occurrence, may alternatively or additionally be employed as metrics. Where the frequencies of occurrence are relatively low or particularly high, additional or alternative metrics are beneficial.
- each alternative cluster correction which is less preferred than another alternative correction according to both of the following metrics is eliminated: i. having a word similarity score lower than the other alternative cluster correction; and ii. having lower frequencies of occurrences and preferably also lower normalized frequencies of occurrence for all of the CFSs than the other alternative cluster correction.
- non-contextual similarity scores of the alternative cluster corrections are as indicated in Table 31 :
- the alternative cluster correction "love in” is eliminated as it has a lower similarity score as well as lower frequencies of occurrence and lower normalized frequencies of occurrence than “live in”.
- the alternative cluster correction "leave in” is not eliminated at this stage since its similarity score is higher than that of "live in”.
- the result of operation of the functionality of sub-stage HD is a reduced frequency matrix and preferably also a reduced normalized frequency matrix, indicating the frequency of occurrence and preferably also the normalized frequency of occurrence of each of a reduced plurality of alternative corrections, each of which has a similarity score, for each of a reduced plurality of CFSs.
- the reduced set of alternative cluster corrections is preferably employed for all further alternative cluster selection functionalities as is seen from the examples which follow hereinbelow.
- a final preference metric is generated.
- One or more of the following alternative metrics may be employed to generate a final preference score for each alternative correction:
- frequency function is used below to refer to the frequency, the normalized frequency or a function of both the frequency and the normalized frequency.
- One possible preference metric is the highest occurrence frequency function for each alternative cluster correction in the reduced matrix or matrices for any of the CFSs in the reduced matrix or matrices.
- the various alternative cluster corrections would be scored as follows: The following input text is provided:
- Another possible preference metric is the average occurrence frequency function of all CFSs for each alternative correction.
- the various alternative corrections would be scored as follows: The following input text is provided:
- non-contextual similarity scores of the alternative cluster corrections are as indicated in Table 36:
- a further possible preference metric is the weighted sum over all CFSs for each alternative correction of the occurrence frequency function for each CFS multiplied by the score of that CFS as computed by the functionality described hereinabove with reference to Fig. 5.
- D. A Specific Alternative Correction/CFS preference metric is generated, as described hereinabove with reference to sub-stages HE - HG, by any one or more, and more preferably most and most preferably all of the following operations on the alternative corrections in the reduced matrix or matrices: i. The cluster from the original input text that is selected for correction is selected to be the reference cluster. ii.
- a modified matrix is produced wherein in each preference matrix, the occurrence frequency function of each alternative correction in each feature gram is replaced by the ratio of the occurrence frequency function of each alternative correction to the occurrence frequency function of the reference cluster.
- a modified matrix of the type described hereinabove in ii. is further modified to replace the ratio in each preference metric by a function of the ratio which function reduces the computational importance of very large differences in ratios.
- a suitable such function is a logarithmic function. The purpose of this operation is to de-emphasize the importance of large differences in frequencies of occurrence in the final preference scoring of the most preferred alternative corrections, while maintaining the importance of large differences in frequencies of occurrence in the final preference scoring, and thus elimination, of the least preferred alternative corrections.
- a modified matrix of the type described hereinabove in ii or iii is additionally modified by multiplying the applicable ratio or function of ratio in each preference metric by the appropriate CFS score. This provides emphasis based on correct grammatical usage and other factors which are reflected in the CFS score.
- a modified matrix of the type described hereinabove in ii, iii or iv is additionally modified by multiplying the applicable ratio or function of ratio in each preference metric by a function of a user uncertainty metric.
- a user input uncertainty metric include the number of edit actions related to an input word or cluster performed in a word processor, vis-a-vis edit actions on other words of the document; the timing of writing of an input word or cluster performed in a word processor, vis-a-vis time of writing of other words of the document and the timing of speaking of an input word or cluster performed in a speech recognition input functionality, vis-a-vis time of speaking of other words by this user.
- the user input uncertainty metric provides an indication of how certain the user was of this choice of words. This step takes the computed bias to a reference cluster and modifies it by a function of the user's certainty or uncertainty regarding this cluster. vi.
- a modified matrix of the type described hereinabove in ii, iii, iv or v is additionally modified by generating a function of the applicable ratio, function of ratio, frequency of occurrence and normalized frequency of occurrence.
- a preferred function is generated by multiplying the applicable ratio or function of ratio in each preference metric by the frequency of occurrence of that CFS.
- a final preference metric is computed for each alternative correction based on the Specific Alternative Correction/CFS preference metric as described hereinabove in D by multiplying the similarity score of the alternative correction by the sum of the Specific Alternative Correction/CFS preference metrics for all CFS for that
- both the frequency of occurrence and the normalized frequency of occurrence of "teach” are greater than those of "touch”, but for another feature, both the frequency of occurrence and the normalized frequency of occurrence of "touch” are greater than those of "teach”.
- ratio metrics described hereinabove with reference to sub-stage HG, are preferably employed as described hereinbelow.
- non-contextual similarity scores of the alternative cluster corrections are as indicated in Table 38:
- the reference cluster is "teach", since it has the highest similarity score. Nevertheless "touch” is selected based on the final preference score described hereinabove. This is not intuitive as may be appreciated from a consideration of the above matrices which indicate that "teach” has the highest frequency of occurrence and the highest normalized frequency of occurrence.
- the final preference score indicates a selection of "touch” over “teach” since the ratio of frequencies of occurrence for a feature in which "touch" is favored is much greater than the ratio of frequencies of occurrence for the other feature in which "teach” is favored.
- an alternative correction may be filtered out on the basis of a comparison of frequency function values and preference metrics for that alternative correction and for the reference cluster using one or more of the following decision rules:
- a confidence level is assigned to the selected alternative correction. This confidence level is calculated based on one or more of the following parameters: a. number, type and scoring of selected CFSs as provided in sub-stage HB above; b. statistical significance of frequency of occurrence of the various alternative cluster corrections, in the context of the CFSs; c. degree of consensus on the selection of an alternative correction, based on preference metrics of each of the CFSs and the word similarity scores of the various alternative corrections; d. non-contextual similarity score (stage I) of the selected alternative cluster correction being above a predetermined minimum threshold. e. extent of contextual data available, as indicated by the number of
- the selected alternative correction is implemented without user interaction. If the confidence level is below the predetermined threshold but above a lower predetermined threshold, the selected alternative correction is implemented but user interaction is invited. If the confidence level is below the lower predetermined threshold, user selection based on a prioritized list of alternative corrections is invited.
- the confidence level is somewhat less, due to the fact that the alternative correction 'back' has a higher frequency of occurrence than 'beach' in the CFS 'beech in the summer' but 'beach' has a higher frequency of occurrence than 'back' in the CFSs 'on the beech in' and 'the beech in the'.
- the alternative correction 'beach' is selected with an intermediate confidence level based on criterion H(c).
- the alternative correction 'beach' is selected with an intermediate confidence level based on criterion H(c).
- the confidence level is even less, based on criterion H(a):
- the only CFS that survives the filtering process is 'Exerts are'.
- the confidence level is relatively low, since the selection is based on only a single CFS, which is relatively short and includes, aside from the suspected word, only one word, which is a frequently occurring word.
- non-contextual similarity scores of the alternative cluster corrections are as indicated in Table 44:
- Fig. 10 is a detailed flowchart illustrating the operation of missing item correction functionality.
- the missing item correction functionality is operative to correct for missing articles, prepositions, punctuation and other items having principally grammatical functions in an input text. This functionality preferably operates on a spelling-corrected input text output from the spelling correction functionality of Fig. 1.
- feature-grams are generated for a spelling-corrected input text.
- the frequency of occurrence of each feature-gram in the spelling-corrected input text in a corpus, preferably an internet corpus (FREQ F-G), is ascertained.
- the expected frequency of a feature-gram based on division of the words in the feature-gram into two consecutive parts following a word W can be expressed as follows:
- FREQ F-G/EFREQ F-G in respect of W 1 is less than a predetermined threshold, the feature-gram in respect of W 1 is considered to be suspect in terms of there being a missing article, preposition or punctuation between W, and W 1+1 in that feature gram.
- a suspect word junction between two consecutive words in a spelling- corrected input text is selected for correction, preferably by attempting to find the word junction which is surrounded by the largest amount of non-suspected contextual data.
- the word junction that has the longest sequence or sequences of non- suspected word junctions in its vicinity is selected.
- One or, preferably, more alternative insertions is generated for each word junction, preferably based on a predefined set of possibly missing punctuation, articles, prepositions, conjunctions or other items, which normally do not include nouns, verbs or adjectives.
- EFREQ F-G in respect of W (FREQ (W 1 - W 1 ) * FREQ (W i+ i -
- the actual frequency of occurrence of each of the feature-grams is less than the expected frequency of occurrence thereof. This indicates suspected absence of an item, such as punctuation.
- a list of alternative insertions to follow the word "read” is generated.
- This list preferably includes a predetermined list of punctuation, articles, conjunctions and prepositions. Specifically, it will include a period ".”
- the CFS frequency of occurrence that includes the cluster with the '.' is retrieved separately for the text before and after the '.'. i.e., the feature-gram "can't read. Please” will not be generated because it includes two separate grammar parsing phrases.
- a '.' is omitted from the beginning of a feature gram when calculating its frequency of occurrence in the corpus. For example, the frequency of ". Please help me” is identical to the frequency of "Please help me”.
- the final preference metric selects the alternative correction "read. Please” and the corrected input text is: I can't read. Please help me.
- the following example illustrates the functionality of adding a missing preposition.
- the final preference metric selects the alternative correction "sit on the" and the corrected input text is: I sit on the sofa.
- Fig. 11 is a detailed flowchart illustrating the operation of superfluous item correction functionality.
- the superfluous item correction functionality is operative to correct for superfluous articles, prepositions, punctuation and other items having principally grammatical functions in an input text. This functionality preferably operates on a spelling-corrected input text output from the spelling correction functionality of Fig. 1.
- Fig. 11 may be combined with the functionality of Fig. 10 or alternatively carried out in parallel therewith, prior thereto or following operation thereof.
- Identification of suspected superfluous items is carried out preferably in the following manner:
- a search is carried out on the spelling-corrected input text to identify items belonging to a predefined set of possibly superfluous punctuation, articles, prepositions, conjunctions and other items, which normally do not include nouns, verbs or adjectives.
- feature-grams are generated for all portions of the m ⁇ sused-word and grammar corrected, spelling-corrected input text containing such item.
- a frequency of occurrence is calculated for each such feature-gram and for a corresponding feature-gram in which the item is omitted. If the frequency of occurrence for the feature-gram in which the item is omitted exceeds the frequency of occurrence for the corresponding feature-gram in which the item is present, the item is considered as suspect.
- a suspect item in a misused-word and grammar corrected, spelling- corrected input text is selected for correction, preferably by attempting to find the item which is surrounded by the largest amount of non-suspected contextual data.
- the item that has the longest sequence or sequences of non-suspected words in its vicinity is selected.
- a possible item deletion is generated for each suspect item.
- At least partially context-based and word similarity-based scoring of the various alternatives, i.e. deletion of the item or non-deletion of the item, is provided, preferably based on a correction alternatives scoring algorithm, described hereinabove with reference to Fig. 9 and hereinbelow.
- the input text is searched to identify any items which belong to a predetermined list of commonly superfluous items, such as, for example, punctuation, prepositions, conjunctions and articles.
- the comma ",” is identified as belonging to such a list.
- the feature-grams, seen in Table 50, which include a comma ",” are generated and identical feature-grams without the comma are also generated (partial list): TABLE 50
- the frequency of occurrence for the feature grams with the ",” omitted exceeds the frequency of occurrence for corresponding feature grams with the ",” present. Therefore, the ",” is considered as suspect of being superfluous.
- the remaining CFSs are the feature-grams: 'is a nice,'; 'a nice, thing'; 'nice, thing to'
- the final preference metric selects the alternative correction "food” and the corrected input text is: We should provide them food and water.
- Fig. 12 is a simplified block diagram illustration of a system and functionality for computer-assisted language translation and generation, constructed and operative in accordance with a preferred embodiment of the present invention.
- input text is supplied to a language generation module 200 from one or more sources, including, without limitation: sentence search functionality 201, which assists a user to construct sentences by enabling the user to enter a query containing a few words and to receive complete sentences containing such words; machine text generation functionality 202, which generates natural language sentences from a machine representation system such as a knowledge base or a logical form; word processor functionality 203, which may produce any suitable text, preferably part of a document, such as a sentence; machine translation functionality 204, which converts text in a source language into text in a target language and which is capable of providing multiple alternative translated texts, phrases and/or words in the target language, which may be processed by the language generation module as alternative input texts, alternative phrases and/or alternative words; speech-to-text conversion functionality 205, which
- sentence retrieval functionality 212 interacts with a stem-to-sentence index 216, which utilizes an internet corpus 220.
- the use of an internet corpus is important in that it provides an extremely large number of sentences, resulting in highly robust language generation functionality.
- An internet corpus is a large representative sample of natural language text which is collected from the world wide web, usually by crawling on the internet and collecting text from website pages.
- dynamic text such as chat transcripts, texts from web forums and texts from blogs, is also collected.
- the collected text is used for accumulating statistics on natural language text.
- the size of an internet corpus can be, for example, one trillion (1,000,000,000,000) words or several trillion words, as opposed to more typical corpus sizes of up to 2 billion words.
- a small sample of the web, such as the web corpus includes 10 billion words, which is significantly less than one percent of the web texts indexed by search engines, such as GOOGLE®.
- the present invention can work with a sample of the web, such as the web corpus, but preferably it utilizes a significantly larger sample of the web for the task of text generation.
- An internet corpus is preferably employed in one of the following two ways: One or more internet search engines is employed using modified input text as a search query. Sentences which include words contained in the search query may be extracted from the search results.
- the stem-to-sentence index 216 is built up over time by crawling and indexing the internet. Preferably this is done by reducing inflected words appearing in the internet corpus to their respective stems and listing all sentences in the corpus which include words having such stems.
- the stem-to-sentence index, as well as the search queries, may be based on selectable parts of the internet and may be identified with those selected parts. Similarly, parts of the internet may be excluded or appropriately weighted in order to correct anomalies between internet usage and general language usage. In such a way, websites that are reliable in terms of language usage, such as news and government websites, may be given greater weight than other websites, such as chat or user forums.
- sentence retrieval functionality 212 Preferably, input text is initially supplied to sentence retrieval functionality 212.
- sentence retrieval functionality 212 The operation of sentence retrieval functionality 212 is described hereinbelow with additional reference to Fig. 13.
- the sentence retrieval functionality 212 is operative to split the input text into independent phrases which are then processed independently in the sentence generation module 214.
- Word stems are generated for all words in each independent phrase. Alternatively, word stems are not generated for some or all of the words in each independent phrase and in such a case, the words themselves are used in a word to sentence index to retrieve sentences from the internet corpus.
- the word stems are then classified as being either mandatory word stems or optional word stems.
- Optional word stems are word stems of adjectives, adverbs, articles, prepositions, punctuation and other items having principally grammatical functions in an input text as well as items in a predefined list of optional words.
- Mandatory word stems are all word stems which are not optional word stems. The optional word stems may be ranked as to their degree of importance in the input text.
- the stem-to-sentence index 216 is employed to retrieve all sentences in the internet corpus 220 which include all word stems. For each independent phrase, if the number of sentences retrieved is less than a predetermined threshold, the stem-to-sentence index 216 is employed to retrieve all sentences in the internet corpus 220 which include all mandatory word stems.
- a word stem alternatives generator is employed to generate alternatives for all mandatory word stems, as described hereinbelow with reference to Fig. 15.
- the stem-to-sentence index 216 is employed to retrieve all sentences in the internet corpus 220 which include as many mandatory word stems as possible, but no less than one mandatory word stem and also alternatives of all remaining mandatory word stems.
- the outputs of the sentence retrieval functionality 212 are preferably as follows: the independent phrases; for each independent phrase, : the mandatory and optional word stems, together with their ranking; the sentences retrieved from internet corpus 212.
- sentence generation functionality 214 The above outputs of the sentence retrieval functionality 212 are supplied to the sentence generation functionality 214.
- the operation of sentence generation functionality 214 is described hereinbelow with additional reference to Figs 14A & 14B.
- simplification of the sentences taken from internet corpus 212 is carried out as described hereinbelow:
- Phrases are extracted from all of the sentences using standard parsing functionality. Phrases which do not include any word stem which appears in the corresponding independent phrase or which is an alternative word stem are deleted.
- the thus simplified sentences resulting from the foregoing steps are grouped into groups having at least a predetermined degree of similarity and the number of simplified sentences in each group is counted.
- each such group is ranked using the following criteria:
- a suitable composite ranking based on criteria A, B and C is preferably provided.
- Groups having rankings according to all of criteria A, B and C, taken individually, which fall below predetermined thresholds are eliminated.
- groups whose rankings according to all of criteria A, B and C, fall below the rankings of another group are eliminated.
- Fig. 15 is a simplified flow chart illustrating functionality for generating alternatives for a word stem, which is useful in the functionalities of Figs. 12 and 13.
- This functionality is known and available on the internet as freeware, such as GNU Aspell and Google ® GSpell.
- the retrieved and prioritized words provide a first plurality of alternatives.
- Additional alternatives may be generated by employing rules based on known alternative usages as well as accumulated user inputs. E.g., u -> you, r -> are, Im -> I am.
- contextual information such as CFSs and more particularly feature- grams
- Word stems which appear often in the same context may be valid alternatives.
- Frequently occurring word combinations, such as CFSs and more particularly feature-grams may be retrieved from an existing corpus, such as an internet corpus.
- the input text is generated automatically by an external system, such as an optical character recognition, speech-to-text or machine translation system
- additional alternatives may be received directly from such system.
- Such additional alternatives typically are generated in the course of operation of such system.
- the alternative translations of a word in a foreign language may be supplied to the present system for use as alternatives.
- the weight of a word stem to indicate the importance of the word in the language.
- the weight is equal to 1 if the word stem is mandatory, and is equal to less than 1 if the word stem is optional.
- weights are indicated in brackets following each word stem. For example, "you (0.5)" means that the word stem 'y° u ' has an importance weighting of 0.5.
- Composite Rank a function of the group count multiplied by a weighted sum of the positive and negative match ranks.
- Sentence Retrieved From Simplified Group Group Ranking Internet Corpus Sentence
- the second group is selected.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
Description
Claims
Priority Applications (8)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CA2731899A CA2731899C (en) | 2007-08-01 | 2009-02-04 | Automatic context sensitive language generation, correction and enhancement using an internet corpus |
CN200980138185.XA CN102165435B (en) | 2007-08-01 | 2009-02-04 | Automatic context sensitive language generation, correction and enhancement using an internet corpus |
JP2011520650A JP5584212B2 (en) | 2008-07-31 | 2009-02-04 | Generate, correct, and improve languages that are automatically context sensitive using an Internet corpus |
US13/056,563 US8645124B2 (en) | 2007-08-01 | 2009-02-04 | Automatic context sensitive language generation, correction and enhancement using an internet corpus |
EP09802606A EP2313835A4 (en) | 2008-07-31 | 2009-02-04 | Automatic context sensitive language generation, correction and enhancement using an internet corpus |
HK12101697.0A HK1161646A1 (en) | 2008-07-31 | 2012-02-21 | Automatic context sensitive language generation, correction and enhancement using an internet corpus |
US14/143,827 US9026432B2 (en) | 2007-08-01 | 2013-12-30 | Automatic context sensitive language generation, correction and enhancement using an internet corpus |
US14/658,468 US20150186336A1 (en) | 2007-08-01 | 2015-03-16 | Automatic context sensitive language generation, correction and enhancement using an internet corpus |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/IL2008/001051 WO2009016631A2 (en) | 2007-08-01 | 2008-07-31 | Automatic context sensitive language correction and enhancement using an internet corpus |
ILPCT/IL2008/001051 | 2008-07-31 |
Related Child Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/056,563 A-371-Of-International US8645124B2 (en) | 2007-08-01 | 2009-02-04 | Automatic context sensitive language generation, correction and enhancement using an internet corpus |
US14/143,827 Continuation US9026432B2 (en) | 2007-08-01 | 2013-12-30 | Automatic context sensitive language generation, correction and enhancement using an internet corpus |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010013228A1 true WO2010013228A1 (en) | 2010-02-04 |
Family
ID=41611281
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IL2009/000130 WO2010013228A1 (en) | 2007-08-01 | 2009-02-04 | Automatic context sensitive language generation, correction and enhancement using an internet corpus |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP2313835A4 (en) |
JP (2) | JP5584212B2 (en) |
WO (1) | WO2010013228A1 (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013032617A1 (en) * | 2011-09-01 | 2013-03-07 | Google Inc. | Server-based spell checking |
US8645124B2 (en) | 2007-08-01 | 2014-02-04 | Ginger Software, Inc. | Automatic context sensitive language generation, correction and enhancement using an internet corpus |
CN103942339A (en) * | 2014-05-08 | 2014-07-23 | 深圳市宜搜科技发展有限公司 | Synonym mining method and device |
US9015036B2 (en) | 2010-02-01 | 2015-04-21 | Ginger Software, Inc. | Automatic context sensitive language correction using an internet corpus particularly for small keyboard devices |
US9135544B2 (en) | 2007-11-14 | 2015-09-15 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US9400952B2 (en) | 2012-10-22 | 2016-07-26 | Varcode Ltd. | Tamper-proof quality management barcode indicators |
US9646277B2 (en) | 2006-05-07 | 2017-05-09 | Varcode Ltd. | System and method for improved quality management in a product logistic chain |
US10176451B2 (en) | 2007-05-06 | 2019-01-08 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US10445678B2 (en) | 2006-05-07 | 2019-10-15 | Varcode Ltd. | System and method for improved quality management in a product logistic chain |
CN110348001A (en) * | 2018-04-04 | 2019-10-18 | 腾讯科技(深圳)有限公司 | A kind of term vector training method and server |
CN110678860A (en) * | 2017-03-13 | 2020-01-10 | 里德爱思唯尔股份有限公司雷克萨斯尼克萨斯分公司 | System and method for word-by-word text mining |
US10697837B2 (en) | 2015-07-07 | 2020-06-30 | Varcode Ltd. | Electronic quality indicator |
US10909973B2 (en) | 2019-01-04 | 2021-02-02 | International Business Machines Corporation | Intelligent facilitation of communications |
US11060924B2 (en) | 2015-05-18 | 2021-07-13 | Varcode Ltd. | Thermochromic ink indicia for activatable quality labels |
US11520987B2 (en) * | 2015-08-28 | 2022-12-06 | Freedom Solutions Group, Llc | Automated document analysis comprising a user interface based on content types |
US11610123B2 (en) * | 2017-08-18 | 2023-03-21 | MyFitnessPal, Inc. | Context and domain sensitive spelling correction in a database |
US11704526B2 (en) | 2008-06-10 | 2023-07-18 | Varcode Ltd. | Barcoded indicators for quality management |
US12033013B2 (en) | 2022-09-14 | 2024-07-09 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9122673B2 (en) * | 2012-03-07 | 2015-09-01 | International Business Machines Corporation | Domain specific natural language normalization |
US9164977B2 (en) | 2013-06-24 | 2015-10-20 | International Business Machines Corporation | Error correction in tables using discovered functional dependencies |
US9830314B2 (en) | 2013-11-18 | 2017-11-28 | International Business Machines Corporation | Error correction in tables using a question and answer system |
KR102396983B1 (en) | 2015-01-02 | 2022-05-12 | 삼성전자주식회사 | Method for correcting grammar and apparatus thereof |
US10095740B2 (en) | 2015-08-25 | 2018-10-09 | International Business Machines Corporation | Selective fact generation from table data in a cognitive system |
US20190385711A1 (en) | 2018-06-19 | 2019-12-19 | Ellipsis Health, Inc. | Systems and methods for mental health assessment |
JP2021529382A (en) | 2018-06-19 | 2021-10-28 | エリプシス・ヘルス・インコーポレイテッド | Systems and methods for mental health assessment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030204569A1 (en) * | 2002-04-29 | 2003-10-30 | Michael R. Andrews | Method and apparatus for filtering e-mail infected with a previously unidentified computer virus |
US20050108001A1 (en) * | 2001-11-15 | 2005-05-19 | Aarskog Brit H. | Method and apparatus for textual exploration discovery |
US7386442B2 (en) * | 2002-07-03 | 2008-06-10 | Word Data Corp. | Code, system and method for representing a natural-language text in a form suitable for text manipulation |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08235182A (en) * | 1995-02-28 | 1996-09-13 | Canon Inc | Method and device for document processing |
AU2003267953A1 (en) * | 2002-03-26 | 2003-12-22 | University Of Southern California | Statistical machine translation using a large monlingual corpus |
CA2589942A1 (en) * | 2004-12-01 | 2006-08-17 | Whitesmoke, Inc. | System and method for automatic enrichment of documents |
JP2007122509A (en) * | 2005-10-28 | 2007-05-17 | Rozetta Corp | Device, method and program for determining naturalness of phrase sequence |
-
2009
- 2009-02-04 EP EP09802606A patent/EP2313835A4/en not_active Withdrawn
- 2009-02-04 WO PCT/IL2009/000130 patent/WO2010013228A1/en active Application Filing
- 2009-02-04 JP JP2011520650A patent/JP5584212B2/en not_active Expired - Fee Related
-
2014
- 2014-07-17 JP JP2014147212A patent/JP2014238855A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050108001A1 (en) * | 2001-11-15 | 2005-05-19 | Aarskog Brit H. | Method and apparatus for textual exploration discovery |
US20030204569A1 (en) * | 2002-04-29 | 2003-10-30 | Michael R. Andrews | Method and apparatus for filtering e-mail infected with a previously unidentified computer virus |
US7386442B2 (en) * | 2002-07-03 | 2008-06-10 | Word Data Corp. | Code, system and method for representing a natural-language text in a form suitable for text manipulation |
Non-Patent Citations (1)
Title |
---|
See also references of EP2313835A4 * |
Cited By (58)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9646277B2 (en) | 2006-05-07 | 2017-05-09 | Varcode Ltd. | System and method for improved quality management in a product logistic chain |
US10037507B2 (en) | 2006-05-07 | 2018-07-31 | Varcode Ltd. | System and method for improved quality management in a product logistic chain |
US10726375B2 (en) | 2006-05-07 | 2020-07-28 | Varcode Ltd. | System and method for improved quality management in a product logistic chain |
US10445678B2 (en) | 2006-05-07 | 2019-10-15 | Varcode Ltd. | System and method for improved quality management in a product logistic chain |
US10504060B2 (en) | 2007-05-06 | 2019-12-10 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US10776752B2 (en) | 2007-05-06 | 2020-09-15 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US10176451B2 (en) | 2007-05-06 | 2019-01-08 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US8914278B2 (en) | 2007-08-01 | 2014-12-16 | Ginger Software, Inc. | Automatic context sensitive language correction and enhancement using an internet corpus |
US9026432B2 (en) | 2007-08-01 | 2015-05-05 | Ginger Software, Inc. | Automatic context sensitive language generation, correction and enhancement using an internet corpus |
US8645124B2 (en) | 2007-08-01 | 2014-02-04 | Ginger Software, Inc. | Automatic context sensitive language generation, correction and enhancement using an internet corpus |
US10719749B2 (en) | 2007-11-14 | 2020-07-21 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US9135544B2 (en) | 2007-11-14 | 2015-09-15 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US9558439B2 (en) | 2007-11-14 | 2017-01-31 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US10262251B2 (en) | 2007-11-14 | 2019-04-16 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US9836678B2 (en) | 2007-11-14 | 2017-12-05 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US10417543B2 (en) | 2008-06-10 | 2019-09-17 | Varcode Ltd. | Barcoded indicators for quality management |
US11341387B2 (en) | 2008-06-10 | 2022-05-24 | Varcode Ltd. | Barcoded indicators for quality management |
US11704526B2 (en) | 2008-06-10 | 2023-07-18 | Varcode Ltd. | Barcoded indicators for quality management |
US11449724B2 (en) | 2008-06-10 | 2022-09-20 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US9996783B2 (en) | 2008-06-10 | 2018-06-12 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US9646237B2 (en) | 2008-06-10 | 2017-05-09 | Varcode Ltd. | Barcoded indicators for quality management |
US10049314B2 (en) | 2008-06-10 | 2018-08-14 | Varcode Ltd. | Barcoded indicators for quality management |
US10089566B2 (en) | 2008-06-10 | 2018-10-02 | Varcode Ltd. | Barcoded indicators for quality management |
US9710743B2 (en) | 2008-06-10 | 2017-07-18 | Varcode Ltd. | Barcoded indicators for quality management |
US11238323B2 (en) | 2008-06-10 | 2022-02-01 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US9626610B2 (en) | 2008-06-10 | 2017-04-18 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US10303992B2 (en) | 2008-06-10 | 2019-05-28 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US10885414B2 (en) | 2008-06-10 | 2021-01-05 | Varcode Ltd. | Barcoded indicators for quality management |
US10789520B2 (en) | 2008-06-10 | 2020-09-29 | Varcode Ltd. | Barcoded indicators for quality management |
US10776680B2 (en) | 2008-06-10 | 2020-09-15 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
US9384435B2 (en) | 2008-06-10 | 2016-07-05 | Varcode Ltd. | Barcoded indicators for quality management |
US9317794B2 (en) | 2008-06-10 | 2016-04-19 | Varcode Ltd. | Barcoded indicators for quality management |
US10572785B2 (en) | 2008-06-10 | 2020-02-25 | Varcode Ltd. | Barcoded indicators for quality management |
US9015036B2 (en) | 2010-02-01 | 2015-04-21 | Ginger Software, Inc. | Automatic context sensitive language correction using an internet corpus particularly for small keyboard devices |
WO2013032617A1 (en) * | 2011-09-01 | 2013-03-07 | Google Inc. | Server-based spell checking |
US10242302B2 (en) | 2012-10-22 | 2019-03-26 | Varcode Ltd. | Tamper-proof quality management barcode indicators |
US10552719B2 (en) | 2012-10-22 | 2020-02-04 | Varcode Ltd. | Tamper-proof quality management barcode indicators |
US9400952B2 (en) | 2012-10-22 | 2016-07-26 | Varcode Ltd. | Tamper-proof quality management barcode indicators |
US10839276B2 (en) | 2012-10-22 | 2020-11-17 | Varcode Ltd. | Tamper-proof quality management barcode indicators |
US9633296B2 (en) | 2012-10-22 | 2017-04-25 | Varcode Ltd. | Tamper-proof quality management barcode indicators |
US9965712B2 (en) | 2012-10-22 | 2018-05-08 | Varcode Ltd. | Tamper-proof quality management barcode indicators |
CN103942339A (en) * | 2014-05-08 | 2014-07-23 | 深圳市宜搜科技发展有限公司 | Synonym mining method and device |
CN103942339B (en) * | 2014-05-08 | 2017-06-09 | 深圳市宜搜科技发展有限公司 | Synonym method for digging and device |
US11060924B2 (en) | 2015-05-18 | 2021-07-13 | Varcode Ltd. | Thermochromic ink indicia for activatable quality labels |
US11781922B2 (en) | 2015-05-18 | 2023-10-10 | Varcode Ltd. | Thermochromic ink indicia for activatable quality labels |
US11614370B2 (en) | 2015-07-07 | 2023-03-28 | Varcode Ltd. | Electronic quality indicator |
US11009406B2 (en) | 2015-07-07 | 2021-05-18 | Varcode Ltd. | Electronic quality indicator |
US11920985B2 (en) | 2015-07-07 | 2024-03-05 | Varcode Ltd. | Electronic quality indicator |
US10697837B2 (en) | 2015-07-07 | 2020-06-30 | Varcode Ltd. | Electronic quality indicator |
US11520987B2 (en) * | 2015-08-28 | 2022-12-06 | Freedom Solutions Group, Llc | Automated document analysis comprising a user interface based on content types |
US11983499B2 (en) | 2015-08-28 | 2024-05-14 | Freedom Solutions Group, Llc | Automated document analysis comprising a user interface based on content types |
CN110678860A (en) * | 2017-03-13 | 2020-01-10 | 里德爱思唯尔股份有限公司雷克萨斯尼克萨斯分公司 | System and method for word-by-word text mining |
CN110678860B (en) * | 2017-03-13 | 2023-06-09 | 里德爱思唯尔股份有限公司雷克萨斯尼克萨斯分公司 | System and method for word-by-word text mining |
US11610123B2 (en) * | 2017-08-18 | 2023-03-21 | MyFitnessPal, Inc. | Context and domain sensitive spelling correction in a database |
CN110348001A (en) * | 2018-04-04 | 2019-10-18 | 腾讯科技(深圳)有限公司 | A kind of term vector training method and server |
CN110348001B (en) * | 2018-04-04 | 2022-11-25 | 腾讯科技(深圳)有限公司 | Word vector training method and server |
US10909973B2 (en) | 2019-01-04 | 2021-02-02 | International Business Machines Corporation | Intelligent facilitation of communications |
US12033013B2 (en) | 2022-09-14 | 2024-07-09 | Varcode Ltd. | System and method for quality management utilizing barcode indicators |
Also Published As
Publication number | Publication date |
---|---|
JP5584212B2 (en) | 2014-09-03 |
JP2014238855A (en) | 2014-12-18 |
JP2011529594A (en) | 2011-12-08 |
EP2313835A1 (en) | 2011-04-27 |
EP2313835A4 (en) | 2012-08-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9026432B2 (en) | Automatic context sensitive language generation, correction and enhancement using an internet corpus | |
WO2010013228A1 (en) | Automatic context sensitive language generation, correction and enhancement using an internet corpus | |
US9015036B2 (en) | Automatic context sensitive language correction using an internet corpus particularly for small keyboard devices | |
CN110019658B (en) | Method and related device for generating search term | |
CN110262674B (en) | Chinese character input method and device based on pinyin input and electronic equipment | |
CN113743090B (en) | Keyword extraction method and device | |
Trost et al. | The language component of the FASTY text prediction system | |
Bossard et al. | Combining a multi-document update summarization system–CBSEAS–with a genetic algorithm | |
He et al. | A corpus-based approach to the genre and diachronic distributions of English absolute clauses | |
Preiss | Probabilistic word sense disambiguation: Analysis and techniques for combining knowledge sources | |
CN115016655A (en) | English-based input method intelligent association method and related components | |
KR20230066798A (en) | Search Result Providing Method Based on User Intention Understanding of Search Word and Storage Medium Recording Program for Executing the Same | |
Flor et al. | ETS Lexical Associations System for the COGALEX-4 Shared Task | |
Tseng | Summarization Assistant for News Brief Services on Cellular Phones | |
Berger et al. | An adaptive multilingual interface for tourism information | |
Jobbins | The contribution of semantics to automatic text processing | |
Fuentes Fort et al. | FEMsum: A flexible eclectic multitask summarizer architecture evaluated in multidocument tasks | |
Merhav et al. | Short and informal documents: a probabilistic model for description enrichment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200980138185.X Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09802606 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2731899 Country of ref document: CA |
|
ENP | Entry into the national phase |
Ref document number: 2011520650 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
REEP | Request for entry into the european phase |
Ref document number: 2009802606 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2009802606 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1013/DELNP/2011 Country of ref document: IN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13056563 Country of ref document: US |