CN1460948A

CN1460948A - Method and device for amending or improving words application

Info

Publication number: CN1460948A
Application number: CN03138209A
Authority: CN
Inventors: P·J·怀特洛克; P·G·埃德蒙兹
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2002-05-22
Filing date: 2003-05-22
Publication date: 2003-12-10
Anticipated expiration: 2023-05-22
Also published as: JP4278090B2; JP2004005641A; GB2388940A; CN1273915C; GB0211727D0

Abstract

A database is provided containing links between words with likelihood values associated with the links providing a measure of the likelihood of such links being correct or idiomatic. The likelihood values are based on the frequencies of occurrence of the links obtained by analysing a large body of text, for example produced by native speakers of the language. In order to check a section of text for possible erroneous or unnatural usage of one or more words of the section, the text is first analysed to establish links between its words. The likelihood of the links in the analysed text is determined from the database. A plausibility value is computed for each word in the analysed text, by combining the likelihood values for the links in which that word occurs. Words are used to index another database containing sets of words which are confusable with the index word. Each of the confusable words is selected in turn and substituted in the links of the index word. The likelihood values for these new links are determined and a plausibility value for the confusable word is computed. In an error-correcting embodiment, confusables are tried for those words whose plausibility falls below a threshold, and confusables which improve plausibility are reported to the user. In a context-sensitive thesaurus embodiment, confusables may be tried for all words, and those for which the plausibility value exceeds a second threshold may be reported.

Description

Revise or improve the method and apparatus that word uses

Technical field

The present invention relates to a kind of method and apparatus of revising or improving the selection and the use of word in the natural language text.The invention still further relates to for computer programming with the computer program of carrying out a kind of like this method, comprise a kind of like this storage medium of program and worked out a kind of like this computing machine of program.

Background technology

Core with language writing or speech is to select to use word.For helping people to select, the mother tongue author uses dictionary, and the language learner generally uses bilingual dictionary.; the mother tongue author finds that it is suitable such details that dictionary can't provide about which synonym in context; the beginner is integration capability or knowledge and select wrong translation from bilingual dictionary for want of, or a word mistake pieced together be another word.

At English learner's tagged corpus (Ni Keersi, 1999 " The Cambridge LearnerCorpus-Error Coding and Analysis for Writing Dictionaries and other booksfor English Learners ", the operating room in summer of learner's corpus, the Cambridge University Press) in, verb or prepositional wrong the use are the most common type of errors that is only second to spelling and punctuate mistake.For example, an author may use " associate to " rather than " associate with ", " loose one ' s temper " rather than " lose one ' s temper ", perhaps " wins me at tennis " rather than " beats me at tennis ".

The present invention makes to detect the wrong of these and other type and they are proposed modification becomes possibility.It can handle real word misspelling (as lose/loose), also can handle other dissimilar mistake.

Look into one and resemble " make " such speech in dictionary, the author can find a large amount of synonyms.These synonyms can be categorized into the phylum of total a kind of central meaning.Phylum may comprise some synonyms like this such as " create ", " construct " and " establish ", but the author can not find " creates a diversion ", " constructs a model " or " establishes arelationship " such speech.

The present invention makes when the input of response such as " make a diversion ", " make a model " or " makea relationship " provides these synonyms to become possibility as suggestion.

The present invention has utilized correlativity or the association by relation constitutes between (needn't be adjacent to) two words occurring or the phrase simultaneously in the language that is called one section of text writing or narration hereinafter.An association may be relevant with intensity or the similarity measured based on its frequency that occurs in a large amount of texts.A speech in the text may be that the likelihood value of basis is relevant with the related probable value in this word place with one.Unreasonable word will be mistake or factitious in context in text.

United States Patent (USP) 4,916,614,4,942,526,5,406,480 disclose establishment and the use that occurs information in syntactic analysis and translation simultaneously.

At United States Patent (USP) 4,674,065,4,868,750,5,258,909,5,537,317,5,659,771,5,799,269,5,907,839 and 5,907, disclosed technology is all used the tabulation of a general confusable set of words in each of 839 piece, such as " hear " and " here ", or " to " and " too ".Such vocabulary in text, occurs and be shown with potential mistake.These patents have then been described and have been revised wrong distinct methods.

United States Patent (USP) 4,674,065 discloses the technology of a kind of service regeulations system, and this system description is used to distinguish the different contexts that confusable word uses.

United States Patent (USP) 4,868,750,5,537,317 and 5,799,269 disclose the system of composing probable value for the part of speech sequence.The probability of a sequence that contains the word that easily is confused can compare with the probability of the sequence that contains the word that it is confused into.If the latter greater than the former, will report possible mistake so.

United States Patent (USP) 5,258,909 disclose a kind of system, and this system is that sequence of terms is composed probable value, is that a word is composed probable value by the mistake assembly for the situation of another word, and these probability are combined to determine whether a speech is another speech by the mistake assembly.

United States Patent (USP) 5,659,771 and 5,907,839 disclose a kind of system, and this system is associated word with its contextual feature of expression, and function of eigenvalue calculation of the special member of gathering by easily being confused with machine learning algorithm.When the member of the set that easily is confused appears in the text, it is correct or incorrect to use this function that it is divided into.

The method of grammar mistake " no supervision detect " of Qiao Duoluo and lining cock (at the 140-147 page or leaf of the proceeding of the North America branch annual meeting for the first time of computerese association in 2002) discloses and used continuous word n-gram model to detect wrong system.The classification of not meeting before this system can detect changes and classification is preserved mistake, but owing to be continuous model, can only contain a very limited length.There is not to discuss modification to mistake.

United States Patent (USP) 5,999,896 disclose a kind of system, and this system discerns possible mistake in the word use by the failure of syntactic analyser, and by finding out those confusable speech of grammatical analysis success is subsequently addressed these problems.

Summary of the invention

According to a first aspect of the present invention, a kind of in the written or spoken text chunk that comprises one group of word of first language first word or the modification selected of phrase or improve one's methods, comprise the following steps:

(a) be provided at first database of the association between word in the first language or the phrase, wherein each is associated to rare one and is associated in the relevant probable value of the frequency of occurrences in the first language text based on this;

(b) analyze text section determining first related between described first word of text section or phrase and one second word or the phrase, first probable value of corresponding at least described association and based on the first likelihood value of described at least one probable value described first word of correspondence or phrase;

(c) one second database of preparation, wherein every has at least a word or phrase to link together with its word that can be confused into or phrase set;

(d) from second database, select or calculate easily be confused word or the phrase that a candidate as first word described in the text section or phrase substitutes;

(e) the second likelihood value based on the second related probable value in first database of derivation easily be confused a word or a phrase, this second association is made up of other word or phrase in easily be confused word or phrase and the text section; And

(f) optionally provide the indication of easily be confused a word or a phrase based on the truthlikeness value that calculates.

Each related probable value in first database also can comprise one based on each and have the word of identical correlationship or other related occurrence frequency of phrase.

Each related probable value in first database also can have other related occurrence frequency of identical correlationship based on all.

Each related probable value in first database is formed by in mutual information, T value, Z value, Yule ' s Q coefficient and the log likelihood at least one.

In step (e), described other word or phrase can be second word or phrase, and the correlationship of second association can be identical with the first related correlationship.

Step (b) can be included as text Duan Zhongyi organize first word or phrase set up one group first related and can be to each first related execution in step (d), (e) and (f).

Step (b) can comprise the association of setting up between non-conterminous word in the text or the phrase.

Step (d) can comprise each confusable word of the set of selecting a word or phrase or phrase and can be to each easily be confused word or phrase execution in step (e) and (f).

Step (f) can comprise by the second likelihood value descending indicates the second likelihood value.

If the first likelihood value less than a first threshold could execution in step (d), (e) and (f).

Step (f) can comprise provides indication when described or each second likelihood value surpass one second threshold value.

Step (f) can comprise if the second likelihood value greater than the first likelihood value then indication is provided.

Step (b) can comprise the function calculation first likelihood value that relies on to learn from the tagged corpus of learner's mistake and relevant likelihood value thereof by machine learning techniques.

This method can comprise with first word in the word replacement text section that easily is confused.

This method can comprise by second language translation generation text section.

This method can comprise from document printing by optical character identification generation text section.

According to a second aspect of the present invention, be provided as computer programming to carry out computer program according to the first aspect present invention method.

According to a third aspect of the present invention, provide the storage medium that comprises according to the program of second aspect present invention.

This medium can comprise computer-readable medium.

According to fourth aspect present invention, provide the computing machine that comprises according to the program of second aspect present invention.

According to fifth aspect present invention, a kind of first word or phrase modification or modifying device of selecting in the written or spoken text chunk that comprises one group of word of first language is provided, comprising:

First database of the association between first language word or the phrase, wherein each is associated to rare one and is associated in the relevant probable value of the frequency of occurrences in a large amount of first language texts based on this;

Be used to analyze the analyzer of text section, with one first between described first word that is based upon text chunk or phrase and one second word or the phrase related, corresponding described association of at least one first probable value and based on corresponding described first word of the first likelihood value of described at least one probable value or phrase; And

Second database, wherein every has at least a word or phrase and its word that can be confused into or phrase set to link together;

Be used for selecting or calculating the instrument of easily be confused word or phrase that a candidate as first word described in the text section or phrase substitutes from second database;

Be used for deriving easily be confused a word or a phrase based on one the second second likelihood value that is associated in the probable value of first database, this second association is made up of other word or phrase in easily be confused word or phrase and the text section; And

Be used for optionally providing the instrument of the indication (25,26) of easily be confused a word or a phrase based on the truthlikeness value that calculates.

By utilizing between the word related possibility, a kind of technology might be provided, it has embodied the improvement of known system that those are only used the probability of part of speech sequence, preserves mistake because such known system can't detect and revise very common classification.

Because the dependency grammer can be obtained and be non-conterminous but still can directly influence dependency between the word that other word select selects, will obtain improvement by using continuous N metagrammar.The N metagrammar can be expanded in principle and cover such dependency, but this can cause the rare diffusing problem of several data in practice.Utilize association to become the language meaning unit for the data centralization that calculating utilized of statistics probable value.Three element relevant segments almost always enough obtain useful statistics, yet even the N metagrammar of four elements also can be omitted many possible or unlikely word combined situation.

An important results for this restriction of language meaning entity statistics is the easier explanation of probable value in requiring the mode of error-detecting.In order to understand this point, consider the meaning of the transition probability between the adjacent word in the continuous two-dimensional grammar model.In a composition, for example between " big " and " dog " in " a bigdog ", transition probability can be directly and the transition probability of the similar sequences of adjective and noun relatively.But " dog " and the transition probability between " a " in " give the dog a bone " are the probability of quite not interesting (with impossible), and this is that a composition that ends at " dog " follows another to start from the probability of the composition of " a ".Interested probability, promptly the composition with " give " beginning has a probability of second object with " bone " beginning, is not embodied and can not compare such as " give the dog a clone " with possible substituting.

That is to say that in N continuous metagrammar model, low transition probability can be pointed out interested unlikely property on the language, also can point out uninterested unlikely property on the language.If the system based on the N continuous metagrammar is with the triggering source of each low probability as fault processing, it will find a large amount of possible " mistakes ", not be real mistake much wherein.It is very big and exist false error is mistakenly classified as very wrong danger to handle these expenses.

Why Here it is does not have known technology to use the triggering source of low transition probability as fault processing, and would rather utilize known confusable certain word that in text, occurs, consider the relative possibility of original series and the sequence that obtains with the replacement word then.

On the contrary, in present technique, " unlikely property " is a more reliable miscue.

Any unlikely association can cause the beginning of fault processing and have only impossible association so to do.Certainly, unlikely association is not always due to mistake; But in present technique, these false triggerings will much less.

And, when the element in having some set that easily are confused in the text is unique triggering source of fault processing (as in many known technologies), addition element had not only increased the number of times that fault processing is triggered but also had increased and estimated assessing the cost of each element in one easily is confused set.

Be (as in the present invention) in the triggering source of fault processing in the possibility of an association with by the truthlikeness that a word is derived, can distinguish very large-scale error characteristic.Confusable notion is not only limited to the high-frequency of spelling and pronunciation and obscures.

Exist known confusable word to trigger as fault processing in the known technology in source utilizing learning algorithm and utilizing simultaneously, just not have other method can detect a word be a possible mistake except the Applied Learning algorithm goes to distinguish it.And the same with the known technology based on the N metagrammar, learning system can not be from being data centralization fully to obtain advantage the language meaning unit.

Present technique has embodied the improvement for known technology based on the grammatical analysis failure, and this is because the grammatical analysis failure is a kind of very coarse testing mechanism (especially those comprise the replacement of the word that part of speech is identical) to the word mistake.On the contrary, even for the possibility of very little sentence segmentation, present technique provides very fine quantitative measurement, and comprises the grammatical analysis failure, as lacks and attach troops to a unit and point out, with a special extreme example as unlikely property.In addition, grammatical analysis success (the coarse condition that has been modified as a mistake) can replace with obtaining improved meticulous quantitative measurement.

Description of drawings

Describe the present invention further by example and with reference to accompanying drawing, accompanying drawing comprises:

Fig. 1 is the device frame principle figure that constitutes an example of the present invention;

Fig. 2 is a block scheme, the correlation structure of declarative sentence " Love is the most important condition tomarriage ";

Fig. 3 is the part of first database, and it is probable value and related linking together; And

Fig. 4 (comprising Fig. 4 a and 4b) shows the example of the present invention as an error detector and modifier.

Embodiment

Mistake and the factitious expression way that detects in the user writing is provided and proposes to improve the method and apparatus of the mode of these language usage.These technology also can be used as the context dependent dictionary and use, and it can propose and the given input expression way similar expression way of hereinafter looking like thereon.Use the basis of the statistical dependence model of word combination as error-detecting and replacement inspection.This has solved several problems of known arrangement, they or use N continuous metagrammar model, perhaps use not characteristic set by analysis.And these technology make becomes possibility for replacement provides wider candidate.Detect the wrong detection that does not rely on easy with wrong particular words, so the mistake that did not run into before can detecting and revise.

This method is used two kinds of relationship types between the word.A kind of relationship type remains between two words of diverse location in the simple sentence.These are correlationships, such as ' subject of ', ' objectof ' and ' modifier ', and the example shown in Fig. 2, its explanation is to the analysis result of sentence " Love is themost important condition for marriage ".Word is represented with their prototype and part of speech, promptly is expressed as entry, and therefore " is " just occurs with " be_V ".The subject of this verb is equal to " love_N ", and its object is equal to " condition_N ".The latter is limited by " the_DET " and is modified by " important_ADJ "." Most_ADV " is equal to the adverbial word of modification " important_ADJ "." For_PREP " is equal to the preposition of modification " condition_N ", and " marriage_N " is equal to the object of preposition " for_PREP ".Tlv triple is made up of two entries, and the correlationship of getting in touch them is known as association.

Another relationship type comprises the relation of definition " possible replacement ", i.e. relation between the selection of the alternative word of given position in sentence.Be some examples of substitutional relation below:

The dictionary relation is such as synonym, antisense, adopted, upward adopted down;

Cause the misspelling of other speech of language,, as " loose " in " lose ", it is homophony that a kind of special situation is wherein arranged, what say is that pronunciation is identical but spell different speech, as " pane " and " pain ";

Etymology, that say is the word that constitutes with different modes that come by a word root (such as " interested " and " interesting ", or " safe " and " safety ");

Language border language easy confusion, what say is alternative translation word (as French " marquer " being translated into " mark " and " brand " all is fine) of a word in the another kind of language;

False friend, one of them speech be not may the translating of its cognate (for example, " possible " and " actual ", be respectively French " actual " correct with translation mistake); And

Insert or deletion error,, also can be considered to the alternative or replaced of an empty word language such as " he rang (at) the doorbell ", " we paid (for) ourmeals ";

When the use of word w in a sentence is identified is inappropriate, promptly is wrong, otherwise is non-usage, and each member of set of words who is known as the collection C (w) that easily is confused of w can be considered to possible substituting.The collection that easily is confused of w is what to be extracted from those words relevant with w, and condition may change along with user's mother tongue, the ability level and the other factors of the used language of writing for the membership qualification of reality.

Correlationship is the method for widely used expression sentence structure.Many found variations are unessential to a great extent under the situation of present technique.A kind of correlationship links two words that are called as related term and centre word.In a kind of typical module, there is not the related term can be relevant with a more than single centre word, but the related term that centre word can have any amount; Other constraint, as forbid circulation, guarantee that a relation in the simple sentence constitutes tree structure.In this regulation, the association (also being known as association) in sentence between two words is represented with triple form:

＜the first entry, relation, second entry 〉

Here entry is a term, as all forms of ' chase_V ' expression verb " to chase ", i.e. chase, chased, chasing.

An association can link together with its intensity or the quantity of possibility.The frequency of an association promptly passes through the number of times of seeing it in the corpus of grammatical analysis at one, just assesses a rough way of its intensity.Measuring method is that this related frequency of calculating departs from the degree according to the desired frequency of frequency of its ingredient more accurately.At some documents (for example, K. KOH-KAE draws, 1999, " Bigram Statistics Revisited:a Comparative Examination of someStatistical Measures in Morphological Analysis of Japanese KanjiSequences ", quantitative linguistics magazine 1999 the 6th phases No. 2, the 149-166 page or leaf, and dust top grade people's " Methods for the Qualitative Evaluation of LexicalAssociation Measures " not, on the machine word speech association in the Toulousc as the collection of thesis of the 30th annual meeting of opening, calendar year 2001, the 188-195 page or leaf, they are given in the comparative assessment of several measuring methods in the particular task) in several such measuring methods are disclosed and at word segmentation, grammatical analysis, translation, use to some extent in information retrieval and the lexicography.In these examples, generally have only those frequency to compare more possible significantly pass joint conference by interested with expection.But present technique also pay close attention to those with the expection frequency compare unlikely significantly related.Detecting such association in text is often indicating and is not meeting grammer or not idiomatic language usage.

Appearing at word in one or more unlikely associations can be subsequently replace with its each member who easily is confused concentrated successively and can obtain the likelihood value of carrying out the result that each such replacement obtains.If these one or more members that easily are confused collection cause the likelihood value that fully improved, these members can be proposed as an alternative.

As the step of a preparation, according to a large amount of mother tongue spoken language texts of dependent parser analysis to set up the probable value database of word combination.Can use any suitable syntactic analyser, suitable example is disclosed in this " Three Generative Lexicalised Models for StatisticalParsing " of M. Courlene, the 35th nd Annual Meeting collection of ACL/ the 8th meeting of EACL, Madrid, " the Parsing English with a Link Grammar " of 1997 and Si Lite and Tan Puli, CMU-CS-91-196, Ka Neiji-Mei Long university, computer science department, 1991.This analyzer in addition need not to be one as the syntactic analyser of the imagination, but can use finite state or increase the similar technique of record correlation mechanism.

According to one or more statistical measurement methods, calculate the frequency (such as mutual information, T value and log likelihood) of every type of association, deposit in the table to each related calculating probable value and with the result.Fig. 3 shows some clauses and subclauses in such database.

In Fig. 3, first row illustrate related own.With ' freq ' is that the row of title are that this is associated in the number of times that occurs in corpus (being the Britain country corpus of about 8,000 ten thousand speech) through grammatical analysis here.All the other row are respectively mutual information, T value, Yule's Q coefficient and log likelihood.In these each is the not homometric(al) that is calculated by following four frequency.

＜the first entry, relation, second entry 〉

＜the first entry, relation, ^*

＜ ^*, relation, first entry 〉

＜ ^*, relation, ^*

Here ' ^*' represent any entry.This parameter mode is disclosed in D. woods " AutomaticRetrieval and Clustering of Similar Words ", and COLING-ACL 98, Montreal, Canada, in August, 1998.Homometric(al) does not have different scopes and in a different manner to the exact value sensitivity of four parameters.But in each case, all the possibility with relation is relevant for this value.The possibility of positive value specifies combination is bigger than contingency, and negative value explanation possibility is little.

For example, calculating＜associate_V padv to_PREP〉the formula of T value be: [P14-2]

tassociate_V . padv . to_PREP =

\frac{F / f (padv) - (f (associate_V \cdot padv) f (padv \cdot to_PREP)) / f {(padv)}^{2}}{\sqrt{f (associate_V \cdot to_PREP) / f (padv)}}

tassociate_V . padv . to_PREP =

\frac{25 / 10587833 - (7680 \times 1020531) / 10587833^{2}}{\sqrt{25 / 10587835}} = - 143.050

F (associate_VPadvto_PREP)=F wherein

In order to obtain the high-quality estimated value of word combinatory possibility, grammatical analysis mother tongue spoken corpus needs as much as possible accurately and wide coverage., grammatical analysis accurately needs to use the high-quality estimated value of word combinatory possibility again, and this has caused a conflict.This conflict can solve by the method for using iteration or camping step by step.This is based on some characteristic of parsing algorithm.

Each independently relatedly is related with a preferred value in the sentence.Preferred value is the estimating of confidence level that has such association in the sentence between two words.This preferred value is that coefficient for example part of speech probability and word degree of separation described in sentence simultaneously, and the language range coefficient function of related intensity between these words for example.

It returns a relation integration, and they satisfy the axiom (promptly related not intersection, each word is related term that is no more than a node or the like) of correlation structure jointly; But, this set does not require and constitutes single threaded tree;

Can change sentence by suitable parameter setting and describe coefficient and language range coefficient relativity preferred value;

A threshold value can be set, so only return the association that preferred value surpasses this threshold value;

The analysis that the iteration of grammatical analysis is put up with a very simple phrase " world title fight " comes illustration.

Must modify " fight " according to grammer " title ", but unclear " world " modifies " title " or " fight ".In English Grammar, any one noun on its right can be modified in each noun in noun sequence except last.In the present example, the knowledge of particular words combined strength can derive the conclusion of " world " modification " title ".In other example, as " plasticbaby pants ", first noun is modified be not in the middle of and then it noun but last noun.

A complete grammatical analysis will provide association:

1.<title_N，mod_of，tight_N>

2.<world_N，mod_of，title_N>

In the first time of grammatical analysis mother tongue spoken corpus iteration, there is not probable value related between the available particular words, so language range coefficient is to not effect of preferred value.The preferred value threshold value is provided with highly, therefore for instance part of speech be indefinite or separate far word can be not associated, and related correctness is with a high credibility.According to this example, only relevant 1 will be returned.Last noun modified certainly in the penult noun in the sequence, irrelevant with language range coefficient.But, when lacking language range information, in this example,, mod_of, fight_N no matter be related 2 or wrong＜world_N〉do not have sufficiently high preferred value and be returned.But, " world title " (with " world fight ") of not following other noun in this corpus waits the related of other example to be returned.

Calculate probable value with the high association of these determinacy then.The iteration of back can use these language range coefficients with definite preferred value subsequently, so the preferred value threshold value can be lowered.This has increased the related quantity of returning (coverage rate of grammatical analysis) and has allowed to calculate possibility statistics more accurately.According to this example,＜world, mod_of, title〉and＜world, mod_of, fight〉relative frequency and/or possibility will make the former have precedence over the latter now.Further iteration will continue to increase effect and the attenuating preferred value threshold value of language range coefficient to preferred value then.Like this, the coverage rate of possibility data and confidence level can little by little be strengthened.

After each iteration of grammatical analysis mother tongue spoken corpus, the probable value of every type of association is determined and is input in the database.

When having prepared or no matter how having obtained enough accurate data storehouse, it just can use in the present invention.The text that is examined problem will be through an iteration of such grammatical analysis process.Can reduce of the effect of language range coefficient to this grammatical analysis, and these coefficients, promptly Guan Lian probable value will be considered in next stage.

Determine each related probable value in the text by the spoken database of inspection mother tongue then.There is lower frequency to give probable value to the association of not seeing in the original mother tongue spoken corpus by supposing them.In a typical embodiment, the frequency of finding in the mother tongue spoken corpus is 1 relevant being dropped, and has greatly reduced data volume.Suppose that then an association that can not find in database has a frequency in 0 and 2 scopes, this is the optimum value of determining according to experiment, and correspondingly calculates its probable value.

(for example negative value) association that probable value is low is the index of possible errors.The probable value of a word place association is incorporated in the likelihood value of this word.The word of non-likelihood substitutes with the member of their collection that easily is confused, and sees whether its truthlikeness result has improvement.

Fig. 4 shows the embodiment of the invention as an error detector and modifier.As the example of grammatical analysis, input text is provided in step 10, in step 11, analyze.In step 12, analyze possibility related in the input text.In step 13, select first word in the text and in step 14, calculate the truthlikeness of this word.Check that in step 15 input text all was used to determine whether all words, if do not have, got next word and repeating step 14 in step 16.

When words all in the text has all had the truthlikeness value that calculates, in step 17, arrange these words by the truthlikeness ascending order.In step 18, select minimum truthlikeness word,, in step 20, stop this method if its truthlikeness is low unlike first threshold in step 19.Otherwise, in step 21, obtain the collection and in step 22, select first word that easily is confused of easily being confused of this word.This speech easily is confused in step 23, and word replaces and calculate this word truthlikeness in context that easily is confused in step 24.If detect improvement on truthlikeness greater than second threshold value in step 25, the word that then in step 26 this easily is confused reports to the user.

Step 27 checks whether all words that easily are confused all tried, if do not have, selects easily be confused word and control of the next one to turn back to step 23 in 28.Otherwise step 29 determines whether that words all in the text is all processed, if do not have, step 30 obtains next word and returns control to step 19.Otherwise, method ends in step 31.

We are each speech w in this embodiment _i(1≤i≤n, the length of sentence) determines the incidence set D (w at its place _i).We are to each D (w then _i) use a function probable value of incidence set is mapped as single value, this value is known as " truthlikeness " λ (w of this word _i).Press these words of plausibility ordering.If minimum truthlikeness word w _{λ min}Truthlikeness be lower than a threshold value, we just attempt to seek a correction.We use each word c successively _i(w _{λ min}) (1≤j≤n is at C (w _{λ min}) in easily the be confused number of word) replace w _{λ min}, and calculate λ (c _i(w _{λ min})).Those words that easily are confused that replace the back to improve the truthlikeness of this word can offer the user.Can show the word that easily is confused by the improved descending that they produce.

Easily be confused collection the member may to obscure the value of obscuring of possibility relevant with expression.For example, we can access the frequency total number that is used as each word of another word mistakenly from learner's tagged corpus; The mistake of true word in pronunciation and/or spelling may interrelate with the value based on editing distance; The word that easily is confused based on semantic dependency may interrelate with the value based on the path distance in a hierarchical network.

If use such information,, promptly substitute score value σ (w by confusion and the improvement on truthlikeness are combined into a single score value _i→ c _i(w _i)), advise with an order that has more help property.

With user interaction process in, the suggestion that provides at first is that a member with the collection that easily is confused replaces this speech to improve w _{λ min}If the user accepts one in these words, substitution effect will be transmitted to other word related with it and repeat new w _{λ min}The computation process of value.Transport process can comprise that an alternative word reattachment is in a word different with original word.

When unlikely association is as a part than macrostructure during independent the use is possible, and vice versa.For example, " by accident " is very strong collocation, and " by the accident " is unlikely and should be considered to a possible mistake.Yet exist the structure bigger, may be correct that comprises the latter, as " horrified by the accident ".

On the contrary, isolated " a knowledge " is typical learner's mistake, and " aknowledge of " is rational expression way.Can arrive " learn a knowledge of " but is a mistake.

These situations can be handled by the probable value that calculating comprises the relevant subgraph of the three or more elements that link two or more associations.Experimental observation points out in most of the cases to need not exceed three elements.In above-mentioned situation, the possible performance of quaternary prime phrase traces back to the more possibility of junior unit.For example, " horrified by the accident " is possible, this is because " horrified by " is a so strong collocation, and " learn a knowledge of " is unlikely, this is because " knowledge " is " learn " unlikely object, has nothing to do in other element.

Can calculate the probable value of element subgraph with distinct methods.A kind of method is with wherein two elements and the association between them are treated as a phrase unit, calculate the possibility tolerance between this phrase unit and the element then, the used computing method and the Calculation Method of carrying out under two element situations are just the same.

Can also realize two probable values related with element are combined into a likelihood value according to different schemes.The weight of effect that we can make the element phrase greater than the weight (a kind of level and smooth scheme) of binary prime phrase if or the element phrase that comprises the binary prime phrase can not meet relevant they frequency and/or certain constraint of possibility, just only with binary prime phrase (a kind of scheme that retreats).The parameter of these schemes can be determined that wherein Xue Xi key element is not the existence of certain word in context or does not exist by experience or learning process, but the intensity and the frequency of combination.

In order to increase the error range that can detect and revise, can carry out some expansions to basic skills.

The truthlikeness of calculating a word can comprise that speech of indication lacks the condition of attaching troops to a unit for other any word.Except under the situation of finite verb (or certain other part of speech in tabulation and title) that can be the root of association tree, independently a mistake (or a wrong grammer) always indicated in word.Therefore attaching troops to a unit to sky, to give a very low probable value be suitable, and this will trigger fault processing.

In order to determine to revise, this method is expanded needs subsequently, and is as described below.

If as mentioned above, the grammatical analysis of the text that be corrected is not influenced strongly by language breadth First value coefficient, if their part of speech is suitable, word generally will be attached.On the contrary, if a word is not attached, mistake generally can not have the word of identical part of speech to be corrected by replacing one.

Mistake may not be a displacement, but omits.For example, a noun can not be attached to an intransitive verb as object.In many such situations, mistake can be corrected by a prepositional insertion.Even a little less than a noun is attached to one during related verb, it also may be suitable inserting.Under arbitrary situation, insertion must be followed the foundation of new association, and its possibility will determine whether mistake is corrected.

Lacking attaches troops to a unit also may be to be caused by the replacement mistake that classification changes.If the collection that easily is confused of the word of a classification comprises another kind of other word, this displacement may must follow the part of once input to reanalyse so.For example, if a beginner writes " get out of the buildingsafety ", sequence " building safety " can be used as (unlikely) noun phrase and analyzes.If the collection that easily is confused of noun " safety " comprises adverbial word " safely ", reanalysing to be essential, is the modifier of verb " get out " to determine the latter, " building " but not " safety " is its object.

This method also can be used as context-sensitive dictionary, and with the likelihood value of not giving each word threshold value being set is example.In this case, how all words all are the candidates who substitutes regardless of truthlikeness.Similarly, substitute and not need to improve truthlikeness.Possible substituting can be proposed, for example, if their truthlikeness value surpasses a threshold value.

Can carry out this method with any proper device, still, in fact, most probably carry out this method by a computing machine, this computing machine has been worked out one and has been controlled it to carry out the program of this method.Fig. 1 illustrates a suitable computer system 100, and this computer based is in a central processing unit (CPU) of serving as controller.CPU1 is equipped with a memory under program 2, for example contains the disc driver form with the storage medium of disk or CD form, comprises the program of controlling CPU1 in turn.One first database 3 for example is stored on the disk, comprises related and relevant probable value.One second database for example also is to be stored on one or the above-mentioned disk, comprises the collection that easily is confused.Be equipped with a read/write or random access memory (RAM) (RAM) 5 to preserve the nonce of parameter in mode commonly used.

CPU is equipped with an input interface 6, and it allows to carry out the text input that mistake, factitious expression way or the like detect.For example, text may be to import by keyboard by hand or may be machine-readable form (for example on disk or CD).CPU1 also is equipped with an output interface 7, and it allows the output of user monitoring this method.Equally, for can be mutual with this method, interface 6 and 7 provides the function of the operation of input data, order or the like and monitoring this method for the user.For example, when the selection of the word that easily is confused that improves likelihood value is provided, can show these by constituting output interface 7 part or all of displays, the user can select a confusable word by suitably operation formation input interface 6 part or all of keyboard and/or mouses.

Provide one to comprise the association between the word and the database of the probable value that interrelates with it, it provides the probable value of this correct or habitual association to measure.Probable value is based on the association generation frequency that obtains by a large amount of texts of analysis, for example text of being created by the people who speaks one's mother tongue.In order to check the possible wrong or unnatural usage of the one or more words whether text chunk is arranged in the text chunk, at first to analyze text to determine the association between its word.Possibility related in the analyzed text is determined by database.Calculate the likelihood value of each word in the analyzed text, this is to obtain by the probable value of the association that this word occurs is combined.Use another database of word index, this database comprises the set of words that easy indexed word is obscured.Select each word and in the association of index word, replace the index word of easily being confused successively.Determine the probable value of these new associations and calculate the likelihood value of this word that easily is confused.In an error-detecting embodiment, drop on a word below the threshold value word of attempting easily being confused for those truthlikeness, and the word that easily is confused that will improve truthlikeness reports to the user.In a context dependent thesaurus embodiment, can attempt the word that easily is confused to all words, and can report that those likelihood values surpass the word that easily is confused of one second threshold value.

Although below only described the embodiment that a present invention is applied to English, the present invention is not limited in English and can be applied to other Languages.

The English text section can be generated by language (for example Japanese) translation of non-English.

Can generate text chunk by the literal that uses Optical Character Recognition system to read document printing.

According to the present invention, provide in order in user's writing, to detect method and a kind of device that mistake and factitious expression way and proposition can improve the mode of these language usage.

According to the present invention, it is possible detecting mistake and factitious expression way and their are proposed to revise in user's writing.It can handle true word misspelling, also can handle other various types of mistakes.

Claims

One kind in the written or spoken text chunk that comprises one group of word of first language first word or the modification selected of phrase or improve one's methods, it is characterized in that, comprise the following steps:

(a) provide first database (3) of the association between a first language word or the phrase, wherein each is associated to rare one and is associated in the relevant probable value of the frequency of occurrences in a large amount of first language texts based on this;

(b) it is related with one first between described first word that is based upon text section or phrase and one second word or the phrase to analyze (14) text section, first probable value of corresponding at least described association and based on the first likelihood value of described at least one probable value described first word of correspondence or phrase;

(c) provide one second database (4), wherein every has at least a word or phrase and its word that can be confused into or phrase set to link together;

(d) from second database (4), select (22) or calculate easily be confused word or the phrase that a candidate as first word described in the text section or phrase substitutes;

(e) derive (23,24) easily be confused word or phrase in first database (3) based on the second likelihood value of the probable value of one second association, this second association is made up of other word or phrase in easily be confused word or phrase and the text section; And

(f) optionally provide the indication (25,26) of easily be confused a word or a phrase based on the truthlikeness value that calculates.
2. the method for claim 1 is characterized in that, each the related probable value in first database (3) also is based on each and comprises one and have the word of identical correlationship or other related occurrence frequency of phrase.
3. the method for claim 1 is characterized in that, each the related probable value in first database (3) also is based on all other related occurrence frequencies with identical correlationship.
4. the method for claim 1 is characterized in that, each the related probable value in first database (3) is formed by in mutual information, T value, Yule ' s Q coefficient and the log likelihood at least one.
5. the method for claim 1 is characterized in that, in step (e), described other word or phrase are second word or phrase, and the correlationship of second association is identical with the first related correlationship.
6. the method for claim 1 is characterized in that, step (b) be included as text Duan Zhongyi organize first word or phrase set up one group first related and and (f) to each first related execution in step (d), (e).
7. the method for claim 1 is characterized in that, step (b) comprises the association of setting up between non-conterminous word in the text or the phrase.
8. the method for claim 1 is characterized in that, step (d) comprises each confusable word of the set of selecting a word or phrase or phrase and to each easily be confused word or phrase execution in step (e) and (f).
9. method as claimed in claim 8 is characterized in that, step (f) comprises by the second likelihood value descending indicates the second likelihood value.
10. the method for claim 1 is characterized in that, if the first likelihood value is less than a first threshold (19) then execution in step (d), (e) and (f).
11. the method for claim 1 is characterized in that, step (f) comprises provides indication when described or each second likelihood value surpass one second threshold value (25).
12. the method for claim 1 is characterized in that, step (f) comprise if the second likelihood value greater than the first likelihood value (25) then indication (26) is provided.
13. the method for claim 1 is characterized in that, step (b) comprises function calculation (14) the first likelihood values that rely on to learn from beginner's mistake tagged corpus and relevant likelihood value thereof by machine learning techniques.
14. the method for claim 1 is characterized in that, word replaces first word in (23) text section with easily being confused.
15. the method for claim 1 is characterized in that, generates text section by the second language translation.
16. the method for claim 1 is characterized in that, generates text section from document printing by optical character identification.
17. be that computer compilation is to carry out the computer program of method according to claim 1.
18. contain storage medium just like the described program of claim 17.
19. medium as claimed in claim 18 comprises computer-readable medium.
20. contain computing machine just like the described program of claim 17.
21. first word or phrase a modification or the modifying device selected in the written or spoken text chunk that comprises one group of word of first language is characterized in that, comprising:

First database (3) of the association between first language word or the phrase, wherein each is associated to rare one and is associated in the relevant probable value of the frequency of occurrences in a large amount of first language texts based on this;

A controller that is used for analyzing (14) text section, with determine one first between described first word of text chunk or phrase and one second word or the phrase related, the corresponding described association of at least one first probable value and corresponding described first word of the first likelihood value based on described at least one probable value or phrase; And

One second database (4), wherein every has at least a word or phrase and its word that can be confused into or phrase set to link together;

Wherein:

Controller (1) is selected (22) or is calculated easily be confused word or the phrase that a candidate as first word described in the text section or phrase substitutes from second database;

Controller (1) derives (23,24) easily be confused a word or a phrase based on one the second second likelihood value that is associated in the probable value in first database (3), this second association is made up of other word or phrase in easily be confused word or phrase and the text section; And

Controller optionally provides the indication (25,26) of easily be confused a word or a phrase based on the truthlikeness value that calculates.