CN101238459A

CN101238459A - Comparing text based documents

Info

Publication number: CN101238459A
Application number: CNA2006800254177A
Authority: CN
Inventors: 罗伯特·弗朗西斯·威廉姆斯; 海因茨·德雷埃尔
Original assignee: Curtin University of Technology
Current assignee: Curtin University of Technology
Priority date: 2005-05-13
Filing date: 2006-05-12
Publication date: 2008-08-06

Abstract

A method of and a system for comparing text based documents comprises lexically normalising each word of the text of a first document (104) to form a first normalised representation. A vector representation of the first document is built (206) from the first normalised representation. Each word of the text of a second document (110) is lexically normalised to form a second normalised representation. A vector representation of the second document is built (204) from the second normalised representation. The alignment of the vector representations is compared (210) to produce a score (218) of the similarity of the second document to the first document.

Description

Comparing text based documents

Technical field

The present invention relates to utilize robotization to handle comparing text based documents (text based documents), to obtain indication to document similarity.The present invention can be applicable to a plurality of fields, and these fields include but not limited to file search and article scoring (essay grading) automatically.

Background technology

In brief, internet search engine is just specified speech scanning webpage (webpage is a text), and return results, promptly with the webpage of specifying the speech coupling.Known there is no based on the similarity notion do not use the appointment speech to come the internet search engine of locating file.

Automatically the article scoring is more complicated.The purpose of this paper promptly be according to article content with the expection answer comparison but not come article (text) is marked according to the specific collection of speech.

Summary of the invention

According to the first string of the present invention, this paper provides a kind of method of comparing text based documents, comprising:

Each word of the text of first file is carried out the vocabulary standardization to be represented to form first standardization;

The vector representation of representing to set up first file according to first standardization;

Each word of the text of second file is carried out the vocabulary standardization to be represented to form second standardization;

The vector representation of representing to set up second file according to second standardization;

The conllinear degree of these vector representations relatively is with the mark of the similarity that generates relative first file of second file.

Preferably, the vocabulary standardization is converted to each word in the file expression of the basic notion that defines in the dictionary.Each speech is used for searching at dictionary the basic notion of this speech.Preferably, each basic word is assigned with a digital value.The numeral of standardization spanned file like this, in certain embodiments.Each standardized basic notion forms the one dimension of described vector representation.Each basic notion is counted.

The counting of each standardized basic notion forms the length of vector in each dimension of vector representation.

Preferably, the conllinear degree of these vector representations relatively generates mark by the cosine of determining the angle (theta) between these vectors.

Usually, cos (theta) calculates according to these vectorial dot products and these vectorial length.

In certain embodiments, the number of basic notion is counted in the file.In one embodiment, the basic notion of each non-0 counting exerts an influence to the counting of notion in each file.Specific basic notion can be got rid of from the counting of notion.Preferably, the counting of the notion of the counting of the notion of second file and first file compares, and exerts an influence with the mark to the similarity of relative first file of second file.Usually, the influence of the basic notion of each of non-zero count is 1.Preferably, described relatively is ratio.

In preferred embodiment, described first file is the model answer article, and described second file is an article to be kept the score, and described mark is the mark of described second article.

According to second aspect of the present invention, a kind of system of comparing text based documents is provided, comprising:

Each word to the text of first file carries out the vocabulary standardization to form the instrument that first standardization is represented;

The instrument of representing to set up the vector representation of first file according to first standardization;

Each word to the text of second file carries out the vocabulary standardization to form the instrument that second standardization is represented;

The instrument of representing to set up the vector representation of second file according to second standardization;

Text to first file carries out the standardized instrument of vocabulary;

Relatively the conllinear degree of these vector representations is with the instrument of the mark of the similarity that generates relative first file of second file.

According to the 3rd aspect of the present invention, a kind of method of comparing text based documents is provided, comprising:

The word of first file is divided into noun phrase and verb subordinate clause;

The word of second file is divided into noun phrase and verb subordinate clause;

The relatively division of first file and the division of second file is with the mark of the similarity that generates relative first file of second file.

In one embodiment, each speech in the file is standardized as basic notion by vocabulary.

Preferably, to file divide relatively by determining that following ratio carries out: the ratio between the number of the noun phrase composition of respective type in the number of the noun phrase composition of one or more types and first file in second file, and the ratio between the number of the verb subordinate clause of respective type in the number of the verb subordinate clause composition of second one or more types in the file and first file, wherein these ratios are influential to mark.

Preferably, the type of described noun phrase composition is: noun phrase noun, noun phrase adjective, noun phrase preposition and noun phrase conjunction.Preferably, the type of described verb subordinate clause composition is: verb subordinate clause verb, verb from sentence adverb, verb subordinate clause auxiliary word, verb subordinate clause preposition and verb from sentence connector.

In a preferred embodiment, described first file is the model answer article, and described second file is an article to be kept the score, and described mark is the mark of second article.

According to a fourth aspect of the present invention, a kind of system of comparing text based documents is provided, comprising:

The word of first file is divided into the instrument of noun phrase and verb subordinate clause;

The word of second file is divided into the instrument of noun phrase and verb subordinate clause;

Relatively the division of the division of first file and second file is with the instrument of the mark of the similarity that generates relative first file of second file.

According to the 5th aspect of the present invention, a kind of method of comparing text based documents is provided, comprising:

The number of representing to determine basic notion in first file according to first standardization;

The number of representing to determine basic notion in second file according to second standardization;

The number of notion at all in the number of basic notion and second file in first file relatively is with the mark of the similarity that generates relative first file of second file.

According to the 6th aspect of the present invention, a kind of system of comparing text based documents is provided, comprising:

The instrument of representing to determine the number of basic notion in first file according to first standardization;

The instrument of representing to determine the number of basic notion in second file according to second standardization;

The number of notion at all in the number of basic notion and second file in first file relatively is with the instrument of the mark of the similarity that generates relative first file of second file.

According to the 7th aspect of the present invention, a kind of method that the text based article is marked is provided, comprising:

Model answer is provided;

The many pieces of manually articles of scoring are provided;

Many pieces of articles to be marked are provided;

The equation that article is marked is provided, wherein this equation has a plurality of tolerance, each is measured has a coefficient, this equation calculates the mark of article by each tolerance of being revised by its coefficient is separately added up, and each tolerance compares to determine by article that each piece is to be marked and standard article;

Coefficient determined in article according to manually scoring;

This equation used in every piece of article to be marked, to generate the mark of every piece of article.

Preferably, determine that according to the article of manually scoring coefficient carries out by linear regression.

Preferably, tolerance comprises the mark by the method generation of comparing text based documents described above.

According to the 8th aspect of the present invention, a kind of system that the text based article file is marked is provided, comprising:

Determine the instrument of the coefficient in the equation according to many pieces of articles of manually marking, wherein this equation article of being used to treat scoring is marked, this equation comprises a plurality of tolerance, each tolerance has a coefficient in these coefficients, this equation generates the mark of article, this mark calculates by each tolerance of being revised by its coefficient is separately added up

Be used for by every piece of article to be marked and standard article being compared to determine the instrument of each tolerance;

Be used for according to determined coefficient and determined tolerance, the instrument of this equation with the mark that generates every piece of article used in every piece of article to be marked.

According to the 9th aspect of the present invention, a kind of method that provides about the visible feedback of article scoring is provided, comprising:

The counting of the counting of each basic notion and each basic notion of in answer, expecting in the article that demonstration is marked.

Preferably, each basic notion is corresponding to the basic implication of defined speech in the dictionary.In certain embodiments, by each word in the quilt article of marking is carried out the vocabulary standardization to generate the expression of basic implication in the article of being marked, and, determine the counting of each basic notion to being counted by the appearance of each basic implication in the article of marking.The counting of basic notion is counted with the method identical with model answer in answer.

Preferably, described demonstration is a diagram.More preferably, described demonstration is the calcspar at each basic notion.

In an embodiment, this method further comprises: selected notion by in the article of marking, and showing the word that is belonged to this notion in the scoring article.Preferably, the word that relates to other notion in the answer also is shown.Preferably, this demonstration is by highlighted realization.

In another embodiment, this method further comprises: select the notion in the expectation article, and show the speech that belongs to this notion in the article.Preferably, the speech that relates to other notion in the answer also is shown.Preferably, this demonstration is by highlighted realization.

Preferably, this method further comprises the synonym that shows selected basic notion.

According to the of the present invention ten aspect, a kind of system that provides about the visible feedback of article scoring is provided, comprising:

The instrument of the counting of each basic notion of expecting in the counting of each basic notion and the answer in the article that demonstration is marked.

According to the 11 aspect of the present invention, a kind of method of digitized representations file is provided, comprising:

Each word to file carries out the vocabulary standardization;

To be divided into a plurality of parts by standardized word in the file, each part is designated as noun phrase or is designated as the verb subordinate clause.

Preferably, a plurality of words are used to determine that each several part is noun phrase or verb subordinate clause.In one embodiment, first three of each a part word is used to determine that this part is noun phrase or verb subordinate clause.In certain embodiments, the row that are assigned to the noun phrase form of each word in the part to the row of section or verb subordinate clause form to section.

Every section word of distributing to a syntactic type of form.

If word has the syntactic type of next section, then these order of words are distributed to the section in the appropriate table.If next word does not belong to next section, this Duan Liuwei blank then, the order assignment of section position that moves on.

If if next word does not belong to the form types when forward part, then expression is when the forward part end.

In certain embodiments, these forms have multirow, so that other row that is not suitable for after current word in the forward part arranging when next word, but this word is not represented next word to be arranged in the next line of this form when forward part finishes.

According to the 12 aspect of the present invention, a kind of system of digitized representations file is provided, comprising:

Each word to file carries out the standardized instrument of vocabulary;

With file be divided into the instrument of a plurality of parts through standardized word, each part is designated as noun phrase or is designated as the verb subordinate clause.

According to the 13 aspect of the present invention, provide a kind of computer program, any one in the method that is configured to be limited more than the control computer execution.

According to the 14 aspect of the present invention, a kind of computer program is provided, be configured to control computer as the above system works that limits.

According to the 15 aspect of the present invention, a kind of computer-readable medium is provided, comprise the above computer program that limits.

Description of drawings

For understanding the present invention better, now contrast accompanying drawing and only describe preferred embodiment by way of example in detail, wherein:

Fig. 1 is for being used for the synoptic diagram of preferred embodiment of the equipment of comparing text based documents according to the embodiment of the invention;

Fig. 2 is the process flow diagram according to the method for embodiment of the invention comparing text based documents, and wherein said text is model answer article and the article that is used to mark;

Fig. 3 is the diagrammatic sketch of the vector representation of 3 files;

The screenshot capture that shows window that Fig. 4 generates for the computer program by the embodiment of the invention, the method according to the embodiment of the invention in this program is marked to article;

Fig. 5 screenshot capture that shows window that this computer program generates of serving as reasons, the notion of the article of being marked in this program and the notion of model answer compare;

Fig. 6 is the window that shows that Alphabetical List is shown;

Fig. 7 is a series of process flow diagrams of certain embodiments of the invention; And

Fig. 8 is the process flow diagram of the embodiment of the invention.

Embodiment

Referring to Fig. 1, system 10 wherein is used for comparing text based documents, and it adopts the form of computers that has processor and the storer of appropriate software is installed usually, and this software control computing machine comes work as the system of described comparing text based documents.System 10 comprises: input end 12 is used for receiving input from the user, and is used to receive the electronic text file that comprises at least one speech; Processor 14 is used to carry out the calculating of comparing text based documents; Storage tool 16, for example hard disk drive or storer are used for the computer program that temporary storage is used for the text of comparison and is used for processor controls 14; And output terminal 18, display for example, it is used to provide the result of comparison.

System 10 is according to method work shown in Figure 2.Contained according to process 100, at first prepare one group of answer.102, the article of the theme of one piece of general introduction article to be assessed is set.104, write answer at this article theme.These answers should be electronic text file maybe should be converted to electronic text file.

106, from these answers, isolate a sample that is used for being undertaken manual scoring by an above score keeper.Preferably, this sample has 10 answers at least.Have found that following thumb rule, promptly the number of files in this sample should be about 5 times of predictive operator number.With regard to following equation, should have 50 files in this sample at least, 100 files preferably should be arranged.Usually, design the key 112 of keeping the score according to article theme 102.Score keeper or under preferable situation more the multidigit score keeper this sample carried out manual (manually) mark.Under this vantage of same piece of writing article being marked by the people more than, produce this by the average mark of manual sample of marking.Residue answer in the answer 104 is formed for the answer 108 of automatic scoring.

Need model answer 110.Can write model answer according to the key of keeping the score 114, also the optimum answer 116 that is used for the sample 106 of manual scoring in these answers can be used as model answer.

Each text based answer, i.e. model answer 110, the answer sample 106 of manually scoring and the residue answer 108 that is used for automatic scoring are all passed through input end 12 inputs 202 to system 10.

Be automatic article scoring technology 200 then.The answer sample 106 of model answer 110, manual scoring and each input 202 that is used for the residue answer 108 of automatic scoring are treated to the follow-up essential structure that further describes respectively.These steps are respectively 204,206 and 208.Then, compare with each treated manual scoring answer 210 from 206 from 204 treated model answers so that generate a cover tolerance, below definition tolerance in more detail.Tolerance is to use multiple technologies in essence, one or more values that the answer and the model answer of each manual scoring compared.Then, this tolerance is used to find out the coefficient of the equation of keeping the score that further describes below.

Compare 212 at the mark that provides in each tolerance of the answer of each manual scoring and the manual scoring process, and use a model and set up technology, so that find out the best coefficient of mark of the manual scoring that produces according to each tolerance.Usually this will be according to linear regression technique.Although should be appreciated that other modeling technique also can use.

The article answer of each the need automatic scoring from 208 compares 214 with the model answer from 204, so that generate the tolerance of each answer.Then, 216 212 coefficients of determining are applied to the tolerance of each article, so that generate the mark of each article.Then at a series of marks of 218 outputs.Then the answer of article also can use further describe below provide the display technique of feedback to watch as the article writting person.

The equation that is used to keep the score

Following equation is used to calculate the article mark:

Mark=C*CosTheta+D*VarRatio+ other factors

This purpose of other factors is comprehensive evaluation this article, but not estimates the answer of article to theme, and it for example includes the row of consideration in the such aspect of style, readability, spelling and grammar mistake.CosTheta and VarRatio assessment this article are to the answer degree of problem.

C and D are the weight variablees.

Below be the more detailed equation that calculates the article mark:

Mark=intercept+A*FleschReadingEase

+B*FleschKincaidGradeLevel

+C*CosTheta+D*VarRatio

+E*RatioNPNouns+F*RatioNPAjectives

+G*RatioNPPrepositions+H*RatioNPConjunctions

+I*RatioVPVerbs+J*RatioVPAdverbs

+K*RatioVPAuxilliaries+L*RatioVPPrepositions

+M*RatioVPConjunctions+N*NoParagraphs

+O*NoPhrases+P*NoWords

+Q*NoSentencesPerParagraph

+R*NoWordsPerSentence+S*NoCharactersPerWord

+T*NoSpellingErrors+U*NoGrammaticalErrors

Wherein, A-U is the regression coefficient that calculates according to the relevant variable in the article training set.Most of the time, a plurality of coefficients in these coefficients are 0.

Intercept is the values of intercept (can think the value with the point of crossing of Y-axis) at regression equation calculation;

FleschReadingEase is the Flesch legibility of being calculated for student's article by Microsoft Word (Flesch reading ease) (legibility);

FleschKincaidGradeLevel reads rank (Flesch-Kincaid reading level) (rank) by Microsoft Word for the Flesch-Kincaid that student's article calculates;

CosTheta calculates according to following further explanation;

VarRatio calculates according to following further explanation;

RatioNPNouns is the ratio on the noun of student's article comparison with standard article in noun phrase;

RatioNPAjectives is the ratio on the adjective of student's article comparison with standard article in noun phrase;

RatioNPPrepositions is the ratio on the preposition of student's article comparison with standard article in noun phrase;

RatioNPConjunctions is the ratio on the conjunction of student's article comparison with standard article in noun phrase;

RatioVPVerbs is the ratio on the verb of student's article comparison with standard article in the verb subordinate clause;

RatioVPAdverbs is the ratio on the adverbial word of student's article comparison with standard article in the verb subordinate clause;

RatioVPAuxilliaries is the ratio on the auxiliary verb of student's article comparison with standard article in the verb subordinate clause;

RatioVPPrepositions is the ratio on the preposition of student's article comparison with standard article in the verb subordinate clause;

RatioVPConjunctions is student's article comparison with standard article ratio on the conjunction in the verb subordinate clause;

NoParagraphs is the paragraph number of student's article;

NoPhrases is the noun phrase of student's article and the total number of verb subordinate clause;

NoWords is the word number in student's article;

NoSentencesPerParagraph is the average sentence number of all paragraphs of student's article;

NoWordsPerSentence is the average number of words of all sentences of student's article;

NoCharactersPerWord is the average number of characters of all words of student's article;

NoSpellingErrors is the sum of the misspelling calculated by Microsoft Word in student's article; And

NoGrammaticalErrors is the number of the grammar mistake calculated by Microsoft Word in student's article.

Below be the substituting equation that can be used for calculating the article mark:

Mark=A*FleschReadingEase

+B*FleschKincaidGradeLevel

+C*CosTheta+D*VarRatio

+E*％SpellingErrors+F*％GrammaticalErrors

+G*ModelLength+H*StudentLength

+I*StudentDotProduct+J*NoStudentConcepts

+K*NoModelConcepts+L*NoSentences

+M*NoWords+N*NonConceptualisedWordSRatio

+O*RatioNPNouns+P*RatioNPAjectives

+Q*RatioNPPrepositions

+R*RatioNPConjunctions

+S*RatioVPVerbs+T*RatioVPAdverbs

+U*RatioVPAuxilliaries+V*RatioVPPrepositions

+W*RatioVPConjunctions

Wherein, A-W is the regression coefficient that calculates according to the relevant variable in the article training set.In the time of most of, a plurality of coefficients in these coefficients are 0.

FleschReadingEase is the Flesch legibility of being calculated for student's article by Microsoft Word;

FleschKincaidGradeLevel is the Flesch-Kincaid difficulty level that is calculated for student's article by Microsoft Word;

CosTheta calculates according to following further explanation;

VarRatio calculates according to following further explanation;

%SpellingErrors is that it is with the formal representation of the number percent of all words in student's article by the number of the misspelling in student's article of Microsoft Word calculating;

%GrammaticalErrors is that it is with the formal representation of the number percent of all sentences in student's article by the number of the grammar mistake in student's article of Microsoft Word calculating;

ModelLength is the vector length of the model answer vector that obtains according to following further explanation;

StudentLength is the vector length of the model answer vector that obtains according to following further explanation;

StudentDotProduct is the dot product of student's vector sum standard vector of obtaining according to following further explanation;

NoStudentConcepts is the number that replaces the notion of the word occur in student's article;

NoModelConcepts is the number of the notion of the word that occurs in the standard article;

NoSentences is the sentence number in student's article;

NoWords is the number of words in student's article;

NonConceptualisedWordSRatio be in dictionary, search in student's article less than the number of word, it is with the ratio formal representation of total number of words in student's article;

RatioVPPrepositions is student's article comparison with standard article ratio on the preposition in the verb subordinate clause; And

RatioVPConjunctions is student's article comparison with standard article ratio on the conjunction in the verb subordinate clause.

Wherein, the coefficient near zero can be converted into zero, to simplify this equation.Coefficient is that zero member of equation (i.e. the variable of this coefficient and this coefficient of application) can remove from this equation.

For these articles and standard article are compared, need convert them to the structure that is suitable for comparison.The processing of changing these articles is as follows:

The basic notion of using dictionary to search each word is carried out the vocabulary standardization with each word in every piece of article; And

Set up the conceptual model of the structure of this article.

Conceptual model

For conceputal modeling, be described as the technology of " piecemeal (chunking) " by the back, this article is divided into noun phrase and verb subordinate clause, so that sentence structure represents with subject and predicate, be called noun phrase (NP) and verb phrase (VP).Usually, the subject that NP specify to discuss, VP are then specified the action that applies or carried out by subject to subject.Yet than NP, the complexity that VP handles is more famous, and this is because VP may comprise a lot of verb subordinate clause (VC) and NP in groups usually.Replace the complicated VP of identification to be easy to many with identification VC.The basis of employed technology is, utilizes continuous structuring section to represent to form the implication of the word of NP and VC, and the structuring section comprises the digital value of dictionary call number in this section of representing this this implication of root.The digital digest of sentence implication is thought to set up like this in the file.

Below the definite structure of NP and VC section further is discussed, but to interpretation concept with provide real example and carry out following consideration.Common sentence comprises following alternating N P and VC.A typical NP section word and digital content may be:

DET ADJ ADJ N

The small black dog

100 143 97 678

DET is a determiner, and ADJ is an adjective, and N is a noun.

A typical VC section word and digital content may be:

V ADV ADV

walked slowly down

34 987 67

V is a verb.

Typical concluding NP section word and digital content may be:

DET N

the street

100 234

Numeral in these examples is the dictionary call number of respective word.Here, numeral is made up, and only is in order to explain.Usually sentence is to be made of in groups alternately NP and VC, but not necessarily with such order, so the sentence summary can be shown by one group of NP section that comprises digital dictionary index and VC segment table.Then, document can be made up of the set of these groups.Notice that sentence not necessarily must start from NP, also can start from VP.

The NP structure

Martha. can rein in (Martha Kolln) and (can rein in, M. (1994) understand computing machine grammer (Understanding English Grammar), mcmillan (MacMillan), New York (NewYork)) 443 pages stated in the Transformational Grammar rule of NP of giving a definition, as follows:

(1)NP＝(DET)+(ADJ)+N+(PREP PHR)+(S)

And as follows at 429 pages Prep Phr:

PREP PHR＝PREP+NP

PREP PHR is preposition phrase, and S is a subject.

When the section that provides to NP was provided, above (1) can be written as:

(2)NP＝DET ADJ N PREP NP S

NP basic composition is:

(3)NP＝DET ADJ N

And some additional structures.Find in practice

(4)NP＝DET ADJ ADJ ADJ N

It is structure preferably.If we as the taproot structure among the NP, can set up complete NP structure according to this core texture with this structure by connecting the appearance of a plurality of core textures by PREP.Find that in practice we also should allow to connect by CONJ (conjunction).So last, we sum up basic composition and should be:

(5)NP＝CONJ PREP：DET ADJ ADJ ADJ N

Wherein, two sections before the colon is linkage section, subsequently be inclusive segment.Because NP section template should be handled the NP of the many reality that run in the general English text, therefore practice shows that we should allow this basic composition to occur about 40 times.In fact, the current execution of program allows this basic composition unrestrictedly to occur.Table 1 illustrates preceding 10 row of this array.

Table 1 noun phrase semantic structure

First core in sentence is formed CONJ section and the PREP section (in fact number is 0) that has the blank of being set to usually.Dead band also is set to 0 arbitrarily.

The VC structure

Kolln (1994) has stated in the Transformational Grammar rule of VP of giving a definition at 428 pages, and is as follows:

(6)VP＝AUX+(COMP)+(ADV)

AUX is an auxiliary verb.COMP is interpreted as NP or ADJ, so by remove this from VP, it is as follows that we obtain VC:

(7)VC＝AUX+V+ADV

We find in practice, if revise this VC definition by increasing extra AUX and ADV, the more useful structure of our acquisitions is:

(8)VC＝AUX AUX ADV ADV V AUX AUX ADV ADV

VC introduces CONJ through regular meeting, finds that in practice we also should allow PREP in VC, so complete VC definition should be:

(9)VC＝CONJ PREP AUX AUX ADV ADV V AUX AUX ADV ADV

We should allow this basic VC to form appearance 40 times, so that handle the VC that runs in practice.In fact, the current execution of program allows this basic composition unrestrictedly to occur.Table 2 illustrates preceding 10 row of this array.

Table 2 verb subordinate clause semantic structure

If sentence starts from VC, the CONJ section is set to blank (in fact numeral is 0) so.Dead band also is set to 0 arbitrarily.

Table 3 illustrates the position that sentence is formed, so that determine the phrase type of 3 positions, table 4 illustrates more multipoint phrase type.P is PREP in these tables.

Table 4

		Section
		Section									0	1	2	3	4	5	6	7	8	9	10
		The phrase type	NOUN	CONJ	P	DET	ADJ	ADJ	ADJ	N	0	1	2	3	4	5	6	7	8	9	10
VERB	CONJ		NOUN	CONJ	P	DET	ADJ	ADJ	ADJ	N	P	AUX	AUX	ADV	ADV	V	AUX	AUX	ADV	ADV

Fig. 8 illustrates parsing sentence so that sentence is divided into the processing 300 of noun phrase and verb subordinate clause.Processing 300 starts from the beginning of each sentence, and the beginning of this sentence also is not determined as noun phrase or verb phrase 302.Obtain first three word position (POS) hereof 304.Can use more or less word, but find that three is useful especially.

When at least one word of residue in sentence, handle proceeding to the cycle stage 318.In 308, searching three words of each position in three positions in pattern list (table 3), is NP or VC so that determine.If do not identify pattern, then produced mistake, analyze and move to next sentence, or move till another NP of identification or VC.

Determine whether current phrase type is different from the current type of distributing to this sentence.If this is the beginning of this sentence, if necessarily not whether answer the phrase type change, does not then show the beginning that has arrived when the ending of before phrase and new phrase 312 so.The index of word such as following about 316 acquisitions that further described.If this is first phrase of this sentence, or still identical in 380 determined types,, current word is increased in the current phrase type then 314.Then, 316, handle and advance to the position that second word is moved to first word, the 3rd word becomes the position of second word, if also have the remaining words of word in sentence, new word is read into the position of the 3rd word.If have a word residue then at least, then handle and return 306.If without any the speech residue, then processing finishes.

Below illustrate at following text and realize these example of structure in practice:

This essay will discuss why it′s a good idea for the Government to raiseschool leaving age to 17，

It will also state why most people in Australia agree with the Governmenton this particular topic.

Paragraph 1

Sentence 1

Phrase 1(Noun)

Row 1

|This|essay|

|DET |N |

|5082|238 |

Phrase 2(Verb)

Row 1

|will|discuss|why|

|AUX |V |ADV|

|2034|238 |99 |

Phrase 3(Noun)

Row 1

|it′s|

|N |

|25 |

Row 2

|a|

|N|

|-|

Row 3

|good|idea|

|ADJ |N |

|317 |317 |

Row 4

|for|the|Government|

|P |DET|N |

|705|507|63 |

Phrase 4(Verb)

Row 1

|to|raise|

|P |V |

|71|307 |

Phrase 5(Noun)

Row 1

|school|

|N |

|307 |

Phrase 6(Verb)

Row 1

|leaving|

|V |

|-1 |

Phrase 7(Noun)

Row 1

|age|

|N |

|553|

Row 2

|to|17.|

|P |N |

|71|-1 |

Paragraph 2

Sentence 1

Phrase 1(Noun)

Row 1

|It|

|N |

|25|

Row 2

|will|

|N |

|131 |

Phrase 2(Verb)

Row 1

|also|

|ADV |

|8 |

Phrase 3(Noun)

Row 1

|state|

|N |

|438 |

Phrase 4(Verb)

Row 1

|why|

|ADV|

|99 |

Phrase 5(Noun)

Row 1

|most|people|

|DET |N |

|5042|373 |

Row 2

|in|Australia|

|P |N |

|70|502 |

Phrase 6(Verb)

Row 1

|agree|

|V |

|20 |

Phrase 7(Noun)

Row 1

|with|the|Government.|

|P |DET|N |

|7142|507|63 |

Sentence 2

Phrase 1(Noun)

Row 1

|P |DET |ADJ |N |

|70|5082|310 |455 |

The calculating significant figure of this method of partition spanned file are represented.

Determine tolerance

After each article being treated to desired structure, make the tolerance of determining in the following method separately.

Vector representation

In order to generate following tolerance, set up the vector representation of each article:

CosTheta; VarRatio; ModelLength; StudentLength; And StudentDotProduct.

The vector representation of each article is set up as follows.Each possible basic implication in the dictionary is distributed to the one dimension in the superspace set of axle.Counting is by forming contributive each speech of each basic implication, and this counting is called the length of each dimension of the vector of vector in being formed on superspace.

Like this, each vocabulary counting of being standardized as the speech of basic implication is used for vector representation.

At Sha Erteng (Salton), G. the automatic information tissue of (1968) and retrieval (AutomaticInformation Organization and Retrieval), McGraw-Hill, in the New York (NewYork), vector representation to e-dictionary is set up and set up the file content that is used for automated information retrieval has comprehensive discussion.

Yet following is that example is indicative.Consider the beginning of the sentence fragment of following continuous sentence from 3 unique files:

Reference number of a document Body of an instrument

(1) The little boy...A small male

(2) A lazy boy...A funny girl

(3) The large boy...Some minor day

Suppose to exist in the dictionary following root word (notion numbering) and word:

The notion numberingSingle Speech

1. the，a

2. little，small，minor

3. boy，male

4. large

5. funny

6. girl

7. some

8. day

9. lazy

Above file fragment is represented to count by the number of times that occur in file fragment the word in that notion numbering about the tri-vector of preceding 3 notions numberings (1-3) and is set up.These vectors are:

Reference number of a documentVector about preceding 3 notions Explain

(1) [2，2，2] [the，a；little，small；boy，male]

(2) [2，0，1] [A，a；；boy]

(3) [1，1，1] [The；minor；boy]

These tri-vectors are shown to the figure pictureization among Fig. 3.

Usually, these ideas expand to about 812 notions in the Macquarie dictionary and all words in the file.This means that vector is established as about 812 dimensions, and Vector Theory continues these dimensions one with the identical method of strictness and is difficult in certainly and describes this vector in this superspace.

According to this vector representation of article, ModelLength and StudentLength variable can calculate by determine the length of vector with conventional method, promptly

Length=SquareRoot (x*x+y*y+...+z*z), wherein vector is: vector (x, y ... z).

Equally, the StudentDotProduct variable can calculate by calculating the dot product that settles the standard between the article vector sum student article vector with conventional method, promptly

DotProduct=(x1*x2+y1*y2+...+z1*z2), wherein vector be the vector 1 (x1, y1 ..., z1) and the vector 2 (x2, y2 ..., z2).

Next, variable CosTheta can conventional method calculate, promptly

Cos(theta)＝DotProduct(v1，v2)/(length(v1)*length(v2))。

If we suppose that file 1 is model answer, can understand file 2 and file 3 in the degree of answer that semantically is near the mark by the degree of closeness of checking their corresponding vectors so.Angle between the vector changes along with " approaching " degree between the vector.Low-angle represents that file comprises similar content, and wide-angle represents that they do not have a lot of identical contents.Angle Theta1 be model answer vector with the vector of file 2 between angle, angle Theta2 is the angle between the vector of the vectorial and file 3 of model answer.

The cosine of Theta1 and Theta2 can be used as the tolerance of this degree of closeness.If file 2 is similar to model answer with file 3, then their vector can be similar to the model answer vector, and be positioned on the straight line with the model answer vector, and have cosine value 1.If on the other hand, promptly they are different fully, and then perpendicular to the model answer vector, their cosine is 0.

Usually in practice, the cosine of file is between this upper and lower bound.

The variable CosTheta of use in the algorithm of keeping the score is this cosine that calculates for the file of keeping the score.

Variable V arRatio determines divided by the number of non-0 dimension in model answer according to the number of non-0 dimension in student's answer.

For example, the notion number that appears in the above-mentioned model answer (file 1) is 3.This can determine according to the number of the non-zero count in digital vectors is represented.

The number that appears at the notion in the above-mentioned file 2 is that second vector index of 2-is 0.In order to be that file 2 calculates VarRatio, we count with the non-zero concepts that the counting of the non-zero concepts in the model answer removes file 2, i.e. VarRatio=2/3=0.67.File 3 corresponding VarRatio are 3/3=1.00.

This simple variable provides the very strong predictive operator of article mark, can appear in the algorithm of keeping the score as a key element usually.

In order to generate the following tolerance of conceptual model, used:

NoStudentConcepts；NoModelConcepts；

NonConceptualisedWordSRatio；RatioNPNouns；

RatioNPAjectives；RatioNPPrepositions；

RatioNPConjunctions；RatioVPVerbs；RatioVPAdverbs；

RatioVPAuxilliaries; RatioVPPrepositions; With

RatioVPConjunctions。

These can be according to described above determining.

The calculating of keeping the score and measuring is shown among Fig. 4.

In case article is marked, can provide the correct concept that where covered of article, where do not cover the feedback of correct concept.As shown in Figure 5, can be shown by the square height of each notion by the counting of each basic notion of expecting in the counting of each basic notion in the article of marking and the answer.

Can in article, select a word further, and the similar concept in article can by highlighted they show.

And, by being chosen in the notion in the model answer article, and by the similar concept in the highlighted article that shows scoring.

The synonym that can also show selected basic notion, as shown in Figure 6.

Example

Developed regression equation according to the training articles of about 100 parts of human scorings and ideal or model answer.Set up file vector described above.According to model answer and the content of training article and the relation between the vector, calculated the value of a plurality of variablees.In case carried out training, and set up scoring algorithm, just each article of not marking is handled, to obtain the value of independent variable, use regression equation then.Usually CosTheta and VarRatio are extremely important in the scoring equation.

In test, 10 grades middle school student hand-written article on paper about theme " The School LeavingAge ".Then, three mankind scoring persons that trained mark to these articles according to the scoring indication.Then, be that 390 article is transcribed into Microsoft Word file layout with total number.Selection has the article of the highest average human mark as model answer.The highest average human mark have possibility 54 within mark 48.5, or 90%.In a test of this system, 100 parts of articles are used to set up scoring algorithm.When arranging, use in test preceding 100 parts of articles to set up scoring algorithm with the identifier ascending order.Determined predictive equation is:

Grade＝ -22.35

+11.00*CosTheta+15.70*VarRatio

+7.64*Characters Per Word+0.20*Number of NP Adjectives

It produces the mark within 54.Predictive equation only needs 4 independent variables in this example.Use this equation that remaining 290 parts of articles are marked then.

The mankind of these 290 parts of articles are equally divided into 30.34, and are equally divided into 29.45 by what the computing machine automatic scoring provided, and difference is 0.89.Correlativity between the mankind and the automatic scoring is 0.79.Mean absolute difference between the two is 3.90, shows the timesharing of getting within 54 (mankind's scorings of maximum possible), and the average error rate is 7.23%.

Correlativity between three mankind itself is 0.81,0.78 and 0.81.

The advantage that mankind scoring person's mark is averaged is shown by the following fact: the correlativity between the average mark that automatic scoring mark and three mankind beat is the highest, is 0.79, is higher than each correlativity 0.67,0.75 and 0.75.

The coefficient of important predictive operator and intercept can be positive or negative.For example can expect that the coefficient of CosTheta predictive operator is for just, the coefficient of SpellingError is for negative.Yet because the unpredictable property of mathematics in the data, this can often not take place.

Also can use the various conversion of predicted value tolerance.They may comprise square root and logarithm.These are canonical transformations of often using in linear regression.The biquadratic of discovery number of word in article is effective predicted value.

Other example of the equation that calculates in the article test batch comprises following:

Grade＝31.49+18.92*CosTheta+17.07*VarRatio-0.23*Ease-

1.02*Level

For the mark within 54.

Grade＝27.+16.07*CosTheta+19.06*VarRatio-0.21*Ease-

0.71*Level

For the mark within 54.

Grade＝-19.59+7.16*CosTheta+12.64*VarRatio

+0.07*Number of NP Adjectives+1.82*Level

For the mark within 30.

Should be noted that and be easy to determine for example, to be expressed as percentage by mark with ratio.Mark is as an example within 54, and this mark can multiply by 100 and obtain the percent mark divided by 54.

For big mark within 30 to 50, the coefficient of CosTheta and VarRatio is usually between about 10 to 20.In order to obtain the percent mark, can use about coefficient of 20 to 40.Yet, may design general equation, for example:

score＝20+40*CosTheta+40*VarRatio

-10*SpellingErrors-10*Grammatical Errors

By using regretional analysis can obtain better result,, rather than fix them as general value so that determine coefficient.

A detailed cover process flow diagram is included among Fig. 7.The false code of one this process flow diagram of cover explanation is listed in the appendix 1.

The technician should be appreciated that and can carry out various modifications and change to the present invention not exceeding under the situation of basic inventive concept.

The present invention can use in the application except that the article scoring, and for example in the file search field, wherein " model answer " file is the file that comprises search condition.Other application and in other is used, use mode of the present invention it will be apparent to those skilled in the art that.

The present invention can use in the application except that the article scoring, for example in machine file translation field.

These class modifications and variations are intended to fall within the scope of the present invention, and its essence to be determined has formed the description of front.

The false code of appendix 1 automatic article points-scoring system

-to the explanation of the process flow diagram of Fig. 3

1.0MarkIT

●Structure Document(Model Answer)(2.0)

●Structure Documnent(Student Answer)(2.0)

●Compute Ratios Between Model Answer and Student Answer(10.2)

●Compute Student Mark

2.0Structure Document(document)

●Chunk document into paragraphs(2.1)

●For each paragraph in the document(3.0)

○Set all concepts hit counts to zero(9.2)

○Chunk paragraph into sentences(3.1)

○For each sentence in the paragraph(4.0)

■Word list＝Chunk sentence into words(4.1.1)

■Get a list of non-empty from word list(4.1.2)

■Tag each non-empty words with its Part of Speech(POS)[third-

party]

■Chunk Sentencc Into Phrases(4.1.4)

○Compute total hit counts for each concept by adding up the concept’s hit

count and their related concepts’hit counts(9.3，8.1)

○Contextualise each word(3.2，4.2，5.2，6.2，7.2)

●Compute grammnatical statistics(10.1)

4.1.4Chunk sentence into phrases(word list)

●Current phrase type＝Untyped

●Get the first three words from word list into word1，word2 and word3

●While word1＜＞null

○New phrase type＝Look up phrase type(word1’s POS，word2’s POS，

word2’s POS)in table 1，from top to bottom(5.3)

○If new phrase type＜＞current phrase type

■Current phrase＝new phrase

○Add word1 to current phrase(5.1)

○Word1＝word2，word2＝word3，word3＝next word from word list

5.1 Add Word into a phrase(word)

●Successful＝Add word into current phrase row(6.1)

●If not successful

○Current phrase row＝new phrase row

○New phrase row’s current slot＝0

○Add word into current phrase row(6.1)

6.1 Add Word into a phrase row(word)

●If row type＜＞INVALID and word’s POS＜＞NO_POS

○Search for next POS slot from current slot(inclusive)onwards(table 2)

○If end of the row

●Return false

○Else

●Slot word

●Current slot＝current slot+1

●Set word’s concept(7.1)

●Return true

●Else

○Slot word

○If word’s POS＜＞NO_POS

●Set word’s concept(7.1)

○Return true

7.1Set word’s concept

●Get concept list(word，POS)(9.4)

●If concept list＝＝null

○Stemmed word＝Stem word using Porter Stemmer[third-party]

○Get concept list(stemmed word，POS)(9.4)

9.4 Get concept list(word，POS)

●Concept list＝Look up concepts related to word & POS in the database system

●If concept list＜＞null

○For each concept number＜＝MAX_CONCEPT_NUMBER

●Concept[number]’s hit count++

●Return concept list

7.2 Set word’s most relevant concept

●If concept list＜＞null

○Most relevant concept＝one of the concepts with the hihgest total hit

count

Claims

1, a kind of method of comparing text based documents comprises:

2, the method for claim 1, wherein the vocabulary standardization is converted to each word in each file the expression of the basic notion that defines in the dictionary.

3, method as claimed in claim 2, wherein each word is used for searching at dictionary the basic notion of this word.

4, as claim 2 or 3 described methods, wherein each basic word is assigned with a digital value.

5, as each described method in the claim 1 to 4, wherein standardization produces the numeral of file.

6, as each described method in the claim 2 to 4, wherein each standardized basic notion forms the one dimension of vector representation.

7, method as claimed in claim 6, wherein the number of times of each standardized basic notion appearance is counted.

8, method as claimed in claim 7, wherein the counting of each standardized basic notion forms the length of vector in each dimension of vector representation.

9, as each described method in the claim 1 to 8, wherein the conllinear degree of these vector representations relatively generates mark by the cosine of determining the angle (theta) between these vectors.

10, method as claimed in claim 9, wherein cos (theta) calculates according to these vectorial dot products and these vectorial length.

11, as claim 2 to 4 and 6 to 8 described methods, wherein the number of basic notion is counted in each file.

12, method as claimed in claim 11, wherein the counting of the notion of the counting of the notion of second file and first file compares, and exerts an influence with the mark to the similarity of relative first file of second file.

13, method as claimed in claim 12, wherein the influence of the basic notion of each of non-zero count is 1.

14, as claim 12 or 13 described methods, wherein said relatively is ratio.

15, as each described method in the claim 1 to 14, wherein said first file is the model answer article, and described second file is to wait the article of being marked, and described mark is the mark of second article.

16, as each described method in the claim 1 to 15, further comprise:

The word of first file is divided into noun phrase and verb subordinate clause;

The relatively division of first file and the division of second file exerts an influence with the mark to the similarity of relative first file of second file.

17, a kind of system of comparing text based documents comprises:

Text to first file carries out the standardized instrument of vocabulary;

18, system as claimed in claim 17, further comprise an instrument, this instrument is searched dictionary to find basic notion according to each word in each file, and described basic notion offered each word in each file is carried out standardized each instrument of vocabulary, wherein said each instrument is converted to each word the expression of corresponding basic notion.

19, system as claimed in claim 18 wherein is used to set up the one dimension of each instrument of vector representation according to each standardized basic notion formation vector representation.

20, system as claimed in claim 19, each instrument that wherein is used to set up vector representation is counted the occurrence number of each standardized basic notion, and described counting forms the length of vector in each dimension of vector representation.

21, as each described system in the claim 17 to 20, the instrument that wherein is used for the conllinear degree of these vector representations of comparison generates mark by the cosine of determining the angle (theta) between these vectors.

22, system as claimed in claim 21, the instrument that wherein is used for the conllinear degree of these vector representations of comparison is configured to calculate cos (theta) according to these vectorial dot products and these vectorial length.

23, system as claimed in claim 20, each instrument that wherein is used for setting up vector representation is counted the number of the basic notion of each file non-zero.

24, system as claimed in claim 23, the instrument that wherein is used for the conllinear degree of comparison vector representation compares the counting of the notion of the counting of the notion of second file and first file, exerts an influence with the mark to the similarity of relative first file of second file.

25, a kind of method of comparing text based documents comprises:

The word of first file is divided into noun phrase and verb subordinate clause;

26, method as claimed in claim 25, wherein each speech in the file is standardized as basic notion by vocabulary.

27, as claim 25 or 26 described methods, wherein to relatively carrying out that file is divided: the ratio between the number of the noun phrase composition of respective type in the number of the noun phrase composition of one or more types and first file in second file by definite following ratio, and the ratio between the number of the verb subordinate clause of respective type in the number of the verb subordinate clause composition of second one or more types in the file and first file, wherein these ratios are influential to mark.

28, method as claimed in claim 27, the type of wherein said noun phrase composition is: noun phrase noun, noun phrase adjective, noun phrase preposition and noun phrase conjunction.

29, as claim 27 or 28 described methods, the type of wherein said subordinate clause composition is: verb subordinate clause verb, verb from sentence adverb, verb subordinate clause auxiliary word, verb subordinate clause preposition and verb from sentence connector.

30, method as claimed in claim 24, wherein said first file is the model answer article, and described second file is an article to be marked, and described mark is the mark of second article.

31, a kind of system of comparing text based documents comprises:

32, a kind of method of comparing text based documents comprises:

33, method as claimed in claim 32 further comprises:

The word of first file is divided into noun phrase and verb subordinate clause;

34, a kind of system of comparing text based documents comprises:

35, a kind of method that the text based article file is marked comprises:

Model answer is provided;

The many pieces of manually articles of scoring are provided;

Many pieces of articles to be marked are provided;

Coefficient determined in article according to manually scoring;

36, method as claimed in claim 35 wherein determines that according to the article of manually scoring coefficient carries out by linear regression.

37, as claim 35 or 36 described methods, wherein said tolerance comprises the mark according to the method generation of each described any one comparing text based documents in the claim 1 to 16,25 to 30 or 32 to 33.

38, a kind of system that the text based article file is marked comprises:

39, a kind of method that provides about the visible feedback of article scoring comprises:

40, method as claimed in claim 39, wherein each basic notion is corresponding to the basic implication of defined word in the dictionary.

41, as claim 39 or 40 described methods, wherein by each word in the quilt article of marking is carried out the vocabulary standardization to generate the expression of basic implication in the article of being marked, and, determine the counting of each basic notion to being counted by the appearance of each basic implication in the article of marking.

42, method as claimed in claim 41, wherein by each word in the standard article being carried out the expression of vocabulary standardization with the basic implication of generation standard article, and the appearance of each basic implication in the standard article counted, determine the counting of each basic notion.

43, as each described method in the claim 39 to 42, further be included in the article of being marked and select notion, and show the word that is belonged to this notion in the scoring article.

44, method as claimed in claim 43, the word that wherein relates to other notion in the article of being marked also is shown.

45, as each described method in the claim 39 to 44, further be included in the standard article and select notion, and belong to the word of this notion in the display standard article.

46, method as claimed in claim 45, the word that wherein relates to other notion in the standard article also is shown.

47, as each described method in the claim 39 to 46, further comprise the synonym that shows selected basic notion.

48, a kind of system that provides about the visible feedback of article scoring comprises:

49, a kind of method of digitized representations file comprises:

Each word to file carries out the vocabulary standardization;

50, method as claimed in claim 49, wherein a plurality of words are used to determine that each several part is noun phrase or verb subordinate clause.

51, method as claimed in claim 49, wherein first three of each a part word is used to determine that this part is noun phrase or verb subordinate clause.

52, as each described method in the claim 49 to 51, wherein the row that are assigned to the noun phrase form of each word in the part to the row of section or verb subordinate clause form to section.

53, method as claimed in claim 52, wherein every section of the form word of distributing to a syntactic type.

54, method as claimed in claim 53 if wherein word has the syntactic type of next section, is then distributed to these order of words the section in the appropriate table.

55, method as claimed in claim 54, if wherein next word does not belong to next section, this Duan Liuwei blank then, the order assignment of section position that moves on.

56, method as claimed in claim 55, if wherein next word does not belong to form types when forward part, then expression finishes when forward part.

57, as each described method in the claim 52 to 56, wherein these forms have multirow, so that other row that is not suitable for after current word in the forward part arranging when next word, but this word is not represented next word to be arranged in the next line of this form when forward part finishes.

58, a kind of system of digitized representations file comprises:

Each word to file carries out the standardized instrument of vocabulary;

59, a kind of computer program is configured to control computer and carries out according to each described arbitrary method in the claim 1 to 16,25 to 30,32 to 33,35 to 37,39 to 47 or 49 to 57.

60, a kind of computer program is configured to control computer and makes its conduct according to each described system works in the claim 17 to 24,31,34,38,48 or 58.

61, a kind of computer-readable recording medium comprises according to claim 59 or 60 described computer programs.