CN108519974A

CN108519974A - English composition automatic detection of syntax error and analysis method

Info

Publication number: CN108519974A
Application number: CN201810279338.3A
Authority: CN
Inventors: 黄翰; 刘方青; 卢尔昂; 郝志峰; 许悦婷
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-03-31
Filing date: 2018-03-31
Publication date: 2018-09-11

Abstract

The present invention provides english composition automatic detection of syntax error and analysis methods.This method first makes pauses in reading unpunctuated ancient writings to the English composition to be detected of input, then word segmentation processing is carried out to word in each sentence after punctuate, then spell check is carried out to word, part-of-speech tagging is carried out to all words after spell check is errorless, then to there are the amendments that the word of multi-tag is labeled effect after these marks, followed by the different error instance rule flow charts of structure, then existing syntax rule and error instance is combined synthetically to carry out syntax check to sentence, the position that syntax error occurs in composition is finally navigated to, specific suggestion for revision is provided.The present invention can position syntax error position, provide specific wrong content and solution；Simultaneously by changing error instance flow chart, moreover it is possible to expand syntax rule.The present invention has the detection of higher composition syntax error and correction capability, quickly can carry out grammer detection to an english composition and feed back, can be applied to real time environment.

Description

English composition automatic detection of syntax error and analysis method

Technical field

The present invention relates generally to natural language processing research fields, and in particular to english composition automatic detection of syntax error with Analysis method.

Background technology

Nowadays it is the world of height globalization, in this context, English becomes essential with world's communication Bridge.An English language the most universal as whole world application, has 400,000,000 people or so by mother tongue of English, using English as the The people that two language are linked up but has been more than 1,000,000,000.The English text accounting that the English learner of non-english mother tongue is write is up to 70%, they grammatically can inevitably make a mistake during carrying out English composition.Especially in academic exchange, paper in English As the important tool of the personal academic level of displaying, played an important role during academic exchange.And english composition grammer Automatic testing method, then can help the writing syntax error of various English study personnel inspections oneself to a certain extent, Avoid the generation of some lower level errors.

It is the theory in linguistics and pedagogy that the research of existing nature grammer process field, which mainly has two major classes, one kind, Research, one kind are the software systems research and development of application software invention.The former research is that the latter brings the theoretic support, postposition to be The former provides technical support, but the country more rests in theoretical research in terms of software application at present.It is external current There are the method for some comparative maturities, the advantage and disadvantage of distinct methods are had nothing in common with each other, and Project Essay Grader (PEG) are generation The english composition intelligence correction system occurred earliest in boundary, but the operational efficiency of the system is very low, can only complete to score reluctantly Work；AES (Automatic Essay Scoring) system is equally available to carry out Automatic Read Overmarginalia and scoring to english composition, right Syntax error is judged, but AES systems are also difficult to solve semantic error, construction error and pragmatic error.It is domestic at present AES systems, have a grammer detection, composition scoring, various complicated functions such as theme association detection, but the grammer error correction of its core Part accuracy rate is very low, cannot be according to the different self-defined alteration detection rules of needs, and the system expandability is also inadequate.And The present invention then focuses on the two aspects of research.The present invention is in the case where ensureing the false detection rate height to english composition, also It is capable of providing the space for expanding syntax rule.

Invention content

The present invention is huge for current english composition syntax error detection demand, and existing english composition syntactic approach is not complete enough Kind situation provides english composition automatic detection of syntax error and analysis method.This method purpose is to help English study Person automatically checks the writing of oneself, the case where personnel not being instructed to help, it can be noted that syntax error helps learner to carry The high English level of oneself.Specific technical solution is as follows.

English composition automatic detection of syntax error and analysis method comprising following steps：

(a) english composition submitted is obtained into line statement subordinate sentence and word word segmentation processing to foreground；

(b) spell check is carried out to all words that participle obtains in step (a), feedback spelling words are to wrong situation and presence Fixed phrase collocation；

If (c) spelling words are errorless, to all words after spell check in step (b), word is carried out using Stamford analyzer Property label；

(d) word after part of speech label is carried out in step (c) and might have multiple part of speech labels, calculates all part of speech labels Probability, the highest part of speech label of select probability；

(e) common English grammar mistake is constructed to the negative example rule flow chart of part of speech label；

(f) by the word each with part of speech label in step (d), according to its part of speech label and negative example rule stream in step (e) Journey figure carries out comparison processing；The syntax error of english composition and the modification of recommendation are returned in foreground, and data are synchronized Store database.

Further, in step (a), sentence subordinate sentence rule, to all fullstops in an english composition, including "!" do Following label：[left word] [prefix] [fullstop] [suffix] [right word], wherein prefix refer to the character string being connected with before fullstop, Suffix is the character string that fullstop is connected thereafter, and right word refers to the next word of fullstop thereafter immediately；Then according to flow (1)-(5) judge whether it being a tail to each fullstop, to realize subordinate sentence：

(1) when the last character of prefix is " ", it is judged as a tail；

(2) suffix be sky, and right word be sky, then may determine that for section end, be also sentence tail；

(3) suffix is space, and right word is not sky, and right word initial caps, if prefix is not Mr, Mrs, Ms, Dr, Miss Abbreviation then be sentence tail；One will be established thus stops vocabulary by what common abbreviation formed；

(4) suffix is not space, and suffix initial caps, if prefix suffix is free of other fullstops, for sentence tail；

(5) other situations may determine that be not belonging to a tail.

Further, in step (b), English word database is built by the way of multiple index table, before starting with word Concordance list of three letters as dictionary.

Further, in step (b), a fixed phrase library is constructed, using in conjunction with SQL statement and regular expression pair The fixed phrase occurred in text is filtered, and is prompted the fixed phrase used in text.

Further, in step (b), the word obtained after participle carries out spell check, can establish one first and store The dictionary for marking English word, then matches all words with dictionary word, the word of successful match is considered to spell Correctly, matching error indicates spelling words mistake.

Further, in step (c), according to the information in sentence context, part-of-speech tagging device is used（PSO Tagger） Each sentence element is distributed the part-of-speech tagging of a corresponding sentence structure by this Open-Source Tools, and output carries part-of-speech tagging Text.

Further, in step (d), often occurs the list with multiple part of speech in the obtained text with part-of-speech tagging Word uses Binzhou treebank tally set（Penn Treebank）It is labelled to each word.Then according to calculation formula： It calculates general The maximum label of rate.

Further, in step (e), statistical classification is carried out by the sentence to various English Grammar, counts common language Method mistake analyzes the logic of mistake, carries out modeling in logic to syntax error, according to the part-of-speech tagging of part-of-speech tagging device, always It bears to detect the basic logic flow of wrong grammer.

Further, in step (f), text, will be with ... after pretreatment and part-of-speech tagging ... form output；

To mark label Tag as the trigger condition of negative example rule, one group of rule in each label Tag rules of correspondence library Then；When label is arrived in scanning When, the basic logic flow path match syntax error will be used, it will if successful match Feed back the syntax error detected and corresponding suggestion for revision.

Further, it in step (f), is added in some syntax rules and stops the processing that word is excluded with exception, especially pair The inspection of long sentence subject-predicate list plural references.

Compared with prior art, the invention has the advantages that and technique effect：

In the detection of existing english composition grammer and analytical technology, although the syntax detection method based on syntax can quickly judge Sentence whether there is mistake, but can not marked erroneous position and feedback error prompt；Though the grammer detection method based on statistics Right false detection rate is high, and is capable of detecting when the position of mistake, but can not illustrate specific error reason, while can not return Suggestion for revision.The present invention first passes through composition subordinate sentence participle and list by the syntax check method based on error instance with rule Word part-of-speech tagging, then the flow chart of corresponding part-of-speech tagging syntax error is constructed, by the corresponding syntax error flow of part of speech Figure judges that sentence whether there is syntax error, and points out modification.Syntax check method based on error instance can determine Position goes out content and the position of syntax error, and rule-based syntax check method can provide the concrete modification side of syntax error Case.

Description of the drawings

Fig. 1 is embodiment multilevel index structure schematic diagram.

Fig. 2 is the syntax rule flow chart based on part-of-speech tagging in example.

Fig. 3 is embodiment English composition automatic detection of syntax error and analysis method flow chart.

Fig. 4 is that the fixed phrase in example prompts schematic diagram.

Fig. 5 is one of the syntax rule logical flow chart based on part-of-speech tagging；

Fig. 6 is two of the syntax rule logical flow chart based on part-of-speech tagging；

Fig. 7 is three of the syntax rule logical flow chart based on part-of-speech tagging；

Fig. 8 is a simple wrong syntax rule schematic diagram in example.

Specific implementation mode

Embodiments of the present invention are described further with reference to embodiments, but the implementation of the present invention is without being limited thereto, It is noted that if the following process or symbol for having not special detailed description, is that those skilled in the art can refer to existing skill Art understand or realize.

English composition automatic detection of syntax error and analysis method, include the following steps：

In step (a), sentence subordinate sentence rule, to all fullstops in an english composition, including "!" do following label： [left word] [prefix] [fullstop] [suffix] [right word], wherein prefix refer to the character string being connected with before fullstop, and suffix is sentence Point character string connected thereafter, right word refers to the next word of fullstop thereafter immediately；Then according to flow (1)-(5) to every One fullstop judge whether it being a tail, to realize subordinate sentence；

In step (b), English word database is built by the way of multiple index table, is made with first three letter of word beginning For the concordance list of dictionary, the visible Fig. 1 of construction method.

In step (b), a fixed phrase library is constructed, using combination SQL statement and regular expression to occurring in text Fixed phrase be filtered, the fixed phrase used in text is prompted.

In step (b), the word obtained after participle carries out spell check, can establish one first and store mark English list Then the dictionary of word matches all words with dictionary word, the word of successful match is considered that spelling is correct, and matching is wrong Accidentally indicate spelling words mistake.

In step (c), according to the information in sentence context, part-of-speech tagging device is used（PSO Tagger）This work of increasing income Each sentence element is distributed the part-of-speech tagging of a corresponding sentence structure, text of the output with part-of-speech tagging by tool.

In step (d), often occurs the word with multiple part of speech in the obtained text with part-of-speech tagging, use Binzhou Treebank tally set（Penn Treebank）Labelled to each word, Binzhou treebank tally set is as shown in Table 1.Then root According to calculation formula：

Meter Calculate the label of maximum probability.

1 Binzhou treebank tally set of table

CC	Coordinating conjunction	PRP$	Possessive pronoun
				CD	Cardinal numerals	RB	Adverbial word
DT	Determiner	RBR	Adverbial word comparative degree
				EX	There are sentence there	RBS	Adverbial word is highest
FW	Exotic vocabulary	RP	Particle
				IN	Preposition or subordinate conjunction	SYM	Symbol
JJ	Adjective or ordinal number	TO	Preposition or infinitive label
				JJR	Comparative adjectives	UH	Interjection
JJS	Adjective is highest	VB	Verb prototype
				LS	List items mark	VBD	Verb past tense
MD	Modal auxiliary	VBG	Verb present participle, gerund
				NN	Singular or noncountable, common noun	VBN	Verb past participle
NNP	Singular proper noun	VBP	The non-third-person singular form of verb
				NNPS	Plural proper noun	VBZ	Verb third-person singular form
NNS	Plural common noun	WDT	Query determiner
				PDT	Anteposition determiner	WP	Interrogative pronoun
POS	Possessive case mark	WP$	whose
				PRP	Personal pronoun	WRB	Interrogative adverb

In step (e), statistical classification is carried out by the sentence to various English Grammar, counts common syntax error, analysis is wrong Logic accidentally, the modeling carried out in logic to syntax error are summed up according to the part-of-speech tagging of part-of-speech tagging device to detect mistake The basic logic flow of grammer, the visible Fig. 2 of specific example flow diagram.

In step (f), text, will be with ... after pretreatment and part-of-speech tagging ... form it is defeated Go out, the meaning of ellipsis is clear for the very fast personnel in this field, can uniquely determine；

To mark label Tag as the trigger condition of negative example rule, one group of rule in each label Tag rules of correspondence library Then；When label is arrived in scanning When, the basic logic flow path match syntax error will be used, if successful match The syntax error and corresponding suggestion for revision that feedback is detected.

In step (f), it is added in some syntax rules and stops the processing that word is excluded with exception, especially to long sentence subject-predicate The inspection of single plural references.

Further, such as Fig. 3, english composition automatic detection of syntax error includes the following steps with analysis method：

The first step carries out subordinate sentence and word segmentation processing to the english composition that foreground obtains.

For all fullstops in an english abstract, including "!" do following label：

[left word] [prefix] [fullstop] [suffix] [right word]

Wherein prefix refers to the character string being connected with before fullstop, and suffix is the character string that fullstop is connected thereafter, and right word refers to sentence The next word of point thereafter immediately.Then sentence is differentiated according to following below scheme：

(1) when the last character of prefix is " ", it is judged as a tail.（The case where at the end of a sentence is with initialism, Such as 5 pm.. of We will meet at）

(2) suffix be sky, and right word be sky, then may determine that for section end, be also sentence tail.

(3) suffix is space, and right word is not sky, and right word initial caps, if prefix be not Mr, Mrs, Ms, Dr, The abbreviations such as Miss are then sentence tail.One will be established thus stops vocabulary by what common abbreviation formed.

(4) suffix is not space, and suffix initial caps, if prefix suffix is free of other fullstops, for sentence tail.

(5) other situations are may determine that substantially to be not belonging to a tail.

By above-mentioned (1)-(5), totally 5 step flows can complete subordinate sentence processing to being used as.Segmenting method is to simple, in root After being segmented according to space, the abbreviation of English phrase and conjunction symbol word are checked.

Second step, word carry out spell check.

Spell check mainly carries out the non-word error checking of text, and non-word mistake refers to the list being not present in language Spelling mistake.For example, by word " thank "（Thank）It is non-word mistake to have been write as " thonk " just, if write as “think（Thinking）" then not in the scope of non-word error checking.In order to carry out non-word error checking, we first have to establish one A dictionary for storing standard English word matches during then reprocessing text with dictionary, successful match, then word is spelled Write correct, matching error, then it represents that be herein misspelling.

And spell check needs to establish a dictionary comprising specialized vocabulary and checks article, in order to solve vocabulary mistake The problem of matching efficiency reduces when more, present invention employs the modes of multiple index table to optimize English word database, It can be seen that Fig. 1, word starts first three concordance list of letter as dictionary, so that the matching times control of the disconnected thorn of most of English System is between 2-4 times.The present invention also constructs a phrase library simultaneously, in conjunction with SQL statement and regular expression, to going out in text Existing fixed phrase is filtered, and reduces error detection, and the addition prompt of the fixed phrase to being used in text, it is seen that Fig. 4.

Third walks, and part of speech label is carried out using Stamford analyzer.

Part-of-speech tagging is carried out using part-of-speech tagging device (POS Tagger), POS Tagger can be each sentence element point The part-of-speech tagging for corresponding to sentence structure with one, finally output carry the text of part-of-speech tagging.Part-of-speech tagging uses guest State treebank tally set (Penn Treebank), Binzhou treebank tally set can be shown in Table 1.

4th step, more word meaning part of speech labels are corrected.

Using after part-of-speech tagging it is possible that stick the word of multi-tag, such as：

Sentence " He went to school with a dog yesterday. " is marked using Penn Treebank tally sets Result afterwards is：

“He/PRP went/VBD to/TO school/NN with/IN a/DT dog/NN yesterday/NN ./.”

In example, He is marked as personal pronoun, and went is marked as verb past tense, and to is marked as infinitive label, School, dog, yesterday are marked as odd number or noncountable common noun, and with is marked as preposition, and a is labeled to be limited Determine word.

In actual English text, many words are both verb and adjective, such as：" close ", on the one hand it Verb " closing " can be used as to understand, it on the one hand again can be as adjective " intimately ".So the work of part-of-speech tagging device is just It is based on a large amount of statistics, is that the word with multiple part of speech distributes a correct label in different sentences.

There are the word of multiple labels, the i.e. word with multiple part of speech, part-of-speech tagging in marking corpus for one Device will be according to（Formula 3-1）Calculate the label of maximum probability：

Finally choose the part of speech label of maximum probability.

5th step, grammer filter analysis.

This step needs a rule base for manually constructing and including wrong grammer, is established in conjunction with part of speech label technology regular. First by part-of-speech tagging device, to each word of each sentence in text, part of speech analysis is carried out according to syntactics.Then pass through Statistical classification is carried out to the sentence of various English Grammar, common syntax error is counted, the logic of mistake is analyzed, to syntax error Modeling in logic is carried out, according to the part-of-speech tagging of part-of-speech tagging device, the wrong syntax rule that analysis and summary comes out is designed to A series of basic logic flow.The regular flow chart in part is as shown in Fig. 5, Fig. 6, Fig. 7.

Composition is put into background process by the 6th step, as a result returns to foreground and database.

Text, will be with ... after pretreatment and part-of-speech tagging ... form output.This method is to mark Note trigger conditions of the label Tag as negative example rule, one group of rule in each label Tag rules of correspondence library.Work as scanning To label When, the label will be corresponded to rule base Negative example rule matched, once matching It is successful then by feedback error suggestion for revision.

For example, there is so one wrong syntax rule in rule base：

PRP (non-3rd person singular form)+VBZ (3rd person, singular form)

Wherein such as table 1 shows that PRP represents personal pronoun in the treebank label of Binzhou, and VBZ represents the verb present indefinite simple present third person Singulative.The pronoun back of the non-third-person singular of the Rule Expression has added the verb of third-person singular, if defeated at this time Entering " I has a cat ", then the meeting is matched to this mistake rule, and method may determine that the sentence has syntax error.

The matching flow of the present invention is illustrated below by a rule the simplest in method and sentence.Sentence Son：

“She goes swim every week.”

After part-of-speech tagging, result is：

“She/PRP goes/VBZ swim/VBP every/DT week/NN ./.”

When present invention processing text encounters goes/VBZ, then Fig. 8 such as can be called to show regular flow, for matching in example sentence " goes/VBZ swim/VBP ", corresponding generation amending advice are " goes:【Syntax error】The feedback of predicate verb repetition " Information.

Claims

1. english composition automatic detection of syntax error and analysis method, which is characterized in that include the following steps：

The english composition submitted is obtained into line statement subordinate sentence and word word segmentation processing to foreground；

Spell check carried out to all words that participle obtains in step (a), feedback spelling words are to wrong situation and existing Fixed phrase is arranged in pairs or groups；

If spelling words are errorless, to all words after spell check in step (b), part of speech mark is carried out using Stamford analyzer Note；

The word after part of speech label is carried out in step (c) and might have multiple part of speech labels, calculates the general of all part of speech labels Rate, the highest part of speech label of select probability；

Common English grammar mistake is constructed to the negative example rule flow chart of part of speech label；

By the word each with part of speech label in step (d), according to its part of speech label and negative example rule flow chart in step (e) Carry out comparison processing；The syntax error of english composition and the modification of recommendation are returned in foreground, and by data sync storage To database.

2. english composition automatic detection of syntax error and analysis method according to claim 1, it is characterised in that：Step (a) In, sentence subordinate sentence rule does following label to all fullstops in an english composition, including " dot, exclamation mark, question mark "：It is [left Word] [prefix] [fullstop] [suffix] [right word], wherein prefix refers to the character string being connected with before fullstop, suffix be fullstop its Connected character string afterwards, right word refers to the next word of fullstop thereafter immediately；Then according to flow (1)-(5) to each Fullstop judge whether it being a tail, to realize subordinate sentence：

(1) when the last character of prefix is " ", it is judged as a tail；

(5) other situations may determine that be not belonging to a tail.

3. english composition automatic detection of syntax error and analysis method according to claim 1, it is characterised in that：Step (b) In, English word database is built by the way of multiple index table, using first three letter that word starts as the rope of dictionary Draw table.

4. english composition automatic detection of syntax error and analysis method according to claim 1, it is characterised in that：Step (b) In, a fixed phrase library is constructed, the fixed phrase occurred in text was carried out using SQL statement and regular expression is combined Filter, prompts the fixed phrase used in text.

5. english composition automatic detection of syntax error and analysis method according to claim 1, it is characterised in that：Step (b) In, the word obtained after participle carries out spell check, can establish a dictionary for storing mark English word first, then will All words are matched with dictionary word, and the word of successful match is considered that spelling is correct, and matching error indicates spelling words Mistake.

6. english composition automatic detection of syntax error and analysis method according to claim 1, it is characterised in that：Step (c) In, according to the information in sentence context, use part-of-speech tagging device（PSO Tagger）This Open-Source Tools is by each sentence Ingredient distributes the part-of-speech tagging of a corresponding sentence structure, text of the output with part-of-speech tagging.

7. english composition automatic detection of syntax error and analysis method according to claim 1, it is characterised in that：Step (d) In, often occur the word with multiple part of speech in the obtained text with part-of-speech tagging, uses Binzhou treebank tally set（Penn Treebank）It is labelled to each word；Then according to calculation formula： Calculate the label of maximum probability.

8. english composition automatic detection of syntax error and analysis method according to claim 1, it is characterised in that：Step (e) In, statistical classification is carried out by the sentence to various English Grammar, counts common syntax error, analyzes the logic of mistake, it is right The modeling that syntax error carries out in logic is summed up according to the part-of-speech tagging of part-of-speech tagging device to detect the basic of wrong grammer Logic flow.

9. english composition automatic detection of syntax error and analysis method according to claim 8, it is characterised in that：Step (f) In, text, will be with ... after pretreatment and part-of-speech tagging ... form output；

To mark label Tag as the trigger condition of negative example rule, one group of rule in each label Tag rules of correspondence library Then；When label is arrived in scanning When, the basic logic flow path match syntax error will be used, it will be anti-if successful match Present the syntax error detected and corresponding suggestion for revision.

10. english composition automatic detection of syntax error and analysis method according to claim 1, it is characterised in that：Step (f) In, it is added in some syntax rules and stops the processing that word is excluded with exception, the especially inspection to long sentence subject-predicate list plural references It looks into.