CN112686030B

CN112686030B - Grammar error correction method, grammar error correction device, electronic equipment and storage medium

Info

Publication number: CN112686030B
Application number: CN202011591170.3A
Authority: CN
Inventors: 戴建新; 汪洋; 付瑞吉; 王士进; 魏思; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2023-12-01
Anticipated expiration: 2040-12-29
Also published as: CN112686030A

Abstract

The invention provides a grammar error correction method, a grammar error correction device, electronic equipment and a storage medium, wherein the grammar error correction method comprises the following steps: performing error detection on the text to be corrected to obtain an error text fragment; performing error correction on the error text segment to obtain a corrected text segment corresponding to the error text segment; determining an error type corresponding to the error text segment based on the interaction vector between the error text segment and the corrected text segment; wherein the interaction vector is used to characterize the difference and commonality features between the erroneous text segment and the corrected text segment. According to the grammar error correction method, the grammar error correction device, the electronic equipment and the storage medium, the error text fragments and the corrected text fragments are obtained by carrying out error detection and error correction on the text to be corrected, and the error types corresponding to the error text fragments are determined based on the interaction vectors between the error text fragments and the corrected text fragments, so that the grammar error correction method provided by the embodiment of the invention has the interpretability.

Description

Grammar error correction method, grammar error correction device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method and apparatus for grammar correction, an electronic device, and a storage medium.

Background

Grammar error correction is to detect and correct grammar errors in language texts, and is widely applied to the fields of grammar learning, text proofreading and the like.

However, in the current grammar error correction method, the grammar error correction method based on the translation mechanism can only output a correct error correction result, but cannot explain the grammar error existing in the input sentence. Therefore, a syntax error correction method with interpretability of the error correction result is needed.

Disclosure of Invention

The invention provides a grammar error correction method, a grammar error correction device, electronic equipment and a storage medium, which are used for solving the defect that the grammar error correction method in the prior art does not have interpretability and realizing the interpretable grammar error correction.

The invention provides a grammar error correction method, which comprises the following steps:

performing error detection on the text to be corrected to obtain an error text fragment;

performing error correction on the error text segment to obtain a corrected text segment corresponding to the error text segment;

determining an error type corresponding to the error text segment based on the interaction vector between the error text segment and the corrected text segment; wherein the interaction vector is used to characterize the difference and commonality features between the erroneous text segment and the corrected text segment.

According to the grammar correction method provided by the invention, the interaction vector between the error text segment and the corrected text segment is determined based on the following steps:

and performing subtraction interaction, multiplication interaction and/or addition interaction on the text vector of the error text segment and the text vector of the corrected text segment to obtain the interaction vector.

According to the grammar error correction method provided by the invention, the determining the error type corresponding to the error text segment based on the interaction vector between the error text segment and the corrected text segment comprises the following steps:

determining an error type corresponding to the error text segment based on the interaction vector between the error text segment and the corrected text segment and the error type prior vector of the error text segment;

the error type prior vector is determined based on an error type of a sample error correction pair matched with an error correction pair in a preset error type library, and the error correction pair is composed of the error text segment and the correction text segment.

According to the grammar error correction method provided by the invention, the error type prior vector is determined based on the following steps:

Matching the error correction pair with each sample error correction pair in the error type library, and taking the error type of the matched sample error correction pair as the prior error type of the error correction pair;

the error type prior vector is determined based on the prior error type and its frequency of occurrence in the error type library.

According to the grammar error correction method provided by the invention, the error detection of the text to be corrected comprises the following steps:

determining a syntactic structure vector of the current word segment based on the syntactic context vector of each word segment in the text to be corrected and the syntactic association degree between each word segment and the current word segment;

and carrying out error detection on the text to be corrected based on the syntactic structure vector of each word segmentation.

According to the grammar error correction method provided by the invention, the error correction is carried out on the error text segment, and the grammar error correction method comprises the following steps:

determining a current error correction vector based on a last error correction vector in the error text segment and a word vector of a last correction word segmentation;

determining a current correction word segmentation based on the current error correction vector;

wherein an initial error correction vector is determined based on the semantic context vector and the syntactic structure vector for each word segment in the erroneous text segment.

According to the grammar error correction method provided by the invention, the semantic context vector of any word in the error text segment is determined based on the semantic context vector of the last word of the any word and the syntactic context vector of the any word.

The invention also provides a grammar error correction device, which comprises:

the error detection unit is used for carrying out error detection on the text to be corrected to obtain an error text fragment;

the error correction unit is used for performing error correction on the error text segment to obtain a corrected text segment corresponding to the error text segment;

an error type classification unit, configured to determine an error type corresponding to the error text segment based on an interaction vector between the error text segment and the corrected text segment; wherein the interaction vector is used to characterize the difference and commonality features between the erroneous text segment and the corrected text segment.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the syntax error correction methods described above when executing the computer program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the syntax error correction methods described above.

According to the grammar error correction method, the grammar error correction device, the electronic equipment and the storage medium, the error text fragments and the corrected text fragments are obtained by carrying out error detection and error correction on the text to be corrected, and the error types corresponding to the error text fragments are determined based on the interaction vectors between the error text fragments and the corrected text fragments, so that the grammar error correction method provided by the embodiment of the invention has the interpretability.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a grammar error correction method provided by the invention;

FIG. 2 is a flow chart of a method for determining an error type prior vector according to the present invention;

FIG. 3 is a flow chart of an error detection method according to the present invention;

FIG. 4 is a flow chart of an error correction method according to the present invention;

FIG. 5 is a flowchart illustrating a syntax error correction method according to another embodiment of the present invention;

FIG. 6 is a schematic diagram of a syntax error correction apparatus according to the present invention;

fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Grammar correction is widely applied in the fields of grammar learning, text proofreading and the like, particularly in the field of modern education, the grammar correction is intelligently modified for students, and the students can learn pertinently aiming at the grammar errors.

Existing syntax error correction schemes generally include two types: an expert system-based error correction scheme and a translation mechanism-based error correction scheme. In the error correction scheme based on the expert system, the grammar knowledge is written into a large number of rules based on the abundant experience of the expert in the linguistic field, and the rules are used as templates for subsequent error correction. And then matching the input text with templates in the rule base by using a template matching method, and if the matching is successful, considering that the fragment in the input text has a corresponding grammar error. The error correction scheme based on the translation mechanism is to analogize an error correction task to a machine translation task, and the translation model is used for extracting sentence structure information and semantic information and coding, decoding sentence structure characteristics and semantic characteristics obtained by coding, and modifying an error text into a correct text, so that the purpose of error correction is achieved.

However, in the error correction scheme based on the expert system, the establishment of the expert system requires a great deal of labor cost, time and effort are wasted, the coverage of the expert experience on the grammar errors is limited, a great deal of long tail errors cannot be covered by the rule base, and the accuracy of grammar error correction is poor. The error correction scheme based on the translation mechanism is an end-to-end error correction scheme, and the translation model can directly modify the error sentence into a correct sentence according to the sentence semantics. For example: for there is an apples, the translation model directly outputs its error correction results: there is an apple. Although the error correction result is correct, the model cannot be clearly defined because apples of the original sentence have noun single complex errors, so apples are modified into apples. Therefore, the error correction result of the method has no interpretability, a grammar learner cannot know the weak knowledge points, and the weak knowledge points are difficult to learn in a targeted manner.

In this regard, the embodiment of the invention provides a grammar error correction method. Fig. 1 is a flow chart of a syntax error correction method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

and 110, performing error detection on the text to be corrected to obtain an error text segment.

Specifically, the text to be corrected is text that needs to be subjected to grammar error detection and grammar error correction, where the text may be text that is directly input by a user, text that is obtained by performing OCR (Optical Character Recognition ) on an image input by the user, or text that is obtained by performing speech recognition on speech input by the user, and the embodiment of the present invention is not limited in this way.

According to each word in the text to be corrected, the text to be corrected can be subjected to syntactic structure analysis, and error detection is carried out according to the syntactic structure analysis result, so that an error text fragment in the text to be corrected is obtained. The method can detect whether grammar errors exist in each word in the text to be corrected one by one, and output an error detection result of each word. Here, the error detection result of any word may include whether the word has a grammar error or not, and may also include the position of the word in the error segment, such as "error start", "error middle", "error end", and "individual error". Wherein, the term "single error" means that the word is independently formed into a grammar error. Based on the error detection result of each word, an error text segment in the text to be corrected can be determined. Wherein the erroneous text snippet may include one or more segmentations. If the error detection result of a word indicates that the word is "single error", the word may be used as an error text segment. If grammar errors exist in all the continuous multiple word fragments, combining the continuous multiple word fragments into an error text fragment according to the sequence of error start, error middle and error end.

And 120, performing error correction on the error text segment to obtain a corrected text segment corresponding to the error text segment.

Specifically, according to the analysis result of the syntactic structure of the text to be corrected and the semantic information of the text to be corrected, error correction can be performed on the error text segment, so as to obtain the corresponding corrected text segment. The grammar correction is performed by combining the syntax structure information and the semantic information of the text to be corrected, so that the accuracy of error correction is improved.

Step 130, determining an error type corresponding to the error text segment based on the interaction vector between the error text segment and the corrected text segment; wherein the interaction vector is used to characterize the difference and commonality features between the erroneous text segment and the corrected text segment.

Specifically, the corrected text segment is a result of correcting the grammar error in the error text segment, and according to the difference characteristic and the commonality characteristic between the error text segment and the corrected text segment, the specific mode of error correction, which content in the error text segment is reserved and modified in the process of modifying the error text segment into the corrected text segment, can be known. According to the specific mode of error correction and the modification modes corresponding to various grammar errors, the error type corresponding to the error text segment can be deduced. Here, the error type corresponding to the error text segment refers to a grammar error type existing in the error text segment, which is associated with a specific grammar knowledge point. In order to check the correctness of the previous error detection and error correction, an error type which indicates that the error correction accuracy of the current corrected text segment is lower and can not be adopted can also be set. For example, in "there is an apples in the tree", the erroneous text fragment is "apples", and the corrected text fragment corresponding thereto is "apple". According to the difference characteristic and the commonality characteristic between the error text segment being "apples" and the correction text segment being "apples", the modification mode of the error text segment can be known to be to remove "s" in "apples", namely, modify from plural form to singular form, and according to the modification mode, the error type corresponding to the error text segment can be deduced to be noun single plural error.

Therefore, the difference characteristics and the commonality characteristics of the error text fragments and the corrected text fragments can be obtained, the error text fragments and the corrected text fragments are encoded to obtain interaction vectors of the error text fragments and the corrected text fragments, and then classification of grammar error types is carried out based on the interaction vectors to obtain the error types corresponding to the error text fragments. The grammar error correction method provided by the embodiment of the invention has the advantage of being capable of indicating the grammar error of which type appears in the error text fragment. According to the error type, grammar learners can conduct targeted learning on associated grammar knowledge points, and learning efficiency is improved.

According to the method provided by the embodiment of the invention, the error text fragments and the corrected text fragments are obtained by carrying out error detection and error correction on the text to be corrected, and the error types corresponding to the error text fragments are determined based on the interaction vectors between the error text fragments and the corrected text fragments, so that the grammar error correction method provided by the embodiment of the invention has the interpretability.

Based on the above embodiment, the interaction vector between the erroneous text segment and the corrected text segment is determined based on the steps of:

and performing subtraction interaction, multiplication interaction and/or addition interaction on the text vector of the error text segment and the text vector of the corrected text segment to obtain an interaction vector.

Specifically, a text vector of the erroneous text segment and a text vector of the corrected text segment are determined based on the erroneous text segment and the corrected text segment, respectively. Wherein the text vector of the erroneous text segment and the text vector of the corrected text segment are determined based on the word vector of each word segment in the corresponding text segment. Here, the word vectors of each word in the corresponding text segment may be fused, for example, vector stitching, averaging, weighting average, etc., to obtain the corresponding text vector, which is not limited in detail in the embodiment of the present invention. For example, the text vector v_orig of the erroneous text segment and the text vector v_target of the corrected text segment may be determined using the following formulas:

wherein, the start_index and end_index are the sequence numbers of the first word segmentation and the last word segmentation of the error text segment, x _i For corresponding word segmentation, lookup_embedding () is a word vector acquisition method, end_index is a sequence number for correcting the last word segmentation of a text segment, and correct_word _i Word segmentation for the corresponding correction.

The text vector of the erroneous text segment and the text vector of the corrected text segment may then be subtracted to obtain a difference characteristic between the erroneous text segment and the corrected text segment. The subtraction refers to performing vector subtraction processing on the text vector of the error text segment and the text vector of the corrected text segment to weaken the same content of the error text segment and the corrected text segment and highlight the difference between the error text segment and the corrected text segment. For example, since the subtractive interaction extracts the difference information of the text before and after the correction, the subtractive interaction result of apples-apples is relatively similar to the subtractive interaction result of friends-friends, and the change information of "delete s" can be reflected.

In addition, the text vector of the error text segment and the text vector of the corrected text segment can be subjected to multiplication interaction and/or addition interaction to acquire the common characteristics between the error text segment and the corrected text segment. The multiplication interaction means that vector multiplication processing is carried out on the text vector of the error text segment and the text vector of the corrected text segment so as to obtain the associated information of the error text segment and the corrected text segment; adding interaction means that vector addition processing is carried out on the text vector of the error text segment and the text vector of the corrected text segment so as to strengthen the same information of the error text segment and the corrected text segment.

According to the embodiment of the invention, the difference characteristics and the commonality characteristics between the error text fragments and the corrected text fragments are obtained in the modes of subtracting interaction, multiplying interaction and adding interaction, and compared with the interaction mode of an attention mechanism, the method has the advantages of small calculated amount and higher interaction efficiency, and is beneficial to improving the efficiency of grammar error correction.

And then fusing the interaction subtracting result, multiplying the interaction result and/or adding the interaction result to obtain the interaction vector between the error text segment and the corrected text segment. The interaction vector may be obtained by performing operations such as splicing, weighted summation or averaging on the interaction result subtracting, and multiplying the interaction result and/or adding the interaction result, which is not particularly limited in the embodiment of the present invention.

For example, the interaction vector v_interaction may be determined using the following formula:

wherein,is a vector concatenation operation.

Based on any of the above embodiments, step 130 specifically includes:

determining the error type corresponding to the error text segment based on the interaction vector between the error text segment and the corrected text segment and the error type prior vector of the error text segment;

the error type prior vector is determined based on an error type of a sample error correction pair matched with an error correction pair in a preset error type library, and the error correction pair is composed of an error text segment and a correction text segment.

In particular, since most grammar errors are enumerable, an error type library may be pre-built to provide a priori information of grammar errors, assisting in determining the error type of the erroneous text segment. The error type library can be established based on a large number of sample error correction pairs collected in an actual application scene and the corresponding error types. The sample error correction pair consists of a sample error text fragment and a corresponding sample correction text fragment. The data stored in the error type library may be in the form of < sample error text segment, sample correction text segment > -error type, for example: < apples, apple > -noun single complex errors.

Based on the error type of the sample error correction pair matched with the error correction pair in the error type library, an error type prior vector of the error text segment can be determined. The error correction pair consists of an error text segment and a correction text segment. Here, the error type prior vector is based on prior information given by the error type library, and a preliminary judgment on the grammar error type existing in the error text segment can provide certain auxiliary information. The error type of the error text segment is determined by combining the interaction vector and the error type priori vector, so that the accuracy of grammar error classification can be improved. Specifically, the interactive vector and the error type prior vector can be fused, such as vector splicing, averaging or weighted summation, and the like, and then the fusion result is utilized to carry out grammar error classification. For example, the interaction vector and the error type prior vector may be fused using the following formula to obtain a fused vector v_check:

wherein v_priority is an error type prior vector.

When grammar error classification is carried out, a pre-trained grammar error classification model can be utilized to classify based on the fusion vector of the interaction vector and the error type prior vector, and the error type corresponding to the error text segment is obtained. An error classification parameter matrix may also be pre-trained to calculate a score for each possible error type based on the fusion vector of the interaction vector and the error type prior vector, and the error type with the highest score is used as the error type of the error text segment. The error classification parameter matrix can be trained based on a large number of sample error text fragments and corresponding sample error types. For example, taking a total of 46 error types as an example, the following formula may be used to determine the error type of the error text segment:

check_label_index＝argmax(W _check v_check)

Wherein W is _check For the error classification parameter matrix, the label_check is the error type of the error text segment, the error type is not adopted, which indicates that the error type cannot be classified into the other 46 error types, the correction text segment has lower accuracy, and the correction text segment can be not adopted for correction.

According to the method provided by the embodiment of the invention, the error type prior vector of the error text segment is determined based on the error type of the sample error correction pair matched with the error correction pair in the error type library, and the error type corresponding to the error text segment is determined by combining the interaction vector between the error text segment and the corrected text segment and the error type prior vector of the error text segment, so that the accuracy of grammar error classification is improved.

Based on any of the above embodiments, fig. 2 is a flow chart of a method for determining an error type prior vector according to an embodiment of the present invention, as shown in fig. 2, where the method includes:

step 210, matching the error correction pair with each sample error correction pair in the error type library, and taking the error type of the matched sample error correction pair as the prior error type of the error correction pair;

step 220, determining an error type prior vector based on the prior error type and its frequency of occurrence in the error type library.

Specifically, the error correction pairs are matched with the error correction pairs of each sample in the error type library one by one. If the error text segment and the corrected text segment in the error correction pair are the same as the sample error text segment and the sample corrected text segment in a certain sample error correction pair, the error correction pair is matched with the sample error correction pair. Then, the error type of the error correction pair of the matched sample is taken as the prior error type of the error correction pair.

Based on the prior error type, the corresponding error type prior vector can be determined and obtained. For example, a priori error types may be encoded by a one-hot (one-hot) encoding method, where positions corresponding to a priori error type are handled 1, and positions corresponding to other error types are handled 0, to obtain an error type a priori vector. In order to distinguish error types with different confidence levels, the frequency of the prior error types appearing in the error type library can be used for weighting the prior error types when determining the prior error type vector so as to highlight the effect of the prior error types with more occurrence times when determining the error types. For example, the coding may be performed on the basis of a single-hot coding scheme, in combination with the frequency at which a priori error types occur in the error type library, such that the value at the location corresponding to the a priori error types is positively correlated with the frequency at which it occurs in the error type library. For example, the value at the location corresponding to the a priori error type may be determined as follows:

v_prior _i ＝log(1+label_count _i )

Wherein v_priority _i For the value at the position corresponding to the prior error type in the prior error type vector, the label_count _i Is the frequency with which a priori error types occur in the error type library.

For example, after matching the error correction pair < apples, apple > with each sample error correction pair in the error type library, determining that the a priori error type to which the error correction pair is correct is a noun single complex error. Thereupon, the frequency of occurrence of the a priori error type in the error type library is obtained as 9999 pieces. By performing calculation according to the above formula, a value of 4 at a position corresponding to the noun single complex error can be obtained, so that an error type a priori vector can be determined to be (0, 4, 0).

According to the method provided by the embodiment of the invention, the error types which are opposite to each other and are error-corrected by the matched sample in the error type library are used as the prior error types which are opposite to each other and the prior error types and the frequency of occurrence of the prior error types in the error type library are used for determining the prior error vectors of the error types, so that the accuracy of error type classification can be improved.

Based on any of the above embodiments, fig. 3 is a flowchart of an error detection method according to an embodiment of the present invention, as shown in fig. 3, step 110 specifically includes:

Step 111, determining a syntactic structure vector of the current word segment based on the syntactic context vector of each word segment in the text to be corrected and the syntactic association degree between each word segment and the current word segment;

and step 112, performing error detection on the text to be corrected based on the syntactic structure vector of each word.

Here, the syntactic context vector of any word segment may characterize the syntax information of that word segment and other words in its context. The syntax context vector of any word may be determined based on the syntax context vector of the word preceding the word and the word vector of the word, and specifically, a recurrent neural network (Recurrent Neural Network, RNN) and its variants, such as Long Short-Term Memory (LSTM), bi-directional Long Short-Term Memory (BiLSTM), and the like, may be used to extract the syntax context vector of each word in the text to be corrected. For example, the syntactic context vector of any word segment may be determined using the following formula:

wherein,syntax context vector for current word segmentation, +.>Syntax context vector for last word, v_word _i Is the word vector of the current word segmentation.

Sentence components of any word in the text to be corrected are not only related to the word itself, but also related to other words which are syntactically related to the word, and the effect of the word with different degrees of association on the syntactical structural representation of the word is also different. Therefore, in order to further improve the representation capability of the syntactic structure of the text to be corrected, the syntactic context vector of each word in the text to be corrected can be synthesized based on the syntactic association degree between each word and the current word so as to obtain the syntactic structure vector of the current word. The syntactic association degree between any word segment and the current word segment is determined based on the syntactic relation between the word segment and the current word segment. For example, the syntax context vector of each word in the text to be corrected can be fused in a weighted summation manner to obtain the syntax structure vector of the current word, as shown in the following formula:

Wherein the text to be corrected contains N participles, v_parser _i Syntax structure vector for current word segmentation, p _ij Is the syntactic association between the word segment j and the current word segment.

For example, for the words "red" and "apple", the syntactic relation between the words "red" and "apple" is a modifier relation, and when determining the syntactic structure vector of "apple", the syntactic relevance corresponding to the modifier relation, for example, 0.3, may be obtained as the syntactic relevance corresponding to the word "red". Here, the degree of syntactic association corresponding to each syntactic relationship may be preset. The syntactic relation between any word and the current word can be determined based on the syntactic analysis result of the text to be corrected. The text to be corrected may be parsed using a parsing tool, such as a Stanford tool, among others.

And then, carrying out error detection on the text to be corrected based on the syntactic structure vector of each word in the text to be corrected. Here, a pre-trained error detection model, such as a sequence labeling model, may be used to label each word segment based on its syntax structure vector, so as to obtain an error detection result of each word segment. The error detection result of any word may be error start, error middle, error end, individual error or no error. An error detection parameter matrix can be pre-trained, and the error detection parameter matrix is used for calculating the score of each possible error detection result corresponding to any word according to the syntactic structure vector of the word, and taking the error detection result with the highest score as the error detection result of the word. The error detection parameter matrix can be obtained by training based on a large number of sample texts to be corrected and sample error detection results of each sample word in the sample texts to be corrected. For example, the error detection result of any word can be determined using the following formula:

label_detection_index＝argmax(W _detection v_parser _i )

Wherein W is _detection And as for the error detection parameter matrix, the label_detection is the error detection result of the current word segmentation.

The embodiment of the invention utilizes the syntax structure information in the syntax structure vector of each word to locate the grammar error, and can avoid the problem of excessive modification. For example, for text to be corrected "he is a beautiful boy", the correction scheme based on the translation mechanism would be modified according to language fluency, and the translation model would typically modify it to "he is a handsome boy" since the handlename is better suited for modifying boy than beauful. However, from a grammatical point of view, "he is a beautiful boy" does not have grammatical errors, and thus there is a problem of excessive modification. The embodiment of the invention carries out error positioning by relying on the syntactic structure information of each word in the text to be corrected, can effectively avoid the problem of excessive modification and improves the accuracy of grammar correction.

According to the method provided by the embodiment of the invention, the syntactic structure vector of the current word is determined based on the syntactic context vector of each word in the text to be corrected and the syntactic association degree between each word and the current word, and the text to be corrected is subjected to error detection based on the syntactic structure vector of each word, so that the problem of excessive modification can be avoided, and the accuracy of grammar correction is improved.

Based on any of the above embodiments, fig. 4 is a flowchart of an error correction method according to an embodiment of the present invention, as shown in fig. 4, step 120 specifically includes:

step 121, determining a current error correction vector based on the last error correction vector in the error text segment and the word vector of the last correction word segmentation;

step 122, determining a current correction word segmentation based on the current error correction vector;

wherein the initial correction vector is determined based on the semantic context vector and the syntactic structure vector of each word segment in the erroneous text segment.

Specifically, when grammar correction is performed, the correction of the wrong grammar component needs to be completed on the basis that sentence semantics are not changed. Therefore, coding and decoding can be performed on the basis of semantic information and syntactic structure information of the text to be corrected, and correction word segmentation corresponding to the error text segment is output one by one, so that the accuracy of error correction is improved.

Based on the semantic context vector and the syntactic structure vector for each word in the erroneous text segment, an initial error correction vector may be determined. The semantic context vector of any word can represent semantic information of the word and the context of the word in the text to be corrected, so that the obtained initial correction vector can represent the semantics and the syntax structure of the error text segment which should be in the text to be corrected. The semantic context vectors and the syntactic structure vectors of the words in the error text segment can be fused to obtain fusion vectors of the words, and then fusion vectors of all the words in the error text segment are fused to obtain initial error correction vectors; the semantic context vector of each word can be fused, the syntactic structure vector of each word is fused, and then the two fusion results are further fused to obtain an initial error correction vector, which is not particularly limited in the embodiment of the invention. For example, the initial error correction vector v_correction may be determined using the following formula ₀ ：

Wherein, start_index is the first word of the error text segment, end_index is the last word of the error text segment, v_semanic _i And v_parser _i A semantic context vector and a syntactic structure vector of the current word segmentation, respectively.

And taking the initial error correction vector as the start of decoding, and sequentially decoding and outputting correction word segmentation. That is, the initial error correction vector is used as the current error correction vector, the current correction word, i.e. the first correction word, is decoded and outputted, then the next error correction vector can be determined based on the current error correction vector and the word vector of the current correction word, and the above-mentioned process is circularly performed until an end mark, such as<End>. Here, the current error correction vector may be determined using RNN or LSTM, for example, the next error correction vector v_correction may be determined using the following formula _i ：

v_correct _i ＝LSTM(v_correct _i-1 ，lookup_embedding(x_correct _i ))

Wherein v_correct _i-1 For the current error correction vector, x_correction _i For the current correction word segmentation, the lookup_segmentation () is a word vector acquisition method.

When determining the current correction word segment, a pre-trained translation model may be employed to decode the current error correction vector. An error correction parameter matrix can be pre-trained, and the error correction parameter matrix is used for calculating the score of each possible word according to the current error correction vector, and the word with the highest score is used as the current correction word. The error correction parameter matrix can be obtained by training based on a large number of sample error text fragments and corresponding sample correction text fragments. For example, the current correction word x_correct may be determined using the following formula _i ：

x_correct _i ＝lookup_word(argmax(W _correct v_correct _i-1 ))

Wherein W is _correct V_correct for error correction parameter matrix _i-1 For the current error correction vector, argmax (W _correct v_correct _i-1 ) The serial number corresponding to the word with the highest score at the current moment can be obtained, and the corresponding word can be obtained according to the serial number by the alookup_word ().

According to the method provided by the embodiment of the invention, the current error correction vector is determined based on the last error correction vector in the error text segment and the word vector of the last corrected word segment, and the current corrected word segment is determined based on the current error correction vector, wherein the initial error correction vector is determined based on the semantic context vector and the syntactic structure vector of each word segment in the error text segment, so that the accuracy of error correction is improved.

Based on any of the above embodiments, the semantic context vector of any of the tokens in the erroneous text fragment is determined based on the semantic context vector of the previous token of the token and the syntactic context vector of the token.

Specifically, the semantics that any word should possess in the text to be corrected are related not only to the context of the word, but also to the syntactic information of the word in the text to be corrected. For example, for a word "match", if the syntax component of the word in the text to be corrected is an object, its semantics are more likely to be "events", but if the syntax component of the word in the text to be corrected is a predicate, its semantics are more likely to be "matches". Therefore, on the basis of the semantic context vector of the last word of any word, the semantic context vector of the word can be obtained by encoding by combining the syntactic context vector of the word and utilizing the syntactic information of the word, so that the semantic representation capability of the semantic context vector is improved.

The semantic context vector of any word can be extracted specifically through models such as RNN, LSTM or BiLSTM, for example, the following formula can be adopted to determine the semantic context vector v_semanic of any word _i ：

Wherein v_secmantic _i-1 For the semantic context vector of the last word segment,a syntactic context vector for the word segment.

It should be noted that, the syntax structure vector of the word is not adopted here, but the syntax context vector is adopted because the syntax structure information contained in the syntax structure vector is stronger, if the semantic context vector of the word is determined based on the semantic context vector of the previous word and the syntax structure vector of the word, the extraction of the semantic information is weakened by the excessively strong syntax structure information, and the semantic characterization capability is affected. Therefore, the embodiment of the invention combines the semantic context vector of the last word and the syntactic context vector of the word, not only can utilize the syntactic information in the syntactic context vector, but also can not weaken the extraction of semantic information, and can improve the semantic representation capability of the semantic context vector.

According to the method provided by the embodiment of the invention, the semantic context vector of the word is determined based on the semantic context vector of the last word of any word in the error text segment and the syntactic context vector of the word, so that the semantic representation capability of the semantic context vector is improved.

Based on any of the above embodiments, fig. 5 is a schematic flow chart of a syntax error correction method according to another embodiment of the present invention, as shown in fig. 5, where the method includes:

text to be corrected, such as "there is an apples in the tree", is determined.

Since the character string cannot be directly analyzed, the text to be corrected can be input into the data preprocessing module in advance for word segmentation. Converting text to be corrected into [ "thene", "is", "an", "apples", "in", "the", "tree" by a word segmentation tool, such as a Stanford word segmentation tool "]For ease of representation, the segmented word sequence is represented as [ x ] ₁ ，x ₂ ，x ₃ ，...，x _i ，...，x _N ]Wherein x is _i The representative position index is the word segmentation at i, and N is the length of the text to be corrected.

Then, the word segmentation sequence is input to a word intention characterization module, and word vector extraction is carried out on each word segment. Because the representation capability of the word vector directly influences the accuracy of subsequent error detection, error correction and error classification, a word vector table can be pre-trained based on the service data of the actual application scene, and the word vector table contains word vectors corresponding to each word. The word vector sequence of the text to be corrected can be obtained through table lookup:

v_word _i ＝lookup_embedding(x _i )

wherein v_word _i Is a word vector indexing the segmented word at position i.

And inputting the word vector sequence of the text to be corrected into a syntactic structure representation module, and carrying out syntactic structure representation on each word in the text to be corrected. Specifically, the syntax structure characterization module adds the syntax structure information of the text to be corrected on the basis of the BiLSTM to obtain the syntax structure vector of each word, and the method for determining the syntax structure vector described in the above embodiment may be specifically used to determine the syntax structure vector, which is not described herein again.

And inputting the syntactic structure vector of each word segment into an error detection module for error detection to obtain an error detection result of each word segment in the text to be corrected. Taking "there is an apples in the tree" as an example, the error detection result of each word is "no error", "single error", "no error", and "no error".

The syntactic context vector for each word segment is input to a semantic representation module. The semantic characterization module determines a semantic context vector for any word segment based on the semantic context vector of the previous word segment of the word segment and the syntactic context vector of the word segment.

Based on the error detection result of each word, determining error text fragments "apples" and "in" and respectively inputting the error text fragments "apples" and "in" to an error correction module. The error correction module determines an initial error correction vector based on the semantic context vector and the syntax structure vector of each word in the error text segment, sequentially decodes and outputs corrected words to obtain corrected text segments corresponding to the error text segment, and corresponds to the error text segments 'apples' and 'in', wherein the corrected text segments are 'apple' and 'on'.

The erroneous text segments and the corrected text segments are input to an error classification module. And the error classification module performs subtraction interaction, multiplication interaction and addition interaction on the error text fragments and the corrected text fragments to obtain interaction vectors between the error text fragments and the corrected text fragments. In addition, a preset error type library is utilized to determine an error type prior vector corresponding to the error text segment. The error type library can be established based on correction data after the teacher corrects the grammar homework of the students. And after the interaction vector and the error type prior vector are fused, carrying out error classification according to the fusion result to obtain the error type corresponding to the error text segment. For example, the error type corresponding to "apples" is "noun single complex error", and the error type corresponding to "in" is "preposition error".

The syntax error correction apparatus provided by the present invention will be described below, and the syntax error correction apparatus described below and the syntax error correction method described above may be referred to correspondingly to each other.

Based on any of the above embodiments, fig. 6 is a schematic structural diagram of a syntax error correction device according to an embodiment of the present invention, and as shown in fig. 6, the device includes an error detection unit 610, an error correction unit 620, and an error type classification unit 630.

The error detection unit 610 is configured to perform error detection on the text to be corrected, so as to obtain an error text segment;

the error correction unit 620 is configured to perform error correction on the error text segment to obtain a corrected text segment corresponding to the error text segment;

the error type classification unit 630 is configured to determine an error type corresponding to the error text segment based on the interaction vector between the error text segment and the corrected text segment; wherein the interaction vector is used to characterize the difference and commonality features between the erroneous text segment and the corrected text segment.

According to the device provided by the embodiment of the invention, the error text fragments and the corrected text fragments are obtained by carrying out error detection and error correction on the text to be corrected, and the error types corresponding to the error text fragments are determined based on the interaction vectors between the error text fragments and the corrected text fragments, so that the grammar correction method provided by the embodiment of the invention has the interpretability.

Based on any of the above embodiments, the apparatus further includes an interaction vector determination unit configured to:

Based on any of the above embodiments, the error type classification unit 630 is specifically configured to:

According to the device provided by the embodiment of the invention, the error type prior vector of the error text segment is determined based on the error type of the sample error correction pair matched with the error correction pair in the error type library, and the error type corresponding to the error text segment is determined by combining the interaction vector between the error text segment and the corrected text segment and the error type prior vector of the error text segment, so that the accuracy of grammar error classification is improved.

Based on any of the above embodiments, the apparatus further comprises an error type prior vector determination unit configured to:

an error type prior vector is determined based on the prior error type and its frequency of occurrence in the error type library.

According to the device provided by the embodiment of the invention, the error types which are opposite to the error correction of the matched sample in the error type library are used as the prior error types which are opposite to the error correction, and the prior error type vector is determined based on the prior error types and the frequency of the prior error types in the error type library, so that the accuracy of error type classification can be improved.

Based on any of the above embodiments, the error detection unit 610 is specifically configured to:

and carrying out error detection on the text to be corrected based on the syntactic structure vector of each word.

According to the device provided by the embodiment of the invention, the syntactic structure vector of the current word is determined based on the syntactic context vector of each word in the text to be corrected and the syntactic association degree between each word and the current word, and the text to be corrected is subjected to error detection based on the syntactic structure vector of each word, so that the problem of excessive modification can be avoided, and the accuracy of grammar correction is improved.

Based on any of the above embodiments, the error correction unit 620 is specifically configured to:

determining a current error correction vector based on a last error correction vector in the error text segment and a word vector of a last correction word;

determining a current correction word segmentation based on the current correction vector;

The device provided by the embodiment of the invention determines the current error correction vector based on the last error correction vector in the error text segment and the word vector of the last correction word segmentation, and determines the current correction word segmentation based on the current error correction vector, wherein the initial error correction vector is determined based on the semantic context vector and the syntactic structure vector of each word segmentation in the error text segment, so that the accuracy of error correction is improved.

According to the device provided by the embodiment of the invention, the semantic context vector of the word is determined based on the semantic context vector of the last word of any word in the error text segment and the syntactic context vector of the word, so that the semantic representation capability of the semantic context vector is improved.

Fig. 7 illustrates a physical schematic diagram of an electronic device, as shown in fig. 7, which may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may call logic instructions in memory 730 to perform a syntax error correction method comprising: performing error detection on the text to be corrected to obtain an error text fragment; performing error correction on the error text segment to obtain a corrected text segment corresponding to the error text segment; determining an error type corresponding to the error text segment based on the interaction vector between the error text segment and the corrected text segment; wherein the interaction vector is used to characterize the difference and commonality features between the erroneous text segment and the corrected text segment.

Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the syntax error correction method provided by the above methods, the method comprising: performing error detection on the text to be corrected to obtain an error text fragment; performing error correction on the error text segment to obtain a corrected text segment corresponding to the error text segment; determining an error type corresponding to the error text segment based on the interaction vector between the error text segment and the corrected text segment; wherein the interaction vector is used to characterize the difference and commonality features between the erroneous text segment and the corrected text segment.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above-provided syntax error correction methods, the method comprising: performing error detection on the text to be corrected to obtain an error text fragment; performing error correction on the error text segment to obtain a corrected text segment corresponding to the error text segment; determining an error type corresponding to the error text segment based on the interaction vector between the error text segment and the corrected text segment; wherein the interaction vector is used to characterize the difference and commonality features between the erroneous text segment and the corrected text segment.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A syntax error correction method, comprising:

determining an error type corresponding to the error text segment based on the interaction vector between the error text segment and the corrected text segment; wherein the interaction vector is used for characterizing the difference characteristic and the commonality characteristic between the error text segment and the correction text segment;

the determining the error type corresponding to the error text segment based on the interaction vector between the error text segment and the corrected text segment comprises the following steps:

the error type prior vector is determined based on an error type of a sample error correction pair matched with an error correction pair in a preset error type library, and the error correction pair is formed by the error text fragment and the correction text fragment;

the error type prior vector is determined based on the steps of:

2. The grammar correction method of claim 1, wherein the interaction vector between the erroneous text segment and the corrected text segment is determined based on the steps of:

3. The syntax error correction method according to claim 1 or 2, wherein said error detecting the text to be error-corrected comprises:

4. The method of claim 3, wherein said performing error correction on said erroneous text segment comprises:

5. The method of claim 4, wherein the semantic context vector of any word segment in the erroneous text segment is determined based on the semantic context vector of a word segment preceding the any word segment and the syntactic context vector of the any word segment.

6. A syntax error correction apparatus, comprising:

an error type classification unit, configured to determine an error type corresponding to the error text segment based on an interaction vector between the error text segment and the corrected text segment; wherein the interaction vector is used for characterizing the difference characteristic and the commonality characteristic between the error text segment and the correction text segment;

the error type classification unit is specifically configured to:

The error type prior vector determination unit is further included for:

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the syntax error correction method according to any one of claims 1 to 5 when the program is executed by the processor.

8. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the syntax error correction method according to any one of claims 1 to 5.