CN112183094B - Chinese grammar debugging method and system based on multiple text features - Google Patents

Chinese grammar debugging method and system based on multiple text features Download PDF

Info

Publication number
CN112183094B
CN112183094B CN202011209481.9A CN202011209481A CN112183094B CN 112183094 B CN112183094 B CN 112183094B CN 202011209481 A CN202011209481 A CN 202011209481A CN 112183094 B CN112183094 B CN 112183094B
Authority
CN
China
Prior art keywords
grammar
text
information
model
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011209481.9A
Other languages
Chinese (zh)
Other versions
CN112183094A (en
Inventor
张仰森
黄改娟
王思远
陈若愚
段瑞雪
尤建清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN202011209481.9A priority Critical patent/CN112183094B/en
Publication of CN112183094A publication Critical patent/CN112183094A/en
Application granted granted Critical
Publication of CN112183094B publication Critical patent/CN112183094B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a Chinese grammar debugging method and a Chinese grammar debugging system based on multiple text characteristics, wherein the method comprises the following steps: (1) Carrying out vector representation on the text by using a pre-training model and grammar priori knowledge respectively to obtain semantic feature vectors and part-of-speech feature vectors, and splicing the part-of-speech feature vectors and the semantic feature vectors end to obtain a vector sequence of the text; (2) Extracting a characteristic vector sequence of the text by using a Bi-LSTM model; (3) Performing attention enhancement on the feature vector sequence based on semantic and part-of-speech collocation information; (4) Performing linear transformation on the feature vector sequence with enhanced attention to obtain a label prediction sequence; (5) Information enhancement based on word order relation features is carried out on the tag prediction sequence; (6) And capturing constraint information of the label prediction sequence after information enhancement, and judging the grammar error boundary position based on the constraint information. Through verification, the invention has better error checking effect and is superior to other existing similar methods.

Description

Chinese grammar debugging method and system based on multiple text features
Technical Field
The invention belongs to the technical field of text recognition, and particularly relates to a Chinese grammar error checking method and system based on multi-element text features.
Background
In recent years, the grammar debugging method mainly comprises a grammar debugging method based on rule matching, a grammar debugging method based on a statistical model and a grammar debugging method based on a deep learning model.
Firstly extracting features such as morphology, syntax, grammar and the like, then constructing a grammar rule template, and finally realizing grammar debugging through matching of texts and the grammar rule template. Grammar rules can better describe special grammar structures in a text, but when the text contains complex grammar structures and informal expression expressions, the grammar rules often cannot accurately describe the grammar phenomena, and a statistical model can well solve the problem. Although the statistical model can construct a more comprehensive grammar rule template aiming at complex grammar phenomena in the text, the grammar rules still cannot cover all grammar phenomena, and cannot further improve the grammar error checking effect.
With the rapid development of deep learning, a variety of network models are beginning to be applied to grammar debugging tasks. Although the conventional network model can better capture long-distance semantic information of a text and constraint information between words, the conventional static text vector representation method still is used in a text vector representation stage and cannot well represent semantic features of the text, so that a vector representation method based on a pre-training model becomes a research hot spot in recent years. Devlin et al [1] A bi-directional transducer framework (BERT) based on self-attention mechanisms has been proposed and trained on very large data sets, which model has been widely used for various natural language processing tasks. Then, new pre-training models are continuously developed, and the semantic representation capability of the vector to the text is further improved. Li Zhi Wan [2] After the pre-training model is introduced into the vector representation of the text, a grammar error-checking model based on Bi-LSTM-CRF is provided, the model is obviously improved in the aspect of text semantic representation, grammar error characteristics can be better captured, and a good grammar error-checking effect is obtained. The above model improves the semantic representation capability of the text, but does not effectively utilize the priori knowledge of grammar, and the grammar error checking effect needs to be improved.
The following references are referred to herein:
[1]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv preprint arXiv:1810.04805,2018.
[2]LI Zai Wan.Analysis and System Implementation of Chinese Language Disorders based on Deep learning[D].University of Electronic Science and Technology of China,2020.(in Chinese).
disclosure of Invention
The invention aims to introduce grammar priori knowledge into a pre-training model, and provides a Chinese grammar error checking method and system based on multi-element text features.
Based on a pre-training model, the semantic information of the text is fused with grammar priori knowledge, and the vector representation of the text is carried out together so as to enrich the characteristic information of the text; and the attention mechanism and word order relation characteristics are introduced, the capturing capability of the model on the text characteristics is enhanced, and further the grammar error checking effect is improved.
The invention provides a Chinese grammar debugging method based on multi-element text characteristics, which comprises the following steps:
(1) Carrying out vector representation on the text by using a pre-training model and grammar priori knowledge respectively to obtain semantic feature vectors and part-of-speech feature vectors, and splicing the part-of-speech feature vectors and the semantic feature vectors end to obtain a vector sequence of the text;
(2) Extracting a characteristic vector sequence of the text from the vector sequence obtained in the step (1) by using a Bi-LSTM model;
(3) Performing attention enhancement based on semantic and part-of-speech collocation information on the feature vector sequence obtained in the step (2);
(4) Performing linear transformation on the feature vector sequence with enhanced attention in the step (3) to obtain a label prediction sequence;
(5) Performing information enhancement based on word order relation characteristics on the tag prediction sequence obtained in the step (4);
(6) Capturing constraint information of the tag prediction sequence obtained in the step (5) after information enhancement, and judging the grammar error boundary position based on the constraint information.
Further, in the step (1), the text is represented by a vector by using a pre-training model, specifically:
each word in the text is characterized as 3 vectors: word vectors, segment vectors, and position vectors; the 3 vectors are summed to obtain the semantic feature vector of each word.
Further, in the step (1), the text is represented by using grammar priori knowledge, specifically:
and performing word segmentation on the text by using a Chinese word segmentation system to obtain word segmentation results, and generating part-of-speech feature vectors by using a single-hot coding mode.
Further, the step (3) specifically comprises:
the characteristic coding matrix M= [ e ] is obtained by combining the LSTM output at each moment 1 ,e 2 ,…,e s ] T =[d 1 ,d 2 ,…,d k ]Wherein e is i The semantic codes obtained by splicing the forward output and the reverse output of each time node in the LSTM are spliced, s is the number of time steps of the LSTM expansion, and k is twice the number of LSTM hidden units;
compressing M to obtain semantic information and part-of-speech collocation information vector p= [ max (d 1 ),max(d 2 ),…,max(d k )];
Performing linear transformation on the feature vector p to obtain attention weight S;
output h of hidden units at each moment of LSTM by S j Carry out weighted update h j ′=Wh j ,h j For the j-th dimension hiding layer output, j=1, 2, …, r, namely completing the attention enhancement of the feature vector sequence based on the semantic and part-of-speech collocation information.
Further, the step (5) specifically comprises:
traversing the label prediction sequence obtained in the step (4), extracting verbs and adjectives, taking main nouns in the minimum grammar unit where the verbs and adjectives are located as strong association items, and adding the main nouns into a strong association set F nx The method comprises the steps of carrying out a first treatment on the surface of the If no main noun exists in the minimum grammar unit, the corresponding strong association set is empty;
traversing the tag prediction sequence obtained in step (4), adding the main nouns marked as having grammar errors to the set F ny
By F' nx =F nx -F nx ∩F ny And F' ny =F ny -F nx ∩F ny Cleaning the strong association set;
will F' ny Summing grammar error label predicted values in the set and taking average value to obtain l' nx F 'is set' nx Strong association l of collection nx Updated to
Figure BDA0002758329660000031
Thereby obtaining the label prediction sequence with enhanced information.
Further, in step (6), constraint information is captured using the CRF model.
The invention provides a Chinese grammar error checking system based on multiple text characteristics, which comprises:
the vector representation module is used for carrying out vector representation on the text by utilizing the pre-training model and grammar priori knowledge respectively to obtain a semantic feature vector and a part-of-speech feature vector, and splicing the part-of-speech feature vector and the semantic feature vector end to obtain a vector sequence of the text;
the feature extraction module is used for extracting a feature vector sequence of the text from the vector sequence obtained by the vector representation module by using the Bi-LSTM model;
the attention enhancement module is used for enhancing the attention of the feature vector sequences obtained by the feature extraction module based on the semantic and part-of-speech collocation information;
the linear transformation module is used for carrying out linear transformation on the feature vector sequence after the attention enhancement of the attention enhancement module to obtain a label prediction sequence;
the information enhancement module is used for carrying out information enhancement based on word order relation characteristics on the tag prediction sequence obtained by the linear transformation module;
the capturing module is used for capturing constraint information of the tag prediction sequence obtained by the information enhancement module after information enhancement, and judging the grammar error boundary position based on the constraint information.
Compared with the prior art, the invention has the following advantages and beneficial effects:
in the text vector representation layer, the semantic features and the part-of-speech collocation features are fused, and the feature information of the text is enriched; attention mechanisms are introduced into the label prediction layer, so that different weights can be given to different parts of the text by the model, and the recognition effect of grammar errors is improved; meanwhile, a word strong association layer and a post-processing mechanism are introduced, so that the capability of the model for acquiring word order relation features is improved, and the performance of grammar error checking of the model is effectively improved.
The method of the invention verifies on the CGED public data set, shows better error checking effect and is superior to other existing similar methods.
Drawings
FIG. 1 is a framework diagram of a basic model of the invention based on semantic features and part-of-speech collocation features;
FIG. 2 is a vector representation schematic of a BERT model;
FIG. 3 is a schematic diagram of string lexical analysis;
FIG. 4 is a diagram of an information enhancement model framework based on semantic information and part-of-speech collocation information attention;
FIG. 5 is a schematic illustration of misprediction and lexical analysis results;
FIG. 6 is a block diagram of a multi-element text feature based grammar debugging model of the present invention.
Detailed Description
The implementation of the invention constructs four types of models: the first class fuses a pre-training model and grammar priori knowledge on the text abstract representation to construct a basic model based on semantic features and part-of-speech collocation features; the second category introduces an attention mechanism in grammar debugging, and builds an information enhancement model based on semantic and part-of-speech collocation information attention; the third class introduces word order relation features aiming at the problem of word order error word positioning completion in error positioning and builds an information enhancement model based on the word order relation features; and the fourth class is to fuse the first three classes of models and construct a grammar error checking model based on the multi-element text characteristics.
For ease of understanding, the principles and embodiments of the four types of models will be described below, respectively.
Basic model based on semantic features and part-of-speech collocation features
Referring to fig. 1, a schematic diagram of a basic model is shown, which is mainly composed of a vector representation layer, a label prediction layer, and a constraint information capture layer. The invention optimizes the vector representation layer of the basic model, so that the model can acquire richer text features.
1.1 vector representation layer
After the text with various grammar errors is analyzed, the semantic information and grammar priori knowledge of the text are found to contain rich features for identifying the various grammar errors. Therefore, the invention constructs a vector representation method which fuses the pre-training model and the grammar priori knowledge.
1) Semantic feature vector
In specific implementation, the pre-training model can adopt a BERT language model published by Google corporation, the core of the BERT model is a bidirectional Transformers coding layer, and the model is trained by predicting the connection relation of the shielding words and sentences. In the task of predicting the masking words, a Masked Language Model (MLM) strategy is used, the distance between words is not limited, and the model learns multi-level context information, namely global semantic information, so that the masking words are predicted, and the depth bi-directional vector representation of the words is achieved.
When performing a vector representation of text, the BERT model uses the result of the summation of three vectors as the final vector representation of the text, i.e., the semantic feature vector. The summing process is shown in fig. 2, where each word is characterized by 3 vectors, namely a word vector, a segment vector, and a position vector. The word vector is the code of the target word obtained according to the vocabulary inquiry; the segmentation vector is the code of the position of the target sentence in the text; the position vector is the code of the position of the target word in the sentence. It should be noted that before each sentence is encoded, CLS and SEP marks need to be added in advance at the beginning and end of the sentence.
2) Part-of-speech grammar priori knowledge
The parts of speech not only contains the syntactic structure characteristics of the text, but also gives consideration to the semantic information of the vocabulary, so that the part of speech is used as grammar priori knowledge to be input into a model. When grammar errors exist in the text, the phenomenon that the grammar structure does not accord with the grammar specification can occur, the original grammar structure of the text sequence is destroyed, and the grammar errors can also cause the problems of true multi-word errors, disordered word sequences and the like of the text, so that character strings and word sequence grammar errors are generated. The grammar error of the character string often causes the phenomenon of single character strings in the text word segmentation result. As shown in FIG. 3, the "middle" word of the character string sequence is wrongly written as "loyal", so that "Chinese" is changed into two single words of "loyal" and "national", the phenomenon of single word scattering occurs, and the part-of-speech sequence is also changed; word order grammar errors can also cause variations in the part-of-speech sequence of text. Therefore, the parts of speech is introduced to carry out vector representation on the text, so that the model can capture part of speech collocation characteristics, and further, the grammar error checking effect is improved.
In specific implementation, a Chinese academy NLPIR system (a Chinese word segmentation system) can be used as a lexical analysis tool, the NLPIR lexical analysis tool marks the parts of speech of words into 22 major classes and 75 minor classes, and a corresponding part-of-speech tagging dictionary is constructed on the basis of the classification system. Firstly, obtaining word segmentation results of a text, and then generating part-of-speech feature vectors (x 'by using a single-hot coding mode' 1 ,x′ 2 ,…,x′ q ) Wherein q is the sequence length of the text, and the text is spliced with the semantic feature vector (x 1 ,x 2 ,…,x q ) Splicing to obtain the final vector representation sequence (x 1 ,x 2 ,…,x q ),(x′ 1 ,x′ 2 ,…,x′ q )。
1.2 Label prediction layer
The label prediction layer adopts Bi-LSTM model to encode context semantic information and part-of-speech collocation information of the text, so that the model can capture semantic features and part-of-speech collocation features, further the recognition effect of character string grammar errors is improved, and the output of the layer is the feature vector (h 1 ,h 2 ,…,h r ) Wherein r is the dimension of the hidden layer, and the label prediction sequence (l 1 ,l 2 ,…,l q )。
1.3 constraint information Capture layer
The CRF is adopted to capture constraint information among words, and then the constraint information is used for judging the grammar error boundary position, so that the accuracy of grammar error position identification can be effectively improved, and further the grammar error checking effect of a model is improved.
(II) information enhancement model based on semantic and part-of-speech collocation information attention
Further analyzing the text with grammar errors, finding that useful information available from different words in the text is different when judging whether grammar errors exist at a certain position. According to this feature, when a grammar error is identified, it is more desirable that a portion related to the grammar error takes a higher weight and a portion not related to the error takes a lower weight. Therefore, limited attention can be selectively distributed to more important information by using an attention mechanism, so that the part with higher grammar error correlation degree occupies higher weight, the feature vector of the text is updated, and the error checking effect of the model is further improved.
A framework diagram of the information enhancement model of the present invention is shown in fig. 4. Based on the attention of the semantic and part-of-speech collocation information, the weight calculation is completed mainly by analyzing the relation inside the sequence, the attention weighting update is carried out on Bi-LSTM output, and the part related to grammar errors in the text is strengthened. In LSTM, the forward output and the backward output of each time node are spliced to obtain semantic code e i ,e i And simultaneously contains the context information and the context information of the current moment. Combining the outputs of LSTM at each moment to obtain the semantic features and part-of-speech collocation characteristics of the textSyndrome coding matrix m= [ e 1 ,e 2 ,…,e s ] T =[d 1 ,d 2 ,…,d k ]. Where s is the number of time steps for LSTM expansion and k is twice the number of LSTM hidden units. Compressing the feature coding matrix to obtain semantic information and part-of-speech collocation information vector p, as shown in formula (1).
p=[max(d 1 ),max(d 2 ),…,max(d k )] (1)
The attention weight W is obtained by linearly changing the feature vector p, and the calculation is shown in the formula (2).
W=Linear(p) (2)
Output h of hidden unit at each moment of LSTM by W j And (3) carrying out weight updating, wherein the calculation is shown in a formula (3).
h′ j =Wh j (3)
h j For the output of the j-th dimension hidden layer, j=1, 2, …, r.
Will weight the h' j As an output of each time instant at the end.
(III) information enhancement model based on word order relation characteristics
The information enhancement model based on semantic information and part of speech collocation information attention has a good error checking effect on the grammar errors of the character strings, but has poor error checking effect on the grammar errors of the word sequences. This is because word order grammar errors behave differently than string grammar errors, such as: "with economic development, the living standard of people is improved. The word sequence grammar error in the method is that the living standard of people is improved, the misprinting is improved, the living standard of people is improved, the grammar error is caused by the sequence confusion of a plurality of words, and the information is difficult to capture only by adopting the grammar error-checking model. Specifically, after extracting the context semantic features and the part-of-speech collocation features, the grammar error-checking model predicts the character labels, the prediction result focuses on a single word, and the word sequence relation among the words is not extracted enough. Therefore, when more than two words are involved, the model can only recognize one to two words, the grammar error checking effect is poor, and aiming at the problem, word sequence relation characteristics and attention mechanisms are introduced, so that the recognition capability of the model on word sequence grammar errors involving a plurality of words is improved.
The word order relation features mainly characterize the matching relation between two parts of speech: relationship of nouns to verbs, adjectives to nouns. Wherein nouns and verbs generally describe the initiating and bearing relationships of actions; and adjectives and nouns generally describe the modified relationship between the adjectives and nouns. Meanwhile, a minimum grammar unit concept is introduced, and texts in intervals to which two point numbers belong in the texts are used as a minimum grammar unit, such as: "with economic development, the living standard of people is improved. "comma divides text into two smallest syntax elements. In the minimum grammar unit, the normal noun is taken as the main noun, and the main noun with word sequence relation with verb and adjective is taken as the strong association of the word. If a verb, adjective is marked as having a word order grammar error, then the words in its corresponding strong association set are also marked together. In the word segmentation result, v represents a common verb, u represents a fluxing word, n represents a noun, and wj represents a punctuation mark. Such as shown in fig. 5, where "up", "raised", "people" are marked as having grammatical errors. According to the above-mentioned thought, the verb "raise" in this grammar unit is marked as having grammar errors, so that its corresponding strong association "people", "living standard" is also marked as having grammar errors together.
Therefore, after the attention layer of the information enhancement model based on semantic information and part of speech collocation information attention is introduced, a word strong association set layer is firstly traversed through the output sequence of the attention layer, characters marked as word order errors in the current sequence are searched, and then a strong association set in a minimum grammar unit is constructed. And then, the characters which are not successfully marked as the word order errors are screened out from the strong association set, and the word order error class predictive scores of the characters are updated according to the calculation formulas (4) - (6). Finally, the updated character label score sequence (l 'is output' 1 ,l′ 2 ,…,l′ q )。
Construction of strong association setThe construction process is as follows: firstly extracting strong association of verb and adjective, taking main noun (namely common noun) in the minimum grammar unit of the verb and adjective as strong association, and adding the main noun and the adjective into a strong association set F a If the minimum grammar unit has no main noun, the strong association item set is empty; then cleaning the strong association item set, extracting the label prediction sequence of the character in the text from the label prediction layer, adding the main noun marked as having grammar error into the set F b By F a And F is equal to b Cleaning the strong association item set to obtain F' a For specific cleaning methods see formulas (4), (5):
F′ a =F a -F a ∩F b (4)
F′ b =F b -F b ∩F a (5)
the calculation process of the tag predictive value of the strong correlation item is that F' b Summing grammar error label predicted values in the set and taking average value to obtain l' a F 'is set' a Strong association l in collection a Updated to l' a The calculation method is shown in the formula (6):
Figure BDA0002758329660000071
wherein l' a Output for the attention layer a Is updated with g being F' b The number of characters in the set is determined,
Figure BDA0002758329660000072
is F' b Predictive probability values for characters in a collection.
A post-processing mechanism is introduced into an output part of the constraint information capturing layer, and the specific method is that when a plurality of word order errors are marked, if continuous single words appear among the word order errors and the total number is smaller than a specified threshold value, the continuous single words are also marked as word order grammar errors, so that the accuracy of grammar error recognition is further improved.
Fourth, grammar error-checking model based on multiple text characteristics
The three models are fused to construct a grammar error-checking model based on the multi-element text characteristics, and the model framework is shown in figure 6.
Specifically, a text is vector-represented by a vector representation method which fuses a pre-training model and grammar priori knowledge, and a vector sequence (x) of the text is obtained 1 ,x 2 ,…,x q ),(x′ 1 ,x′ 2 ,…,x′ q ). Firstly, extracting context semantic features and part-of-speech collocation features of a text through a Bi-LSTM layer to obtain a vector sequence (h 1 ,h 2 ,…,h r ). Then, the vector sequence (h) is enhanced by using an information enhancement model based on semantic and part-of-speech collocation information attention 1 ,h 2 ,…,h r ) Information enhancement is carried out, and an updated vector sequence (h 'is calculated by using a formula (3)' 1 ,h′ 2 ,…,h′ r ) Obtaining a tag prediction sequence (l) through linear transformation 1 ,l 2 ,…,l q ). Secondly, utilizing the capturing capability of an information enhancement model based on word order relation characteristics to the word order relation characteristics, introducing a word strong association set calculation layer, and enhancing word order relation information to a vector sequence to obtain (l '' 1 ,l′ 2 ,…,l′ q ). Finally, marking the grammar error position through the CRF layer to obtain a final grammar error recognition sequence (y) 1 ,y 2 ,…,y q )。
Examples
The feasibility and the advantageous effects of the invention will be verified in the following with reference to examples.
The present embodiment validates on a chinese grammar automated diagnostic event (Chinese Grammar Error Diagnosis, CGED) (2016-2018) public dataset. The training set of the model uses the training set of the CGED2016, the testing sets are three testing sets of the CGED2016, 2017 and 2018, the model is tested on the three testing sets respectively, and the highest value is selected as the final performance data of the model. The experimental result adopts the identification accuracy (P), the recall rate (R) and the comprehensive evaluation index F of the identification accuracy (P) and the recall rate (R) 1 The values serve as criteria for model performance. P means that the correct syntax error is identified as a percentage of the total identified syntax error, R means that the correct syntax error is identified as a percentage of all syntax errors in the data, F 1 Is the harmonic mean of P and R, comprehensively considering the performance of the model.
The comparison model selected in this embodiment includes:
(1) LSTM model (see P.L.Chen, S.H.Wu, L.P.Chen, et al, improving the Selection Error Recognition in a Chinese Grammar Error Detection System [ C ]// IEEE International Conference on Information Reuse & integration ieee 2016.):
the model regards grammar debugging task as sequence labeling task, long-distance semantic information of text is obtained by using long-term and short-term memory network, and then grammar debugging is carried out.
(2) The al_i_nlp model (see Y.Yang, P.J.Xie, J.Tao, et AL alibaba at IJCNLP-2017Task1:Embedding Grammatical Features into LSTMs for Chinese Grammatical Error Diagnosis Task[C ]// IJCNLP,2017,41.):
the model provides a combined model of a two-way long-short-term memory network and a conditional random field, and simultaneously adds features such as part of speech, syntax and the like, thereby improving the capability of the model for identifying long-distance grammar errors.
(3) W_pos model (see LI Zai wan. Analysis and System Implementation of Chinese Language Disorders based on Deep learning [ D ]. University of Electronic Science and Technology of China, 2020):
the model splices word vectors, part-of-speech vectors and the like into input vectors, so that the input vectors can represent more text information.
(4) HFL model (see R.J.Fu, Z.Q.Pei, J.F.Gong, et al Chinese Grammatical Error Diagnosis using Statistical and Prior Knowledge driven Features with Probabilistic Ensemble Enhancement [ C ]// NLPTEA,2018,52-59):
the model further integrates word statistical characteristics and priori grammar knowledge on the basis of an AL_I_NLP model, and performs post-processing on model output, so that grammar error checking effect is improved.
(5) BERT model (see LI Zai wan. Analysis and System Implementation of Chinese Language Disorders based on Deep learning [ D ]. University of Electronic Science and Technology of China, 2020.):
the model carries out vector representation of the text through the BERT model, and realizes grammar debugging through a two-way long-short-term memory network and a conditional random field model.
The embodiment of the invention adopts a deep learning framework PyTorch to realize the model, and adopts a text batch processing mode to train and debug the model. The experimental environment is a block of RTX2080Ti, the size of the embedded dimension of the pre-trained word is 768, the parameters in the model are set, the Hyperopt library is adopted for distributed parameter adjustment, the optimal parameter set of the model is obtained, and the specific parameter selection result is as follows: adam is used as an optimizer, the initial learning rate is set to 0.00005, the learning rate decay factor is set to 0.00001, and the size of batch is set to 20. For the choice of dropout values, experiments were performed on the verification set in this embodiment, and F was chosen on the verification set 1 The final value of dropout is 0.4 when the value is highest and the training round number is small.
Meanwhile, the influence of Bi-LSTM layer numbers on the model effect is compared and analyzed, and experimental results show that the double-layer network can better capture text semantic features; for setting a post-processing threshold value in a word sequence grammar error-checking model, counting word segmentation data with word sequence errors, finding that a plurality of words are auxiliary words, adverbs and prepositions in a text with word sequence errors, and setting the length of the word sequence grammar error-checking model to be not more than 3, so that the threshold value is set to be 3. The settings of the various parameters of the model are shown in table 1.
Table 1 experimental parameter settings
Parameter name Parameter value
Bi-LSTM layer number 2
Word vector dimension 778
Number of batch processes 20
Learning rate 0.00005
dropout 0.4
Bi-LSTM output dimension 250
According to the embodiment of the invention, a comparison experiment is firstly carried out on a text vector representation method, an attention mechanism and introduced word order relation features of fusion semantic feature vectors and part-of-speech feature vectors, a baseline model is a character-based Bi-LSTM-CRF model, a BP model represents a grammar error checking model fusing the semantic features and the part-of-speech collocation features, a BP_A model represents an information enhancement model introduced with the attention mechanism, and a BP_A_N model represents an information enhancement model further introduced with the word order relation features. The effect of adding different features on identifying grammar errors was analyzed by comparative experiments, the results of which are shown in tables 2-3.
TABLE 2 influence of features on string grammar error recognition effect
Figure BDA0002758329660000101
TABLE 3 influence of features on word order grammar misrecognition effects
Figure BDA0002758329660000102
As can be seen from the comparison experiment result, the BP model generates text vectors by using a vector representation method of fusing semantic features and part-of-speech features, so that the capturing capability of the model to the semantic features and part-of-speech collocation features is enhanced, the recognition effect of the character string grammar errors is greatly improved, and the word sequence grammar errors are also greatly improved in recall rate; on the basis, BP_A introduces an attention mechanism, the model gives different weights to different parts of the text, and experimental results show that the recognition effect of the model on the grammar errors of the character strings is further improved, the grammar errors of word sequence classes have no obvious change, so that the assumption before the text is verified, and the model lacks the capturing capability of the word sequence relation characteristics in the text; aiming at the problem, a word strong association set calculation layer is added into the BP_A_N model, word order relation features are introduced, and multiple words with word order errors are marked at the same time through the processing of the word strong association layer, so that recall rate of word order grammar errors is improved, and effectiveness of the model is proved.
The embodiment performs experiments and analysis according to the experimental flow and the evaluation method of the CGED data set, and the accuracy, the recall and the F are calculated 1 The three dimensions of the values are compared. The results of the experimental comparison with the former model are shown in table 4.
Table 4 results of experimental comparison with the front model
Figure BDA0002758329660000103
Figure BDA0002758329660000111
Experiments of the LSTM, AL_I_NLP and W_ POS, BERT, HFL models are all performed based on the CGED data set, and the experimental results are experimental results in the model correspondence. On the CGED data set, the BP_A_N model provided by the invention is superior to other models in performance index, and the effectiveness of the method is proved.
The BP model simultaneously introduces semantic features and part-of-speech collocation features, and combines the semantic features and the part-of-speech collocation features to generate a vector representation of a text, and compared with LSTM, AL_I_NLP, BERT and W_POS models, the BP model has great improvement in accuracy and recall rate. On this basis, bp_a draws attention to the mechanism for information enhancement, and the model can give more weight to recognizing characters that provide more information on grammatical errors, for example: "I are loyal people". When judging whether the 'Zhong' has grammar errors or not, the 'Guo' can be obtained from the semantic features and the part-of-speech collocation features, and the 'Guo' is more helpful to the judging process, so that the 'Guo' can be given more weight through the attention mechanism calculation by the model, and further the 'Zhong' is identified as having grammar errors. Experimental results show that the model has a great improvement in recognition accuracy, and the effectiveness of the method is proved.
BP_A_N further introduces word order relation features, so that the model can identify long-distance word order errors, such as: "should not pose health problems to others. When the model identifies other people as word sequence grammar errors, the health problems can be marked together through word strong correlation layer calculation, and the word sequence grammar errors can be marked together in the post-processing process, so that the correct marking of the word sequence grammar errors is realized, and the recall rate of the model is improved. However, this method also generates erroneous judgment, for example: "to convert it into power that motivates the development of our society. The model marks the "developing our society" as word sequence grammar errors, and the "society" is also marked as word sequence grammar errors through word strong correlation layer calculation, so that misjudgment of the model is caused. Therefore, the method improves the recall rate of the model on word sequence grammar errors, but also loses certain precision and reduces certain accuracy. However, compared with HFL, the model balances the accuracy and recall rate well, and is simpler.
It will be appreciated by those of ordinary skill in the art that in the embodiments of the method of the present invention, the sequence numbers of the steps are not used to define the sequence of the steps, and it is within the scope of the present invention for those of ordinary skill in the art to change the sequence of the steps without undue effort. The examples described herein are presented to aid the reader in understanding the practice of the invention and should be understood that the scope of the invention is not limited to such specific statements and examples. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims (6)

1. A Chinese grammar error checking method based on multiple text features is characterized by comprising the following steps:
(1) Carrying out vector representation on the text by using a pre-training model and grammar priori knowledge respectively to obtain semantic feature vectors and part-of-speech feature vectors, and splicing the part-of-speech feature vectors and the semantic feature vectors end to obtain a vector sequence of the text;
(2) Extracting a characteristic vector sequence of the text from the vector sequence obtained in the step (1) by using a Bi-LSTM model;
(3) Performing attention enhancement based on semantic and part-of-speech collocation information on the feature vector sequence obtained in the step (2);
(4) Performing linear transformation on the feature vector sequence with enhanced attention in the step (3) to obtain a label prediction sequence;
(5) Performing information enhancement based on word order relation characteristics on the tag prediction sequence obtained in the step (4);
(6) Capturing constraint information of the tag prediction sequence obtained in the step (5) after information enhancement, and judging the grammar error boundary position based on the constraint information;
the step (5) comprises the following steps:
traversing the label prediction sequence obtained in the step (4), extracting verbs and adjectives, taking main nouns in the minimum grammar unit where the verbs and adjectives are located as strong association items, and adding the main nouns into a strong association set F nx The method comprises the steps of carrying out a first treatment on the surface of the If no main noun exists in the minimum grammar unit, the corresponding strong association set is empty;
traversing the tag prediction sequence obtained in step (4), adding the main nouns marked as having grammar errors to the set F ny
By F' nx =F nx -F nx ∩F ny And F' ny =F ny -F nx ∩F ny Cleaning the strong association set;
will F' ny Summing grammar error label predicted values in the set and taking average value to obtain l' nx F 'is set' nx Strong association of collections l' nx Updated to
Figure FDA0004212155650000011
Thereby obtaining the label prediction sequence with enhanced information.
2. The method for debugging the Chinese grammar based on the multi-element text features as claimed in claim 1, wherein the method comprises the following steps:
in the step (1), a pre-training model is utilized to carry out vector representation on the text, specifically:
each word in the text is characterized as 3 vectors: word vectors, segment vectors, and position vectors; the 3 vectors are summed to obtain the semantic feature vector of each word.
3. The method for debugging the Chinese grammar based on the multi-element text features as claimed in claim 1, wherein the method comprises the following steps:
in the step (1), the text is expressed by using grammar priori knowledge, specifically:
and performing word segmentation on the text by using a Chinese word segmentation system to obtain word segmentation results, and generating part-of-speech feature vectors by using a single-hot coding mode.
4. The method for debugging the Chinese grammar based on the multi-element text features as claimed in claim 1, wherein the method comprises the following steps:
the step (3) is specifically as follows:
the characteristic coding matrix M= [ e ] is obtained by combining the LSTM output at each moment 1 ,e 2 ,…,e s ] T =[d 1 ,d 2 ,…,d k ]Wherein e is i The semantic codes obtained by splicing the forward output and the reverse output of each time node in the LSTM are spliced, s is the number of time steps of the LSTM expansion, and k is twice the number of LSTM hidden units;
compressing M to obtain semantic information and part-of-speech collocation information vector p= [ max (d 1 ),max(d 2 ),…,max(d k )];
Performing linear transformation on the feature vector p to obtain attention weight S;
output h of hidden units at each moment of LSTM by S j Weighting and updating h' j =Wh j ,h j For the j-th dimension hiding layer output, j=1, 2, …, r, namely completing the attention enhancement of the feature vector sequence based on the semantic and part-of-speech collocation information.
5. The method for debugging the Chinese grammar based on the multi-element text features as claimed in claim 1, wherein the method comprises the following steps:
in step (6), constraint information is captured using a CRF model.
6. A Chinese grammar error-checking system based on multiple text features is characterized by comprising:
the vector representation module is used for carrying out vector representation on the text by utilizing the pre-training model and grammar priori knowledge respectively to obtain a semantic feature vector and a part-of-speech feature vector, and splicing the part-of-speech feature vector and the semantic feature vector end to obtain a vector sequence of the text;
the feature extraction module is used for extracting a feature vector sequence of the text from the vector sequence obtained by the vector representation module by using the Bi-LSTM model;
the attention enhancement module is used for enhancing the attention of the feature vector sequences obtained by the feature extraction module based on the semantic and part-of-speech collocation information;
the linear transformation module is used for carrying out linear transformation on the feature vector sequence after the attention enhancement of the attention enhancement module to obtain a label prediction sequence;
the information enhancement module is used for carrying out information enhancement based on word order relation characteristics on the tag prediction sequence obtained by the linear transformation module;
the capturing module is used for capturing constraint information of the tag prediction sequence obtained by the information enhancement module after information enhancement and judging the grammar error boundary position based on the constraint information;
the information enhancement for the tag prediction sequence obtained by the linear transformation module based on the word order relation features further comprises the following steps:
traversing the label prediction sequence obtained by the linear transformation module, extracting verbs and adjectives, taking main nouns in the minimum grammar unit where the verbs and adjectives are located as strong association items, and adding the main nouns into a strong association set F nx The method comprises the steps of carrying out a first treatment on the surface of the If no main noun exists in the minimum grammar unit, the corresponding strong association set is empty;
traversing the label prediction sequence obtained by the linear transformation module, adding the main nouns marked as grammar errors into a set F ny
By F' nx =F nx -F nx ∩F ny And F' ny =F ny -F nx ∩F ny Cleaning the strong association set;
will F' ny Summing grammar error label predicted values in the set and taking average value to obtain l' nx F 'is set' nx Strong association of collections l' nx Updated to
Figure FDA0004212155650000031
Thereby obtaining the label prediction sequence with enhanced information.
CN202011209481.9A 2020-11-03 2020-11-03 Chinese grammar debugging method and system based on multiple text features Active CN112183094B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011209481.9A CN112183094B (en) 2020-11-03 2020-11-03 Chinese grammar debugging method and system based on multiple text features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011209481.9A CN112183094B (en) 2020-11-03 2020-11-03 Chinese grammar debugging method and system based on multiple text features

Publications (2)

Publication Number Publication Date
CN112183094A CN112183094A (en) 2021-01-05
CN112183094B true CN112183094B (en) 2023-06-16

Family

ID=73917826

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011209481.9A Active CN112183094B (en) 2020-11-03 2020-11-03 Chinese grammar debugging method and system based on multiple text features

Country Status (1)

Country Link
CN (1) CN112183094B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362809B (en) * 2021-07-02 2023-02-21 上海淇玥信息技术有限公司 Voice recognition method and device and electronic equipment
CN113392649B (en) * 2021-07-08 2023-04-07 上海浦东发展银行股份有限公司 Identification method, device, equipment and storage medium
CN113609824A (en) * 2021-08-10 2021-11-05 上海交通大学 Multi-turn dialog rewriting method and system based on text editing and grammar error correction
CN113836286B (en) * 2021-09-26 2024-04-05 南开大学 Community orphan older emotion analysis method and system based on question-answer matching
CN114610891B (en) * 2022-05-12 2022-07-22 湖南工商大学 Law recommendation method and system for unbalanced judicial officials document data
CN116070595B (en) * 2023-03-07 2023-07-04 深圳市北科瑞讯信息技术有限公司 Speech recognition text error correction method and device, electronic equipment and storage medium
CN116070629A (en) * 2023-04-06 2023-05-05 北京蜜度信息技术有限公司 Chinese text word order checking method, system, storage medium and electronic equipment
CN117350283A (en) * 2023-10-11 2024-01-05 西安栗子互娱网络科技有限公司 Text defect detection method, device, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0409425A2 (en) * 1989-07-15 1991-01-23 Keechung Kim Method and apparatus for translating language
CN103136196A (en) * 2008-04-18 2013-06-05 上海触乐信息科技有限公司 Methods used for inputting text into electronic device and correcting error
CN106776549A (en) * 2016-12-06 2017-05-31 桂林电子科技大学 A kind of rule-based english composition syntax error correcting method
CN106775935A (en) * 2016-12-01 2017-05-31 携程旅游网络技术(上海)有限公司 The analytic method and its device and computer system of interpreted languages
CN109948152A (en) * 2019-03-06 2019-06-28 北京工商大学 A kind of Chinese text grammer error correcting model method based on LSTM
CN110717334A (en) * 2019-09-10 2020-01-21 上海理工大学 Text emotion analysis method based on BERT model and double-channel attention
CN111428026A (en) * 2020-02-20 2020-07-17 西安电子科技大学 Multi-label text classification processing method and system and information data processing terminal

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0409425A2 (en) * 1989-07-15 1991-01-23 Keechung Kim Method and apparatus for translating language
CN103136196A (en) * 2008-04-18 2013-06-05 上海触乐信息科技有限公司 Methods used for inputting text into electronic device and correcting error
CN106775935A (en) * 2016-12-01 2017-05-31 携程旅游网络技术(上海)有限公司 The analytic method and its device and computer system of interpreted languages
CN106776549A (en) * 2016-12-06 2017-05-31 桂林电子科技大学 A kind of rule-based english composition syntax error correcting method
CN109948152A (en) * 2019-03-06 2019-06-28 北京工商大学 A kind of Chinese text grammer error correcting model method based on LSTM
CN110717334A (en) * 2019-09-10 2020-01-21 上海理工大学 Text emotion analysis method based on BERT model and double-channel attention
CN111428026A (en) * 2020-02-20 2020-07-17 西安电子科技大学 Multi-label text classification processing method and system and information data processing terminal

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding;Jacobdevlin 等;《arxiv》;1-16 *
Multi-task Learning for Chinese Word Usage Errors Detection;Jinbin Zhang 等;《arxiv》;1-4 *
中文文本语义错误侦测方法研究;张仰森 等;《计算机学报》;911-924 *
多特征的中文文本校对算法研究;李建华 等;《计算机工程与科学》;93-96 *
文本自动校对技术研究综述;张仰森 等;《计算机应用研究》;8-12 *

Also Published As

Publication number Publication date
CN112183094A (en) 2021-01-05

Similar Documents

Publication Publication Date Title
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN112733533B (en) Multi-modal named entity recognition method based on BERT model and text-image relation propagation
CN114580382A (en) Text error correction method and device
CN117151220B (en) Entity link and relationship based extraction industry knowledge base system and method
CN115357719B (en) Power audit text classification method and device based on improved BERT model
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN112200664A (en) Repayment prediction method based on ERNIE model and DCNN model
CN112612871A (en) Multi-event detection method based on sequence generation model
CN115048447A (en) Database natural language interface system based on intelligent semantic completion
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
CN110134950A (en) A kind of text auto-collation that words combines
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN115510863A (en) Question matching task oriented data enhancement method
CN114611520A (en) Text abstract generating method
Göker et al. Neural text normalization for turkish social media
CN111274354B (en) Referee document structuring method and referee document structuring device
CN112818698A (en) Fine-grained user comment sentiment analysis method based on dual-channel model
Nguyen et al. Are word boundaries useful for unsupervised language learning?
CN116681061A (en) English grammar correction technology based on multitask learning and attention mechanism
Saetia et al. Semi-supervised Thai Sentence segmentation using local and distant word representations
CN114036246A (en) Commodity map vectorization method and device, electronic equipment and storage medium
CN114330350A (en) Named entity identification method and device, electronic equipment and storage medium
CN113012685B (en) Audio recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant