CN108959260A

CN108959260A - A kind of Chinese grammer error-detecting method based on textual term vector

Info

Publication number: CN108959260A
Application number: CN201810735068.2A
Authority: CN
Inventors: 李思; 赵建博; 李明正; 徐雅静
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2018-07-06
Filing date: 2018-07-06
Publication date: 2018-12-07
Anticipated expiration: 2038-07-06
Also published as: CN108959260B

Abstract

The invention discloses a kind of Chinese grammer error-detecting method and devices, belong to field of information processing.The feature of this method includes: the first text word vectorization to input, and connection forms text matrix；Recognition with Recurrent Neural Network is recycled to form the mask about component significance level each in term vector；Rebuild text matrix；Contextual information is extracted using Recognition with Recurrent Neural Network；Each error resistance score of word is calculated using feedforward neural network；Errors present is inferred using error resistance score.The present invention, which passes through to combine, is based on textual term vector, so that Chinese grammer detection effect gets a promotion, has very big use value.

Description

A kind of Chinese grammer error-detecting method based on textual term vector

Technical field

The present invention relates to field of information processing, in particular to a kind of Chinese grammer error detection side neural network based Method.

Background technique

Chinese grammer error detection is the new task of the comparison in Chinese natural language processing, it is therefore an objective to judge that non-Chinese is female Whether sentence written by the people of language is wrong, and provides error message.

Chinese grammer error-detecting method most common at present is the sequence for having supervision using error detection task as one Mark task is completed.Relatively common syntax error detection has N-Gram, Recognition with Recurrent Neural Network etc..But these networks all ten Divide the feature for relying on engineer, needs the addition of more manual features.Recently, since neural network can oneself study The feature of text is to replace complicated manual features, so many work are all wrong in Chinese grammer by Application of Neural Network in trial Error detection.But most work has ignored same word not well using information expressed by Chinese vocabulary It may be different with meaning under text.And the present invention obtains to solve the problem above-mentioned using Recognition with Recurrent Neural Network The mask of each component significance level in term vector recycles Recognition with Recurrent Neural Network, has obtained preferable error detection effect.

Summary of the invention

In order to solve existing technical problem, the present invention provides a kind of Chinese grammer error detections neural network based Method.Scheme is as follows:

Step 1, each word for inputting text is mapped as term vector by us, by text parameter, by the text of input Originally it is mapped as a text matrix.

Step 2, we handle text matrix using a Recognition with Recurrent Neural Network, obtain about term vector component The mask of significance level in the text.

Step 3 is handled text matrix using the mask of term vector component significance level in the text, obtains weight The text matrix that the term vector built indicates.

Step 4, the text matrix that the term vector of reconstruction indicates is carried out input Recognition with Recurrent Neural Network and handled by us, Obtain the character representation of each term vector in text.

Step 5, we handle the character representation of term vector each in text, and character representation passes through a forward direction Neural network obtains the error resistance score of each word；

Step 6 is inferred in face of the error resistance score of each word in entire text layers, obtains wrong word information.

Detailed description of the invention

Fig. 1 is the network structure of Chinese grammer error detection provided by the invention

Fig. 2 is the internal structure chart of shot and long term memory network unit

Specific embodiment

It next will be for a more detailed description to embodiment of the present invention.

Scheme the network structure first is that error-detecting method provided by the invention, including:

Step S1: the text word vectorization of input；

Step S2: Recognition with Recurrent Neural Network forms the mask about component significance level each in term vector；

Step S3: text matrix is rebuild；

Step S4: Recognition with Recurrent Neural Network extracts contextual information；

Step S5: feedforward neural network calculates each error resistance score of word；

Step S6: errors present is inferred using error resistance score；

Each step will be specifically described below:

Step S1: text word vectorization.The present invention initially sets up the mapping dictionary that word is numbered to term vector, by text In each word be mapped as corresponding word number.Term vector matrix is established, per number corresponding corresponding word number line by line, often A line represents a term vector.Word number is mapped as corresponding term vector by term vector matrix.Connect each word in text Term vector formed text matrix.Assuming that Chinese word share it is N number of, then term vector matrix can be expressed as the square of a N*d Battle array, wherein d indicates the dimension of term vector.For the word of entire input text after vectorization, text matrix can be expressed as x.

Wherein, xi indicates that the term vector of i-th of word in text, n indicate word number in text size, that is, text,Table Show the column connection of vector.

Step S2: Recognition with Recurrent Neural Network forms the mask about component significance level each in term vector.Recognition with Recurrent Neural Network The contextual information of text can be preferably extracted, and is widely applied in text-processing field, such as: Chinese word segmentation, text Classification.Compared to the method for the N-Gram based on statistical learning, the dependence that Recognition with Recurrent Neural Network can be concerned about the longer time is closed System preferably captures the Global Information of article.Traditional Recognition with Recurrent Neural Network will appear the problem of gradient disappearance and gradient explosion, And shot and long term memory network (LSTM) can be very good to solve this problem.Input gate is utilized in shot and long term memory network, is forgotten Door, out gate can more effectively Schistosomiasis control to long range dependence.The method of current Chinese grammer error detection In do not fully take into account the limitation of expression of the term vector to word meaning.The same word may in different texts There is different meanings.Therefore the mask about component significance level each in term vector is formed using Recognition with Recurrent Neural Network herein, Term vector is rebuild using mask.

Fig. 2 gives a kind of cellular construction of shot and long term memory network, and when moment t can be described as:

i_t=σ (W_i·x_t+U_i·h_t-1+b_i)

f_t=σ (W_f·x_t+U_f·h_t-1+b_f)

o_t=σ (W_o·x_t+U_o·h_t-1+b_o)

h_t=o_t⊙tanh(C_t)

Wherein x is the vector of input, and C is memory unit, and i is input gate, and f is to forget door, and o is out gate.σ is Sigmoid activation primitive.⊙ is that numerical value contraposition is multiplied, and is matrix multiple.W and U is the weight square of input and hidden layer respectively Battle array, b are biasings.It is the candidate value of memory unit, is codetermined by current input with hiding layer state before.C_tPass through Input gate and forget that door is respectively acting on the value collective effect of memory unit candidate value and previous moment memory unit.In term vector The mask of each component significance level is determined by the last layer that shot and long term memory network exports, and contains entire article from front to back All information.The calculating of mask can be described as:

Mask=h_n

Wherein mask indicates the mask of each component significance level in term vector.

Step S3: text matrix is rebuild.We have obtained the text matrix that term vector is formed by connecting in step 1, in step We have obtained the mask of term vector component significance level in the text in rapid 2.It is considered that each representation in components in term vector The different information of word meaning.For different texts, the meaning of same vocabulary may be made a big difference.We utilize The dot product of mask and term vector reconstructs term vector.Mask illustrates under the text which aspect is the meaning of word be more biased towards, will On the one hand that suitable for current text amplifies, that on the one hand reduction to not being the meaning indicated in text.

Term vector after reconstruct reconnects, and constitutes the text matrix that the term vector of reconstruction indicates.

Step S4: Recognition with Recurrent Neural Network extracts contextual information.The term vector that we are rebuild in step 3 indicates Text matrix, we are handled using text matrix of the Recognition with Recurrent Neural Network to reconstruction, obtain each term vector in text Character representation.Although unidirectional shot and long term memory network can also extract the network information but cannot extract reversed letter well Breath.We are extracted using two-way shot and long term memory network with the contextual information of text.Two-way shot and long term memory network There are two the memory units in direction, extract positive and reversed text information respectively.We export positive and reversed unit Vector is spliced, the character representation as term vector each in text.

WhereinIndicate output of the shot and long term memory network in t moment of forward direction,Indicate reversed shot and long term memory network In the output of t moment.Vector withVector is directly spliced, the character representation of the term vector as t moment.

Step S5: feedforward neural network calculates each error resistance score of word.We have obtained passing through circulation in step 4 The character representation of each term vector in text after Processing with Neural Network.We calculate word using the character representation of term vector Error resistance score.

Score=h_t·W

Wherein W is the weight for calculating score.Calculate the score of all words in text, the available institute in text level The error resistance score of some words.

Step S6: errors present is inferred using error resistance score.By taking correct and wrong double labels as an example, it correctly will be designated as 0, Mistake is designated as 1.For wrong score, error resistance score have it is bigger, it is more it is possible that wrong.Judged in text by score Errors present provides error message.

Above in conjunction with attached drawing to a kind of Chinese grammer error-detecting method based on textual term vector that is proposed and each The specific embodiment of module is expounded.By the description of embodiment of above, one of ordinary skill in the art can To be clearly understood that the present invention can realize by means of software and necessary general hardware platform.

According to the thought of the present invention, there will be changes in the specific implementation manner and application range.In conclusion this Description should not be construed as limiting the invention.

Invention described above embodiment does not constitute the restriction to invention protection scope.It is any of the invention Made modifications, equivalent substitutions and improvements etc., should all be included in the protection scope of the present invention within spirit and principle.

Claims

1. a kind of Chinese grammer error-detecting method based on textual term vector, which is characterized in that the method includes following Structure and step:

(1) input text word vectorization: to input text word map, by word be converted to corresponding word to Amount, the text segmented the i.e. numerical value of input turn to the text matrix that the term vector of each word is formed by connecting；

(2) Recognition with Recurrent Neural Network forms the mask about component significance level each in term vector: the text square obtained to step (1) Battle array is handled, and Recognition with Recurrent Neural Network handles contextual information, is obtained about term vector component important journey in the text The mask of degree；

(3) text matrix rebuild: to word in step (1) be converted to corresponding term vector using step (2) obtain about word The mask of component of a vector significance level in the text is handled, the text matrix that the term vector rebuild indicates；

(4) Recognition with Recurrent Neural Network extracts contextual information: carrying out to the text matrix that term vector indicates of rebuilding that step (3) obtains Processing, Recognition with Recurrent Neural Network extract contextual information, obtain the character representation of each term vector in text；

(5) feedforward neural network calculates each error resistance score of word: to each term vector in text obtained in step (4) Character representation is handled, and character representation obtains the error resistance score of each word in text by a feedforward neural network；

(6) infer errors present using error resistance score: the error resistance score obtained to step (5) is handled, in entire text This level infers the error resistance score of each word, obtains wrong word information.

2. the method as described in claim 1, which is characterized in that the step (1) specifically includes:

(1.1) map index and term vector matrix that initialization word is numbered to term vector；

(1.2) word corresponding word is mapped as by map index to number；

(1.3) corresponding term vector in term vector matrix is obtained by the word number of each word；

(1.4) term vector is connected, obtains the text matrix that the term vector of each word is formed by connecting.

3. the method as described in claim 1, which is characterized in that the step (2) specifically includes:

(2.1) loop initialization neural network parameter；

(2.2) the text matrix that circulation neural unit obtains step (1) is handled, and obtains output matrix；

(2.3) output matrix is handled, obtains the mask about term vector component significance level in the text.

4. the method as described in claim 1, which is characterized in that step (2) Recognition with Recurrent Neural Network is that shot and long term remembers net Network.

5. method as claimed in claim 3, which is characterized in that the step (2.3) specifically includes:

The output matrix obtained to step (2.2) is handled, obtain output matrix the last layer output as about word to Measure the mask of component significance level in the text.

6. the method as described in claim 1, which is characterized in that the step (3) specifically includes:

The text that step (1) is obtained about the mask of term vector component significance level in the text obtained by step (2) Each term vector carries out dot product in matrix, the text matrix that the term vector rebuild indicates.

7. the method as described in claim 1, which is characterized in that the step (4) specifically includes:

(3.1) loop initialization neural network parameter:

(3.2) the text matrix that the reconstruction term vector that step (3) obtains indicates is handled by Recognition with Recurrent Neural Network, is obtained The character representation of each term vector in text.

8. the method as described in claim 1, which is characterized in that step (4) Recognition with Recurrent Neural Network is two-way shot and long term note Recall network.

9. the method as described in claim 1, which is characterized in that the step (5) specifically includes:

(4.1) feedforward neural network parameter is initialized；

(4.2) character representation of each term vector inputs feedforward neural network in the text for obtaining step (4), obtains in text The error resistance score of each word.

10. the method as described in claim 1, which is characterized in that the step (6) specifically includes:

The error resistance score of each word is handled in the text obtained to step (5), and in entire text level, mistake point is most High word is considered mistake occur, obtains wrong word information.