CN109614612A

CN109614612A - A kind of Chinese text error correction method based on seq2seq+attention

Info

Publication number: CN109614612A
Application number: CN201811441697.0A
Authority: CN
Inventors: 李石君; 邓永康; 杨济海; 余伟; 余放; 李宇轩
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2019-04-12

Abstract

The present invention relates to a kind of Chinese text error correction method based on seq2seq+attention, belong to the research category of the quality of data, it is related to RNN, two-way RNN, LSTM, seq2seq, the technical fields such as attention mechanism construct seq2seq+attention neural network model mainly for communication equipment record of examination, model training is carried out using Adam optimization method, carries out error checking tasks using trained model.Used neural network model can be widely applied to the text error correction of other field in this method, avoid the redesign of model to a certain extent.

Description

A kind of Chinese text error correction method based on seq2seq+attention

Technical field

The invention belongs to the technical field of Chinese text error correction, it is related specifically to generate in Power Communication Management System logical Believe the field of error correction of overhaul of the equipments record.

Background technique

The main study subject that the field is related to, key technology and practical application value specifically include that

Power Communication Management System: it is the electric power dedicated communications network system as smart grid important support, is general headquarters With provincial company " two-stage deployment ", general headquarters, branch, provincial company, company, cities and counties " level Four application " communications management system " SG- TMS".It is carried forward vigorously by the project construction of standardization and normalization and to system functionization, " SG-TMS " depth is melted In the routine work for entering tens of thousands of power communication professionals, and acquire comprehensively the construction in the past few years of tens of thousands of equipment, operation, Data are managed, accumulative magnanimity Electric Power Communication Data and numerous external system data, common data have been formed together development The basis of big data analysis.

Overhaul of the equipments record: a large amount of overhaul datas, side have been had accumulated in the Information Management System of smart grid communication Formula data, log data also have many classes wherein the structural data of existing specification such as overhauls type, execution date Like the semi-structured data of log one kind, there are also many similar routing modes descriptions, " three arrange a case " document, picture etc. are non- Structural data.By the in-depth analysis and excavation to these processes and findings data, management rule can be summed up, to existing System and way to manage optimizes and reasonable arrangement.It can also be realized by big data means to the method for operation, " three arrange one The automatic first trial of machine of the procedures such as case " work, intelligent functions of filling a vacancy etc. to the automatic auxiliary error correction of logging, reduce pipe Personnel labor intensity is managed, hoisting way, maintenance examination & approval efficiency and record are normative.

Based on word granularity: mainly there are two reasons: first, due to including wrong word in overhaul of the equipments record or existing scarce Situation is lost, causes word segmentation result inaccurate, so error checking tasks are not suitable for carrying out in word granularity；Second, in given fixation In the case where vocabulary, the error checking tasks based on word can not handle the word of OOV (out ofvocabulary).

RNN:RNN is a kind of sequence link model, captures dynamic sequence by the circulation in network node.Before standard It is different to present neural network, RNN can retain the status information of the contextual window of random length.Although RNN is difficult to train, and Millions of a parameters, but the network architecture are generally comprised, the latest developments in terms of optimisation technique and parallel computation allow them to Successfully learnt on a large scale.

Two-way RNN: in classical Recognition with Recurrent Neural Network, state is one-way transmission from front to back.However, asking at some In topic, the state of the output at current time not only and before has relationship, also related to state later.At this moment it just needs two-way RNN is come such issues that solve.For example, predicting the word lacked in a sentence not only in text error checking tasks of the invention It needs to judge according to above, it is also desirable to which, in conjunction with subsequent content, at this moment two-way RNN can play its effect.

LSTM:Long Short-Term Memory is shot and long term memory network.RNN is in long-term dependence (the time sequence of processing The farther away node of distance on column) when can get into enormous difficulties, can be related to when because calculating connection between farther away node And the multiple multiplication of Jacobian matrix, this can bring gradient disappear or gradient explode the problem of.In order to solve this problem, Sepp Hochreiter et al.^[3]LSTM model is proposed, by increasing input threshold, thresholding and output thresholding are forgotten, so that self-loopa Weight be variation, so in the case where model parameter is fixed, the integral scale of different moments can dynamically change, The problem of being expanded so as to avoid gradient disappearance or gradient.

Seq2seq:seq2seq is the network of an Encoder-Decoder structure, its input is a sequence, defeated Out and a sequence, the vector that the signal sequence of a variable-length becomes regular length is expressed in Encoder, The vector of this regular length is become the signal sequence of the target of variable-length by Decoder.

Attention mechanism: using the neural network model of coder-decoder structure need by list entries must It wants information to be expressed as the vector of a regular length, and is then difficult to retain whole necessary informations when list entries is very long, especially It is when the length of list entries than training data concentrate it is longer when.DzmitryBahdanau et al. uses attention machine It makes one word Shi Douhui of every generation and finds out a maximally related set of words therewith in list entries, model is according to current later Context vector and it is all before the word that generates predict next target word

Summary of the invention

Error detection and error correction for communication equipment record of examination, the invention proposes based on attention mechanism Seq2seq neural network model.The step of realizing record of examination error correction is as follows:

A kind of Chinese text error correction method based on seq2seq+attention, which comprises the following steps:

Step 1, Text Pretreatment: the record of examination being primarily based in python reading database extracts in document files All the elements, then carry out Chinese subordinate sentence using regular expression and operate, result is stored in text file, every a line is corresponding One sentence, at the same by the correct text manually marked be stored in another text file and original document correspond；Record Proprietary symbol and common Chinese character table in lower field of power communication collectively constitute character list；

Step 2, the seq2seq neural network model based on attention is constructed, is specifically included:

Step 2.1, Encoder module layer, including Layer and M layers of two-way LSTM of Embedding are constructed, in which:

The input of layer one, Embedding Layer is that the one-hot of current character is encoded, and one-hot coding can basis The character list formed in step 1 obtains；The output of Embedding Layer be current character word vector i.e.:

e_t=E^T·x_t

Wherein x_tIt is the one-hot coding of t moment input character, is v dimensional vector, v is in the character list that step 1 obtains Character sum；E is character vector matrix, is v × d dimension matrix, during specific implementation, d takes the number between 100-200 Word, d represent the dimension of each character vector, and matrix E is the parameter of model, are obtained by training；e_tIt is t moment input character Word vector；

During specific implementation, using tf.nn.embedding_lookup function in TensorFlow, character is obtained Vector；

Basic unit in layer two, M layers of two-way LSTM is LSTM, and the hidden state calculation formula of jth layer t moment LSTM is such as Under:

Wherein, It is initialized as null vector, function σ (x)=1/ (1+e^-x), function tanh (x)=(e^x-e^-x)/(e^x+e^-x),It is model parameter, is obtained by training, ⊙ Indicate that corresponding element is multiplied；

Calculation formula based on LSTM obtains: the hidden state vector propagated from front to backThat propagates from back to front is hidden Hide state vectorThen in jth layer t moment, the output of Bi-LSTM is

During specific implementation, the function BasicLSTMCell in TensorFlow can be used to realize；

Step 2.2, Decoder module layer is constructed, Decoder module layer is one M layers unidirectional LSTM language model structure, Each layer of initial state vector is all from Encoder；The output for representing M layers of Decoder, converts by softmax, obtains To the probability of each character, calculation formula is as follows:

Calculate loss function:

Wherein, y_tIt is one-hot coding, represents the character of t moment output, P is output sequence length；

Step 2.3, it constructs attention module layer: obtaining the hidden of the last layer all moment in Encoder module layer Hide state vectorQ is the sum for inputting character；The first layer t-1 moment is obtained in Decoder module Hidden state vectorThe calculation formula of attention vector is as follows:

α=softmax (w)

In calculating Decoder module layer when each layer of t moment of hidden state vector, β is added；

Step 3, carry out model training using Adam optimization method: after neural network model constructs successfully, data are from defeated Enter the whole flow process to output, exist in the form of TensorFlow falls into a trap nomogram, directly utilizes Tf.train.AdamOptimizer () .minimize () is iterated training, finds parameter optimal value；

Step 4, error checking tasks are carried out, specifically:

In the Decoder stage, deduction process is carried out using beam search, beam size is 2, selects two every time generally The highest character of rate when meeting full stop<EOS>, stops inferring, obtains output sequence as the input predicted next time.

Therefore, the present invention has the advantage that

(1) with existing rule-based error correction scheme, depth model can extract feature to avoid artificial end to end, subtract Few labor workload.Moreover, RNN series model is strong to text task capability of fitting.

(2) attention mechanism is added in seq2seq model, more preferable for long text effect, model is easier to restrain, Effect is more preferable.

Detailed description of the invention

Fig. 1 is neural network overall framework.

Specific embodiment

When it is implemented, technical solution provided by the present invention can be real using computer software technology by those skilled in the art Existing automatic running process.Below in conjunction with drawings and examples the present invention will be described in detail technical solution.

Step 1: Text Pretreatment

Using the record of examination in the related tool reading database in python, extract all interior in document files Hold, then carries out Chinese subordinate sentence using regular expression and operate, result is stored in text file, the corresponding sentence of every a line Son, at the same by the correct text manually marked be stored in another text file and original document correspond.Record electric power Proprietary symbol and common Chinese character table in the communications field collectively constitute character list used in the present invention.

Step 2: seq2seq neural network model of the building based on attention

(1) Encoder module

Encoder module mainly includes two parts: Layer and M layers of two-way LSTM of Embedding.

The input of EmbeddingLayer is that the one-hot of current character is encoded, and one-hot coding can be according to step 1 The character list of middle formation obtains.The output of Embedding Layer be current character word vector i.e.:

e_t=E^T·x_t

Wherein x_tIt is the one-hot coding of t moment input character, is v dimensional vector, v is in the character list that step 1 obtains Character sum.E is character vector matrix, is v × d dimension matrix, during specific implementation, d takes the number between 100-200 Word, d represent the dimension of each character vector, and matrix E is the parameter of model, are obtained by training.e_tIt is t moment input character Word vector.

During specific implementation, using tf.nn.embedding_lookup function in TensorFlow, character is obtained Vector.

Basic unit in M layers of two-way LSTM is LSTM, and the hidden state calculation formula of jth layer t moment LSTM is as follows:

Wherein, It is initialized as null vector, function σ (x)=1/ (1+e^-x), function tanh (x)=(e^x-e^-x)/(e^x+e^-x),It is model parameter, is obtained by training, ⊙ Indicate that corresponding element is multiplied.

Calculation formula based on above-mentioned LSTM obtains: the hidden state vector propagated from front to backIt propagates from back to front Hidden state vectorThen in jth layer t moment, the output of Bi-LSTM is

During specific implementation, the function BasicLSTMCell in TensorFlow can be used to realize.

(2) Decoder module

Decoder module is one M layers unidirectional LSTM language model structure, and each layer of initial state vector is all from Encoder。The output for representing M layers of Decoder, converts by softmax, obtains the probability of each character, calculation formula It is as follows:

Calculate loss function:

Wherein, y_tIt is one-hot coding, represents the character of t moment output, P is output sequence length.

(3) attention module

The hidden state vector at the last layer all moment is obtained in Encoder moduleQ is input The sum of character.The hidden state vector at first layer t-1 moment is obtained in Decoder moduleAttention vector Calculation formula is as follows:

α=softmax (w)

In calculating Decoder module when each layer of t moment of hidden state vector, β is added.

Step 3: model training is carried out using Adam optimization method

After neural network model constructs successfully, data are fallen into a trap from the whole flow process for being input to output with TensorFlow The form of nomogram exists, and is directly iterated training using tf.train.AdamOptimizer () .minimize (), finds Parameter optimal value.

Step 4: carrying out error checking tasks

Specific embodiment described herein is only an example for the spirit of the invention.The neck of technology belonging to the present invention The technical staff in domain can make an amendment or supplement or be substituted in a similar manner to described specific embodiment, but can't be inclined From spirit or beyond the scope defined by the appended claims of the invention.

Claims

1. a kind of Chinese text error correction method based on seq2seq+attention, which comprises the following steps:

Step 1, Text Pretreatment: the record of examination being primarily based in python reading database extracts the institute in document files There is content, then carries out Chinese subordinate sentence using regular expression and operate, result is stored in text file, every a line is one corresponding Sentence, at the same by the correct text manually marked be stored in another text file and original document correspond；Record electricity Proprietary symbol and common Chinese character table in the power communications field collectively constitute character list；

The input of layer one, Embedding Layer is that the one-hot of current character is encoded, and one-hot coding can be according to step The character list formed in 1 obtains；Embeddin_gThe output of Layer be current character word vector i.e.:

e_t=E^T·x_t

Wherein x_tIt is the one-hot coding of t moment input character, is v dimensional vector, v is the word in the character list that step 1 obtains Symbol sum；E is character vector matrix, is v × d dimension matrix, during specific implementation, d takes the number between 100-200, d generation The dimension of each character vector of table, matrix E are the parameters of model, are obtained by training；e_tBe t moment input character word to Amount；

During specific implementation, using tf.nn.embedding_lookup function in TensorFlow, character vector is obtained；

Basic unit in layer two, M layers of two-way LSTM is LSTM, and the hidden state calculation formula of jth layer t moment LSTM is as follows:

Wherein, It is initialized as null vector, function σ (x)=1/ (1+e^-x), function tanh (x)=(e^x-e^-x)/(e^x +e^-x),It is model parameter, is obtained by training, ⊙ expression pair Answer element multiplication；

Calculation formula based on LSTM obtains: the hidden state vector propagated from front to backThe hidden state propagated from back to front VectorThen in jth layer t moment, the output of Bi-LSTM is

Step 2.2, Decoder module layer is constructed, Decoder module layer is one M layers unidirectional LSTM language model structure, each Layer initial state vector is all from Encoder；The output for representing M layers of Decoder, converts by softmax, obtains every The probability of a character, calculation formula are as follows:

Calculate loss function:

Step 2.3, it constructs attention module layer: obtaining the hiding shape at the last layer all moment in Encoder module layer State vectorQ is the sum for inputting character；Hiding for first layer t-1 moment is obtained in Decoder module State vectorThe calculation formula of attention vector is as follows:

α=softmax (w)

Step 3, carry out model training using Adam optimization method: after neural network model constructs successfully, data are from being input to The whole flow process of output exists in the form of TensorFlow falls into a trap nomogram, directly utilizes tf.train.AdamOptimizer () .minimize () is iterated training, finds parameter optimal value；

Step 4, error checking tasks are carried out, specifically:

In the Decoder stage, deduction process is carried out using beam search, beam size is 2, selects two probability every time most High character when meeting full stop<EOS>, stops inferring, obtains output sequence as the input predicted next time.