CN111985220A

CN111985220A - End-to-end judicial literature automatic proofreading method based on deep learning

Info

Publication number: CN111985220A
Application number: CN202010751655.8A
Authority: CN
Inventors: 朱海麒; 姜峰
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2020-11-24

Abstract

The invention discloses an end-to-end judicial literature automatic proofreading method based on deep learning, and belongs to the technical field of natural language processing. The automatic proofreading method comprises the following steps: step one, a Transformer model structure is provided; training the Transformer model to obtain the likelihood of the maximized model on training data S; and step three, introducing a length penalty term into the likelihood obtained in the step two to obtain a decoding strategy. The invention uses the encoder-decoder model-transform based on the self-attention mechanism, effectively avoids the defects of the cyclic neural network and the convolutional neural network, and the proposed method is far superior to the performance of the encoder-decoder model based on the cyclic neural network and the convolutional neural network.

Description

End-to-end judicial literature automatic proofreading method based on deep learning

Technical Field

The invention relates to an end-to-end judicial literature automatic proofreading method based on deep learning, and belongs to the technical field of natural language processing.

Background

With the gradual improvement of informatization in the judicial field, a large number of judicial documents are generated, and in the face of massive judicial document texts, the manually written judicial documents have certain implicit grammatical errors, which provides a serious challenge for the traditional manual-based proofreading. The method can make the text more fluent and easy to read if the grammar error is not corrected, and the judicial documents are used as carriers for law enforcement, which will bring great influence if there is grammar error or logic error, but it is obviously unrealistic to process a large amount of texts based on manual proofreading, so that the text correction technology is more and more concerned in recent years.

Compared with tasks such as machine translation, automatic question answering and dialogue, the grammar error correction task is relatively small and the research community scale is not large, and generally speaking, the grammar error correction research is processed through three stages in a methodology: artificial rule based methods, statistical classifier based methods, machine translation based methods. In recent years, the development of deep learning has led to the application of a series of end-to-end learning methods in the field of Natural Language Processing (NLP), and Machine Translation has also gradually shifted from the original Statistical Machine Translation (SMT) method to a series of Neural Machine Translation (NMT) methods, such as RNN seq2seq models, attention Machine models, ConvS2S models, and self-attention Machine model-based Transformer models, as a hot research problem in the field of NLP. Therefore, we roughly divide the syntax error correction method into four kinds: rule-based methods, statistical classifier-based methods, SMT-based methods, and NMT-based methods.

The manual rule-based approach has a key drawback in that it cannot cover all error types in the text. For some syntax errors involving complex contexts, it is almost impossible to give all the error correction rules, e.g. word collocation errors. Furthermore, the error correction rules often need to be given by human language experts, which undoubtedly consumes a lot of costs.

The classifier-based method is not tried in the chinese text error correction task, and the main reasons for this are the word boundaries and the huge character set of chinese. Strictly speaking, Chinese has no word boundaries, there is no definite interval of words in Chinese, and each word has a very short length, and there are not a large number of fixed phrases like in English, so in the process of correcting Chinese grammar error, the context factor must be considered, which will cause great trouble to the classifier method.

The rule-based method is one of the most widely used methods at present, and although the method has the advantages of high accuracy, no requirement for corpus labeling, good interpretability and the like, the key defects of the method cannot be compensated, and considerable manpower and material resources are consumed in the process of rule making. The method based on the classifier needs to construct different classifiers and confusion sets aiming at different types of errors, the method based on the SMT can automatically learn the confusion sets from parallel data without other linguistic input, and multiple error types can be corrected by using one SMT model, so that the method is better at correcting complex errors. Despite the above advantages, the SMT based approach relies on large scale manual labeling of parallel corpora, whereas the classifier based approach is able to learn models from unlabeled corpora. In addition, the SMT-based GEC system is limited by generalization capability and cannot effectively access broader source and target contexts, so researchers apply the NMT method to the task of text automatic proofreading, and a series of RNN seq2 seq-based models are proposed.

In addition, at present, there is no automatic Chinese text proofreading method specially for judicial documents, and for the informatization development of the judicial field, the automatic Chinese text proofreading method specially for the judicial documents is exactly required by the automatic Chinese text proofreading method. The general automatic proofreading method for Chinese text often cannot identify proper nouns and legal terms in the judicial field, thus causing great trouble to the automatic proofreading of the text. Therefore, the calibration accuracy is not ideal and the device cannot be normally used under the actual condition.

At present, the mainstream method is to regard grammar error correction as a monolingual translation task, and the error correction process is to "translate" an error sentence into a correct sentence. With the progress of neural machine translation, a large number of methods and models are migrated into a grammar error correction task, however, before 2018, the seq2seq model based on RNN fails to exceed the system based on statistical machine translation, and the seq2seq model based on multilayer CNN adopted by Chollampatt et al becomes a neural network method exceeding the statistical machine translation system for the first time. On the basis, the invention uses a more advanced model Transformer to a Chinese grammar error correction task and obtains a better result.

Disclosure of Invention

The invention aims to provide an end-to-end judicial literature automatic proofreading method based on deep learning, so as to solve the defects in the existing text automatic proofreading method.

An end-to-end judicial literature automatic proofreading method based on deep learning comprises the following steps:

step one, a Transformer model structure is provided;

training the Transformer model to obtain the likelihood of the maximized model on training data S;

and step three, introducing a length penalty term into the likelihood obtained in the step two to obtain a decoding strategy.

Further, the Transformer model structure comprises an encoder and a decoder, wherein the encoder and the decoder each comprise 6 identical layers, and each layer comprises a self-attention sublayer and a forward neural network sublayer.

Further, there is an encoder-decoder attention sublayer between the encoder and decoder, and the respective sublayers are connected by applying residual errors at the output and are subjected to layer normalization in the same layer.

Furthermore, in the second step, in order to utilize the position information of the symbols in the sequence, a position code is blended into the input embedding, and the dimension of the position code and the implicit dimension d of the transform model_modelSimilarly, the specific calculation formula is:

where pos is a position index of a symbol in the sequence, i indicates a component of a position-coded vector, and embedding is the output of the previous sublayer, including a query vector, a key vector, and a value vector.

Further, in step two, in training the Transformer model, the maximum likelihood estimation is used, and the goal is to maximize the likelihood of the model on the training data S:

θ＝argmax∑_(x，y)∈S logp(y|x；θ)。

further, in step three, specifically, given an input error sentence x, the Transformer model generates a target segment correction sentence y using Beam Search decoding_corAt each time step, the first k candidate sentences with the highest score are reserved, a length penalty term is introduced into the original likelihood score to obtain a decoding strategy, and the calculation formula is as follows:

where LP is the length penalty, α is a hyper-parameter, and score is the likelihood score.

The main advantages of the invention are: the invention provides an end-to-end judicial literature automatic proofreading method based on deep learning, which uses an encoder-decoder model-Transformer based on an attention-self mechanism to effectively avoid the defects of a cyclic neural network and a convolutional neural network. In the Transformer, any position in the sequence can interact with all other positions through a self-attention mechanism, and the characteristic makes the long-distance dependence of the Transformer modeling become very direct; furthermore, the degree of parallelization of the computations in the transform is very high, and the computations at each position in the sequence can be done simultaneously. Currently, in machine translation, the performance of a Transformer is far better than that of a coder-decoder model based on a circular neural network or a convolutional neural network. Experimental results show that the method of the present invention is far superior to the performance of encoder-decoder models based on recurrent neural networks and convolutional neural networks.

Drawings

FIG. 1 shows the structure of a Transformer model.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention refers to a Transformer encoder-decoder model which is obviously superior to other structures (a cyclic neural network and a convolutional neural network) in the machine translation task, converts grammar error correction into a monolingual machine translation task, and introduces the specific structure of the Transformer model, a model loss function and a training criterion in detail.

step one, a Transformer model structure is provided;

In particular, the Transformer model is similar to a conventional encoder-decoder, and comprises an encoder and a decoderFIG. 1 shows a structure diagram of a Transformer model. Given source-side error sentence x ═ x₁，x₂，...，x_mB, the transform encoder encodes x as a set of implicit state representations e ═ e in continuous space₁，e₂，...，e_mBased on this representation, the transform decoder generates a target-side corrected sentence y ═ y { y } time-step by time-step₁，y₂，...，y_w，}。

Further, there is an encoder-decoder attention sublayer between the encoder and decoder, and the sublayers are connected by applying residual errors at the output and are subjected to layer normalization.

Specifically, like a normal Transformer, the encoder and the decoder respectively include 6 identical layers, and the layers in the encoder are composed of a self-attention sublayer and a forward neural network sublayer. The input first passes through the self-attention sublayer, after which the same forward neural network is applied to the outputs from different locations of the attention sublayer. The layers at the decoder side also contain a self-attention sublayer and a forward neural network sublayer, and between these two, there is also an encoder-decoder attention sublayer, similar in structure to the attention layers in the typical recurrent neural network encoder-decoder model. At the output of all sub-layers of the encoder, decoder, residual concatenation is applied and layer normalization is done.

When encoding a symbol at one position in a sequence, the self-attention mechanism allows it to interact with symbols at other positions in the sequence to help generate a more optimal encoding for that symbol. In the self-attention sublayer, multiple attention points are applied, and a query vector, a key vector and a value vector are all output (or directly input embedding) from a previous sublayer.

The encoder-decoder attention sublayer is similar to the attention layer in a typical recurrent neural network encoder-decoder model, with the query vector q coming from the output of the previous sublayer in the decoder, and the set of key vectors K and the set of value vectors V both coming from the output of the encoder.

Furthermore, in the second step, since the Transformer model does not include any loop structure, in order to utilize the position information of the symbols in the sequence, a position code is merged into the input embedding, and the dimension of the position code and the implicit dimension d of the Transformer model are merged_modelSimilarly, the specific calculation formula is:

θ＝argmax∑_(x，y)∈S logp(y|x；θ)。

wherein, LP is a length penalty item, and a sentence that is too long will cause grammar quality problems such as redundancy, so that a penalty is given to a sentence that is too long. Alpha is a hyperparameter and is a reasonable range obtained through experiments. Socre is a likelihood score, a likelihood score is calculated for all the obtained candidate sentences, then the sentences are sorted according to the likelihood score, and the first k sentences are taken as our results.

Specifically, in the experimental process, a series of methods based on different models are selected for comparison, and for fairness, results of a plurality of methods using additional corpus resources are omitted. The introduction of each baseline system is as follows: chollampatt et al methods of stacking multilayer convolutional neural networks as encoder-decoder models (MLConv), and systems in which they perform model integration (MLConv (4ens)) and reorder candidate correction sentences output by the syntactic error correction model using edit distance features (MLConv (4ens) + EO); chollampatt et al statistical machine translation method based systems (SMT + NNJM), their systems using a word-level statistical machine translation model and integrating a neural network model and a language model; yuan et al are based on a system of recurrent neural network encoder-decoder models with attention mechanism (RNNSearch + align), which additionally deals with spelling errors by means of an unsupervised word alignment model. Based on the model methods, the experimental data selected by the user is used for retraining, and the experimental result similar to that in the original text is obtained.

In the aspect of data fusion model selection, corresponding tests are carried out on a Transformer model, and the best effect is obtained in an NLPCC + HSK + judicial data fusion mode. Experiments show that the scale of the data can influence the effect of the grammar error correction method to a certain extent. Therefore, in subsequent model comparison experiments, an NLPCC + HSK + judicial data fusion mode is selected as a training data set, judicial test data which is divided in advance is completely used as a test set, and the following experiment results are obtained.

Table 1 model comparative experimental results

From experimental results, our method using Transformer outperformed others to some extent, MLConv (4ens) + EO being closest in performance to Transformer, but MLConv being less than about 5 points on a single model than Transformer. Other methods based on the recurrent neural network are inferior to those of the multi-layer convolutional neural network, and we guess that this is probably because the convolutional neural network has excellent ability to capture local features, and most of syntax errors occur in a part of a sentence.

In summary, considering the syntax error correction of the judicial writing as a machine translation problem, it is feasible to apply the Transformer model, and the effect can indeed exceed the existing syntax error correction method. Moreover, the problem that a large number of rules need to be manually made in the rule-based grammar error correction method is avoided, but the corpus scale problem in the special field can still be one of the main reasons for limiting the improvement of the effect of the method.

One specific embodiment is set forth below:

in the experimental process, a series of methods based on different models are selected for comparison, and for fairness, results of a plurality of methods using additional corpus resources are omitted. The introduction of each baseline system is as follows: chollampatt et al methods of stacking multilayer convolutional neural networks as encoder-decoder models (MLConv), and systems in which they perform model integration (MLConv (4ens)) and reorder candidate correction sentences output by the syntactic error correction model using edit distance features (MLConv (4ens) + EO); chollampatt et al statistical machine translation method based systems (SMT + NNJM), their systems using a word-level statistical machine translation model and integrating a neural network model and a language model; yuan et al are based on a system of recurrent neural network encoder-decoder models with attention mechanism (RNNSearch + align), which additionally deals with spelling errors by means of an unsupervised word alignment model. Based on the model methods, the experimental data selected by the user is used for retraining, and the experimental result similar to that in the original text is obtained.

In the aspect of data fusion model selection, corresponding tests are carried out on a Transformer model, and for a professional document, the best effect is obtained in an NLPCC + HSK + judicial data fusion mode. Experiments show that the scale of the data can influence the effect of the grammar error correction method to a certain extent. Therefore, in subsequent model comparison experiments, an NLPCC + HSK + judicial data fusion mode is selected as a training data set, judicial test data which is divided in advance is completely used as a test set, and the following experiment results are obtained.

TABLE 2 model comparison test results (common Chinese text)

TABLE 3 model comparison test results (professional paperwork)

Claims

1. An end-to-end judicial literature automatic proofreading method based on deep learning is characterized by comprising the following steps:

step one, a Transformer model structure is provided;

2. The method of claim 1, wherein the transform model structure comprises an encoder and a decoder, each of the encoder and the decoder comprises 6 identical layers, and each layer comprises a self-attention sublayer and a forward neural network sublayer.

3. The method for automatic proofreading of end-to-end judicial documents based on deep learning according to claim 2, characterized in that there is also a sub-layer of encoder-decoder attention between encoder and decoder, each sub-layer is connected by applying residual error at output, and layer normalization is performed in the same layer.

4. The method for automatically proofreading an end-to-end judicial literature based on deep learning of claim 1, wherein in the second step, in order to utilize the position information of the symbols in the sequence, a position code is merged into the input embedding, and the dimension of the position code is equal to the implicit dimension d of the transform model_modelSimilarly, the specific calculation formula is:

5. The method for automatically proofreading the end-to-end judicial documents based on deep learning of claim 1, wherein in the step two, when training the Transformer model, the maximum likelihood estimation is used, and the goal is to maximize the likelihood of the model on the training data S:

θ＝argmax∑_(x，y)∈Slogp(y|x；θ)。

6. the method of claim 1, wherein in step three, specifically, given an input error sentence x, the Transformer model generates a target segment correction sentence y by using Beam Search decoding_corAt each time step, the first k candidate sentences with the highest score are reserved, a length penalty term is introduced into the original likelihood score to obtain a decoding strategy, and the calculation formula is as follows: