CN112699218A

CN112699218A - Model establishing method and system, paragraph label obtaining method and medium

Info

Publication number: CN112699218A
Application number: CN202011605780.4A
Authority: CN
Inventors: 翁洋; 李鑫; 王竹; 其他发明人请求不公开姓名
Original assignee: Chengdu Shuzhilian Technology Co Ltd
Current assignee: Chengdu Shuzhilian Technology Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-04-23

Abstract

The invention discloses a model establishing method and a system, a paragraph label obtaining method and a medium, relating to the field of natural language processing migration learning, comprising the following steps: collecting all referee document data from a database to obtain pre-training data; paragraph labels of different types of referee documents are defined; marking paragraph labels of different types of referee documents to obtain training data; constructing a structured referee document model; pre-training the model; training a structural model of the referee document after pre-training by using training data; debugging the trained structured referee document model to obtain a final structured referee document model; the input of the structured model of the referee document is the text data of the referee document, the paragraph of the input referee document is added with a task prefix, and the output of the structured model of the referee document is the paragraph label text data of the referee document; the model established by the method can predict the label of any type of referee document paragraph after training.

Description

Model establishing method and system, paragraph label obtaining method and medium

Technical Field

The invention relates to the field of natural language processing migration learning, in particular to a method and a system for building a structured model of a referee document, and a method and a medium for acquiring a label of a referee document paragraph.

Background

By 12 months in 2019, over eight million referee documents are on the internet, and massive data resources are provided for practice and research of artificial intelligence of laws.

The referee document is a judicial product which records the judicial activity process and makes clear the rights and obligations of the parties, is an important resource for researching legal text information, and provides important element indexes for legal artificial intelligence application research based on the classification recommendation, the referee result prediction, the intelligent question and answer and the like of the referee document. However, official documents are disclosed in substantially plain text form, as is typical of unstructured data, making it difficult to accurately identify and extract information from the official documents. We need to structure the referee document using machine learning algorithms. The structured referee document means that corresponding labels are marked on document paragraphs, and the referee document in a pure text form is converted into structured data with paragraph labels, and belongs to a text multi-classification task. The label system of the referee document paragraph is established to provide basic support for further information extraction tasks such as dispute focus extraction, case situation element extraction, entity identification and relationship extraction of case facts and the like.

However, in the face of various types of official documents, not only does manual structuring consume a lot of time and effort, but also the official documents may have poor effects because the standards are difficult to unify and the processes are difficult to control, and the types of official documents are various: according to the case types, the cases are divided into six types of cases, namely criminal cases, civil cases, administrative cases, indemnification cases, executive cases, other cases and the like, and the cases can be divided into first-level examination, second-level examination, reexamination and the like according to the examination and management procedures. Each type of official document contains the nine major parts of title, header, fact, reason, official basis, official body, tail, drop, appendix. For different types of official documents, the paragraphs can be subdivided into different paragraph labels such as "original appeal", "reported dialect", "court deed fact", "right of last resort announcement" and the like according to different described official information. Different types of referee documents have both almost identical text paragraphs and very different paragraph types in content distribution, that is, the paragraph labels of the referee documents have intersection but are not completely identical, the diversity of the paragraph labels makes the structured tasks of various types of referee documents very complicated, and the structured tasks of various types of referee documents cannot be effectively completed by using the traditional text classification method.

The information shared between the multi-type referee documents cannot be effectively utilized by using the traditional text classification method, and the information waste is caused. For example, for a civil-opinion general program referee document, a learner proposes a paragraph text classification method based on the context semantic features of the paragraphs of the referee document. By using BERT as a coding layer, CRF (conditional random field) models the paragraph label relationship, learns the relevant information between the paragraph text semantic information and the paragraph context in the complete referee document, and obtains good classification effect. Although the semantic information of the civil referee document paragraph is learned, part of the model effect comes from the correlation between the learned specific context labels of the civil referee document, and the definition of the label of the civil referee document and the label of the paragraph of other types of referee documents are different, so that the model cannot be used for directly predicting the paragraph classification of other types of referee documents.

Some traditional "convolution/cyclic neural network + fully-connected classification layer" models represented by TextCNN and LSTM and their derivative models have limited modeling capability for long texts and weak representation capability for complex texts, which directly limits the classification capability of such models for referee document structuring. Although the 'pre-training language model + fully-connected classification layer' represented by BERT and XLNET can make great progress on the text representation level, the classification tasks of different types all need to be subjected to completely independent fine-tuning processes, common information among multiple types of referee documents is wasted, and higher complexity is brought.

Disclosure of Invention

Research on the background technology finds that it is necessary to explore a unified machine learning method based on specification requirements to complete the structured task of the full-type referee document, and convert referee documents with various types and forms into structured text data which is easier to identify by a machine and is more standardized.

The invention abandons the traditional full-connection classifier and researches a unified structured frame of the referee document, so that the referee document can adapt to referee documents of various types or multilevel trial and error programs. On the basis of utilizing the marked data to the maximum extent, a new type of referee document is adapted to achieve higher structural precision.

The purpose of the invention is: and establishing a unified multi-task transfer learning text classification frame, and researching a structured deep learning model oriented to the multi-type referee document. The model can predict any type of referee document paragraph label after training.

In order to achieve the above object, the present invention provides a method for building a structured model of a referee document, the method comprising:

collecting all referee document data from a database to obtain pre-training data;

paragraph labels of different types of referee documents are defined;

marking paragraph labels of different types of referee documents to obtain training data;

constructing a structured referee document model;

constructing a pre-training task, and converting an input sequence of a model into word vector input through a pre-training referee document structured model; wherein, the pre-training data is all referee documents in the database;

training the structural model of the referee document after pre-training by using training data to obtain the structural model of the referee document after training;

debugging the trained structured referee document model to obtain a final structured referee document model;

the input of the structured referee document model is referee document text data, a task prefix is added to the paragraph of the input referee document, and the output of the structured referee document model is paragraph label text data of the referee document.

The method establishes a multi-type referee document structuring method based on a unified text-to-text transfer learning classification model, so that the method can adapt to referee document structuring tasks of various types or multi-level judicial programs. The method adopts a text-to-text form and a strong multi-head attention mechanism to carry out coding and decoding to uniformly process the multi-type referee documents, can utilize the shared information among the multi-type referee documents, reduces the information waste, reduces the data mark amount required by a deep learning model, and simultaneously does not need to train different classification algorithms for the referee documents of different types, thereby greatly saving manpower, material resources and time.

In the method, a referee document structured model adopts a transformer structure.

The method comprises the steps of obtaining a structured model of a referee document, and obtaining a plurality of semantic information of the referee document, wherein the training stage of the referee document structured model is pre-trained by using a transformer structure, the training stage is assisted by the model to capture deep bidirectional semantic information of an input text and conveniently add a task prefix in the input text, and the model can simultaneously complete the structured tasks of a plurality of types of referee documents by identifying different tasks in a text-to-text form.

The structured model of the referee document in the method comprises an encoder and a decoder;

the referee document structured model maps an input sequence to a word vector sequence, the word vector sequence is transmitted to an encoder, the encoder outputs the word vector sequence to a decoder, the decoder outputs the word vector sequence through a softmax layer, and the softmax layer weight is shared with an input embedded matrix;

the encoder and the decoder both comprise a plurality of preset structures, and each preset structure comprises two substructures: a self-attention layer and a feed-forward network; the input of each substructure is subjected to application layer normalization processing, and after the application layer normalization processing, the input of each preset structure is added to the output of a residual error network by the residual error network; dropout applies to both the feed-forward network, residual connection, self attention weight, and transform structure inputs and outputs.

In the method, the decoder output passes through a softmax layer, and the softmax layer weight is shared with an input embedding matrix.

The training mode of the structured model of the referee document in the method is as follows: and defining a maximum likelihood objective function, and optimizing parameters of an encoder, a decoder and a word vector through neural network back propagation to enable the structured model of the referee document to learn semantic information of each paragraph of the referee document.

In the method, word segmentation and cleaning processing are carried out on the text data of the referees of different types to obtain pre-training data;

constructing a word-level pre-training task by using a mask language model, and randomly extracting terms with a preset proportion from an input sequence of a structured model of a referee document to remove;

all the continuous rejected terms are replaced by a single specific term;

and pre-training the referee document structured model by using pre-training data to predict the specific term to complete a pre-training task, so as to obtain a word vector applicable to the legal field.

The debugging mode of the structured model of the referee document in the method is as follows: and changing the layer number of the encoder and the decoder and the proportion of the mask in the pre-training, and selecting the optimal structured model of the referee document to be applied to the final actual scene.

The invention also provides a system for establishing the structured model of the referee document, which comprises:

the definition unit is used for defining paragraph labels of different types of referee documents;

the marking unit is used for marking paragraph labels of different types of referee documents to obtain training data;

the building unit is used for building a structured document model of the referee;

the pre-training unit is used for constructing a pre-training task and converting an input sequence of the model into word vector input through a pre-training referee document structured model;

the training unit is used for training the structural model of the referee document after pre-training by using the training data to obtain the structural model of the referee document after training;

the debugging unit is used for debugging the trained structured model of the referee document to obtain a final structured model of the referee document;

The invention also provides a method for obtaining the label of the official document paragraph, which comprises the following steps:

acquiring referee document data to be processed;

adding a task prefix to a paragraph of referee document data to be processed to obtain input data;

inputting input data into a structured referee document model established by the structured referee document model establishing method, and outputting paragraph labels of the referee document to be processed by the structured referee document model.

The present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the referee document structured model building method.

The invention also provides a structured model building device for the referee document, which comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor realizes the steps of the structured model building method for the referee document when executing the computer program.

One or more technical schemes provided by the invention at least have the following technical effects or advantages:

the method aims at the structured task of the full type of referee documents in the judicial field, and can automatically judge the paragraph labels of the referee documents of any category by using a uniform model. The adoption of the text-to-text structure in the model of the invention has a plurality of beneficial effects on the multi-type referee document structuring task of the invention.

First, the present invention uses the strongest Multi-head self-attention structure at both the encoding module and the decoding module, and can exert great advantages of the Transformer in both the encoding of the input text and the decoding of the generated text, the strong Multi-head self-attention (Multi-head self-attention) structure can help the model capture deep two-way semantic information of the input text at the encoding stage, which is especially important in relatively complex referee documents, and the decoding module can generate corresponding category names to achieve the purpose of paragraph structuring.

Secondly, the invention aims to construct a unified framework to solve the problem of structuring the multi-type referee document, and the text-to-text model structure has natural convenience. Conventional text classification models typically use a number of softmax layers equal to the desired class in the final classification layer to generate the prediction probability for each class. The softmax layer with the fixed number of output neurons can only aim at a specific classification task, and if the softmax layer is popularized to other classification tasks, a completely new classification layer needs to be constructed, which is equivalent to retraining or fine tuning a model, and great complexity is brought. And the text-to-text model structure can output paragraph classes of all types of referee documents without training different models on different types.

Thirdly, the model can be pre-trained by using a huge legal unsupervised corpus which can be organized into chapters by using the framework, and the unsupervised training by using a huge corpus is a natural advantage in the field of NLP (natural language processing). The present invention can use a forensic corpus to perform multi-task continuous learning of models, such as using mask language models and word-sentence relationship prediction to construct word-level pre-training tasks, using sentence reordering and sentence distance prediction to construct sentence-level pre-training tasks.

Fourthly, the model of the invention simultaneously learns the semantic information of similar paragraphs in different types of referee documents in the fine tuning stage, and equivalently provides more data for the model under the same condition, thereby improving the fitting capability of the model and reducing the generalization error. Although deep learning models can mine deep rules of data only by needing a large amount of labeled data, the labeling amount of some types of referee documents can be reduced by the unified learning mode of the deep learning models.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention;

FIG. 1 is a schematic flow chart of a structured modeling method for referee documents;

FIG. 2 is a schematic diagram of the composition of a structured referee document model building system.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflicting with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described and thus the scope of the present invention is not limited by the specific embodiments disclosed below.

Example one

Fig. 1 is a schematic flow diagram of a structured referee document model building method, according to a first embodiment of the present invention, there is provided a structured referee document model building method, including:

paragraph labels of different types of referee documents are defined;

paragraph labels are marked on the official documents of different types to obtain training data;

constructing a structured referee document model;

constructing a pre-training task, and converting an input sequence of a model into word vector input through a pre-training referee document structured model;

The specific implementation process of the method is as follows:

firstly, the definition of the official documents by the law expert meets the standard regulation of the highest people court on the style of the official documents, and also meets the paragraph labels of different types of official documents which are generally used by the official documents in the official practice, and then the paragraph labels are carried out on a large number of civil official documents and a small number of official documents of other case types for training the model. The manual paragraph marking of the referee document is time-consuming and labor-consuming, and meanwhile, the marking process has a plurality of uncontrollable factors, so that the marking amount of other case types can be reduced by using the model disclosed by the invention.

Then, establishing a structured referee document model by adopting the following modeling steps, and carrying out structured referee document by using the structured referee document model:

1. text vectorization: aiming at all official document text data in a database, (pre-training uses all official document data (without labels) in the database, and post-training uses data of various types of official documents (with labels) which are expected to be structured by a model user) after segmenting and cleaning the data, a mask language model is used for constructing a pre-training task at a word level, 15% of terms are randomly extracted from an input sequence for removing, wherein the proportion of random extraction in practical application can be other proportions, and the proportion of random extraction is not specifically limited by the invention. All successive rejected terms are replaced with a single specific term. And predicting the specific word item to complete a pre-training task through a training referee document structured model to obtain a word vector suitable for the legal field.

2. Establishing a unified model: inspired by a Transformer machine translation task and a T5 pre-training framework, the method takes an original text classification task as a machine translation task, and generates a prediction category by adopting a model with text form input and text form output. Specifically, this is a unified model for structuring different categories of referee documents, and the present invention can make the machine specify the task to be performed by adding a task-specific text prefix (task-specific prefix) to the input text, for example:

an examination and judgment book: this court deems … … → court theory

And reviewing the decision book: the court was considered … … → court reauthorization theory after reainquisition

And (3) adjusting the book: reaching mediation agreement … … → mediation result

The model adopts the structure of the basic transformer. The input sequence is first mapped to a word vector sequence, which is then passed to the encoder. The encoder consists of several structures, each of which consists of two sub-structures, a self-attention layer and a feed-forward network. The input application layer normalization (layer normalization) for each substructure. After layer normalization, the residual network adds the input of each structure to its output. Dropout is applied to both the feed-forward network, residual connection, self attention weight, and input and output of the entire transform structure. The decoder is similar in structure to the encoder except that its self-attention layer computation includes the output of the encoder. The self-attention mechanism in the decoder also uses a form of autoregressive attention that only allows the model to take note of past outputs. All attention mechanisms in the Transformer structure are divided into separate "heads", the outputs of which are concatenated before further processing. The output of the last decoder block passes through the softmax layer, with its weights shared with the input embedding matrix. With the sequence softmax layer output, the model obtains a string of text as a prediction of paragraph class.

3. Model training: defining a maximum likelihood objective function, and optimizing parameters of an encoder, a decoder and a word vector through neural network back propagation so that the model can more accurately learn semantic information of each paragraph of the referee document.

4. Debugging the model: and changing the layer number of the encoder and the decoder and the proportion of the mask in the pre-training process, and selecting the optimal model to be applied to the final actual scene.

Through the steps, the invention can use a unified referee document structured model, adds the task prefix to any type of referee document paragraph, and inputs the task prefix to the model, thereby obtaining the label of the referee document paragraph.

The model is pre-trained on legal text data of a full referee document, and then fine-tuned by full-type referee document paragraph data with prefixes, and the framework provides a consistent maximum likelihood training target for pre-training and fine-tuning. Meanwhile, the model identifies the specific task to be faced through the text prefix, namely identifies the specific document type to be aimed at, so that the model has the effect of multiple purposes.

Example two

Fig. 2 is a schematic composition diagram of a structured referee document model building system, an embodiment of the present invention provides a structured referee document model building system, which includes:

the collecting unit is used for collecting all referee document data from the database to obtain pre-training data;

EXAMPLE III

The third embodiment of the invention provides a method for obtaining a label of a paragraph of a referee document, which comprises the following steps:

acquiring referee document data to be processed;

Example four

The fourth embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the method for building a structured model of a referee document are implemented.

The referee document structured model creation means, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer-readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments of the present invention may also be stored in a computer readable storage medium through a computer program, and when the computer program is executed by a processor, the computer program may implement the steps of the embodiments of the methods. Wherein the computer program comprises computer program code, an object code form, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory, random access memory, point carrier signal, telecommunications signal, and software distribution medium or the like. It should be noted that the computer readable medium may contain content that is appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction.

EXAMPLE five

The fifth embodiment of the present invention provides an apparatus for building a structured model of a referee document, which comprises a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the steps of the method for building the structured model of the referee document when executing the computer program.

The processor may be a Central Processing Unit (CPU), or other general-purpose processor, a digital signal processor (digital signal processor), an Application Specific Integrated Circuit (Application Specific Integrated Circuit), an off-the-shelf programmable gate array (field programmable gate array) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory can be used for storing the computer program and/or the module, and the processor realizes various functions of the structured document modeling device of the referee in the invention by operating or executing the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card, a secure digital card, a flash memory card, at least one magnetic disk storage device, a flash memory device, or other volatile solid state storage device.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. The method for establishing the structured model of the referee document is characterized by comprising the following steps:

paragraph labels of different types of referee documents are defined;

constructing a structured referee document model;

2. The method for building a structured referee document model according to claim 1, wherein the structured referee document model adopts a transformer structure.

3. The referee document structured model establishment method according to claim 2, wherein the referee document structured model comprises an encoder and a decoder;

the judgment document structured model maps an input sequence to a word vector sequence, the word vector sequence is transmitted to an encoder, the encoder outputs the word vector sequence to a decoder, the decoder outputs the word vector sequence through a softmax layer, and the softmax layer weight is shared with an input embedded matrix;

4. The referee document structured model building method according to claim 3, wherein the decoder output passes through a softmax layer, and softmax layer weights are shared with the input embedding matrix.

5. The method for building a structured referee document model according to claim 1, wherein the structured referee document model is trained in the following way: and defining a maximum likelihood objective function, and optimizing parameters of an encoder, a decoder and a word vector through neural network back propagation to enable the structured model of the referee document to learn semantic information of each paragraph of the referee document.

6. The method for building a structured model of official document according to claim 1, wherein:

performing data word segmentation and cleaning treatment on different types of referee document text data to obtain pre-training data;

all the continuous rejected terms are replaced by a single specific term;

and pre-training the referee document structured model by using pre-training data to predict the specific term to complete a pre-training task, so as to obtain a word vector suitable for the legal field.

7. The method for building a structured referee document model according to claim 1, wherein the structured referee document model is debugged in the following manner: and changing the layer number of the encoder and the decoder and the proportion of the mask in the pre-training process, and selecting the optimal structured model of the referee document to be applied to the final actual scene.

8. The structured model building system of the referee document is characterized by comprising the following components:

9. A method for obtaining a label of a passage of a referee document, the method comprising:

acquiring referee document data to be processed;

inputting input data into a structured referee document model established by the structured referee document model establishing method according to any one of claims 1-7, and outputting paragraph labels of the referee document to be processed by the structured referee document model.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the structured modeling method for official documents according to any one of claims 1 to 7.