CN110837558A

CN110837558A - Judgment document entity relation extraction method and system

Info

Publication number: CN110837558A
Application number: CN201911082739.0A
Authority: CN
Inventors: 不公告发明人
Original assignee: Chengdu Xingyun Law Technology Co Ltd
Current assignee: Chengdu Xingyun Law Technology Co Ltd
Priority date: 2019-11-07
Filing date: 2019-11-07
Publication date: 2020-02-25
Anticipated expiration: 2039-11-07
Also published as: CN110837558B

Abstract

The invention discloses a method and a system for extracting the entity relationship of a referee document, wherein the method comprises the following steps: setting a format of sample data; converting the format of the training sample into the format of the sample data to obtain the training sample after format conversion; training a pre-training model bert by using the training sample after format conversion to obtain a trained model bert; inputting a referee document sentence of which the entity relationship is to be extracted into the trained model bert, and outputting the relationship between two entities in the referee document sentence; under the condition of the same data volume, the method and the system have better effect than other relation extraction models; the method and the system use sentence information, entity position information and entity type information to splice in the middle of the model, and accord with the relationship to extract the natural structure of the triple.

Description

Judgment document entity relation extraction method and system

Technical Field

The invention relates to the field of extraction of natural language processing information, in particular to a method and a system for extracting the entity relationship of a referee document.

Background

Currently, common relationship extraction methods include rule-based relationship extraction, relationship extraction based on methods such as deep learning CNN, and the like. Due to the diversity of legal relationships (only civil cases up to 800 cases) and the diversity of textual expressions, rule-based relationship extraction cannot be applied in the field of official documents. Relation extraction based on deep learning CNN and other methods depends on a large amount of labeled data, a large amount of professionals in different case field subdivision are needed, efficiency is low, and requirements for labeling personnel are high. At present, no typical case for extracting entity relation in the referee document field exists.

Disclosure of Invention

The invention aims to extract the relationship between two entities in the referee document. The official document describes case facts in a specific area, and converts unstructured case descriptions into a description mode which can be identified by a program, so that the official document is an important step for realizing case matching. The invention realizes the extraction of entity relations in the case situation facts of the referee documents, and is used for supporting the description of case situations in a triple form.

In order to achieve the above object, one aspect of the present invention provides a method for extracting an entity relationship of a referee document, the method comprising:

setting a format of sample data;

converting the format of the training sample into the format of the sample data to obtain the training sample after format conversion;

training a pre-training model bert by using the training sample after format conversion to obtain a trained model bert;

inputting the referee document sentence with the entity relationship to be extracted into the trained model bert, and outputting the relationship between two entities in the referee document sentence.

Preferably, the structure of the pre-training model bert is as follows:

the input to the model is a token sequence that unambiguously represents a single referee's sentence or a pair of referee's sentences. For a given token, its input representation is constructed by summing the corresponding token, segment, and position elements, resulting in a 768-dimensional vector for the token length representation through a 12-layer bi-directional self-orientation process.

Preferably, the format of the sample data is: (sentence, (entity 1, start position 1, end position 1, entity type1), (entity 2, start position 2, end position 2, entity type2), relationship between entity 1 and entity 2) represents the relationship of a known pair of entities in a sentence, and the entity position information is added to disambiguate when a plurality of identical entities exist in the sentence. For example, sample data: (Zhang III opens a three-bought driving car on the big avenue of Tianfu to have a tsunami, (Zhang III, 1, 2, natural people), (Ben Car, 15, 18, motor vehicle), drive), the relation that the sentence "Zhang III opens a three-bought driving car on the big avenue of Tianfu to have a tsunami" corresponding entity "Zhang III" and "Ben car" is "drive", there are two "Zhang III" in the sentence, the position information is used to distinguish the two "Zhang III", the first "Zhang III" and "Ben car" are "drive" relation, the second "Zhang III" and "Ben car" are "purchase" relation.

Preferably, the training process of the model is as follows:

inputting the training sample after format conversion into a pre-training model bert;

extracting an output vector of the CLS position of the pre-training model bert as a vector representation of a sentence of the referee document, and recording the vector representation as sentenceEmbedding;

extracting a vector sequence corresponding to the starting position to the ending position of the entity 1 in an output sequence of the pre-training model bert, wherein the average value of the orientation quantity sequence is used as the vector representation of the entity 1 and is marked as ent1 Embedding;

extracting a vector sequence corresponding to the starting position to the ending position of the entity 2 in an output sequence of the pre-training model bert, wherein the average value of the orientation quantity sequence is used as the vector representation of the entity 2 and is marked as ent2 Embedding;

sequentially splicing the sentenceEmbedding vector, the ent1Embedding vector and the ent2Embedding vector to obtain a vector after first splicing;

transforming the vectors subjected to the first splicing to obtain transformed vector representation, and recording the transformed vector representation as vector 1;

converting the entity type corresponding to the entity 1 into vector representation, and marking as ent1type embedding;

converting the entity type corresponding to the entity 2 into vector representation, and marking as ent2type embedding;

sequentially splicing vector1, ent1type embedding and ent2type embedding vectors to obtain a vector after secondary splicing;

and classifying the vectors subjected to the second splicing by using a softmax layer to obtain a trained model bert.

Preferably, the sentence in the method is a sentence of a referee document.

On the other hand, the invention also provides a referee document entity relationship extraction system, which comprises:

a sample data format setting unit for setting a format of sample data;

the training sample format conversion unit is used for converting the format of the training sample into the format of the sample data to obtain the training sample with the processed format;

the model training unit is used for training the pre-training model bert by using the training sample after format conversion to obtain a trained model bert;

and the entity relationship extraction unit is used for inputting the referee document sentence of the entity relationship to be extracted into the trained model bert and outputting the relationship between two entities in the referee document sentence.

Further, the structure of the pre-training model bert is as follows:

the input to the model is a token sequence that unambiguously represents a single referee's sentence or a pair of referee's sentences. For a given token, its input representation is constructed by summing the corresponding token, segment, and position elements, resulting in a 768-dimensional vector for the token length representation through a 12-layer bi-directional self-orientation process. Further, the format of the sample data is: (sentence, (entity 1, start position 1, end position 1, entity type1), (entity 2, start position 2, end position 2, entity type2), relationship between entity 1 and entity 2) represents the relationship of a known pair of entities in a sentence, and the entity position information is added to disambiguate when a plurality of identical entities exist in the sentence. For example, sample data: (Zusan opens a three-bought driving car on the big avenue of Tianfu to have a tsunami, (Zusan, 1, 2, natural people), (Zusan, 15, 18, motor vehicle), driving), the relation of a first entity 'Zusan' corresponding to a sentence 'Zusan opens a three-bought driving car on the big avenue of Tianfu to have a tsunami' and a 'driving car' is 'driving', two 'Zusan' exist in the sentence, position information is used for distinguishing the two 'Zusan', the first 'Zusan' and the 'driving car' are 'driving' relation, and the second 'Zusan' and the 'driving car' are 'purchasing' relation.

Further, the process of training the model by the model training unit is as follows:

Further, the sentences in the system are sentences in the referee document.

One or more technical schemes provided by the invention at least have the following technical effects or advantages:

the method and the system can achieve better effect under the condition of small samples. The method and the system adopt a transfer learning method, use a pre-training model bert, and have the experiment effect far better than other relation extraction models (CNN, PCNN and LSTM) under the condition of a small sample.

The method and the system have better effect than other relation extraction models (Bert + CNN, CNN, PCNN, LSTM) under the condition of the same data volume (including small samples, medium samples and large samples). The method and the system use sentence information, entity position information and entity type information to splice in the middle of the model, and accord with the relation to extract the natural structure of the triple.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention;

FIG. 1 is a flow chart of a method for extracting the entity relationship of a referee document according to the present invention;

FIG. 2 is a schematic structural diagram of a pre-training model bert in the present invention;

FIG. 3 is a schematic diagram of a system for extracting the entity relationship of official document according to the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflicting with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described and thus the scope of the present invention is not limited by the specific embodiments disclosed below.

Referring to fig. 1, an embodiment of the present invention provides a method for extracting an entity relationship of a referee document, where the method includes:

setting a format of sample data;

inputting the sentence of the referee document with the entity relationship to be extracted into the trained model bert, and outputting the relationship between the two entities in the sentence.

The method aims to extract the relationship between two entities in the sentence of the referee document. The method comprises the following three steps:

step 1: and (4) preparing data. The format of sample data is defined as (content, (ent1, start1, end1, type1), (ent2, start2, end2, type2), relationship). Wherein, the sensor represents a sentence, ent1 represents an entity 1, ent2 represents an entity 2, start1 and start2 represent the start position of the entity in the sentence, end1 and end2 represent the end position of the entity in the sentence, type1 and type2 represent the entity type, and relationship represents the relationship between the entity 1 and the entity 2.

Step 2: and (6) data processing. The input data is converted into the input format required by the pre-trained model bert.

And step 3: model training and entity relationship extraction. Detailed structure of the model referring to fig. 2, the training process comprises the following steps:

1. and inputting the input data after format conversion into a pre-training model bert.

2. And extracting an output vector of the CLS position of the model bert as a vector representation of the sensor, and marking the vector representation as sensor embedding, wherein the vector representation represents the meaning of a sentence.

3. Vectors corresponding to start1 to end1 of ent1 in the output sequence of the model bert are extracted, and then the extracted vectors are averaged to be used as vector representation of ent1, which is denoted as ent1 Embedding.

4. Vectors corresponding to start2 to end2 of ent2 in the output sequence of the model bert are extracted, and then the extracted vectors are averaged to be used as vector representation of ent2, which is denoted as ent2 Embedding.

5. And sequentially splicing the sentenceEmbedding, the ent1Embedding and the ent2Embedding to obtain a vector after first splicing.

6. And transforming the vectors subjected to the first splicing by a plurality of full connection layers to obtain a new vector representation vector 1.

7. The entity type1 corresponding to ent1 is converted into a vector representation ent1type embedding.

8. The entity type2 corresponding to ent2 is converted into a vector representation ent2type embedding.

9. Vector1, ent1TypeEmbedding and ent2TypeEmbedding are sequentially spliced to obtain a vector after second splicing.

10. And adding a softmax layer behind the second spliced vector for classifying the second spliced vector to obtain a corresponding relation.

Referring to fig. 3, an embodiment of the present invention provides a system for extracting an entity relationship of a referee document, including:

a sample data format setting unit for setting a format of sample data;

the training sample format conversion unit is used for converting the format of the training sample into the format of the sample data to obtain the training sample after format conversion;

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for extracting the entity relationship of a referee document, which is characterized by comprising the following steps:

setting a format of sample data;

2. The method for extracting the entity relationship of the official document according to claim 1, wherein the structure of the pre-trained model bert is as follows:

inputting a pre-training model bert into a token sequence which clearly represents a single referee document sentence or a pair of referee document sentences; for a given token, its input representation is constructed by summing the corresponding token, segment, and positionednumbers, resulting in a 768-dimensional vector for the token length representation through a 12-layer bi-directional self-orientation process.

3. The method according to claim 1, wherein the sample data format is: (sentence, (entity 1, start position 1, end position 1, entity type1), (entity 2, start position 2, end position 2, entity type2), relationship between entity 1 and entity 2).

4. The method for extracting the relationship between official document entities as claimed in claim 3, wherein the training process of the pre-training model bert is as follows:

splicing the sentenceEmbedding vector, the ent1Embedding vector and the ent2Embedding vector in sequence to obtain a vector after first splicing;

splicing vector1, ent1type embedding vector and ent2type embedding vector in sequence to obtain a vector after second splicing;

5. A referee document entity relationship extraction system, the system comprising:

a sample data format setting unit for setting a format of sample data;

6. The system for extracting referee document entity relationship according to claim 5, wherein the pre-trained model bert has the structure:

inputting a pre-training model bert into a token sequence which clearly represents a single referee document sentence or a pair of referee document sentences; for a given token, its input representation is constructed by summing the corresponding token, segment, and position elements, resulting in a 768-dimensional vector for the token length representation through a 12-layer bi-directional self-orientation process.

7. The system according to claim 5, wherein the sample data format is: (sentence, (entity 1, start position 1, end position 1, entity type1), (entity 2, start position 2, end position 2, entity type2), relationship between entity 1 and entity 2).

8. The system for extracting the relationship between official document entities as claimed in claim 7, wherein the process of training the pre-trained model bert by the model training unit is as follows: