CN112800764A

CN112800764A - Entity extraction method in legal field based on Word2Vec-BilSTM-CRF model

Info

Publication number: CN112800764A
Application number: CN202011620453.6A
Authority: CN
Inventors: 李参宏
Original assignee: Jiangsu Netmarch Technologies Co ltd
Current assignee: Jiangsu Netmarch Technologies Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-05-14
Anticipated expiration: 2040-12-31
Also published as: CN112800764B

Abstract

The invention discloses a named entity identification method based on the legal field of Word2Vec-BilSTM-CRF, which specifically comprises the following steps: acquiring original data in the legal field and preprocessing the data to obtain training corpus data; inputting the obtained training corpus data into a Word2Vec algorithm in combination with a CBOW model, thereby obtaining a Word vector aiming at the legal field; labeling the training corpus data obtained by preprocessing by combining a template matching mode and a pause mode of a Chinese corpus and the like to obtain a labeled corpus, taking Bi-LSTM as a coding layer of a model, combining the obtained labeled corpus with the obtained word vector as input of the coding layer, and outputting to obtain text semantic information characteristics; and (3) taking the text semantic information features acquired by the Bi-LSTM layer as the input of the CRF, and finally outputting the recognition result of the named entity. The method has the advantages that the entities with rich types in the legal documents are identified, fine-grained depiction of the entities in the legal field is realized, data structuring in the legal field is realized, and further mining of the relationship between different entities in the legal field is significant.

Description

Entity extraction method in legal field based on Word2Vec-BilSTM-CRF model

Technical Field

The invention relates to the field of named entity identification, in particular to an entity extraction method in the legal field based on a Word2Vec-BilSTM-CRF model.

Background

In the legal field, named entities involved are numerous and complex, whether during case investigation or for court trial. The most common of these entities are case-passing elements such as people (criminal suspects, victims), time, place, motivation, events, and the like. For the different case elements, the case elements have different characteristics and expressions under the context of different criminal law and criminal names.

The legal field is a wide variety of entities, each of which may be represented in a different form. The named entities with different representation forms are identified by a uniform method, fine-grained depiction of the entities in the legal field is realized, data structuring in the legal field is realized, and further mining of the relationship between different entities in the legal field is of great significance.

Chinese patent publication No. CN110807084A, published on 18.02/2020, discloses a patent term relationship extraction method based on Bi-LSTM and keyword policy in attention mechanism, which includes the following steps: step 1): preprocessing a patent text, identifying term characteristics, adding position information, obtaining category keyword characteristics through an improved TextRank algorithm, and forming a vector matrix; step 2): importing the vector matrix into a Bi-LSTM model, and acquiring the overall characteristics of the text information by adopting an attention mechanism; step 3): selecting key features of each sentence as local features by utilizing the maximum pooling layer; step 4): fusing the global features and the local features; step 5): and outputting a classification result by using a softmax classifier. Based on the extraction of patent term relationship, the invention aims at the problem of long-distance dependence in the traditional deep learning method, and through comparison of various experiments, the effect of the invention is superior to that of the existing method, and the requirement of practical application can be well met.

Because the patent is relative to the legal field, named entities of the patent are simple and uniform, the method can extract patent terms, but the effect of the extraction method cannot be applied to the legal field with complicated named entities, no effective identification method is used for mining entities in the legal field, and the extraction effect is poor.

Therefore, it is necessary to provide a new extraction method to solve the above problems.

Disclosure of Invention

In order to solve the problems in the background art, the invention provides a named entity identification method in the legal field based on Word2Vec-BilSTM-CRF, which can mine the relationship between different entities in the legal field.

In order to achieve the purpose, the invention provides the following technical scheme: a named entity identification method based on the legal field of Word2Vec-BilSTM-CRF specifically comprises the following steps:

acquiring original data in the legal field and preprocessing the data to obtain training corpus data; inputting the corpus data obtained in the step A into a Word2Vec algorithm in combination with a CBOW model, so as to obtain a Word vector aiming at the legal field; labeling the training corpus data obtained by preprocessing in the step A by combining the template matching mode and the pause mode of the Chinese corpus and the like to obtain a labeled corpus, specifically: constructing a label system according to a specific entity contained in the legal field, wherein a BIO labeling mode is adopted, a B label is used as the beginning of the entity, an I label represents a non-beginning part of the entity, and an O label represents a non-entity part; constructing an initial entity library in the legal field; traversing the training corpus data set to obtain a sentence set conforming to the dun mode and the like; matching synonyms and parallel words of entities in the initial entity library by using a pause mode and the like, and expanding the entity library by using the entities; performing entity labeling on the training corpus data according to entity use template matching in a legal entity library; checking the marked training expected data obtained by C5 in a manual screening mode, correcting and marking-supplementing entities, updating an entity library, and finally obtaining correctly marked training corpus data; taking Bi-LSTM as a coding layer of the model, combining the labeling linguistic data obtained in the step C and the word vectors obtained in the step B as input of the coding layer, and outputting to obtain text semantic information characteristics; and D, taking the text semantic information features acquired by the Bi-LSTM layer in the step D as the input of the CRF, and finally outputting the recognition result of the named entity.

B, constructing a specific disuse word list in the legal field, and performing word segmentation and word disuse on the training corpus data obtained in the step A by utilizing a jieba and ltp Chinese word segmentation tool; and converting semantic information contained in the vocabulary into n-dimensional Word vectors by using a Word2Vec algorithm and combining a CBOW model to obtain the specific Word vectors in the legal field.

Compared with the prior art, the entity extraction method based on the legal field of Word2Vec-BilSTM-CRF has the beneficial effects that: the method has the advantages that the entities with rich types in the legal documents are identified, fine-grained depiction of the entities in the legal field is realized, data structuring in the legal field is realized, and further mining of the relationship between different entities in the legal field is significant.

Drawings

FIG. 1 is a schematic flow chart of an entity extraction method in the legal field based on Word2Vec-BilSTM-CRF in the invention.

FIG. 2 is a flowchart illustrating obtaining a markup corpus according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention is further explained below with reference to the drawings and the embodiments.

Referring to fig. 1, the invention provides an entity extraction method in the legal field based on Word2Vec-BilSTM-CRF, which specifically comprises the following steps:

: the method for obtaining the training corpus data comprises the following steps of:

step A1: original data in the legal field, including case statement, litigation reports, referee documents and the like in the legal field, are acquired from the Internet by combining a crawler technology with manual screening;

step A2: and carrying out preliminary cleaning and noise reduction on the obtained semi-structured or unstructured multi-source data to obtain available data information.

And B: training word vectors in the legal domain; inputting the corpus data obtained in the step A into a Word2Vec algorithm in combination with a CBOW model, so as to obtain a Word vector aiming at the legal field; the method comprises the following steps:

step B1: constructing a stop word list in the legal field, and performing word segmentation and stop word removal on training corpus data by using Chinese word segmentation tools such as jieba, ltp and the like;

step B2: obtaining a Word vector aiming at the legal field by using a Word2Vec algorithm;

step B3: the Word2Vec algorithm uses the CBOW model to convert semantic information into an n-dimensional vector. The input of the CBOW model is a word vector corresponding to a related word of a certain characteristic word, and the output is the word vector of the specific word, so that the semantic information of the context can be well stored.

And C: aiming at the training corpus data obtained by preprocessing in the step A, constructing an initial entity library in the legal field, and labeling by combining template matching and the pause mode of the Chinese corpus, wherein the pause mode can effectively reduce the manual labeling work and obtain a labeled corpus; the method comprises the following steps:

step C1: constructing a label system aiming at named entities in the legal field, wherein the named entities comprise the types, components and characteristics of laws; adopting a BIO labeling mode, wherein a B label is used as the beginning of an entity, an I label represents a non-beginning part of the entity, and an O label represents a non-entity part;

step C2: manually constructing an initial entity library in the legal field;

step C3: traversing the training corpus data set to obtain a sentence set conforming to the dun mode and the like;

in the Chinese corpus, the use of pause signs is mainly to list synonyms of a certain kind of words, and entities appearing in the corpus assume that before and after pause signs appear, parallel words are often the same kind of words or synonyms of the entities, and can be used as entities to supplement an entity library, and the mode is called a pause waiting mode.

The dungeon mode is not limited to just the front and back entities connected by a dungeon number, but generally has some expressions as follows:

step C4: matching synonyms, parallel words and the like of the entities in the initial entity library by using the patterns of pause and the like, and expanding the entity library by using the entities;

step C5: performing entity labeling on the training corpus data according to entity use template matching in a legal entity library;

step C6: and checking the marked training expected data acquired by the C5 in a manual screening mode, correcting and supplementing entity, updating the entity library, and finally acquiring the training corpus data with correct mark.

Step D: taking a Bi-LSTM model as an encoding layer, wherein X is (X)₁,x₂,x₃,…,x_n) As input to the coding layer, where x_iC, obtaining a word vector of the legal field corresponding to each word in the training corpus data marked in the step C and obtained by training in the step B;

f_t＝σ(W_f·[h_t-1，x_t]+b_f)

i_t＝σ(W_i·[h_t-1，x_t]+b_i)

o_t＝σ(W_o·[h_t-1，x_t]+b_o)

h_t＝o_t*tanhC_t

{h₀，h₁，...，h_n}＝{[h_L0，h_Rn]，[h_L1，h_R(n-1)]，...，[h_Ln，h_R0]}

Bi-LSTM can effectively use past features (through forward states) and future features (through backward states) within a specified time horizon, using back propagation through time to train a Bi-directional LSTM network.

Step E: inputting the label vector characteristics obtained by the Bi-LSTM layer into a CRF layer to obtain the score of each word label;

the CRF layer can effectively utilize sentence-level label information, and sets a constraint condition for further mining the relation between different entities in the legal field to ensure that the final prediction is effective, wherein the constraint condition can be automatically learned by the CRF layer during training data. In particular, the method comprises the following steps of,

the sentence for which an entity needs to be identified is expressed as the following expression, x_iWords in the representation sentence:

X＝(x₁，x₂，...，x_n)；

the corresponding labels of the sentence are:

Y＝(y₁，y₂，...，y_n)；

determining a scoring method function expression mode corresponding to the sentence corresponding to the recognition entity:

wherein A is a transition score matrix, A_i，jRepresents a score for a transition from label i to label j, where y₀And y_nStart and end tags for sentences, respectively; so the latitude of a is (k +2) × (k +2) (k is the number of tags); p is a fraction matrix output by the Bi-LSTM network, and has latitude of n x k (k is the label number), and P_i，jRepresenting the score of the ith word corresponding to the jth tag in the sentence.

The aim is to obtain a maximum value of the scoring function.

For a given sentence X, the probability of getting label y is:

Y_Xall possible tag sequences corresponding to the sentence X are represented, that is, each tag sequence corresponding to the sentence has a score and a probability, so as to maximize the probability of the real sequence corresponding to the sentence.

In addition, a loss function is provided, the minimum value in the loss function is obtained, and the transformation is given by:

to obtain a minimum in the loss function.

Expressed by the likelihood formula:

finally, named entities such as persons, motivations, events and the like in the identified case pass are output. Therefore, entities with rich types in the legal documents are identified, fine-grained depiction of the entities in the legal field and data structuring in the legal field are realized, and the relationship among different entities in the legal field is further mined.

The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

Claims

1. A named entity identification method based on the legal field of Word2Vec-BilSTM-CRF is characterized by comprising the following steps:

step A: acquiring original data in the legal field and preprocessing the data to obtain training corpus data;

and B: inputting the corpus data obtained in the step A into a Word2Vec algorithm in combination with a CBOW model, so as to obtain a Word vector aiming at the legal field;

and C: labeling the training corpus data obtained by preprocessing in the step A by combining the template matching mode and the pause mode of the Chinese corpus and the like to obtain a labeled corpus, specifically:

step C1: constructing a label system according to a specific entity contained in the legal field, wherein a BIO labeling mode is adopted, a B label is used as the beginning of the entity, an I label represents a non-beginning part of the entity, and an O label represents a non-entity part;

step C2: constructing an initial entity library in the legal field;

step C4: matching synonyms and parallel words of entities in the initial entity library by using a pause mode and the like, and expanding the entity library by using the entities;

step C6: checking the marked training expected data obtained by C5 in a manual screening mode, correcting and marking-supplementing entities, updating an entity library, and finally obtaining correctly marked training corpus data;

step D: taking Bi-LSTM as a coding layer of the model, combining the labeling linguistic data obtained in the step C and the word vectors obtained in the step B as input of the coding layer, and outputting to obtain text semantic information characteristics;

step E: and D, taking the text semantic information features acquired by the Bi-LSTM layer in the step D as the input of the CRF, and finally outputting the recognition result of the named entity.

2. The method for identifying the named entity in the legal field based on Word2Vec-BilSTM-CRF as claimed in claim 1, wherein: the step B specifically comprises the following steps:

step B1: b, constructing a specific disuse word list in the legal field, and performing word segmentation and word disuse on the training corpus data obtained in the step A by utilizing a jieba and ltp Chinese word segmentation tool;

step B2: and converting semantic information contained in the vocabulary into n-dimensional Word vectors by using a Word2Vec algorithm and combining a CBOW model to obtain the specific Word vectors in the legal field.

3. The method for identifying the named entity in the legal field based on Word2Vec-BilSTM-CRF as claimed in claim 1, wherein: in the step E, the step of the method is carried out,

X＝(x₁,x₂,…,x_n)；

the corresponding labels of the sentence are:

Y＝(y₁,y₂,…,y_n)；

determining a scoring method function expression mode corresponding to the sentence corresponding to the identified entity to obtain the maximum value of the scoring function:

wherein A is a transition score matrix, A_i,jRepresents a score for a transition from label i to label j, where y₀And y_nStart and end tags for sentences, respectively; so the latitude of a is (k +2) × (k + 2); p is a fractional matrix of Bi-LSTM network output with latitude n x k, P_i,jA score representing that the ith word in the sentence corresponds to the jth tag;

for a given sentence X, so that sentence X most likely obtains a corresponding true sequence:

Y_Xrepresenting all possible tag sequences for sentence X.

4. The method for identifying named entities in the legal field based on Word2Vec-BilSTM-CRF as claimed in claim 3, wherein: providing a loss function to obtain a minimum value in the loss function, transformed to the following equation: