CN113221567A - Judicial domain named entity and relationship combined extraction method - Google Patents

Judicial domain named entity and relationship combined extraction method Download PDF

Info

Publication number
CN113221567A
CN113221567A CN202110505601.8A CN202110505601A CN113221567A CN 113221567 A CN113221567 A CN 113221567A CN 202110505601 A CN202110505601 A CN 202110505601A CN 113221567 A CN113221567 A CN 113221567A
Authority
CN
China
Prior art keywords
entity
relationship
relationships
vector
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110505601.8A
Other languages
Chinese (zh)
Inventor
毛松
李振伟
程佳
张文静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING AEROSPACE INTELLIGENCE AND INFORMATION INSTITUTE
Original Assignee
BEIJING AEROSPACE INTELLIGENCE AND INFORMATION INSTITUTE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING AEROSPACE INTELLIGENCE AND INFORMATION INSTITUTE filed Critical BEIJING AEROSPACE INTELLIGENCE AND INFORMATION INSTITUTE
Priority to CN202110505601.8A priority Critical patent/CN113221567A/en
Publication of CN113221567A publication Critical patent/CN113221567A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Abstract

The invention discloses a named entity and relationship joint extraction method in the judicial field, which is an entity relationship extraction method of a BILSTM network and attention mechanism set based on a BERT pre-training language model, realizes joint learning of two tasks through parameter sharing, and fully utilizes the relation between the tasks to optimize a result. Selecting a BERT pre-training language model training word vector to complete the conversion work of the data set word vector; then, acquiring more complete context feature information by using a BILSTM neural network, thereby extracting the text depth word vector features; and finally, acquiring a class label of the character through a softmax classifier to realize entity identification, and judging the association relationship between the current character and the previous character by using an attention mechanism to realize the combined extraction of the entity and the multiple relationships.

Description

Judicial domain named entity and relationship combined extraction method
Technical Field
The invention relates to the technical field of information extraction, in particular to a named entity and relationship combined extraction method in the judicial field.
Background
With the rapid development of the internet, in the present day of explosive growth of information, how to efficiently acquire required information is a hot research problem, and an information extraction technology is in force. The information extraction can be subdivided into 3 subtasks of named entity identification, entity relationship extraction and event extraction, wherein semantic triples are obtained through the entity identification and the entity relationship extraction, and the semantic triples are important preconditions for constructing a knowledge graph and understanding a natural language. The judicial field is a typical knowledge-intensive industry, and in the big data era of information explosion, laws and regulations, guide cases, legal documents and the like are developed in judicial work, and the judicial field has substantial and massive judicial data for the public, parties and judicial authorities. The text in the judicial field is mainly the professional description of the law personnel's past work on the culprit or suspect, which includes a large number of entities and entity relationships related to case details. The traditional method of manually extracting, integrating and managing information only by manpower can not meet the requirement of information extraction at present. Therefore, the automatic information extraction of the design model becomes a hot problem in the judicial industry at present, and how to effectively identify named entities and relationship classification in the judicial data is a key step for realizing automatic judgment. Existing correlation studies usually separate two subtasks of entity extraction and relationship extraction, namely, by pipeline (pipeline) mode, and although pipeline framework has flexibility to integrate different data sources and learning algorithms, there are certain problems: the connection between the two tasks is lost, resulting in error propagation, and the existence of common overlapping relationships between judicial entities is not considered.
Therefore, how to consider the relation between tasks and the overlapping relation between the judicial named entities, eliminate error propagation, and extract the judicial named entities and the relations is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides a named entity and relationship joint extraction method in the judicial field, which is an entity relationship extraction method combining a bilst network and an attention mechanism based on a BERT pre-training language model, shares text characteristic parameters extracted by the bottom-layer bilst network, realizes two task joint learning through parameter sharing, fully utilizes the inter-task contact to optimize the result, and utilizes the interaction information thereof to improve the performance of model entity identification and relationship extraction without splitting.
In order to achieve the purpose, the invention adopts the following technical scheme:
a judicial domain named entity and relationship combined extraction method comprises the following steps:
step 1: selecting judicial domain data, setting a labeling strategy, labeling the judicial domain data according to the labeling strategy, and constructing a relation extraction data set;
step 2: inputting the relation extraction data set into a BERT pre-training language model to obtain vector representation of each character in the relation extraction data set, and generating a word vector sequence;
and step 3: inputting the word vector sequence into a BILSTM network to extract characteristic information and obtain a semantic vector;
and 4, step 4: classifying the semantic vector by adopting a softmax classifier to obtain an entity label of the character; thereby realizing entity identification;
and 5: and extracting the relationship labels between the entities according to the semantic vectors by adopting an attention mechanism, thereby realizing the extraction of the entity relationship.
Preferably, the label policy setting entity label is composed of four parts, including an entity boundary, an entity category, a relationship category and an entity position; the entity boundary adopts a BIO marking principle; the entity categories include legal documents, legal subjects, legal objects and legal facts; the judicial relationships comprise construction relationships, subject relationships, reason relationships, result relationships, tool relationships, mode relationships, place relationships, time relationships, purpose relationships, engagement relationships, reception relationships and parallel relationships; the entity positions are represented as 1, 2 and M, wherein 1 represents that the character is the 1 st entity in the relationship, 2 represents that the character is the 2 nd entity in the relationship, and M represents that the characters have an overlapping relationship and are respectively located at different positions.
Preferably, the BERT pre-training language model adopts a bidirectional Transformer as an encoder, represents the specific semantics of characters in context according to the semantic relationship of the context, and adds a residual error network and layer normalization; in particular, the method comprises the following steps of,
the attention unit calculation formula is:
Figure BDA0003058297480000031
q, K and V are input word vector matrixes respectively, and respectively represent a Query matrix, a Keys matrix and a Value matrix; dkIs the input vector dimension; QKTThe value obtained by point multiplication of the Q input word vector matrix and the K input word vector matrix represents the relation between the input word vectors, and determines the attention degree of other parts of sentence input when a word is coded at a certain position; through dkAfter the reduction is carried out, weight representation is obtained through softmax normalization, the current output is weighted sum of all word vectors in the sentence, so that the representation of each word contains information of other words in the sentence, and is context-dependent and has more global property compared with the traditional word vector representation;
adopting a MultiHead mode, expanding the capability of the model to focus on different positions, and adopting the formula:
MultiHead(Q,K,V)=Concat(head1,...headk)W0 (2)
headi=Attention(QWi Q,KWi K,VWi V) (3)
wherein W0Is an additional weight matrix; headiEach small space representing a multi-head self-attention model; wi Q、Wi K、Wi VRespectively representing matrix mapping parameters obtained by training a Query matrix, a Keys matrix and a Value matrix, wherein i represents the input character sequence at different moments, the first word i is 2, and so on;
residual network and layer normalization are added to improve the degradation problem:
Figure BDA0003058297480000032
FNN=max(0,xW1+b1)W2+b2 (5)
wherein, alpha and beta are parameters to be learned; μ and σ are the mean and variance of the input layers; w1And W2Is a weight matrix; b1And b2Is a bias vector. The input of the BERT pre-training language model is a relational extraction data set, the lexical text is labeled, the output is a word distributed representation of each word in the text, and the BERT pre-training language model aims to convert the lexical text into low-dimensional and dense vector representation.
Preferably, both the coding layer and the decoding layer in the BILSTM network adopt a bidirectional LSTM structure, each sentence is respectively calculated by adopting sequence and reverse sequence to obtain two sets of different hidden layer representations, and then the final hidden layer representation is obtained by vector splicing;
Figure BDA0003058297480000041
wherein h ist-1Inputting the above information; x is the number oftIs the current input information; sigma is a Sigmoid function of element directions; bf、bi、bcAnd b0Respectively representing a forgetting gate offset vector, an input gate offset vector, a cell state offset vector and an output gate offset vector;
Figure BDA0003058297480000042
is a cellular memory; is the product of the element directions; w is af、wi、wGAnd w0Respectively representing a forgetting gate weight matrix, an input gate weight matrix, a cell state weight matrix and an output gate weight matrix; tan h is a hyperbolic tangent function; f. oft,it,Ct,OtRespectively representing a forgetting gate, an input gate, a cell state and an output gate;
Figure BDA0003058297480000043
indicating the cell state from the last unit ct-1A new cell state after renewal; h istIs a hidden layer representation;
to hide the sequence from the layer
Figure BDA0003058297480000044
And reverse order hidden layer representation
Figure BDA0003058297480000045
The coded information representing a single character t in tandem, is represented as
Figure BDA0003058297480000046
ht' denotes the last output of the BILSTM network, decoding layer ht' is a semantic vector obtained by analyzing the context information at the moment t.
Preferably, the label corresponding to the character is classified from 4 entity category label sets of legal documents, legal subjects, legal objects and legal facts by using a softmax classifier, and the label probability y ist
Figure BDA0003058297480000047
Wherein the content of the first and second substances,
Figure BDA0003058297480000048
semantic vectors extracted through a BILSTM network; u is a parameter matrix; b is a bias term. U and b are obtained from the network iteration.
Preferably, in the step 5, when extracting the relation of the character input at the time t, the corresponding semantic vector is extracted
Figure BDA0003058297480000049
As an input of the attention mechanism, judging the incidence relation between the non-entity characters at the time t and the non-entity characters at the previous time, wherein a specific calculation formula of relation extraction is as follows:
Figure BDA00030582974800000410
Figure BDA00030582974800000411
wherein the content of the first and second substances,
Figure BDA00030582974800000412
representing the superposition of semantic vectors decoded and output at the moment t and all the previous moments in a BILSTM network decoding layer; b≤tAnd btAll represent a bias vector; w1And W2Are all weight matrices;
Figure BDA00030582974800000413
representing the probability of association between characters.
According to the technical scheme, compared with the prior art, the invention discloses and provides a named entity and relationship joint extraction method in the judicial field, and the named entity and relationship joint extraction method is an entity relationship extraction algorithm based on the combination of the BITTM network and the attention mechanism of a BERT pre-training language model, realizes two tasks joint learning through parameter sharing, and fully utilizes the inter-task relation to optimize the result. Selecting a BERT pre-training language model training word vector to complete the conversion work of the data set word vector; then, acquiring more complete context feature information by using a BILSTM neural network, thereby extracting the text depth word vector features; and finally, acquiring a class label of the character through a softmax classifier to realize entity identification, and judging the association relationship between the current character and the previous character by using an attention mechanism to realize the combined extraction of the entity and multiple relationships.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic diagram of the method of the present invention;
FIG. 2 is a schematic illustration of an exemplary annotation provided by the present invention;
FIG. 3 is a schematic diagram of a BERT pre-training language model provided by the present invention;
FIG. 4 is a diagram of a Transformer coding unit according to the present invention;
fig. 5 is a schematic diagram of a data acquisition and preprocessing process provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a named entity and relationship combined extraction method in the judicial field.
1. When an entity is labeled, a label consists of 4 parts, which are respectively: entity boundaries, entity categories, relationship categories, and entity locations.
1) A physical boundary. The entity boundary adopts a BIO marking principle, wherein B represents an initial word of each entity, I represents a middle or tail word of the entity, and a non-entity character in the text is marked as O in a unified way;
2) entity category and relationship category labels. According to the design of relevant practitioners, the entities to be extracted are divided into: the legal documents, legal subjects, legal objects and legal facts are in four categories, and { DOC, SUB, OBJ, FAC } is used as the type labels of the 4 categories of entity names respectively. The labeled entity classes are shown in table 1 below:
TABLE 1 entity classes
Figure BDA0003058297480000061
The relationship class labels are subdivided into 12 types, which are: the relationship of job (agent), the relationship of job (task), the relationship of reason (cause), the relationship of result (outcontrol), the relationship of tool (tool), the relationship of mode (means), the relationship of place (place), the relationship of time (time), the relationship of destination (purpose), the relationship of job (engage), the relationship of lead (stress), and the relationship of parallel (juxtap) as shown in table 2 below:
TABLE 2 judicial concept relationship Classification and description
Figure BDA0003058297480000062
Figure BDA0003058297480000071
3) The physical location. The position of the entity represents the position of the entity in the relationship, and is defined by 1, 2 and M, wherein 1 represents that the word is the 1 st entity in the relationship, 2 represents the 2 nd entity, and M represents that the word exists in an overlapping relationship and is respectively different positions. The final relationship extraction result can be represented as a triple { entity 1, relationship category, entity 2}, as shown in fig. 2 for the labeled example sentence, for judicial entities and relationship labeled examples.
2. BERT pre-training language model
In order to avoid the error of boundary segmentation to the maximum extent, a word marking mode is selected, namely, the word is used as a basic unit for input, and a BERT pre-training language model is introduced for effectively integrating semantic information.
The BERT pre-training language model further increases the generalization capability of the word vector model, and fully describes the character-level, word-level, sentence-level and even sentence-level relation characteristics. The structure is shown in fig. 3, in order to fuse the contexts of the left and right sides of a word, BERT uses a bidirectional Transformer as an encoder, where "bidirectional" means that when a model processes a word, it can represent the specific semantics of the word in the context according to the semantic relationship of the context.
The Transformer coding unit is shown in fig. 4, which is the most important part of BERT, and the Transformer models a piece of text based entirely on the attention mechanism. The most important module of the coding unit is a self-attention part, and the calculation formula is as follows:
Figure BDA0003058297480000072
where Q, K and V are the input word vector matrices, d, respectivelykTo input vector dimensions, QKTRepresenting calculating the relationship between input word vectors; through dkAfter the reduction is carried out, weight representation is obtained through softmax normalization, the current output is weighted sum of all word vectors in the sentence, so that the representation of each word contains information of other words in the sentence, and is context-dependent and has more global property compared with the traditional word vector representation;
adopting a MultiHead mode, expanding the capability of the model to focus on different positions, and adopting the formula:
MultiHead(Q,K,V)=Concat(head1,headk)W0 (2)
headi=Attention(QWi Q,KWi K,VWi V) (3)
wherein W0Is an additional weight matrix; headiEach small space representing the multi-head self-attention model can be understood as the multi-head self-attention model is divided into a plurality of heads, namely a plurality of small spaces, headsiIs a calculation in each space; wi Q、Wi K、Wi VRespectively representing matrix mapping parameters obtained by training a Query matrix, a Keys matrix and a Value matrix, wherein i represents the input character sequence at different moments, the first word i is 2, and so on;
residual network and layer normalization are added to improve the degradation problem:
Figure BDA0003058297480000081
FNN=max(0,xW1+b1)W2+b2 (5)
where α and β are parameters to be learned, μ and σ are the mean and variance of the input layer; w1And W2Is a weight matrix; b1And b2Is a bias vector; the input of the BERT pre-training language model is a relational extraction data set, the judicial text is labeled, the output is a word distributed representation of each word in the text, and the BERT pre-training language model aims to convert the judicial text into low-dimensional and dense vector representation.
The BERT pre-training language model can make full use of information on the left and right sides of a word to obtain better distributed representation of the word.
3. BILSTM network
A Bidirectional LSTM (BLSTM) structure is adopted in both the encoding layer and the decoding layer. The bidirectional LSTM calculates two sets of different hidden layer representations of each sentence respectively by adopting sequence (starting from the first word and recursing from left to right) and reverse sequence (starting from the last word and recursing from right to left), and then obtains the final hidden layer representation by vector splicing.
Figure BDA0003058297480000082
Wherein h ist-1Inputting the above information; x is the number oftIs the current input information; sigma is a Sigmoid function of element directions; bf、bi、bcAnd b0Respectively representing a forgetting gate offset vector, an input gate offset vector, a cell state offset vector and an output gate offset vector;
Figure BDA0003058297480000083
is a cellular memory; is the product of the element directions; w is af、wi、wGAnd w0Respectively representing a forgetting gate weight matrix, an input gate weight matrix, a cell state weight matrix and an output gate weightA re-matrix; tan h is a hyperbolic tangent function; f. oft,it,Ct,OtRespectively representing a forgetting gate, an input gate, a cell state and an output gate;
Figure BDA0003058297480000084
indicating the cell state from the last unit ct-1A new cell state after renewal; h istIs a hidden layer representation;
to hide the sequence from the layer
Figure BDA0003058297480000091
And reverse order hidden layer representation
Figure BDA0003058297480000092
The coded information representing a single character t in tandem, is represented as
Figure BDA0003058297480000093
ht' denotes the final output, decoding layer ht' is a semantic vector obtained by analyzing the context information at the moment t.
4. Entity identification and relationship extraction
In entity identification, as shown in fig. 2, in the encoding process, the text sequence is recorded as S ═ x (x)1,x2,xn) Then, for the t-th character, i.e. the character input at the time of t, the vector representation E ═ of each character is obtained by the BERT pre-training language model1,e2,en) The word vector of the t-th word is etFeature information extracted by the BILSTM network coding layer
Figure BDA0003058297480000094
When the text features are analyzed by the same method, the method comprises the steps of
Figure BDA0003058297480000095
As the input of the time T BILSTM, the semantic information of the judicial text obtained by the propagation calculation of the forward LSTM and the backward LSTM of the decoding layer is respectively recorded as
Figure BDA0003058297480000096
Finally, the final semantic information is obtained through splicing
Figure BDA0003058297480000097
Namely, the semantic vector which is obtained by the decoding layer BILSTM network according to the context information analysis at the time t. Finally, classifying the label corresponding to the character from the 4 character label sets by using a softmax classifier, wherein the label probability y of the label ist
Figure BDA0003058297480000098
In the formula
Figure BDA0003058297480000099
Is a semantic vector extracted by a BILSTM network, U is a parameter matrix, and b is a bias item.
In the relationship extraction, the attention mechanism is used for realizing the relationship extraction between judicial domain entities. When extracting the relation of the t-th character (namely extracting the relation of the character input at the t moment), decoding by a BILSTM neural network to obtain the result
Figure BDA00030582974800000910
The semantic vector is used as the input of an attention mechanism, and the incidence relation between the non-entity characters at the time t and the non-entity characters at the previous time is judged.
As shown in fig. 2, when t is 6, it is only necessary to determine that there is a relationship between the current time character "hold" and the characters "sheet" and "three", and no determination is necessary for O tags such as "on", "some", "ground", and the like. The specific calculation formula in the relation extraction is as follows:
Figure BDA00030582974800000911
Figure BDA00030582974800000912
in the formula
Figure BDA00030582974800000913
Representing the superposition of the decoded output semantic features at time t and at all preceding times in the decoding layer, b≤tAnd btAll represent a bias vector; w1And W2Are all weight matrices;
Figure BDA00030582974800000914
representing the probability of association between characters. Obtained b≤tThe method not only can extract the probability between the t-time character and the previous character, but also can extract the type of the relationship, thereby realizing the extraction of the relationship label between judicial domain entities.
5. Verification analysis
The selected judicial domain corpus is derived from two parts, the first part is to acquire the website information of the judicial domain by using a web crawler technology, and the website information comprises a case information public network of a people inspection institute, a referee document network, a highest people court trial service guide case and a public case issued by a highest people court communique, 294 articles are selected in total, and 16 ten thousand characters are counted; the second part is a judicial domain dictionary, China's active Law assembly, including 200 ten thousand law articles, the contents of which include constitutional, criminal and national flag laws. After appropriate data preprocessing and manual labeling are carried out, a corpus of property disputes is constructed (the data preprocessing stage comprises the steps of removing special characters such as spaces, empty data and the like in a text, removing some partial texts with low training significance and excessively short texts, adopting a jieba word segmentation tool to segment the text, using a Chinese word segmentation tool NLPIR of Chinese academy of sciences and a hectic stop word list to carry out name recognition, text segmentation, part-of-speech marking and preprocessing so as to delete stop words and the like), and finally 10000 sentences are selected for experiment. 5 judicial relationships were extracted and classified, respectively: the relationship between the executed and accepted affairs, time, tools, modes and results, 2000 pieces of data are obtained from each relationship. Half of the training set is used as a training set for model training, and the other half is used as a test set for evaluating the performance of the method. The data acquisition and preprocessing process is shown in fig. 5.
(1) Evaluation index
The evaluation indexes of named entity identification and relationship extraction are values of precision rate P, recall rate R and F. And the accuracy and the recall rate respectively evaluate the entity relationship extraction effect from two different angles of precision ratio and recall ratio. For the entity relationship extraction task, the accuracy rate P and the recall rate R are mutually influenced and have a complementary relationship, so that the information of the accuracy rate and the recall rate is comprehensively considered by adopting an F1 value. The specific calculation formula is as follows: t ispIdentifying the correct number of entities for the model, FpNumber of unrelated entities identified for the model, FnThe number of related entities but not detected by the model.
Figure BDA0003058297480000101
Figure BDA0003058297480000102
Figure BDA0003058297480000103
4.3 comparison of experiments
In order to effectively verify the rationality of the method and prove the necessity of each module in the method, after relevant data of the method is obtained in a simulation experiment, the set comparison test is as follows:
1) pipeline CNN model: using CNN to extract relation, using Convolution Deep Neural Network (CDNN) to extract vocabulary and sentence level characteristics, and using all word marks as input;
2) pipeline LSTMAttention model: the model adopts a bidirectional LSTM neural network model and is added with an Attention mechanism, so that complex characteristic engineering in the traditional work is avoided, and a relatively excellent effect is obtained in the task;
3) a combined model: the entity identification part is output by adding a softmax to an LSTM-decode layer, and the relation extraction part is composed of a CNN layer and a Max Pooling layer;
4) the multi-head combined model: a shared BERT layer is different from the base model.
Performance evaluation experiments were performed on the test sets, respectively, and data comparison was performed by integrating the results of the three experiments, the comparison results being shown in table 3 below.
TABLE 3 judicial entity relationship extraction results
Figure BDA0003058297480000111
As can be seen from the experimental results in Table 3, the accuracy, recall rate and F1 value are improved in three aspects in general:
1. and comparing the combined model with the pipeline model, wherein the combined model is higher than the pipeline model, and the combined model is superior to the pipeline model.
2. Compared with a combined model and a multi-head combined model, the F value of multi-head combination is higher than 1.61, and the design of the multi-head in the multi-head combined model has certain advantages.
3. Compared with the named entity recognition part of the invention, the named entity recognition part is 0.94 higher than the F1 value of the multi-head joint model, and the relation extraction part is 3.72 higher than the F1 value of the multi-head joint model, which shows that the named entity recognition part has better effect in both the tasks of entity recognition and relation extraction.
The method effectively solves the problem that word vectors cannot be constructed due to the polysemy of a word, and obtains better effect on the data set; in the relation extraction stage, a new combined model is constructed by comparing and investigating the advantages and the disadvantages of the pipeline model and the combined model extracted by relation, the entity and the relation can be extracted simultaneously, and the problems of error propagation delay of the pipeline model and neglect of the relation between two subtasks are solved. The technical scheme of the invention obtains partial support and guidance of China national key research and development plans (2018YFC 0832200; 2018YFC 0832201).
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (6)

1. A judicial domain named entity and relationship combined extraction method is characterized by comprising the following steps:
step 1: selecting judicial domain data, setting a labeling strategy, labeling the judicial domain data according to the labeling strategy, and constructing a relation extraction data set;
step 2: inputting the relation extraction data set into a BERT pre-training language model to obtain vector representation of each character in the relation extraction data set, and generating a word vector sequence;
and step 3: inputting the word vector sequence into a BILSTM network to extract characteristic information and obtain a semantic vector;
and 4, step 4: classifying the semantic vector by adopting a softmax classifier to obtain an entity label of the character;
and 5: and extracting the relationship labels between the entities according to the semantic vectors by adopting an attention mechanism.
2. The judicial domain named entity and relationship joint extraction method according to claim 1, wherein the tag policy setting entity tag is composed of four parts, including an entity boundary, an entity type, a relationship type, and an entity position; the entity boundary adopts a BIO marking principle; the entity categories include legal documents, legal subjects, legal objects and legal facts; the judicial relationships comprise construction relationships, subject relationships, reason relationships, result relationships, tool relationships, mode relationships, place relationships, time relationships, purpose relationships, engagement relationships, reception relationships and parallel relationships; the entity positions are represented as 1, 2 and M, wherein 1 represents that the character is the 1 st entity in the relationship, 2 represents that the character is the 2 nd entity in the relationship, and M represents that the characters have an overlapping relationship and are respectively located at different positions.
3. The judicial domain named entity and relationship joint extraction method according to claim 1, wherein the BERT pre-training language model adopts a bidirectional Transformer as an encoder, represents the specific semantics of characters in context according to the semantic relationship of context, and adds a residual error network and layer normalization; in particular, the method comprises the following steps of,
the attention unit calculation formula is:
Figure FDA0003058297470000011
q, K and V are input word vector matrixes respectively representing a Query matrix, a Keys matrix and a Value matrix; dkIs the input vector dimension; QKTPerforming point multiplication on the Q input word vector matrix and the K input word vector matrix to obtain a score, and expressing the relation between the input word vectors; through dkAfter reduction, weight representation is obtained through softmax normalization, and output is weighted sum of all word vectors in a sentence;
adopting a MultiHead mode, and the formula is as follows:
MultiHead(Q,K,V)=Concat(head1,…headk)W0 (2)
headi=Attention(QWi Q,KWi K,VWi V) (3)
wherein W0To attachA weight matrix; headiEach small space representing a multi-head self-attention model; wi Q、Wi K、Wi VRespectively representing matrix mapping parameters obtained by training a Query matrix, a Keys matrix and a Value matrix, wherein i represents the sequence of input characters at different moments;
adding a residual network and layer normalization to improve the degradation problem:
Figure FDA0003058297470000021
FNN=max(0,xW1+b1)W2+b2 (5)
wherein, alpha and beta are parameters to be learned; μ and σ denote the mean and variance of the input layer, respectively; w1And W2Is a weight matrix; b1And b2Is a bias vector.
4. The judicial domain named entity and relationship joint extraction method according to claim 2, wherein the coding layer and the decoding layer in the BILSTM network both adopt a bidirectional LSTM structure, each sentence is respectively calculated in sequence and in reverse order to obtain two different hidden layer representations, and then the final hidden layer representation is obtained by vector splicing;
Figure FDA0003058297470000022
wherein h ist-1Inputting the above information; x is the number oftIs the current input information; sigma is a Sigmoid function of element directions; bf、bi、bcAnd b0Respectively representing a forgetting gate offset vector, an input gate offset vector, a cell state offset vector and an output gate offset vector;
Figure FDA0003058297470000023
is a cellular memory; is an elementThe product of the directions; w is af、wi、wcAnd w0Respectively representing a forgetting gate weight matrix, an input gate weight matrix, a cell state weight matrix and an output gate weight matrix; tan h is a hyperbolic tangent function; f. oft,it,Ct,OtRespectively representing a forgetting gate, an input gate, a cell state and an output gate;
Figure FDA0003058297470000031
indicating the cell state from the last unit ct-1A new cell state after renewal; h istIs a hidden layer representation;
to hide the sequence from the layer
Figure FDA0003058297470000032
And reverse order hidden layer representation
Figure FDA0003058297470000033
The coded information representing a single character t in tandem, is represented as
Figure FDA0003058297470000034
ht' denotes the last output of the BILSTM network, decoding layer ht' is a semantic vector obtained by analyzing the context information at the moment t.
5. The method for jointly extracting named entities and relationships in the judicial domain according to claim 4, wherein the labels corresponding to the characters are classified from the label sets of 4 entity categories of legal documents, legal subjects, legal objects and legal facts by using a softmax classifier, and the label probability y ist
Figure FDA0003058297470000035
Wherein the content of the first and second substances,
Figure FDA0003058297470000036
semantic vectors extracted through a BILSTM network; u is a parameter matrix; b is a bias term.
6. The method for jointly extracting named entities and relationships in the judicial domain according to claim 5, wherein in the step 5, when the relationships are extracted for the characters input at the time t, the corresponding semantic vector h is extractedt dAs an input of the attention mechanism, judging the incidence relation between the non-entity characters at the time t and the non-entity characters at the previous time, wherein a specific calculation formula of relation extraction is as follows:
Figure FDA0003058297470000037
Figure FDA0003058297470000038
wherein the content of the first and second substances,
Figure FDA0003058297470000039
representing the superposition of semantic vectors decoded and output at the moment t and all the previous moments in a BILSTM network decoding layer; b≤tAnd btAll represent a bias vector; w3And W4Are all weight matrices;
Figure FDA00030582974700000310
representing the probability of association between characters.
CN202110505601.8A 2021-05-10 2021-05-10 Judicial domain named entity and relationship combined extraction method Pending CN113221567A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110505601.8A CN113221567A (en) 2021-05-10 2021-05-10 Judicial domain named entity and relationship combined extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110505601.8A CN113221567A (en) 2021-05-10 2021-05-10 Judicial domain named entity and relationship combined extraction method

Publications (1)

Publication Number Publication Date
CN113221567A true CN113221567A (en) 2021-08-06

Family

ID=77094209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110505601.8A Pending CN113221567A (en) 2021-05-10 2021-05-10 Judicial domain named entity and relationship combined extraction method

Country Status (1)

Country Link
CN (1) CN113221567A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609868A (en) * 2021-09-01 2021-11-05 首都医科大学宣武医院 Multi-task question-answer driven medical entity relationship extraction method
CN113761106A (en) * 2021-09-08 2021-12-07 上海快确信息科技有限公司 Self-attention-enhanced bond transaction intention recognition system
CN113779260A (en) * 2021-08-12 2021-12-10 华东师范大学 Domain map entity and relationship combined extraction method and system based on pre-training model
CN113821636A (en) * 2021-08-27 2021-12-21 上海快确信息科技有限公司 Financial text joint extraction and classification scheme based on knowledge graph
CN113849597A (en) * 2021-08-31 2021-12-28 艾迪恩(山东)科技有限公司 Illegal advertising word detection method based on named entity recognition
CN113987150A (en) * 2021-10-29 2022-01-28 深圳前海环融联易信息科技服务有限公司 Bert-based multi-layer attention mechanism relation extraction method
CN114372138A (en) * 2022-01-11 2022-04-19 国网江苏省电力有限公司信息通信分公司 Electric power field relation extraction method based on shortest dependence path and BERT
CN114444506A (en) * 2022-01-11 2022-05-06 四川大学 Method for extracting relation triple fusing entity types
CN114548099A (en) * 2022-02-25 2022-05-27 桂林电子科技大学 Method for jointly extracting and detecting aspect words and aspect categories based on multitask framework
CN114841122A (en) * 2022-01-25 2022-08-02 电子科技大学 Text extraction method combining entity identification and relationship extraction, storage medium and terminal
CN115169326A (en) * 2022-04-15 2022-10-11 山西长河科技股份有限公司 Chinese relation extraction method, device, terminal and storage medium
CN115329756A (en) * 2021-10-21 2022-11-11 盐城金堤科技有限公司 Execution subject extraction method and device, storage medium and electronic equipment
CN115358239A (en) * 2022-08-17 2022-11-18 北京中科智加科技有限公司 Named entity and relationship recognition method and storage medium
CN116151243A (en) * 2023-04-23 2023-05-23 昆明理工大学 Entity relation extraction method based on type correlation characterization
TWI807400B (en) * 2021-08-27 2023-07-01 台達電子工業股份有限公司 Apparatus and method for generating an entity-relation extraction model
CN116501830A (en) * 2023-06-29 2023-07-28 中南大学 Method and related equipment for jointly extracting overlapping relation of biomedical texts
CN113849597B (en) * 2021-08-31 2024-04-30 艾迪恩(山东)科技有限公司 Illegal advertisement word detection method based on named entity recognition

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779260A (en) * 2021-08-12 2021-12-10 华东师范大学 Domain map entity and relationship combined extraction method and system based on pre-training model
CN113821636A (en) * 2021-08-27 2021-12-21 上海快确信息科技有限公司 Financial text joint extraction and classification scheme based on knowledge graph
TWI807400B (en) * 2021-08-27 2023-07-01 台達電子工業股份有限公司 Apparatus and method for generating an entity-relation extraction model
CN113849597A (en) * 2021-08-31 2021-12-28 艾迪恩(山东)科技有限公司 Illegal advertising word detection method based on named entity recognition
CN113849597B (en) * 2021-08-31 2024-04-30 艾迪恩(山东)科技有限公司 Illegal advertisement word detection method based on named entity recognition
CN113609868A (en) * 2021-09-01 2021-11-05 首都医科大学宣武医院 Multi-task question-answer driven medical entity relationship extraction method
CN113761106A (en) * 2021-09-08 2021-12-07 上海快确信息科技有限公司 Self-attention-enhanced bond transaction intention recognition system
CN115329756A (en) * 2021-10-21 2022-11-11 盐城金堤科技有限公司 Execution subject extraction method and device, storage medium and electronic equipment
CN113987150A (en) * 2021-10-29 2022-01-28 深圳前海环融联易信息科技服务有限公司 Bert-based multi-layer attention mechanism relation extraction method
CN114444506B (en) * 2022-01-11 2023-05-02 四川大学 Relation triplet extraction method for fusing entity types
CN114444506A (en) * 2022-01-11 2022-05-06 四川大学 Method for extracting relation triple fusing entity types
CN114372138A (en) * 2022-01-11 2022-04-19 国网江苏省电力有限公司信息通信分公司 Electric power field relation extraction method based on shortest dependence path and BERT
CN114841122A (en) * 2022-01-25 2022-08-02 电子科技大学 Text extraction method combining entity identification and relationship extraction, storage medium and terminal
CN114548099A (en) * 2022-02-25 2022-05-27 桂林电子科技大学 Method for jointly extracting and detecting aspect words and aspect categories based on multitask framework
CN114548099B (en) * 2022-02-25 2024-03-26 桂林电子科技大学 Method for extracting and detecting aspect words and aspect categories jointly based on multitasking framework
CN115169326A (en) * 2022-04-15 2022-10-11 山西长河科技股份有限公司 Chinese relation extraction method, device, terminal and storage medium
CN115358239A (en) * 2022-08-17 2022-11-18 北京中科智加科技有限公司 Named entity and relationship recognition method and storage medium
CN115358239B (en) * 2022-08-17 2023-08-22 北京中科智加科技有限公司 Named entity and relationship recognition method and storage medium
CN116151243A (en) * 2023-04-23 2023-05-23 昆明理工大学 Entity relation extraction method based on type correlation characterization
CN116501830A (en) * 2023-06-29 2023-07-28 中南大学 Method and related equipment for jointly extracting overlapping relation of biomedical texts
CN116501830B (en) * 2023-06-29 2023-09-05 中南大学 Method and related equipment for jointly extracting overlapping relation of biomedical texts

Similar Documents

Publication Publication Date Title
CN113221567A (en) Judicial domain named entity and relationship combined extraction method
CN112163416B (en) Event joint extraction method for merging syntactic and entity relation graph convolution network
Dahouda et al. A deep-learned embedding technique for categorical features encoding
US20220147836A1 (en) Method and device for text-enhanced knowledge graph joint representation learning
CN110990564B (en) Negative news identification method based on emotion calculation and multi-head attention mechanism
CN110196906B (en) Deep learning text similarity detection method oriented to financial industry
CN110321563A (en) Text emotion analysis method based on mixing monitor model
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN110889786A (en) Legal action insured advocate security use judging service method based on LSTM technology
CN113962219A (en) Semantic matching method and system for knowledge retrieval and question answering of power transformer
CN113168499A (en) Method for searching patent document
CN113704546A (en) Video natural language text retrieval method based on space time sequence characteristics
CN113196277A (en) System for retrieving natural language documents
CN113011161A (en) Method for extracting human and pattern association relation based on deep learning and pattern matching
CN112559723A (en) FAQ search type question-answer construction method and system based on deep learning
CN115759092A (en) Network threat information named entity identification method based on ALBERT
CN113919366A (en) Semantic matching method and device for power transformer knowledge question answering
CN113704396A (en) Short text classification method, device, equipment and storage medium
Kovvuri et al. Pirc net: Using proposal indexing, relationships and context for phrase grounding
CN112989830B (en) Named entity identification method based on multiple features and machine learning
CN113239663A (en) Multi-meaning word Chinese entity relation identification method based on Hopkinson
CN115796182A (en) Multi-modal named entity recognition method based on entity-level cross-modal interaction
CN115455202A (en) Emergency event affair map construction method
CN114547237A (en) French recommendation method fusing French keywords
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination