CN112214610A - Entity relation joint extraction method based on span and knowledge enhancement - Google Patents
Entity relation joint extraction method based on span and knowledge enhancement Download PDFInfo
- Publication number
- CN112214610A CN112214610A CN202011021524.0A CN202011021524A CN112214610A CN 112214610 A CN112214610 A CN 112214610A CN 202011021524 A CN202011021524 A CN 202011021524A CN 112214610 A CN112214610 A CN 112214610A
- Authority
- CN
- China
- Prior art keywords
- entity
- span
- relationship
- graph
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 34
- 239000013598 vector Substances 0.000 claims abstract description 43
- 238000012549 training Methods 0.000 claims abstract description 22
- 238000001914 filtration Methods 0.000 claims abstract description 12
- 238000003062 neural network model Methods 0.000 claims abstract description 7
- 238000003058 natural language processing Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 33
- 238000004458 analytical method Methods 0.000 claims description 10
- 238000012886 linear function Methods 0.000 claims description 8
- 230000003213 activating effect Effects 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000004140 cleaning Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 2
- 238000013145 classification model Methods 0.000 claims description 2
- 238000002372 labelling Methods 0.000 abstract description 15
- 230000014509 gene expression Effects 0.000 description 11
- 230000000694 effects Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 101150071100 CBY2 gene Proteins 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000006116 polymerization reaction Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Pure & Applied Mathematics (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Computational Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Operations Research (AREA)
- Algebra (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a span and knowledge enhancement based entity relationship joint extraction method, and belongs to the technical field of information extraction and natural language processing. Firstly, constructing a sample data set and labeling the data set; then, entity recognition and relation classification are carried out, specifically, for the labeled data, a pre-training language model is utilized to map words in a high-dimensional discrete space to low-dimensional continuous space vectors; performing span identification, filtering and relationship classification by using a span-based model; converting the relationship classification into a graph classification by using a graph-based model, and introducing a syntactic dependency relationship so as to assist the relationship judgment classification; and performing joint training on the output result of the span-based model and the output result of the graph-based model, and identifying entities contained in the data and relationships among the entities. The invention introduces syntax information such as dependency relationship and the like into the end-to-end neural network model, thereby effectively identifying the overlapping relationship and improving the accuracy of entity relationship joint extraction.
Description
Technical Field
The invention belongs to the technical field of information extraction and natural language processing, and particularly relates to an entity relationship joint extraction method based on span and knowledge enhancement.
Background
Extracting entities and their inter-relationships plays a crucial role in understanding text. Specifically, named entity recognition, which is to identify entities having specific meanings in a text and determine the types of the entities (names of people, places, names of organizations, proper nouns, and the like), and relationship classification, which is to determine the types of relationships existing between a given set of entity pairs, are particularly critical in determining the structure of a text for use in downstream tasks such as knowledge graph construction, knowledge-based question answering, and the like.
The traditional entity relationship extraction method is a streamline process, namely named entity identification and relationship classification are divided into two independent subtasks, after a section of text is given, the entities in the text are identified, and then the relationship types among the identified entities are judged. Although the pipeline method is easy to implement, error transmission is easy to occur in the process, and if an error occurs in the named entity identification process, the effect of subsequent relation classification can be influenced. In order to solve the above problems, some recent researches propose a method for extracting a joint entity relationship, so as to fully mine potential dependency relationships between entities and relationships thereof, so that two tasks of named entity identification and relationship classification can bring out the best in each other. Although the combined entity relationship extraction method can effectively alleviate the error transfer problem in the pipeline method, the requirement on data set labeling is high, and a large amount of high-quality labeling data is needed to train the model. However, annotating data in a particular domain is time consuming and difficult. Meanwhile, the existing entity relationship extraction method based on the end-to-end neural network cannot fully mine information such as syntax and semantics among sentences, and the phenomena of overlapping relationship, multiple labels and the like are neglected in data labeled based on labeling systems such as BIO/BILO, so that the effect of entity relationship extraction is influenced.
Disclosure of Invention
The technical problem is as follows: aiming at the problem that the existing entity relationship extraction method is poor in entity relationship extraction effect, the invention provides an entity relationship joint extraction method based on span and knowledge enhancement, which can introduce syntax information such as dependency relationship and the like into an end-to-end neural network model, identify overlapped relationships and further improve the entity relationship extraction accuracy.
The technical scheme is as follows: the invention discloses a span and knowledge enhancement based entity relationship joint extraction method, which comprises the following steps:
s1: building a data set
Collecting data of a specific field, cleaning the collected data and constructing a data set of the field;
s2: annotating data
Randomly selecting a plurality of data in the data set, manually marking, and automatically marking the data which are not manually marked in the data set by using a regular template;
s3: entity identification and relationship classification
For the labeled data, mapping words in a high-dimensional discrete space to low-dimensional continuous space vectors by using a pre-training language model, and embedding codes;
performing span identification, filtering and relationship classification by using a span-based model;
converting the relationship classification into a graph classification by using a graph-based model, and introducing a syntactic dependency relationship so as to assist the relationship judgment classification;
and performing joint training on the output result of the span-based model and the output result of the graph-based model, and identifying entities contained in the data and relationships among the entities.
Further, in step S2, when the data is manually labeled, the entity location information, the entity type, and the relationship between the entities of the data are labeled.
Further, in step S2, when the regular template is used to automatically label the data, the relationship between the entity type and the entity is preset, the regular template is compiled according to the domain to which the data set belongs by using the writing knowledge of domain experts, and the preset entity type and the relationship between the entities are labeled in the data by means of template matching.
Further, in step S3, the pre-training language model uses a BERT model to obtain a vector representation of the word through the effective coding context information.
Further, in step S3, the span-based model includes an entity classifier, a span filter and a relationship classifier, and the entity classifier is used to judge and classify the entities, and the span filter is used to filter out spans that are not entities, and then the relationship classifier is used to judge the entity relationship type for classification.
Further, the method for classifying the entities by using the entity classifier comprises the following steps:
embedding a candidate span encoded by a pre-trained network model ti,ti+1,...,ti+kInputting the data into an entity classifier, and obtaining a vector representation f (t) of the entity through one-time maximum poolingi,ti+1,...,ti+k) And a special classification vector t obtained by coding with the BERT modelclsVector t encoding span widthwidthAnd splicing to obtain the vector representation of the final entity:
wherein, i and k both represent serial numbers, and then the spliced result is obtainedInputting into a full connection layer and obtaining the probability distribution of the entity type through softmax activation:
wherein ,eiRepresents an entity type, WiIs a weight, biTo be offset, siAnd (4) representing the ith span, and judging and classifying the entity type through the probability distribution.
Further, the method for filtering the span by using the span filter comprises the following steps: and in the probability distribution of the entity type obtained based on the entity classifier, if the probability value of the 'none' type is the highest, identifying the span as the 'none' type, judging that the span is not the entity, and filtering the span.
Further, the method for judging the relationship type by using the relationship classifier comprises the following steps:
representing span-encoded vectors obtained by an entity classifierAndcoded vector representation obtained by embedded coding of context between two spansAnd splicing to obtain a relation representation, wherein the relation between the entity pairs is opposite, and two opposite relation representations are respectively arranged between all the entity pairs:
wherein ,ri,j and rj,iRespectively representing the relationship between the ith entity and the jth entity, wherein i and j represent serial numbers;
inputting the relation representation into a full connection layer, and activating through a sigmoid function to obtain the probability distribution of the relation type:
wherein ,Wi,j and Wj,iRepresents a weight, bi,j and bj,iRepresenting bias, judging and classifying the relation types between the entity pairs through the obtained probability distribution of the relation types, wherein sigma (DEG) represents a function.
Further, the method for judging and classifying the entity relationship by using the model based on the graph comprises the following steps:
utilizing a HanLP natural language processing tool to obtain a dependency analysis tree of the sentence, converting the dependency analysis tree into an adjacency matrix, and obtaining an input graph G based on a graph modeli(ii) a Then input the graphGiInputting the data into a graph convolution neural network model GIN realized by CogDL, and obtaining vector representation of the whole graph by repeatedly and iteratively learning the characteristics of neighbor nodes
Vector representation of a graphThe probability distribution of the graph classification is obtained by inputting into a full connection layer and activating with softmax:
wherein ,the weight is represented by a weight that is,and representing bias, and judging and classifying the entity relationship by using the probability distribution of graph classification.
Further, the method for performing joint training on the output result based on the span model and the output result of the graph classification model and identifying the entities contained in the data and the relationship types among the entities comprises the following steps:
entity identification loss gamma of span-based model obtained by using cross entropy loss functione:
wherein ,MeThe number of entity types;is an indicator variable, the value is 0 or 1, if the class and the sample class are the same, the value is 1, otherwise the value is 0;belonging to entity class c for observing entity spanseA predicted probability of (d); n represents the total number of samples in the data set, and e is the identification of the entity;
obtaining relation classification loss gamma of model based on span by using BCEWithLogs loss functionr:
wherein ,yrThe index variable represents whether the prediction relation type is the same as the sample type, and r is the identifier of the relation; graph classification loss gamma for graph-based models using cross entropy loss functiong:
wherein ,MgIs the number of relationship types;is an indicator variable;belong to class c for observation mapsgG is a relation identifier in the graph classification;
performing joint training by using the following formula to obtain joint loss gamma:
γ=γe+γr+f(·)γg
wherein f (-) is a linear function, and the linear function f (-) takesWhere x represents the number of samples input and N represents a sample in the data setThe sum of the products.
Has the advantages that: compared with the prior art, the invention has the following advantages:
the invention discloses a span and knowledge enhancement based entity relationship joint extraction method, which is used for solving entity relationship joint extraction in a specific field. The method is composed of a span-based model and a graph-based model, wherein the span-based model can perform entity identification and relationship classification by using context expression in a text, and the graph-based model performs a graph classification task by using a syntax tree obtained by syntactic dependency analysis so as to effectively judge the relationship type. The model of the invention can introduce syntax information such as dependency relationship and the like into the end-to-end neural network model, thereby effectively identifying the overlapping relationship and improving the accuracy of entity relationship joint extraction.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is an exemplary diagram of a manual annotation in an embodiment of the present invention;
FIG. 3 is a flowchart of a process for automatic tagging in an embodiment of the present invention;
FIG. 4 is an exemplary diagram of a canonical template in an embodiment of the invention;
FIG. 5 is a model diagram of entity relationship joint extraction according to the present invention.
Detailed Description
The invention is further described with reference to the following examples and the accompanying drawings.
Referring to fig. 1, the entity relationship joint extraction method based on span and knowledge enhancement of the present invention includes:
s1: building a data set
In the embodiment of the present invention, crawler software or a crawling program is used to crawl news texts on a portal website, and in other embodiments, the data set may also be data accumulated by an enterprise or data collected in other manners. And after enough data are collected, cleaning the collected data, and cleaning the data which do not meet the requirements to complete the construction of the data set.
For example, in one embodiment of the present invention, for news in the military field, in the implementation process, a certain portal website military news webpage is crawled, 840000 military news articles are collected, articles irrelevant to the military field or not relevant to the military field are washed away by using keywords in the military field, and 85000 articles are finally obtained, and a data set including 85000 articles is constructed.
S2: annotating data
Data marking includes artifical mark and automatic marking, when generally adopting artifical mark, can make full use of expert's experience, consequently the accuracy of mark is higher relatively, but because the data bulk is great in the data set, can not accomplish the mark through the mode of artifical mark completely, consequently need carry out automatic marking, improves the efficiency of mark.
In the embodiment of the invention, a plurality of data in the data set are randomly selected for manual marking, and the rest of the data in the data set are marked in an automatic marking mode. When data is manually marked, position information, entity types and entity relationship types of entities need to be marked. The entity type and the entity relationship type are preset before data labeling, for example, for the data set in the military field, the preset entity type includes: equipment, people, organizations, place names, military activities, job titles, combat readiness engineering; the preset entity relationship types comprise: deployed, held, owned, located. 338 articles are randomly extracted for the data set of the military field, and experts in the military field are invited to label the extracted 338 articles. The expert manually labels the extracted data according to the preset entity type and the entity relationship type, a specific number is given to an entity appearing in the article, meanwhile, the entity position is labeled according to the position of the entity starting and ending in the article, and fig. 2 shows a labeling example of data during manual labeling.
For data that is not manually labeled, in the embodiment of the present invention, labeling is performed by using a regular template, and a flow of labeling by using a regular template is shown in fig. 3:
(1) and defining entity relationship types, wherein for one embodiment of the invention, as the entities and the relationships in the military field are too complex, field experts with higher degree of engagement with the specialty during classification of design types are discussed and formulated according to the common content of the data set, so that the mainstream military entities and relationship types at present can be summarized more accurately, and the extracted relationship triples can be added into the construction of the military knowledge graph.
(2) Randomly extracting 100 military news texts from a data set, manually writing corresponding regular expressions aiming at the relation and the entity in each news text, then testing the effect of the regular expressions on the 338 manually marked military news texts, and supplementing the corresponding missing regular expressions according to the value of the recall rate (recall). It is noted that in other embodiments, other amounts of data may be extracted to write a regular expression.
(3) And (4) iterating, returning to the step (2), and repeating the step (2) until the accuracy (precision) and recall (recall) of the regular extraction reach the threshold values. At this time, the whole process is finished, and the corresponding entities and relations are extracted from the data set by using the good regular expressions and the labeling of the data is carried out.
In the implementation of the invention, 119 pieces of relational regular expressions are jointly designed, and an example of the written regular template is shown in FIG. 4. Analyzing the matching result of the regular template on the labeling data set, and determining the relationship between the relationship and the regular extracted manual labeling relationship statement according to two aspects: the type defined in advance by the relational regular expression is the same as the type defined by manual marking, or the head and tail entities of the manually marked relational sentences are in the sentences extracted by the regular expression.
After the data labeling is finished, the manually labeled data and the automatically labeled data are mixed and disordered, and then the entity recognition machine is classified according to the system.
S3: entity identification and relationship classification
For the labeled data, mapping words in a high-dimensional discrete space to low-dimensional continuous space vectors by using a pre-training language model, and embedding codes; performing span identification, filtering and relationship classification by using a span-based model; converting the relationship classification into a graph classification by using a graph-based model, and introducing a syntactic dependency relationship so as to assist the relationship classification; and performing joint training on the output result of the span-based model and the output result of the graph-based model, and identifying entities contained in the data and relationships among the entities.
In the embodiment of the invention, the pre-training language model adopts a BERT model which is issued by Google and is trained aiming at Chinese, and words in a high-dimensional discrete space can be mapped to a low-dimensional continuous space vector and are embedded into codes. The BERT model is a multi-layer bidirectional Transformer structure, and vector representation of words can be obtained by effectively encoding context information. For example, given a sentence containing n words, input to the BERT-based embedded encoding module will result in a word vector sequence { t } of length n +1cls,t1,t2,...,tnThe BERT model adds a special classification vector t covering the whole sentence information at the head end of the sequencecls。
The span-based model comprises an entity classifier, a span filter and a relation classifier, the entity classifier is used for carrying out entity classification on the output of the BERT model, the span filter is used for filtering out non-entity spans, and then the relation classifier is used for judging and classifying entity relations.
After the span-based model obtains the BERT-based text vector representation, the span is obtained by adopting an optimized negative sampling mode, and the span which is not in the labeling entity list is defined as a negative sample. For example, for sentences (U.S., nation, F, -, 1, 5, war, fighter, plane), entities that may be detected are (U.S., (U.S.), (U.S. F), (F-15 fighter), and so on. Unlike the prior art, the span-based model of the present invention does not perform beam search on the entity and relationship hypotheses, but instead sets a maximum value of NeI.e. choosing at most N among all possible entitieseAnd marking samples not labeled as positive examples in the training set as negative examples. Unlike the existing span-based model, the present invention proposes a new way to select the negative case, i.e. first createEstablishing a set S of military entities, wherein the set contains entities (labeled data and result of entity regular extraction) as many as possible in the data set, segmenting sentences by using jieba (segmentation software), obtaining all possible entities by segmenting words and obtaining part-of-speech corresponding to segmentation result, for example, I can obtain three entities of I, Beijing, Tianan and Tiananmen in Beijing Tiananmen according to the part-of-speech, firstly filtering according to the part-of-speech, only preserving nouns, then carrying out similarity calculation on the nouns and the entities in the entity set S, selecting the value with the highest similarity as the score of the segmentation result, and finally sorting according to the higher similarity and the higher priority if the negative example is higher, if the N is not reachedeFilling the result of word segmentation, and then selecting a random span, wherein the random span can select an entity with the length of 2-10. For example, in the military corpus studied in the embodiment of the present invention, the length of the entity is substantially in this range, and the entity which is more consistent with the characteristics of the military entity but is not labeled can be selected as a negative example to enable the training effect of the model to be better.
After the span-based model selects good possible entities, the vector representation of the entities is processed. The vector representation of the entity consists of three parts, namely vector representation of tokens contained by the entity (see fig. 5, namely, the word of the entity is mapped to the corresponding id of the word in the dictionary of the pre-training model), width embedding (see fig. 5) and special mark CLS (see fig. 5).
Therefore, the method for classifying the entities by using the entity classifier comprises the following steps:
embedding a candidate span encoded by a pre-trained network model ti,ti+1,...,ti+kInputting the data into an entity classifier, and obtaining a vector representation f (t) of the entity through one-time maximum poolingi,ti+1,...,ti+k). Width embedding is an embedding matrix (the matrix contains the characteristics of words) learned in training, namely the width k +1 of an entity indicates that the entity contains k +1 tokens, so that the width embedding of the entity is expressed by using k +1 as a subscript, and a vector with the width k +1 obtained by indexing in the width matrix represents twidthI.e. to spanWidth encoded vector. The special mark symbol CLS is generated by BERT model, covers the global information of input sentence, and the BERT model is coded to obtain special classification vector tcls. Representing the vector of the entity by f (t)i,ti+1,...,ti+k) Special classification vectors t encoded with the BERT modelclsVector t encoding span widthwidthAnd splicing to obtain the vector representation of the final entity:
wherein, i and k both represent serial numbers, and then the spliced result is obtainedInputting the entity type into a full connection layer and activating the entity type through softmax, wherein the entity type comprises a non-type 'none', and obtaining the probability distribution of the entity type:
wherein ,eiRepresents an entity type, WiIs a weight, biTo be offset, siAnd representing the ith span, and judging the entity type through the probability distribution.
The span filter filters the span according to the probability distribution of the entity type obtained by the entity classifier, filters non-entity spans, and during filtering, if the probability value of the 'none' type in the probability distribution is the highest, the span is identified as the 'none' type, namely, the span is judged not to be an entity, so that the span is filtered.
The relationship classifier is used for entity relationship classification, and the relationship is constructed and classified for all possible entity pairs. First randomly selecting at most N from possible entitiesrA set of relationships is formed for the entities. For one by entity pair(s)1,s2) Formed entities, their relation vectorsThe representation is composed of two parts, one part is head-tail entity vector representation obtained by the span identification part, and span coding representation can be obtained by an entity classifierAndanother part is textual features. In addition to entity features, relationship extraction may also rely on textual features. In the invention, CLS is not selected as the text feature, but the text between two entities is maximally pooled, the context information between the entity pairs is reserved, and the coded vector representation of the text feature is obtained by embedded codingIf there is no text between the two entities, thenWill be set to 0. Since the relationships of the entity pairs are often asymmetric, and the head and tail entities of the relationships cannot be reversed, each entity pair will get two opposite relationship representations:
wherein ,ri,j and rj,iRespectively representing the relationship between the ith entity and the jth entity, wherein i and j represent sequence numbers.
Inputting the relation representation into a full connection layer, and activating through a sigmoid function to obtain the probability distribution of the relation type:
wherein ,Wi,j and Wj,iRepresents a weight, bi,j and bj,iAnd expressing bias, judging the relation type between the entity pairs through the obtained probability distribution of the relation type, and expressing a function by sigma (·).
The graph-based model converts the relation classification into the graph classification problem, introduces syntactic dependency analysis and assists the relation classification, thereby effectively relieving the defect that the end-to-end neural network model cannot effectively mine syntactic information.
The method for judging classification by using the graph-based model including the dependency analysis tree, the graph convolution neural network and the graph classifier and using the graph-based model to assist the relationship comprises the following steps: inputting any sentence, obtaining the dependency analysis tree of the sentence by using the HanLP natural language processing tool, and converting the dependency analysis tree into an adjacency matrix to obtain an input graph G of the graph-based modeliMore specifically, for words of each node composing the tree, summing word vectors obtained by a BERT model to form a node label, using the dependency relationship type among the words as an edge label, and using the relationship type of the whole sentence as a graph label; then input graph GiInputting the data into a Graph convolution neural Network (GIN) model realized by CogDL (a tool kit), and learning the characteristics of neighbor nodes through multiple iterations to obtain a representation vector of the whole Graph
Representation vector of a graphThe probability distribution of the graph classification is obtained by inputting into a full connection layer and activating with softmax:
wherein ,is the weight of the image,representing the bias, and judging and classifying the relationship according to the probability distribution of the graph classification.
The method for performing joint training on the output result of the span-based model and the output result of the graph-based model and identifying the entities contained in the data and the relationship types among the entities comprises the following steps:
entity identification loss gamma of span-based model obtained by using cross entropy loss functione:
wherein ,MeThe number of entity types;is an indicator variable, the value is 0 or 1, if the class and the sample class are the same, the value is 1, otherwise the value is 0;belonging to entity class c for observing entity spanseE is the identity of the entity.
Obtaining relation classification loss gamma of model based on span by using BCEWithLogs loss functionr:
wherein ,yrIs an indicator variable and represents whether the prediction relation category is the same as the sample category; n represents the total number of samples in the dataset and r is the identity of the relationship.
Graph classification loss gamma for graph-based models using cross entropy loss functiong:
wherein ,MgIs the number of relationship types;is an indicator variable;belong to class c for observation mapsgG is a relation identifier in the graph classification;
performing joint training by using the following formula to obtain joint loss gamma:
γ=γe+γr+f(·)γg
wherein f (·) is a linear function. In a preferred embodiment of the invention, the linear function f (-) takesWhere x represents the number of samples input and N represents the sum of the samples in the dataset.
Through joint training, entities contained in the sentences and relationship types among the entities are identified.
Based on the method of the present invention, a specific application example is given.
First, a military news webpage of a representative certain website is crawled, and 840,000 military news articles are obtained. Filtering out articles which are irrelevant to the military field or do not contain the military relation based on the keywords of the military field, finally obtaining 85,000 articles, and completing the construction of the data set. Then 338 articles were randomly extracted, inviting the domain experts to do manual labeling. For the article which is not manually marked, the regular template is used for automatic marking, and 119 relational regular expressions are designed in total to automatically mark the data set. Finally, the data sets were randomly divided into a training set and a test set, the ratio of the two data sets being 10: 1. the parameters of the model referred to in the present invention are set as in table 1.
TABLE 1 parameter settings in the model
In order to prove the superiority of the model in the invention, the model results in the invention are compared with the existing model, table 2 lists the evaluation results of different models, and the evaluation results of different models are compared on three index measures, including accuracy, recall rate and F1 value.
TABLE 2 evaluation results of different models
In table 2, the results in row 1 are not based on the graph model, and rows 2 to 4 represent the results of the hybrid model. Comparing the model of the present invention with different GNN (Graph Neural Networks) variants, it can be found that the performance of each model is different. Although the SortPool model performed well in the graph classification task, there was no improvement in the relationship prediction task F1 index score compared to the single model. Likewise, SpERT + PATCH + SAN behaves generically in both graph classification and relationship abstraction. The observation that the model achieves the highest F1 score in the graph classification, the entity identification and the relation classification shows that the performance of the model can be improved by introducing specific external knowledge based on the graph model.
TABLE 3 comparison of results of different joint extraction methods
To jointly train a span-based model and a graph-based model, the entity identification penalty γ derived for the span-based model is requiredeModel-derived relationship classification penalty γ based on spanrAnd a graph classification loss γ obtained by a graph-based modelgPolymerization is carried out. Table 3 shows the corresponding extraction results for three different combination methods. The results show that in addition to multiplication, addition and linear functions can be accurately co-trained. Meanwhile, by adopting a linear function F (x) ═ xN, the model can obtain 76.60 and 58.57F 1 scores in entity identification and relationship classification respectively, which are higher than those of the other two combined methods.
The entity relationship joint extraction method based on span and knowledge enhancement provided by the invention is used for solving the problem of entity relationship joint extraction in a specific field. The method is composed of a span-based model and a graph-based model, wherein the span-based model can perform entity identification and relationship classification by using context expression in a text, and the graph-based model performs a graph classification task by using a syntax tree obtained by syntactic dependency analysis so as to effectively judge the relationship type. The model of the invention can introduce syntax information such as dependency relationship and the like into the end-to-end neural network model, thereby effectively identifying the overlapping relationship and improving the accuracy of entity relationship joint extraction.
The above examples are only preferred embodiments of the present invention, it should be noted that: it will be apparent to those skilled in the art that various modifications and equivalents can be made without departing from the spirit of the invention, and it is intended that all such modifications and equivalents fall within the scope of the invention as defined in the claims.
Claims (10)
1. A span and knowledge enhancement based entity relation joint extraction method is characterized by comprising the following steps:
s1: building a data set
Collecting data of a specific field, cleaning the collected data and constructing a data set of the field;
s2: annotating data
Randomly selecting a plurality of data in the data set, manually marking, and automatically marking the data which are not manually marked in the data set by using a regular template;
s3: entity identification and relationship classification
For the labeled data, mapping words in a high-dimensional discrete space to low-dimensional continuous space vectors by using a pre-training language model, and embedding codes;
performing span identification, filtering and relationship classification by using a span-based model;
converting the relationship classification into a graph classification by using a graph-based model, and introducing a syntactic dependency relationship so as to assist the relationship judgment classification;
and performing joint training on the output result of the span-based model and the output result of the graph-based model, and identifying entities contained in the data and relationships among the entities.
2. The method for extracting entity relationships based on span and knowledge enhancement as claimed in claim 1, wherein in step S2, when the data is labeled manually, the entity location information, the entity type and the relationships between the entities of the data are labeled.
3. The entity relationship joint extraction method based on span and knowledge enhancement as claimed in claim 1, wherein in step S2, when the regular template is used to label the data automatically, the relationship between entity type and entity is preset, according to the domain to which the data set belongs, the regular template is compiled by using the knowledge written by domain experts, and the preset entity type and the relationship between entities are marked in the data by means of template matching.
4. The method for entity relationship joint extraction based on span and knowledge enhancement as claimed in claim 1, wherein in step S3, the pre-trained language model employs a BERT model to obtain the vector representation of the word by effectively encoding context information.
5. The method for extracting entity relationship based on span and knowledge enhancement as claimed in claim 1, wherein in step S3, the span-based model includes an entity classifier, a span filter and a relationship classifier, the entity classifier is used to classify the entity by judgment, the span filter is used to filter out non-entity spans, and then the relationship classifier is used to classify the entity relationship type by judgment.
6. The method of claim 5, wherein the entity classifier is used to classify the entities according to the following steps:
embedding a candidate span encoded by a pre-trained network model ti,ti+1,...,ti+kInputting the data into an entity classifier, and obtaining a vector representation f (t) of the entity through one-time maximum poolingi,ti+1,...,ti+k) And a special classification vector t obtained by coding with the BERT modelclsVector t encoding span widthwidthAnd splicing to obtain the vector representation of the final entity:
wherein, i and k both represent serial numbers, and then the spliced result is obtainedInputting into a full connection layer and obtaining the probability distribution of the entity type through softmax activation:
wherein ,eiRepresents an entity type, WiRepresents a weight, biDenotes the offset, siAnd (4) representing the ith span, and judging and classifying the entity type through the probability distribution.
7. The entity relationship joint extraction method based on span and knowledge enhancement as claimed in claim 5 or 6, wherein the method for filtering the span by using the span filter is as follows: and in the probability distribution of the entity type obtained based on the entity classifier, if the probability value of the 'none' type is the highest, identifying the span as the 'none' type, judging that the span is not the entity, and filtering the span.
8. The entity relationship joint extraction method based on span and knowledge enhancement as claimed in claim 7, wherein the method for judging the relationship type by using the relationship classifier comprises:
representing span-encoded vectors obtained by an entity classifierAndcoded vector representation obtained by embedded coding of context between two spansAnd splicing to obtain a relation representation, wherein the relation between the entity pairs is opposite, and two opposite relation representations are respectively arranged between all the entity pairs:
wherein ,ri,j and rj,iRespectively representing the relationship between the ith entity and the jth entity, wherein i and j represent serial numbers;
inputting the relation representation into a full connection layer, and activating through a sigmoid function to obtain the probability distribution of the relation type:
wherein ,Wi,j and Wj,iRepresents a weight, bi,j and bj,iRepresenting bias, judging and classifying the relation types between the entity pairs through the obtained probability distribution of the relation types, wherein sigma (DEG) represents a function.
9. The method for extracting entity relationship based on span and knowledge enhancement as claimed in claim 1, wherein the method for judging and classifying entity relationship with the aid of graph-based model comprises:
utilizing a HanLP natural language processing tool to obtain a dependency analysis tree of the sentence, converting the dependency analysis tree into an adjacency matrix, and obtaining an input graph G based on a graph modeli(ii) a Then input graph GiInputting the data into a graph convolution neural network model GIN realized by CogDL, and obtaining vector representation of the whole graph by repeatedly and iteratively learning the characteristics of neighbor nodes
Vector representation of a graphThe probability distribution of the graph classification is obtained by inputting into a full connection layer and activating with softmax:
10. The method for extracting entity relationship based on span and knowledge enhancement as claimed in claim 1, wherein the method for performing joint training on the output result based on the span model and the output result based on the graph classification model and identifying the entity included in the data and the relationship type between the entities comprises:
entity identification loss gamma of span-based model obtained by using cross entropy loss functione:
wherein ,MeThe number of entity types;is an indicator variable, the value is 0 or 1, if the class and the sample class are the same, the value is 1, otherwise the value is 0;belonging to entity class c for observing entity spanseA predicted probability of (d); n represents the total number of samples in the data set, and e is the identification of the entity;
obtaining relation classification loss gamma of model based on span by using BCEWithLogs loss functionr:
wherein ,yrThe index variable represents whether the prediction relation type is the same as the sample type, and r is the identifier of the relation;
graph classification loss gamma for graph-based models using cross entropy loss functiong:
wherein ,MgIs the number of relationship types;is an indicator variable;belong to class c for observation mapsgG is a relation identifier in the graph classification;
performing joint training by using the following formula to obtain joint loss gamma:
γ=γe+γr+f(·)γg
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011021524.0A CN112214610B (en) | 2020-09-25 | 2020-09-25 | Entity relationship joint extraction method based on span and knowledge enhancement |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011021524.0A CN112214610B (en) | 2020-09-25 | 2020-09-25 | Entity relationship joint extraction method based on span and knowledge enhancement |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112214610A true CN112214610A (en) | 2021-01-12 |
CN112214610B CN112214610B (en) | 2023-09-08 |
Family
ID=74052289
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011021524.0A Active CN112214610B (en) | 2020-09-25 | 2020-09-25 | Entity relationship joint extraction method based on span and knowledge enhancement |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112214610B (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112989835A (en) * | 2021-04-21 | 2021-06-18 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Extraction method of complex medical entities |
CN113051356A (en) * | 2021-04-21 | 2021-06-29 | 深圳壹账通智能科技有限公司 | Open relationship extraction method and device, electronic equipment and storage medium |
CN113094513A (en) * | 2021-04-08 | 2021-07-09 | 北京工商大学 | Span representation-based end-to-end menu information extraction method and system |
CN113204615A (en) * | 2021-04-29 | 2021-08-03 | 北京百度网讯科技有限公司 | Entity extraction method, device, equipment and storage medium |
CN113240443A (en) * | 2021-05-28 | 2021-08-10 | 国网江苏省电力有限公司营销服务中心 | Entity attribute pair extraction method and system for power customer service question answering |
CN113411549A (en) * | 2021-06-11 | 2021-09-17 | 上海兴容信息技术有限公司 | Method for judging whether business of target store is normal or not |
CN113536795A (en) * | 2021-07-05 | 2021-10-22 | 杭州远传新业科技有限公司 | Method, system, electronic device and storage medium for entity relation extraction |
CN113627185A (en) * | 2021-07-29 | 2021-11-09 | 重庆邮电大学 | Entity identification method for liver cancer pathological text naming |
CN113779260A (en) * | 2021-08-12 | 2021-12-10 | 华东师范大学 | Domain map entity and relationship combined extraction method and system based on pre-training model |
CN113791791A (en) * | 2021-09-01 | 2021-12-14 | 中国船舶重工集团公司第七一六研究所 | Business logic code-free development method based on natural language understanding and conversion |
CN114611497A (en) * | 2022-05-10 | 2022-06-10 | 北京世纪好未来教育科技有限公司 | Training method of language diagnosis model, language diagnosis method, device and equipment |
CN114881038A (en) * | 2022-07-12 | 2022-08-09 | 之江实验室 | Chinese entity and relation extraction method and device based on span and attention mechanism |
CN115599902A (en) * | 2022-12-15 | 2023-01-13 | 西南石油大学(Cn) | Oil-gas encyclopedia question-answering method and system based on knowledge graph |
US20230153533A1 (en) * | 2021-11-12 | 2023-05-18 | Adobe Inc. | Pre-training techniques for entity extraction in low resource domains |
CN117131198A (en) * | 2023-10-27 | 2023-11-28 | 中南大学 | Knowledge enhancement entity relationship joint extraction method and device for medical teaching library |
CN117744657A (en) * | 2023-12-26 | 2024-03-22 | 广东外语外贸大学 | Medicine adverse event detection method and system based on neural network model |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110019839B (en) * | 2018-01-03 | 2021-11-05 | 中国科学院计算技术研究所 | Medical knowledge graph construction method and system based on neural network and remote supervision |
US10706045B1 (en) * | 2019-02-11 | 2020-07-07 | Innovaccer Inc. | Natural language querying of a data lake using contextualized knowledge bases |
CN110597998A (en) * | 2019-07-19 | 2019-12-20 | 中国人民解放军国防科技大学 | Military scenario entity relationship extraction method and device combined with syntactic analysis |
CN111339774B (en) * | 2020-02-07 | 2022-11-29 | 腾讯科技(深圳)有限公司 | Text entity relation extraction method and model training method |
-
2020
- 2020-09-25 CN CN202011021524.0A patent/CN112214610B/en active Active
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113094513A (en) * | 2021-04-08 | 2021-07-09 | 北京工商大学 | Span representation-based end-to-end menu information extraction method and system |
CN113094513B (en) * | 2021-04-08 | 2023-08-15 | 北京工商大学 | Span representation-based end-to-end menu information extraction method and system |
CN112989835B (en) * | 2021-04-21 | 2021-10-08 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Extraction method of complex medical entities |
CN113051356A (en) * | 2021-04-21 | 2021-06-29 | 深圳壹账通智能科技有限公司 | Open relationship extraction method and device, electronic equipment and storage medium |
CN112989835A (en) * | 2021-04-21 | 2021-06-18 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Extraction method of complex medical entities |
CN113204615A (en) * | 2021-04-29 | 2021-08-03 | 北京百度网讯科技有限公司 | Entity extraction method, device, equipment and storage medium |
CN113204615B (en) * | 2021-04-29 | 2023-11-24 | 北京百度网讯科技有限公司 | Entity extraction method, device, equipment and storage medium |
CN113240443B (en) * | 2021-05-28 | 2024-02-06 | 国网江苏省电力有限公司营销服务中心 | Entity attribute pair extraction method and system for power customer service question and answer |
CN113240443A (en) * | 2021-05-28 | 2021-08-10 | 国网江苏省电力有限公司营销服务中心 | Entity attribute pair extraction method and system for power customer service question answering |
CN113411549A (en) * | 2021-06-11 | 2021-09-17 | 上海兴容信息技术有限公司 | Method for judging whether business of target store is normal or not |
CN113411549B (en) * | 2021-06-11 | 2022-09-06 | 上海兴容信息技术有限公司 | Method for judging whether business of target store is normal or not |
CN113536795A (en) * | 2021-07-05 | 2021-10-22 | 杭州远传新业科技有限公司 | Method, system, electronic device and storage medium for entity relation extraction |
CN113536795B (en) * | 2021-07-05 | 2022-02-15 | 杭州远传新业科技有限公司 | Method, system, electronic device and storage medium for entity relation extraction |
CN113627185A (en) * | 2021-07-29 | 2021-11-09 | 重庆邮电大学 | Entity identification method for liver cancer pathological text naming |
CN113779260A (en) * | 2021-08-12 | 2021-12-10 | 华东师范大学 | Domain map entity and relationship combined extraction method and system based on pre-training model |
CN113791791B (en) * | 2021-09-01 | 2023-07-25 | 中国船舶重工集团公司第七一六研究所 | Business logic code-free development method based on natural language understanding and conversion |
CN113791791A (en) * | 2021-09-01 | 2021-12-14 | 中国船舶重工集团公司第七一六研究所 | Business logic code-free development method based on natural language understanding and conversion |
US20230153533A1 (en) * | 2021-11-12 | 2023-05-18 | Adobe Inc. | Pre-training techniques for entity extraction in low resource domains |
CN114611497B (en) * | 2022-05-10 | 2022-08-16 | 北京世纪好未来教育科技有限公司 | Training method of language diagnosis model, language diagnosis method, device and equipment |
CN114611497A (en) * | 2022-05-10 | 2022-06-10 | 北京世纪好未来教育科技有限公司 | Training method of language diagnosis model, language diagnosis method, device and equipment |
CN114881038A (en) * | 2022-07-12 | 2022-08-09 | 之江实验室 | Chinese entity and relation extraction method and device based on span and attention mechanism |
CN115599902A (en) * | 2022-12-15 | 2023-01-13 | 西南石油大学(Cn) | Oil-gas encyclopedia question-answering method and system based on knowledge graph |
CN117131198A (en) * | 2023-10-27 | 2023-11-28 | 中南大学 | Knowledge enhancement entity relationship joint extraction method and device for medical teaching library |
CN117131198B (en) * | 2023-10-27 | 2024-01-16 | 中南大学 | Knowledge enhancement entity relationship joint extraction method and device for medical teaching library |
CN117744657A (en) * | 2023-12-26 | 2024-03-22 | 广东外语外贸大学 | Medicine adverse event detection method and system based on neural network model |
Also Published As
Publication number | Publication date |
---|---|
CN112214610B (en) | 2023-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112214610B (en) | Entity relationship joint extraction method based on span and knowledge enhancement | |
CN110597735B (en) | Software defect prediction method for open-source software defect feature deep learning | |
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
CN111639171B (en) | Knowledge graph question-answering method and device | |
CN107133220B (en) | Geographic science field named entity identification method | |
CN107766324B (en) | Text consistency analysis method based on deep neural network | |
CN107729468B (en) | answer extraction method and system based on deep learning | |
CN107463658B (en) | Text classification method and device | |
CN107315738B (en) | A kind of innovation degree appraisal procedure of text information | |
CN109189767B (en) | Data processing method and device, electronic equipment and storage medium | |
CN103955451A (en) | Method for judging emotional tendentiousness of short text | |
CN107679110A (en) | The method and device of knowledge mapping is improved with reference to text classification and picture attribute extraction | |
CN112256939A (en) | Text entity relation extraction method for chemical field | |
CN110377690B (en) | Information acquisition method and system based on remote relationship extraction | |
CN111159342A (en) | Park text comment emotion scoring method based on machine learning | |
CN111984790B (en) | Entity relation extraction method | |
CN108304382A (en) | Mass analysis method based on manufacturing process text data digging and system | |
CN112257441A (en) | Named entity identification enhancement method based on counterfactual generation | |
CN115952292B (en) | Multi-label classification method, apparatus and computer readable medium | |
CN111274494B (en) | Composite label recommendation method combining deep learning and collaborative filtering technology | |
CN115146062A (en) | Intelligent event analysis method and system fusing expert recommendation and text clustering | |
CN115659947A (en) | Multi-item selection answering method and system based on machine reading understanding and text summarization | |
CN106815209B (en) | Uygur agricultural technical term identification method | |
CN110245234A (en) | A kind of multi-source data sample correlating method based on ontology and semantic similarity | |
CN114547232A (en) | Nested entity identification method and system with low labeling cost |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |