CN112214610B - Entity relationship joint extraction method based on span and knowledge enhancement - Google Patents

Entity relationship joint extraction method based on span and knowledge enhancement Download PDF

Info

Publication number
CN112214610B
CN112214610B CN202011021524.0A CN202011021524A CN112214610B CN 112214610 B CN112214610 B CN 112214610B CN 202011021524 A CN202011021524 A CN 202011021524A CN 112214610 B CN112214610 B CN 112214610B
Authority
CN
China
Prior art keywords
entity
span
relationship
graph
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011021524.0A
Other languages
Chinese (zh)
Other versions
CN112214610A (en
Inventor
张骁雄
刘姗姗
丁鲲
张雨豪
张慧
刘茗
蒋国权
漆桂林
周晓磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202011021524.0A priority Critical patent/CN112214610B/en
Publication of CN112214610A publication Critical patent/CN112214610A/en
Application granted granted Critical
Publication of CN112214610B publication Critical patent/CN112214610B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Pure & Applied Mathematics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Operations Research (AREA)
  • Algebra (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a physical relationship joint extraction method based on span and knowledge enhancement, belonging to the technical field of information extraction and natural language processing. Firstly, constructing a sample data set and marking the data set; then entity identification and relation classification are carried out, specifically, the words in the high-dimensional discrete space are mapped to low-dimensional continuous space vectors by utilizing a pre-training language model for the marked data; performing span identification, filtering and relationship classification by using a span-based model; converting the relationship classification into graph classification by using a graph-based model, and introducing syntactic dependency relationship so as to assist the relationship judgment classification; and performing joint training on the output result of the span-based model and the output result of the graph-based model, and identifying entities and relationships among the entities contained in the data. According to the invention, the dependency relationship and other syntactic information are introduced into the end-to-end neural network model, so that the overlapping relationship is effectively identified, and the entity relationship joint extraction accuracy is improved.

Description

Entity relationship joint extraction method based on span and knowledge enhancement
Technical Field
The invention belongs to the technical field of information extraction and natural language processing, and particularly relates to a span and knowledge enhancement-based entity relationship joint extraction method.
Background
Extraction entities and their inherent relationships play a vital role in understanding text. Specifically, named entity recognition and relationship classification are particularly critical in judging text structures for downstream tasks such as knowledge graph construction, knowledge-based questions and answers, wherein named entity recognition refers to recognizing an entity with a specific meaning in text and judging the type (name, place name, organization name, proper noun, etc.) of the entity, and relationship classification refers to judging the type of relationship existing between a given set of entity pairs.
The traditional entity relation extraction method is a pipeline process, namely named entity identification and relation classification are divided into two independent subtasks, after a text is given, the entity in the text is identified, and then the relation type among the identified entities is judged. Although the pipeline method is easy to implement, error transfer is easy to occur in the process, and if errors occur in the named entity identification process, the effect of subsequent relation classification can be affected. Aiming at the problems, a method for extracting the relationship of the combined entity is proposed in some recent researches, so that potential dependency relationships between the entity and the relationship are fully mined, and two tasks of named entity identification and relationship classification can achieve the effect of complement each other. Although the combined entity relation extraction method can effectively alleviate the error transfer problem in the pipeline method, the requirement on the labeling of the data set is high, and a large amount of high-quality labeling data is required for training the model. However, labeling data in a particular field is time consuming and difficult. Meanwhile, the existing entity relation extraction method based on the end-to-end neural network cannot fully mine the information such as the syntax and the semantics among sentences, and the phenomenon such as overlapping relation and multi-label is ignored in the data set marked on the basis of the marking system such as BIO/BILO and the like, so that the effect of entity relation extraction can be influenced.
Disclosure of Invention
Technical problems: aiming at the problem that the existing entity relation extraction method has poor entity relation extraction effect, the invention provides a span and knowledge enhancement-based entity relation joint extraction method, which can introduce syntax information such as dependency relation and the like into an end-to-end neural network model to identify overlapped relation, thereby improving the extraction accuracy of entity relation.
The technical scheme is as follows: the invention relates to a span and knowledge enhancement-based entity relationship joint extraction method, which comprises the following steps:
s1: constructing a dataset
Collecting data of a specific field, cleaning the collected data, and constructing a data set of the field;
s2: labeling data
Randomly selecting a plurality of data in the data set, manually marking the data, and automatically marking the data which are not manually marked in the data set by using a regular template;
s3: entity identification and relationship classification
Mapping words in a high-dimensional discrete space to low-dimensional continuous space vectors by using a pre-training language model for the marked data, and embedding codes;
performing span identification, filtering and relationship classification by using a span-based model;
converting the relationship classification into graph classification by using a graph-based model, and introducing syntactic dependency relationship so as to assist the relationship judgment classification;
and performing joint training on the output result of the span-based model and the output result of the graph-based model, and identifying entities and relationships among the entities contained in the data.
Further, in step S2, when the data is manually marked, the relationship among the entity location information, the entity type and the entity of the data is marked.
Further, in step S2, when the regular template is used to automatically label the data, the relationship between the entity type and the entity is preset, and according to the domain to which the data set belongs, the regular template is written by using the writing knowledge of the domain expert, and the relationship between the entity type and the entity is preset in the data is labeled in a template matching manner.
Further, in step S3, the pre-training language model uses a BERT model to obtain a vector representation of the word through efficient encoding of the context information.
Further, in step S3, the span-based model includes an entity classifier, a span filter and a relationship classifier, the entity classifier is used to determine and classify the entity, the span filter is used to filter out the span of the non-entity, and the relationship classifier is used to determine the relationship type of the entity to classify.
Further, the method for classifying the entities by using the entity classifier comprises the following steps:
candidate span { t } to be embedded with code by pre-trained network model i ,t i+1 ,...,t i+k Inputting into an entity classifier, and performing primary maximum pooling to obtain a vector representation f (t) i ,t i+1 ,...,t i+k ) And is encoded with a BERT model to obtain a special classification vector t cls Vector t encoding span width width Splicing to obtain vector representation of the final entity:
wherein i and k each represent a sequence number, and then splicing the spliced resultsInputting into a full connection layer and activating by softmax to obtain probability distribution of entity types:
wherein ,ei Representing entity type, W i Weight, b i For biasing, s i Representing the ith span, and judging and classifying the entity types through the probability distribution.
Further, the span filtering method using the span filter comprises the following steps: and identifying the span as the "none" type if the probability value of the "none" type is highest in the probability distribution of the entity type obtained based on the entity classifier, judging that the span is not an entity, and filtering the span.
Further, the method for judging the relationship type by using the relationship classifier comprises the following steps:
representing span-coded vectors obtained by entity classifier and />Coding vector representation +_obtained by embedded coding with context between two spans>The relation expression is obtained by splicing, and as the relation among the entity pairs is opposite, two opposite relation expressions exist among all the entity pairs, namely:
wherein ,ri,j and rj,i Respectively representing the relation between the ith entity and the jth entity, wherein i and j represent serial numbers;
inputting the relation expression into a full connection layer, and activating through a sigmoid function to obtain probability distribution of relation types:
wherein ,Wi,j and Wj,i Representing weights, b i,j and bj,i And (3) representing bias, judging and classifying the relationship types among the entity pairs through the obtained probability distribution of the relationship types, wherein sigma (·) represents a function.
Further, the method for judging and classifying the entity relationship by using the model assistance based on the graph comprises the following steps:
obtaining a dependency analysis tree of sentences by utilizing a HanLP natural language processing tool, and converting the dependency analysis tree into an adjacent matrix to obtain an input graph G based on a graph model i The method comprises the steps of carrying out a first treatment on the surface of the Then will input the graph G i Input into a graph convolution neural network model GIN realized by CogDL, and obtain vector representation of the whole graph through multiple iterative learning of the characteristics of neighbor nodes
Representing vectors of a graphThe probability distribution of the graph classification is obtained by inputting into a fully connected layer and activating by softmax:
wherein ,representing weights +.>And representing the bias, and judging and classifying the entity relationship by using probability distribution of graph classification.
Further, the method for carrying out joint training on the output result based on the span model and the output result of the graph classification model and identifying the entities and the relationship types among the entities included in the data comprises the following steps:
entity recognition loss gamma for span-based models using cross entropy loss functions e
wherein ,Me Is the number of entity types;to indicate a variable, the value is 0 or 1, if the class and sample class are the same as 1, otherwise 0; />For observing that the entity span belongs to the entity class c e Is used for predicting the probability of (1); n represents the total number of samples in the dataset, e is the identity of the entity;
obtaining a relationship classification loss gamma of a span-based model using a BCEWithLogits loss function r
wherein ,yr For indicating variables, representing whether the predicted relationship category is the same as the sample category, r is the identity of the relationship; obtaining graph classification loss gamma for graph-based models using cross entropy loss functions g
wherein ,Mg Is the number of relationship types;is an indicator variable; />To observe that the graph belongs to the category c g G is a relation identifier in the graph classification;
the joint training is performed by using the following formula to obtain the joint loss gamma:
γ=γ er +f(·)γ g
wherein f (·) is a linear function, and the linear function f (·) is takenWhere x represents the number of samples entered and N represents the sum of samples in the dataset.
The beneficial effects are that: compared with the prior art, the invention has the following advantages:
the invention discloses a span and knowledge enhancement-based entity relationship joint extraction method, which is used for solving the problem of entity relationship joint extraction in a specific field. The method comprises a span-based model and a graph-based model, wherein the span-based model can utilize context representation in text to perform entity identification and relationship classification, and the graph-based model utilizes a syntax tree obtained by syntactic dependency analysis to perform graph classification tasks so as to effectively judge relationship types. The model of the invention can introduce the syntax information such as the dependency relationship and the like into the end-to-end neural network model, thereby effectively identifying the overlapping relationship and improving the entity relationship joint extraction accuracy.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is an exemplary diagram of an artificial annotation in an embodiment of the invention;
FIG. 3 is a flow chart of a process for automatic labeling in an embodiment of the invention;
FIG. 4 is an exemplary diagram of a canonical template in an embodiment of the invention;
FIG. 5 is a diagram of a model of entity-relationship joint extraction according to the present invention.
Detailed Description
The invention is further illustrated by the following examples and the accompanying drawings.
Referring to fig. 1, the span and knowledge enhancement-based entity relationship joint extraction method of the present invention includes:
s1: constructing a dataset
The building of the data set is to build the data set in the field of interest or the specific field, and enough data needs to be collected before the data set is built, in the embodiment of the invention, the crawler software or the crawling program is utilized to crawl news texts on the portal, and in other embodiments, the data accumulated by enterprises or the data collected in other modes can be used. After enough data are collected, the collected data are washed, and data which do not meet the requirements are washed away, so that the construction of a data set is completed.
For example, in one embodiment of the present invention, for news in the military field, during the implementation process, a certain web portal military news page is crawled, 840 000 military news articles are collected in total, the articles which are irrelevant to the military field or do not include military relations are washed out by using keywords in the military field, and finally 85 000 articles are obtained, so that a data set including 85 000 articles is constructed.
S2: labeling data
The data marking comprises manual marking and automatic marking, and expert experience is fully utilized when manual marking is generally adopted, so that the marking accuracy is relatively high, but because the data volume in the data set is large, the marking cannot be completed completely through a manual marking mode, automatic marking is needed, and the marking efficiency is improved.
In the embodiment of the invention, a plurality of data in the data set are randomly selected for manual labeling, and the rest data in the data set are labeled in an automatic labeling mode. When the data is manually marked, the position information, the entity type and the entity relation type of the entity are required to be marked. The entity type and the entity relation type are preset before data marking, for example, aiming at the data set in the military field, the preset entity types comprise: equipment, figures, organizations, place names, military operations, job titles, and combat readiness projects; the preset entity relation type comprises the following steps: deployment, hold, own, located. For the data set in the military field, 338 articles are randomly extracted, and an expert in the military field is invited to mark the extracted 338 articles. The expert marks the extracted data manually according to the preset entity type and entity relation type, a specific number is given to the entity appearing in the article, meanwhile, the entity position marks according to the beginning and ending positions of the entity in the article, and fig. 2 shows a marking example of the data during manual marking.
For data which is not manually marked, in the embodiment of the invention, the marking is performed by using a regular template, and the marking flow by using the regular is as shown in fig. 3:
(1) For the embodiment of the invention, because the entity and the relation in the military field are too complex, the field expert with higher degree of fit with the profession when the design type is classified is formulated according to the common content discussion of the data set, the currently mainstream military entity and relation type can be summarized more accurately, and the extracted relation triples can be added into the construction of the military knowledge graph.
(2) Randomly extracting 100 military news texts from the data set, manually writing corresponding regular expressions aiming at the relation and the entity in each news text, then testing the effect of the regular expressions on the artificially marked 338 military news texts, and supplementing the corresponding missing regular expressions according to the value of the recall ratio (recall). It is noted that in other embodiments, other amounts of data may be extracted to write regular expressions.
(3) Iterating back to step (2), and repeating step (2) until the accuracy (precision) and recall (recall) of the canonical extraction reach a threshold. At this time, the whole flow is finished, and the corresponding entities and relations are extracted from the data set by using perfect regular expressions and the data is marked.
In the implementation of the invention, 119 relational regular expressions are designed, and a written regular template is shown in fig. 4 for example. Analyzing the matching result of the regular template on the annotation data set, and successfully extracting the relation rule to the artificial annotation relation statement is determined by two aspects: the type of the relation regular expression pre-defined is the same as the type of the manual annotation definition or the head and tail entities of the manual annotation relation statement are in the statement extracted by the regular expression.
After the data is marked, mixing and disturbing the manually marked data and the automatically marked data, and then classifying the entity identification mechanism.
S3: entity identification and relationship classification
Mapping words in a high-dimensional discrete space to low-dimensional continuous space vectors by using a pre-training language model for the marked data, and embedding codes; performing span identification, filtering and relationship classification by using a span-based model; converting the relationship classification into graph classification by using a graph-based model, and introducing syntactic dependency relationship so as to assist the relationship classification; and performing joint training on the output result of the span-based model and the output result of the graph-based model, and identifying entities and relationships among the entities contained in the data.
In the embodiment of the invention, the pre-training language model adopts a BERT model which is issued by Google and is used for training Chinese, and words in a high-dimensional discrete space can be mapped to low-dimensional continuous space vectors and embedded into codes. The BERT model is a multi-layer bi-directional transform structure that can efficiently encode context information to obtain a vector representation of words. For example, a given one contains nAfter the sentences of the words are input into the BERT-based embedded coding module, a word vector sequence { t } with the length of n+1 is obtained cls ,t 1 ,t 2 ,...,t n The BERT model adds a special classification vector t covering the whole sentence of information into the head end of the sequence cls
The span-based model comprises an entity classifier, a span filter and a relation classifier, wherein the entity classifier is used for carrying out entity classification on the output of the BERT model, the span filter is used for filtering the span of non-entity, and then the relation classifier is used for judging and classifying the entity relation.
After the text vector representation based on BERT is obtained by the span-based model, the span is obtained in an optimized negative sampling mode, and the span which is not in the labeling entity list is defined as a negative sample. For example, for sentences (America, china, F, -,1,5, war, fighter, machine), the entities that may be detected are (America), (America F), (F-15 fighter), and so forth. Unlike prior art work, the span-based model of the present invention does not perform beam searches on physical and relational assumptions, but instead sets a maximum value N e I.e. choose at most N among all possible entities e The entities and marks samples that are not labeled as positive examples in the training set as negative examples. Unlike the existing span-based model, the invention provides a new way of selecting negative examples, namely firstly, a military entity set S is established, the set S comprises as many entities (marked data and results of regular extraction of the entities) as possible in the data set, then, the words are segmented by using jieba (a word segmentation software), the words can obtain all possible entities and obtain the parts of speech corresponding to the segmentation results, for example, "I can obtain I, beijing Tiananmen", tianan and Tiananmen "can obtain three entities, namely I, beijing, tianan and Tianan, firstly, filtering according to the parts of speech, only preserving nouns, then carrying out similarity calculation on the nouns and the entities in the entity set S, selecting the value with the highest similarity as the score of the segmentation result, finally, selecting the negative examples is ordered according to the higher priority of the similarity, if N cannot be reached e Filling with the rest results of word segmentation, and randomly selecting spanIn the method, the span is randomly selected to select the entity with the length of 2-10. For example, the length of the entity is basically within the range of the military corpus studied in the embodiment of the invention, and the training effect of the model can be better by selecting the entity which is more consistent with the characteristics of the military entity but is not marked as a negative example.
The span-based model, after selecting the possible entities, processes the vector representation of the entities. The vector representation of an entity consists of three parts, namely a vector representation of tokens that the entity contains (see fig. 5, i.e. mapping the words of the entity to the ids corresponding to the words in the dictionary of the pre-trained model), width embedding (see fig. 5) and special tags CLS (see fig. 5).
Therefore, the method for classifying the entities by using the entity classifier comprises the following steps:
candidate span { t } to be embedded with code by pre-trained network model i ,t i+1 ,...,t i+k Inputting into an entity classifier, and performing primary maximum pooling to obtain a vector representation f (t) i ,t i+1 ,...,t i+k ). Width embedding is an embedding matrix learned in training (the matrix contains the features of words), namely that the width of an entity is k+1, which means that k+1 tokens are contained in the entity, and then the width of the entity is embedded as a vector expression t with k+1 as a subscript and obtained by indexing in the width matrix width I.e. vectors encoding the span width. The special mark symbol CLS is generated by a BERT model, covers the global information of the input sentence, and the BERT model codes to obtain a special classification vector t cls . Representing the vector of the entity by f (t i ,t i+1 ,...,t i+k ) Special classification vector t encoded with BERT model cls Vector t encoding span width width Splicing to obtain vector representation of the final entity:
wherein i and k each represent a sequence number, and then splicing the spliced resultsInput into a fully connected layer and activated by softmax, resulting in types of entities, including no type "none", and resulting in probability distribution of entity types:
wherein ,ei Representing entity type, W i Weight, b i For biasing, s i Representing the ith span, and judging the entity type through the probability distribution.
And when filtering, if the probability value of the "none" type is highest in the probability distribution, the span is identified as the "none" type, namely, the span is judged not to be an entity, so that the span is filtered.
The relationship classifier is used for classifying entity relationships, and all possible entity pairs are constructed and classified. First randomly selecting a maximum of N from possible entities r The relationship sets are composed for the entities. For a pair(s) 1 ,s 2 ) The relation vector representation of the constituted entity is composed of two parts, one part is the head-tail entity vector representation obtained by the span identification part, and the span coding representation can be obtained by the entity classifier and />The other part is a text feature. In addition to the physical features, the relation extraction also relies on text features. In the invention, CLS is not selected as text feature, but text between two entities is maximally pooled, context information between entity pairs is reserved, and coding of text feature is obtained by embedding codingCode vector representation +.>If there is no text between the two entities +.>Will be set to 0. Since the relationship of entity pairs tends to be asymmetric, the head-to-tail entities of the relationship cannot be reversed, so that each entity pair will be represented by two opposite relationships:
wherein ,ri,j and rj,i Respectively representing the relation between the ith entity and the jth entity, i and j representing serial numbers.
Inputting the relation expression into a full connection layer, and activating through a sigmoid function to obtain probability distribution of relation types:
wherein ,Wi,j and Wj,i Representing weights, b i,j and bj,i And (3) representing bias, judging the relationship type between the entity pairs through the obtained probability distribution of the relationship type, wherein sigma (·) represents a function.
The graph-based model is used for converting the relation classification into the graph classification problem, and syntactic dependency analysis is introduced to assist the relation classification, so that the defect that the end-to-end neural network model cannot effectively mine syntactic information is effectively relieved.
The method for judging and classifying by utilizing the graph-based model comprises the following steps of: inputting any sentence, obtaining a dependency analysis tree of the sentence by utilizing a HanLP natural language processing tool, and converting the dependency analysis tree into an adjacent matrix to obtain an input graph G of a graph-based model i More specifically, summing word vectors of each node constituting the tree, which are obtained by the BERT model, as node labels, dependency relationship types among the words as edge labels, and relationship types of the whole sentence as graph labels; then will input the graph G i Input to model GIN (Graph Isomorphism Network, graph isomorphic network) of graph convolution neural network implemented by cog dl (a tool kit), the feature of neighbor node is learned through multiple iterations to obtain the representation vector of the whole graph
Vector representation of a graphThe probability distribution of the graph classification is obtained by inputting into a fully connected layer and activating by softmax:
wherein ,is a weight of->And (3) representing bias, and judging and classifying the relationship through probability distribution of graph classification.
The method for carrying out joint training on the output result of the span-based model and the output result of the graph-based model and identifying the entities and the relationship types among the entities included in the data comprises the following steps:
entity recognition loss gamma for span-based models using cross entropy loss functions e
wherein ,Me Is the number of entity types;to indicate a variable, the value is 0 or 1, if the class and sample class are the same as 1, otherwise 0; />For observing that the entity span belongs to the entity class c e E is the identity of the entity.
Obtaining a relationship classification loss gamma of a span-based model using a BCEWithLogits loss function r
wherein ,yr To indicate a variable, indicating whether the predicted relationship class is the same as the sample class; n represents the total sample amount in the dataset and r is the identity of the relationship.
Obtaining graph classification loss gamma for graph-based models using cross entropy loss functions g
wherein ,Mg Is the number of relationship types;is an indicator variable; />To observe that the graph belongs to the category c g G is a relation identifier in the graph classification;
the joint training is performed by using the following formula to obtain the joint loss gamma:
γ=γ er +f(·)γ g
wherein f (·) is a linear function. In a preferred embodiment of the invention, the linear function f (.) is takenWhere x represents the number of samples entered and N represents the sum of samples in the dataset.
And identifying the entities contained in the sentences and the relationship types among the entities through joint training.
Based on the method of the invention, a specific application example is given.
First, a military news web page of a representative web site is crawled, and 840,000 military news articles are obtained. And filtering articles which are irrelevant to the military field or do not contain military relations based on the keywords in the military field, and finally obtaining 85,000 articles to complete the construction of a data set. Then 338 articles are randomly extracted, and the expert in the field is invited to manually mark. For articles which are not manually marked, automatic marking is carried out by utilizing a regular template, and 119 relational regular expressions are designed to automatically mark a data set. Finally, the data sets were randomly divided into training and testing sets, the ratio of the two data sets being 10:1. the parameters of the model according to the present invention are set as shown in table 1.
Table 1 parameter settings in model
In order to prove the superiority of the model in the invention, the model result in the invention is compared with the existing model, the evaluation results of different models are listed in table 2, and the evaluation results of different models are respectively compared on three index measures, including accuracy, recall and F1 value.
Table 2 evaluation results of different models
In table 2, the results in line 1 are not based on the graph model, and lines 2 to 4 represent the results of the hybrid model. By comparing the model of the present invention with different GNN (Graph Neural Networks, graphic neural network) variants, it was found that the performance of each model was different. Although the SortPool model performs well in graph classification tasks, there is no improvement in the relationship prediction task F1 index score compared to the single model. Similarly, spERT+PATCH+SAN performs generally in both graph classification and relational extraction. The observation shows that the model reaches the highest F1 score in graph classification, entity identification and relation classification, which shows that the performance of the model can be improved by introducing specific external knowledge based on the graph model.
Table 3 comparison of results from different joint extraction methods
To jointly train a span-based model and a graph-based model, entity recognition loss gamma obtained for the span-based model is required e Span-based model derived relationship classification loss gamma r Graph classification loss gamma representing graph-based model g Polymerization is carried out. The extraction results corresponding to the three different joint methods are given in table 3. The results show that in addition to multiplication, addition and linear functions can be accurately joint trained. At the same time, a linear function f (x) =xn, moduloTypes can obtain F1 scores of 76.60 and 58.57 respectively in entity identification and relationship classification, which are higher than those of other two combined methods.
The invention provides a span and knowledge enhancement-based entity relationship joint extraction method, which is used for solving the problem of entity relationship joint extraction in a specific field. The method comprises a span-based model and a graph-based model, wherein the span-based model can utilize context representation in text to perform entity identification and relationship classification, and the graph-based model utilizes a syntax tree obtained by syntactic dependency analysis to perform graph classification tasks so as to effectively judge relationship types. The model of the invention can introduce the syntax information such as the dependency relationship and the like into the end-to-end neural network model, thereby effectively identifying the overlapping relationship and improving the entity relationship joint extraction accuracy.
The above examples are only preferred embodiments of the present invention, it being noted that: it will be apparent to those skilled in the art that several modifications and equivalents can be made without departing from the principles of the invention, and such modifications and equivalents fall within the scope of the invention.

Claims (6)

1. A military field entity relationship joint extraction method based on span and knowledge enhancement is characterized by comprising the following steps:
s1: constructing a dataset
Collecting data of a specific field, cleaning the collected data, and constructing a data set of the field;
s2: labeling data
Randomly selecting a plurality of data in the data set, manually marking the data, and automatically marking the data which are not manually marked in the data set by using a regular template;
s3: entity identification and relationship classification
Mapping words in a high-dimensional discrete space to low-dimensional continuous space vectors by using a pre-training language model for the marked data, and embedding codes;
performing span identification, filtering and relationship classification by using a span-based model;
converting the relationship classification into graph classification by using a graph-based model, and introducing syntactic dependency relationship so as to assist the relationship judgment classification;
performing joint training on the output result of the span-based model and the output result of the graph-based model, and identifying entities and relationships among the entities contained in the data;
the span-based model comprises an entity classifier, a span filter and a relation classifier, wherein the entity classifier is used for judging and classifying the entity, the span filter is used for filtering the span of the non-entity, and the relation classifier is used for judging the entity relation type for classifying;
the method for judging and classifying the entities by using the entity classifier comprises the following steps:
candidate span { t } to be embedded with code by pre-trained network model i ,t i+1 ,...,t i+k Inputting into an entity classifier, and performing primary maximum pooling to obtain a vector representation f (t) i ,t i+1 ,...,t i+k ) And is encoded with a BERT model to obtain a special classification vector t cls Vector t encoding span width width Splicing to obtain vector representation of the final entity:
wherein i and k each represent a sequence number, and then splicing the spliced resultsInputting into a full connection layer and activating by softmax to obtain probability distribution of entity types:
wherein ,ei Representing entity type, W i Representation ofWeight, b i Representing bias, s i Representing the ith span, and judging and classifying the entity types through the probability distribution;
the method for converting the relationship classification into the graph classification by using the graph-based model and introducing the syntactic dependency relationship so as to assist the relationship judgment classification comprises the following steps:
obtaining a dependency analysis tree of sentences by utilizing a HanLP natural language processing tool, and converting the dependency analysis tree into an adjacent matrix to obtain an input graph G based on a graph model i The method comprises the steps of carrying out a first treatment on the surface of the Then will input the graph G i Input into a graph convolution neural network model GIN realized by CogDL, and obtain vector representation of the whole graph through multiple iterative learning of the characteristics of neighbor nodes
Representing vectors of a graphThe probability distribution of the graph classification is obtained by inputting into a fully connected layer and activating by softmax:
wherein ,representing weights +.>Representing bias, and judging and classifying entity relations by using probability distribution of graph classification;
the method for carrying out joint training on the output result of the span-based model and the output result of the graph-based model and identifying the entities and the relations among the entities contained in the data comprises the following steps:
entity recognition loss gamma for span-based models using cross entropy loss functions e
wherein ,Me Is the number of entity types;to indicate a variable, the value is 0 or 1, if the class and sample class are the same as 1, otherwise 0; />For observing that the entity span belongs to the entity class c e Is used for predicting the probability of (1); n represents the total number of samples in the dataset, e is the identity of the entity;
obtaining a relationship classification loss gamma of a span-based model using a BCEWithLogits loss function r
wherein ,yr For indicating variables, representing whether the predicted relationship category is the same as the sample category, r is the identity of the relationship; r is (r) i,j Representing the relation between the ith entity and the jth entity, i and j representing serial numbers;
obtaining graph classification loss gamma for graph-based models using cross entropy loss functions g
wherein ,Mg Is the number of relationship types;is an indicator variable; />To observe that the graph belongs to the category c g G is a relation identifier in the graph classification;
the joint training is performed by using the following formula to obtain the joint loss gamma:
γ=γ er +f(·)γ g
wherein f (·) is a linear function, and the linear function f (·) is takenWhere x represents the number of samples entered and N represents the sum of samples in the dataset.
2. The method for jointly extracting entity relations in the military field based on span and knowledge enhancement according to claim 1, wherein in step S2, when the data is manually marked, the entity position information, the entity type and the relation among the entities of the data are marked.
3. The method for jointly extracting entity relations in the military field based on span and knowledge enhancement according to claim 1, wherein in step S2, when the data is automatically marked by using a regular template, the relation between the entity types and the entities is preset, according to the field to which the data set belongs, the regular template is written by using knowledge written by field experts, and the relation between the entity types and the entities preset in the data is marked by means of template matching.
4. The method for jointly extracting entity relationships in the military field based on span and knowledge enhancement according to claim 1, wherein in step S3, a pre-training language model adopts a BERT model, and a vector representation of words is obtained through efficient coding of context information.
5. The method for jointly extracting the entity relationships in the military field based on span and knowledge enhancement according to claim 1, wherein the method for filtering the span by using a span filter is as follows: and identifying the span as the "none" type if the probability value of the "none" type is highest in the probability distribution of the entity type obtained based on the entity classifier, judging that the span is not an entity, and filtering the span.
6. The method for jointly extracting the entity relationships in the military field based on span and knowledge enhancement according to claim 5, wherein the method for judging the relationship types by using the relationship classifier is as follows:
representing span-coded vectors obtained by entity classifier and />Coding vector representation +_obtained by embedded coding with context between two spans>The relation expression is obtained by splicing, and as the relation among the entity pairs is opposite, two opposite relation expressions exist among all the entity pairs, namely:
wherein ,ri,j and rj,i Respectively representing the relation between the ith entity and the jth entity, wherein i and j represent serial numbers;
inputting the relation expression into a full connection layer, and activating through a sigmoid function to obtain probability distribution of relation types:
wherein ,Wi,j and Wj,i Representing weights, b i,j and bj,i And (3) representing bias, judging and classifying the relationship types among the entity pairs through the obtained probability distribution of the relationship types, wherein sigma (·) represents a function.
CN202011021524.0A 2020-09-25 2020-09-25 Entity relationship joint extraction method based on span and knowledge enhancement Active CN112214610B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011021524.0A CN112214610B (en) 2020-09-25 2020-09-25 Entity relationship joint extraction method based on span and knowledge enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011021524.0A CN112214610B (en) 2020-09-25 2020-09-25 Entity relationship joint extraction method based on span and knowledge enhancement

Publications (2)

Publication Number Publication Date
CN112214610A CN112214610A (en) 2021-01-12
CN112214610B true CN112214610B (en) 2023-09-08

Family

ID=74052289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011021524.0A Active CN112214610B (en) 2020-09-25 2020-09-25 Entity relationship joint extraction method based on span and knowledge enhancement

Country Status (1)

Country Link
CN (1) CN112214610B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094513B (en) * 2021-04-08 2023-08-15 北京工商大学 Span representation-based end-to-end menu information extraction method and system
CN113051356B (en) * 2021-04-21 2023-05-30 深圳壹账通智能科技有限公司 Open relation extraction method and device, electronic equipment and storage medium
CN112989835B (en) * 2021-04-21 2021-10-08 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Extraction method of complex medical entities
CN113204615B (en) * 2021-04-29 2023-11-24 北京百度网讯科技有限公司 Entity extraction method, device, equipment and storage medium
CN113240443B (en) * 2021-05-28 2024-02-06 国网江苏省电力有限公司营销服务中心 Entity attribute pair extraction method and system for power customer service question and answer
CN113411549B (en) * 2021-06-11 2022-09-06 上海兴容信息技术有限公司 Method for judging whether business of target store is normal or not
CN113536795B (en) * 2021-07-05 2022-02-15 杭州远传新业科技有限公司 Method, system, electronic device and storage medium for entity relation extraction
CN113627185A (en) * 2021-07-29 2021-11-09 重庆邮电大学 Entity identification method for liver cancer pathological text naming
CN113779260B (en) * 2021-08-12 2023-07-18 华东师范大学 Pre-training model-based domain map entity and relationship joint extraction method and system
CN113791791B (en) * 2021-09-01 2023-07-25 中国船舶重工集团公司第七一六研究所 Business logic code-free development method based on natural language understanding and conversion
CN114611497B (en) * 2022-05-10 2022-08-16 北京世纪好未来教育科技有限公司 Training method of language diagnosis model, language diagnosis method, device and equipment
CN114881038B (en) * 2022-07-12 2022-11-11 之江实验室 Chinese entity and relation extraction method and device based on span and attention mechanism
CN115599902B (en) * 2022-12-15 2023-03-31 西南石油大学 Oil-gas encyclopedia question-answering method and system based on knowledge graph
CN117131198B (en) * 2023-10-27 2024-01-16 中南大学 Knowledge enhancement entity relationship joint extraction method and device for medical teaching library

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019839A (en) * 2018-01-03 2019-07-16 中国科学院计算技术研究所 Medical knowledge map construction method and system based on neural network and remote supervisory
CN110597998A (en) * 2019-07-19 2019-12-20 中国人民解放军国防科技大学 Military scenario entity relationship extraction method and device combined with syntactic analysis
CN111339774A (en) * 2020-02-07 2020-06-26 腾讯科技(深圳)有限公司 Text entity relation extraction method and model training method
US10706045B1 (en) * 2019-02-11 2020-07-07 Innovaccer Inc. Natural language querying of a data lake using contextualized knowledge bases

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019839A (en) * 2018-01-03 2019-07-16 中国科学院计算技术研究所 Medical knowledge map construction method and system based on neural network and remote supervisory
US10706045B1 (en) * 2019-02-11 2020-07-07 Innovaccer Inc. Natural language querying of a data lake using contextualized knowledge bases
CN110597998A (en) * 2019-07-19 2019-12-20 中国人民解放军国防科技大学 Military scenario entity relationship extraction method and device combined with syntactic analysis
CN111339774A (en) * 2020-02-07 2020-06-26 腾讯科技(深圳)有限公司 Text entity relation extraction method and model training method

Also Published As

Publication number Publication date
CN112214610A (en) 2021-01-12

Similar Documents

Publication Publication Date Title
CN112214610B (en) Entity relationship joint extraction method based on span and knowledge enhancement
CN110597735B (en) Software defect prediction method for open-source software defect feature deep learning
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN111639171B (en) Knowledge graph question-answering method and device
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN107766324B (en) Text consistency analysis method based on deep neural network
CN104699763B (en) The text similarity gauging system of multiple features fusion
CN107463607B (en) Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning
CN110134946B (en) Machine reading understanding method for complex data
CN109543181B (en) Named entity model and system based on combination of active learning and deep learning
CN112487206B (en) Entity relationship extraction method for automatically constructing data set
CN109885824A (en) A kind of Chinese name entity recognition method, device and the readable storage medium storing program for executing of level
CN112711953A (en) Text multi-label classification method and system based on attention mechanism and GCN
CN107679110A (en) The method and device of knowledge mapping is improved with reference to text classification and picture attribute extraction
CN108287911A (en) A kind of Relation extraction method based on about fasciculation remote supervisory
CN115796181A (en) Text relation extraction method for chemical field
CN112966525B (en) Law field event extraction method based on pre-training model and convolutional neural network algorithm
CN111858896A (en) Knowledge base question-answering method based on deep learning
CN116383399A (en) Event public opinion risk prediction method and system
CN111274494B (en) Composite label recommendation method combining deep learning and collaborative filtering technology
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN113138920B (en) Software defect report allocation method and device based on knowledge graph and semantic role labeling
CN110245234A (en) A kind of multi-source data sample correlating method based on ontology and semantic similarity
CN113742396A (en) Mining method and device for object learning behavior pattern
CN110377690B (en) Information acquisition method and system based on remote relationship extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant