CN112214610A - Entity relation joint extraction method based on span and knowledge enhancement - Google Patents

Entity relation joint extraction method based on span and knowledge enhancement Download PDF

Info

Publication number
CN112214610A
CN112214610A CN202011021524.0A CN202011021524A CN112214610A CN 112214610 A CN112214610 A CN 112214610A CN 202011021524 A CN202011021524 A CN 202011021524A CN 112214610 A CN112214610 A CN 112214610A
Authority
CN
China
Prior art keywords
entity
span
relationship
graph
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011021524.0A
Other languages
Chinese (zh)
Other versions
CN112214610B (en
Inventor
张骁雄
刘姗姗
丁鲲
张雨豪
张慧
刘茗
蒋国权
漆桂林
周晓磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202011021524.0A priority Critical patent/CN112214610B/en
Publication of CN112214610A publication Critical patent/CN112214610A/en
Application granted granted Critical
Publication of CN112214610B publication Critical patent/CN112214610B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Pure & Applied Mathematics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Operations Research (AREA)
  • Algebra (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a span and knowledge enhancement based entity relationship joint extraction method, and belongs to the technical field of information extraction and natural language processing. Firstly, constructing a sample data set and labeling the data set; then, entity recognition and relation classification are carried out, specifically, for the labeled data, a pre-training language model is utilized to map words in a high-dimensional discrete space to low-dimensional continuous space vectors; performing span identification, filtering and relationship classification by using a span-based model; converting the relationship classification into a graph classification by using a graph-based model, and introducing a syntactic dependency relationship so as to assist the relationship judgment classification; and performing joint training on the output result of the span-based model and the output result of the graph-based model, and identifying entities contained in the data and relationships among the entities. The invention introduces syntax information such as dependency relationship and the like into the end-to-end neural network model, thereby effectively identifying the overlapping relationship and improving the accuracy of entity relationship joint extraction.

Description

Entity relation joint extraction method based on span and knowledge enhancement
Technical Field
The invention belongs to the technical field of information extraction and natural language processing, and particularly relates to an entity relationship joint extraction method based on span and knowledge enhancement.
Background
Extracting entities and their inter-relationships plays a crucial role in understanding text. Specifically, named entity recognition, which is to identify entities having specific meanings in a text and determine the types of the entities (names of people, places, names of organizations, proper nouns, and the like), and relationship classification, which is to determine the types of relationships existing between a given set of entity pairs, are particularly critical in determining the structure of a text for use in downstream tasks such as knowledge graph construction, knowledge-based question answering, and the like.
The traditional entity relationship extraction method is a streamline process, namely named entity identification and relationship classification are divided into two independent subtasks, after a section of text is given, the entities in the text are identified, and then the relationship types among the identified entities are judged. Although the pipeline method is easy to implement, error transmission is easy to occur in the process, and if an error occurs in the named entity identification process, the effect of subsequent relation classification can be influenced. In order to solve the above problems, some recent researches propose a method for extracting a joint entity relationship, so as to fully mine potential dependency relationships between entities and relationships thereof, so that two tasks of named entity identification and relationship classification can bring out the best in each other. Although the combined entity relationship extraction method can effectively alleviate the error transfer problem in the pipeline method, the requirement on data set labeling is high, and a large amount of high-quality labeling data is needed to train the model. However, annotating data in a particular domain is time consuming and difficult. Meanwhile, the existing entity relationship extraction method based on the end-to-end neural network cannot fully mine information such as syntax and semantics among sentences, and the phenomena of overlapping relationship, multiple labels and the like are neglected in data labeled based on labeling systems such as BIO/BILO, so that the effect of entity relationship extraction is influenced.
Disclosure of Invention
The technical problem is as follows: aiming at the problem that the existing entity relationship extraction method is poor in entity relationship extraction effect, the invention provides an entity relationship joint extraction method based on span and knowledge enhancement, which can introduce syntax information such as dependency relationship and the like into an end-to-end neural network model, identify overlapped relationships and further improve the entity relationship extraction accuracy.
The technical scheme is as follows: the invention discloses a span and knowledge enhancement based entity relationship joint extraction method, which comprises the following steps:
s1: building a data set
Collecting data of a specific field, cleaning the collected data and constructing a data set of the field;
s2: annotating data
Randomly selecting a plurality of data in the data set, manually marking, and automatically marking the data which are not manually marked in the data set by using a regular template;
s3: entity identification and relationship classification
For the labeled data, mapping words in a high-dimensional discrete space to low-dimensional continuous space vectors by using a pre-training language model, and embedding codes;
performing span identification, filtering and relationship classification by using a span-based model;
converting the relationship classification into a graph classification by using a graph-based model, and introducing a syntactic dependency relationship so as to assist the relationship judgment classification;
and performing joint training on the output result of the span-based model and the output result of the graph-based model, and identifying entities contained in the data and relationships among the entities.
Further, in step S2, when the data is manually labeled, the entity location information, the entity type, and the relationship between the entities of the data are labeled.
Further, in step S2, when the regular template is used to automatically label the data, the relationship between the entity type and the entity is preset, the regular template is compiled according to the domain to which the data set belongs by using the writing knowledge of domain experts, and the preset entity type and the relationship between the entities are labeled in the data by means of template matching.
Further, in step S3, the pre-training language model uses a BERT model to obtain a vector representation of the word through the effective coding context information.
Further, in step S3, the span-based model includes an entity classifier, a span filter and a relationship classifier, and the entity classifier is used to judge and classify the entities, and the span filter is used to filter out spans that are not entities, and then the relationship classifier is used to judge the entity relationship type for classification.
Further, the method for classifying the entities by using the entity classifier comprises the following steps:
embedding a candidate span encoded by a pre-trained network model ti,ti+1,...,ti+kInputting the data into an entity classifier, and obtaining a vector representation f (t) of the entity through one-time maximum poolingi,ti+1,...,ti+k) And a special classification vector t obtained by coding with the BERT modelclsVector t encoding span widthwidthAnd splicing to obtain the vector representation of the final entity:
Figure BDA0002700796450000021
wherein, i and k both represent serial numbers, and then the spliced result is obtained
Figure BDA0002700796450000022
Inputting into a full connection layer and obtaining the probability distribution of the entity type through softmax activation:
Figure BDA0002700796450000031
wherein ,eiRepresents an entity type, WiIs a weight, biTo be offset, siAnd (4) representing the ith span, and judging and classifying the entity type through the probability distribution.
Further, the method for filtering the span by using the span filter comprises the following steps: and in the probability distribution of the entity type obtained based on the entity classifier, if the probability value of the 'none' type is the highest, identifying the span as the 'none' type, judging that the span is not the entity, and filtering the span.
Further, the method for judging the relationship type by using the relationship classifier comprises the following steps:
representing span-encoded vectors obtained by an entity classifier
Figure BDA0002700796450000032
And
Figure BDA0002700796450000033
coded vector representation obtained by embedded coding of context between two spans
Figure BDA0002700796450000034
And splicing to obtain a relation representation, wherein the relation between the entity pairs is opposite, and two opposite relation representations are respectively arranged between all the entity pairs:
Figure BDA0002700796450000035
Figure BDA0002700796450000036
wherein ,ri,j and rj,iRespectively representing the relationship between the ith entity and the jth entity, wherein i and j represent serial numbers;
inputting the relation representation into a full connection layer, and activating through a sigmoid function to obtain the probability distribution of the relation type:
Figure BDA0002700796450000037
Figure BDA0002700796450000038
wherein ,Wi,j and Wj,iRepresents a weight, bi,j and bj,iRepresenting bias, judging and classifying the relation types between the entity pairs through the obtained probability distribution of the relation types, wherein sigma (DEG) represents a function.
Further, the method for judging and classifying the entity relationship by using the model based on the graph comprises the following steps:
utilizing a HanLP natural language processing tool to obtain a dependency analysis tree of the sentence, converting the dependency analysis tree into an adjacency matrix, and obtaining an input graph G based on a graph modeli(ii) a Then input the graphGiInputting the data into a graph convolution neural network model GIN realized by CogDL, and obtaining vector representation of the whole graph by repeatedly and iteratively learning the characteristics of neighbor nodes
Figure BDA0002700796450000041
Figure BDA0002700796450000042
Vector representation of a graph
Figure BDA0002700796450000043
The probability distribution of the graph classification is obtained by inputting into a full connection layer and activating with softmax:
Figure BDA0002700796450000044
wherein ,
Figure BDA0002700796450000045
the weight is represented by a weight that is,
Figure BDA0002700796450000046
and representing bias, and judging and classifying the entity relationship by using the probability distribution of graph classification.
Further, the method for performing joint training on the output result based on the span model and the output result of the graph classification model and identifying the entities contained in the data and the relationship types among the entities comprises the following steps:
entity identification loss gamma of span-based model obtained by using cross entropy loss functione
Figure BDA0002700796450000047
wherein ,MeThe number of entity types;
Figure BDA0002700796450000048
is an indicator variable, the value is 0 or 1, if the class and the sample class are the same, the value is 1, otherwise the value is 0;
Figure BDA0002700796450000049
belonging to entity class c for observing entity spanseA predicted probability of (d); n represents the total number of samples in the data set, and e is the identification of the entity;
obtaining relation classification loss gamma of model based on span by using BCEWithLogs loss functionr
Figure BDA00027007964500000410
wherein ,yrThe index variable represents whether the prediction relation type is the same as the sample type, and r is the identifier of the relation; graph classification loss gamma for graph-based models using cross entropy loss functiong
Figure BDA0002700796450000051
wherein ,MgIs the number of relationship types;
Figure BDA0002700796450000052
is an indicator variable;
Figure BDA0002700796450000053
belong to class c for observation mapsgG is a relation identifier in the graph classification;
performing joint training by using the following formula to obtain joint loss gamma:
γ=γer+f(·)γg
wherein f (-) is a linear function, and the linear function f (-) takes
Figure BDA0002700796450000054
Where x represents the number of samples input and N represents a sample in the data setThe sum of the products.
Has the advantages that: compared with the prior art, the invention has the following advantages:
the invention discloses a span and knowledge enhancement based entity relationship joint extraction method, which is used for solving entity relationship joint extraction in a specific field. The method is composed of a span-based model and a graph-based model, wherein the span-based model can perform entity identification and relationship classification by using context expression in a text, and the graph-based model performs a graph classification task by using a syntax tree obtained by syntactic dependency analysis so as to effectively judge the relationship type. The model of the invention can introduce syntax information such as dependency relationship and the like into the end-to-end neural network model, thereby effectively identifying the overlapping relationship and improving the accuracy of entity relationship joint extraction.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is an exemplary diagram of a manual annotation in an embodiment of the present invention;
FIG. 3 is a flowchart of a process for automatic tagging in an embodiment of the present invention;
FIG. 4 is an exemplary diagram of a canonical template in an embodiment of the invention;
FIG. 5 is a model diagram of entity relationship joint extraction according to the present invention.
Detailed Description
The invention is further described with reference to the following examples and the accompanying drawings.
Referring to fig. 1, the entity relationship joint extraction method based on span and knowledge enhancement of the present invention includes:
s1: building a data set
In the embodiment of the present invention, crawler software or a crawling program is used to crawl news texts on a portal website, and in other embodiments, the data set may also be data accumulated by an enterprise or data collected in other manners. And after enough data are collected, cleaning the collected data, and cleaning the data which do not meet the requirements to complete the construction of the data set.
For example, in one embodiment of the present invention, for news in the military field, in the implementation process, a certain portal website military news webpage is crawled, 840000 military news articles are collected, articles irrelevant to the military field or not relevant to the military field are washed away by using keywords in the military field, and 85000 articles are finally obtained, and a data set including 85000 articles is constructed.
S2: annotating data
Data marking includes artifical mark and automatic marking, when generally adopting artifical mark, can make full use of expert's experience, consequently the accuracy of mark is higher relatively, but because the data bulk is great in the data set, can not accomplish the mark through the mode of artifical mark completely, consequently need carry out automatic marking, improves the efficiency of mark.
In the embodiment of the invention, a plurality of data in the data set are randomly selected for manual marking, and the rest of the data in the data set are marked in an automatic marking mode. When data is manually marked, position information, entity types and entity relationship types of entities need to be marked. The entity type and the entity relationship type are preset before data labeling, for example, for the data set in the military field, the preset entity type includes: equipment, people, organizations, place names, military activities, job titles, combat readiness engineering; the preset entity relationship types comprise: deployed, held, owned, located. 338 articles are randomly extracted for the data set of the military field, and experts in the military field are invited to label the extracted 338 articles. The expert manually labels the extracted data according to the preset entity type and the entity relationship type, a specific number is given to an entity appearing in the article, meanwhile, the entity position is labeled according to the position of the entity starting and ending in the article, and fig. 2 shows a labeling example of data during manual labeling.
For data that is not manually labeled, in the embodiment of the present invention, labeling is performed by using a regular template, and a flow of labeling by using a regular template is shown in fig. 3:
(1) and defining entity relationship types, wherein for one embodiment of the invention, as the entities and the relationships in the military field are too complex, field experts with higher degree of engagement with the specialty during classification of design types are discussed and formulated according to the common content of the data set, so that the mainstream military entities and relationship types at present can be summarized more accurately, and the extracted relationship triples can be added into the construction of the military knowledge graph.
(2) Randomly extracting 100 military news texts from a data set, manually writing corresponding regular expressions aiming at the relation and the entity in each news text, then testing the effect of the regular expressions on the 338 manually marked military news texts, and supplementing the corresponding missing regular expressions according to the value of the recall rate (recall). It is noted that in other embodiments, other amounts of data may be extracted to write a regular expression.
(3) And (4) iterating, returning to the step (2), and repeating the step (2) until the accuracy (precision) and recall (recall) of the regular extraction reach the threshold values. At this time, the whole process is finished, and the corresponding entities and relations are extracted from the data set by using the good regular expressions and the labeling of the data is carried out.
In the implementation of the invention, 119 pieces of relational regular expressions are jointly designed, and an example of the written regular template is shown in FIG. 4. Analyzing the matching result of the regular template on the labeling data set, and determining the relationship between the relationship and the regular extracted manual labeling relationship statement according to two aspects: the type defined in advance by the relational regular expression is the same as the type defined by manual marking, or the head and tail entities of the manually marked relational sentences are in the sentences extracted by the regular expression.
After the data labeling is finished, the manually labeled data and the automatically labeled data are mixed and disordered, and then the entity recognition machine is classified according to the system.
S3: entity identification and relationship classification
For the labeled data, mapping words in a high-dimensional discrete space to low-dimensional continuous space vectors by using a pre-training language model, and embedding codes; performing span identification, filtering and relationship classification by using a span-based model; converting the relationship classification into a graph classification by using a graph-based model, and introducing a syntactic dependency relationship so as to assist the relationship classification; and performing joint training on the output result of the span-based model and the output result of the graph-based model, and identifying entities contained in the data and relationships among the entities.
In the embodiment of the invention, the pre-training language model adopts a BERT model which is issued by Google and is trained aiming at Chinese, and words in a high-dimensional discrete space can be mapped to a low-dimensional continuous space vector and are embedded into codes. The BERT model is a multi-layer bidirectional Transformer structure, and vector representation of words can be obtained by effectively encoding context information. For example, given a sentence containing n words, input to the BERT-based embedded encoding module will result in a word vector sequence { t } of length n +1cls,t1,t2,...,tnThe BERT model adds a special classification vector t covering the whole sentence information at the head end of the sequencecls
The span-based model comprises an entity classifier, a span filter and a relation classifier, the entity classifier is used for carrying out entity classification on the output of the BERT model, the span filter is used for filtering out non-entity spans, and then the relation classifier is used for judging and classifying entity relations.
After the span-based model obtains the BERT-based text vector representation, the span is obtained by adopting an optimized negative sampling mode, and the span which is not in the labeling entity list is defined as a negative sample. For example, for sentences (U.S., nation, F, -, 1, 5, war, fighter, plane), entities that may be detected are (U.S., (U.S.), (U.S. F), (F-15 fighter), and so on. Unlike the prior art, the span-based model of the present invention does not perform beam search on the entity and relationship hypotheses, but instead sets a maximum value of NeI.e. choosing at most N among all possible entitieseAnd marking samples not labeled as positive examples in the training set as negative examples. Unlike the existing span-based model, the present invention proposes a new way to select the negative case, i.e. first createEstablishing a set S of military entities, wherein the set contains entities (labeled data and result of entity regular extraction) as many as possible in the data set, segmenting sentences by using jieba (segmentation software), obtaining all possible entities by segmenting words and obtaining part-of-speech corresponding to segmentation result, for example, I can obtain three entities of I, Beijing, Tianan and Tiananmen in Beijing Tiananmen according to the part-of-speech, firstly filtering according to the part-of-speech, only preserving nouns, then carrying out similarity calculation on the nouns and the entities in the entity set S, selecting the value with the highest similarity as the score of the segmentation result, and finally sorting according to the higher similarity and the higher priority if the negative example is higher, if the N is not reachedeFilling the result of word segmentation, and then selecting a random span, wherein the random span can select an entity with the length of 2-10. For example, in the military corpus studied in the embodiment of the present invention, the length of the entity is substantially in this range, and the entity which is more consistent with the characteristics of the military entity but is not labeled can be selected as a negative example to enable the training effect of the model to be better.
After the span-based model selects good possible entities, the vector representation of the entities is processed. The vector representation of the entity consists of three parts, namely vector representation of tokens contained by the entity (see fig. 5, namely, the word of the entity is mapped to the corresponding id of the word in the dictionary of the pre-training model), width embedding (see fig. 5) and special mark CLS (see fig. 5).
Therefore, the method for classifying the entities by using the entity classifier comprises the following steps:
embedding a candidate span encoded by a pre-trained network model ti,ti+1,...,ti+kInputting the data into an entity classifier, and obtaining a vector representation f (t) of the entity through one-time maximum poolingi,ti+1,...,ti+k). Width embedding is an embedding matrix (the matrix contains the characteristics of words) learned in training, namely the width k +1 of an entity indicates that the entity contains k +1 tokens, so that the width embedding of the entity is expressed by using k +1 as a subscript, and a vector with the width k +1 obtained by indexing in the width matrix represents twidthI.e. to spanWidth encoded vector. The special mark symbol CLS is generated by BERT model, covers the global information of input sentence, and the BERT model is coded to obtain special classification vector tcls. Representing the vector of the entity by f (t)i,ti+1,...,ti+k) Special classification vectors t encoded with the BERT modelclsVector t encoding span widthwidthAnd splicing to obtain the vector representation of the final entity:
Figure BDA0002700796450000091
wherein, i and k both represent serial numbers, and then the spliced result is obtained
Figure BDA0002700796450000092
Inputting the entity type into a full connection layer and activating the entity type through softmax, wherein the entity type comprises a non-type 'none', and obtaining the probability distribution of the entity type:
Figure BDA0002700796450000093
wherein ,eiRepresents an entity type, WiIs a weight, biTo be offset, siAnd representing the ith span, and judging the entity type through the probability distribution.
The span filter filters the span according to the probability distribution of the entity type obtained by the entity classifier, filters non-entity spans, and during filtering, if the probability value of the 'none' type in the probability distribution is the highest, the span is identified as the 'none' type, namely, the span is judged not to be an entity, so that the span is filtered.
The relationship classifier is used for entity relationship classification, and the relationship is constructed and classified for all possible entity pairs. First randomly selecting at most N from possible entitiesrA set of relationships is formed for the entities. For one by entity pair(s)1,s2) Formed entities, their relation vectorsThe representation is composed of two parts, one part is head-tail entity vector representation obtained by the span identification part, and span coding representation can be obtained by an entity classifier
Figure BDA0002700796450000094
And
Figure BDA0002700796450000095
another part is textual features. In addition to entity features, relationship extraction may also rely on textual features. In the invention, CLS is not selected as the text feature, but the text between two entities is maximally pooled, the context information between the entity pairs is reserved, and the coded vector representation of the text feature is obtained by embedded coding
Figure BDA0002700796450000096
If there is no text between the two entities, then
Figure BDA0002700796450000097
Will be set to 0. Since the relationships of the entity pairs are often asymmetric, and the head and tail entities of the relationships cannot be reversed, each entity pair will get two opposite relationship representations:
Figure BDA0002700796450000098
Figure BDA0002700796450000099
wherein ,ri,j and rj,iRespectively representing the relationship between the ith entity and the jth entity, wherein i and j represent sequence numbers.
Inputting the relation representation into a full connection layer, and activating through a sigmoid function to obtain the probability distribution of the relation type:
Figure BDA0002700796450000101
Figure BDA0002700796450000102
wherein ,Wi,j and Wj,iRepresents a weight, bi,j and bj,iAnd expressing bias, judging the relation type between the entity pairs through the obtained probability distribution of the relation type, and expressing a function by sigma (·).
The graph-based model converts the relation classification into the graph classification problem, introduces syntactic dependency analysis and assists the relation classification, thereby effectively relieving the defect that the end-to-end neural network model cannot effectively mine syntactic information.
The method for judging classification by using the graph-based model including the dependency analysis tree, the graph convolution neural network and the graph classifier and using the graph-based model to assist the relationship comprises the following steps: inputting any sentence, obtaining the dependency analysis tree of the sentence by using the HanLP natural language processing tool, and converting the dependency analysis tree into an adjacency matrix to obtain an input graph G of the graph-based modeliMore specifically, for words of each node composing the tree, summing word vectors obtained by a BERT model to form a node label, using the dependency relationship type among the words as an edge label, and using the relationship type of the whole sentence as a graph label; then input graph GiInputting the data into a Graph convolution neural Network (GIN) model realized by CogDL (a tool kit), and learning the characteristics of neighbor nodes through multiple iterations to obtain a representation vector of the whole Graph
Figure BDA0002700796450000103
Figure BDA0002700796450000104
Representation vector of a graph
Figure BDA0002700796450000105
The probability distribution of the graph classification is obtained by inputting into a full connection layer and activating with softmax:
Figure BDA0002700796450000106
wherein ,
Figure BDA0002700796450000107
is the weight of the image,
Figure BDA0002700796450000108
representing the bias, and judging and classifying the relationship according to the probability distribution of the graph classification.
The method for performing joint training on the output result of the span-based model and the output result of the graph-based model and identifying the entities contained in the data and the relationship types among the entities comprises the following steps:
entity identification loss gamma of span-based model obtained by using cross entropy loss functione
Figure BDA0002700796450000111
wherein ,MeThe number of entity types;
Figure BDA0002700796450000112
is an indicator variable, the value is 0 or 1, if the class and the sample class are the same, the value is 1, otherwise the value is 0;
Figure BDA0002700796450000113
belonging to entity class c for observing entity spanseE is the identity of the entity.
Obtaining relation classification loss gamma of model based on span by using BCEWithLogs loss functionr
Figure BDA0002700796450000114
wherein ,yrIs an indicator variable and represents whether the prediction relation category is the same as the sample category; n represents the total number of samples in the dataset and r is the identity of the relationship.
Graph classification loss gamma for graph-based models using cross entropy loss functiong
Figure BDA0002700796450000115
wherein ,MgIs the number of relationship types;
Figure BDA0002700796450000116
is an indicator variable;
Figure BDA0002700796450000117
belong to class c for observation mapsgG is a relation identifier in the graph classification;
performing joint training by using the following formula to obtain joint loss gamma:
γ=γer+f(·)γg
wherein f (·) is a linear function. In a preferred embodiment of the invention, the linear function f (-) takes
Figure BDA0002700796450000118
Where x represents the number of samples input and N represents the sum of the samples in the dataset.
Through joint training, entities contained in the sentences and relationship types among the entities are identified.
Based on the method of the present invention, a specific application example is given.
First, a military news webpage of a representative certain website is crawled, and 840,000 military news articles are obtained. Filtering out articles which are irrelevant to the military field or do not contain the military relation based on the keywords of the military field, finally obtaining 85,000 articles, and completing the construction of the data set. Then 338 articles were randomly extracted, inviting the domain experts to do manual labeling. For the article which is not manually marked, the regular template is used for automatic marking, and 119 relational regular expressions are designed in total to automatically mark the data set. Finally, the data sets were randomly divided into a training set and a test set, the ratio of the two data sets being 10: 1. the parameters of the model referred to in the present invention are set as in table 1.
TABLE 1 parameter settings in the model
Figure BDA0002700796450000121
Figure BDA0002700796450000131
In order to prove the superiority of the model in the invention, the model results in the invention are compared with the existing model, table 2 lists the evaluation results of different models, and the evaluation results of different models are compared on three index measures, including accuracy, recall rate and F1 value.
TABLE 2 evaluation results of different models
Figure BDA0002700796450000132
In table 2, the results in row 1 are not based on the graph model, and rows 2 to 4 represent the results of the hybrid model. Comparing the model of the present invention with different GNN (Graph Neural Networks) variants, it can be found that the performance of each model is different. Although the SortPool model performed well in the graph classification task, there was no improvement in the relationship prediction task F1 index score compared to the single model. Likewise, SpERT + PATCH + SAN behaves generically in both graph classification and relationship abstraction. The observation that the model achieves the highest F1 score in the graph classification, the entity identification and the relation classification shows that the performance of the model can be improved by introducing specific external knowledge based on the graph model.
TABLE 3 comparison of results of different joint extraction methods
Figure BDA0002700796450000133
To jointly train a span-based model and a graph-based model, the entity identification penalty γ derived for the span-based model is requiredeModel-derived relationship classification penalty γ based on spanrAnd a graph classification loss γ obtained by a graph-based modelgPolymerization is carried out. Table 3 shows the corresponding extraction results for three different combination methods. The results show that in addition to multiplication, addition and linear functions can be accurately co-trained. Meanwhile, by adopting a linear function F (x) ═ xN, the model can obtain 76.60 and 58.57F 1 scores in entity identification and relationship classification respectively, which are higher than those of the other two combined methods.
The entity relationship joint extraction method based on span and knowledge enhancement provided by the invention is used for solving the problem of entity relationship joint extraction in a specific field. The method is composed of a span-based model and a graph-based model, wherein the span-based model can perform entity identification and relationship classification by using context expression in a text, and the graph-based model performs a graph classification task by using a syntax tree obtained by syntactic dependency analysis so as to effectively judge the relationship type. The model of the invention can introduce syntax information such as dependency relationship and the like into the end-to-end neural network model, thereby effectively identifying the overlapping relationship and improving the accuracy of entity relationship joint extraction.
The above examples are only preferred embodiments of the present invention, it should be noted that: it will be apparent to those skilled in the art that various modifications and equivalents can be made without departing from the spirit of the invention, and it is intended that all such modifications and equivalents fall within the scope of the invention as defined in the claims.

Claims (10)

1. A span and knowledge enhancement based entity relation joint extraction method is characterized by comprising the following steps:
s1: building a data set
Collecting data of a specific field, cleaning the collected data and constructing a data set of the field;
s2: annotating data
Randomly selecting a plurality of data in the data set, manually marking, and automatically marking the data which are not manually marked in the data set by using a regular template;
s3: entity identification and relationship classification
For the labeled data, mapping words in a high-dimensional discrete space to low-dimensional continuous space vectors by using a pre-training language model, and embedding codes;
performing span identification, filtering and relationship classification by using a span-based model;
converting the relationship classification into a graph classification by using a graph-based model, and introducing a syntactic dependency relationship so as to assist the relationship judgment classification;
and performing joint training on the output result of the span-based model and the output result of the graph-based model, and identifying entities contained in the data and relationships among the entities.
2. The method for extracting entity relationships based on span and knowledge enhancement as claimed in claim 1, wherein in step S2, when the data is labeled manually, the entity location information, the entity type and the relationships between the entities of the data are labeled.
3. The entity relationship joint extraction method based on span and knowledge enhancement as claimed in claim 1, wherein in step S2, when the regular template is used to label the data automatically, the relationship between entity type and entity is preset, according to the domain to which the data set belongs, the regular template is compiled by using the knowledge written by domain experts, and the preset entity type and the relationship between entities are marked in the data by means of template matching.
4. The method for entity relationship joint extraction based on span and knowledge enhancement as claimed in claim 1, wherein in step S3, the pre-trained language model employs a BERT model to obtain the vector representation of the word by effectively encoding context information.
5. The method for extracting entity relationship based on span and knowledge enhancement as claimed in claim 1, wherein in step S3, the span-based model includes an entity classifier, a span filter and a relationship classifier, the entity classifier is used to classify the entity by judgment, the span filter is used to filter out non-entity spans, and then the relationship classifier is used to classify the entity relationship type by judgment.
6. The method of claim 5, wherein the entity classifier is used to classify the entities according to the following steps:
embedding a candidate span encoded by a pre-trained network model ti,ti+1,...,ti+kInputting the data into an entity classifier, and obtaining a vector representation f (t) of the entity through one-time maximum poolingi,ti+1,...,ti+k) And a special classification vector t obtained by coding with the BERT modelclsVector t encoding span widthwidthAnd splicing to obtain the vector representation of the final entity:
Figure FDA0002700796440000021
wherein, i and k both represent serial numbers, and then the spliced result is obtained
Figure FDA0002700796440000022
Inputting into a full connection layer and obtaining the probability distribution of the entity type through softmax activation:
Figure FDA0002700796440000023
wherein ,eiRepresents an entity type, WiRepresents a weight, biDenotes the offset, siAnd (4) representing the ith span, and judging and classifying the entity type through the probability distribution.
7. The entity relationship joint extraction method based on span and knowledge enhancement as claimed in claim 5 or 6, wherein the method for filtering the span by using the span filter is as follows: and in the probability distribution of the entity type obtained based on the entity classifier, if the probability value of the 'none' type is the highest, identifying the span as the 'none' type, judging that the span is not the entity, and filtering the span.
8. The entity relationship joint extraction method based on span and knowledge enhancement as claimed in claim 7, wherein the method for judging the relationship type by using the relationship classifier comprises:
representing span-encoded vectors obtained by an entity classifier
Figure FDA0002700796440000024
And
Figure FDA0002700796440000025
coded vector representation obtained by embedded coding of context between two spans
Figure FDA0002700796440000026
And splicing to obtain a relation representation, wherein the relation between the entity pairs is opposite, and two opposite relation representations are respectively arranged between all the entity pairs:
Figure FDA0002700796440000027
Figure FDA0002700796440000028
wherein ,ri,j and rj,iRespectively representing the relationship between the ith entity and the jth entity, wherein i and j represent serial numbers;
inputting the relation representation into a full connection layer, and activating through a sigmoid function to obtain the probability distribution of the relation type:
Figure FDA0002700796440000031
Figure FDA0002700796440000032
wherein ,Wi,j and Wj,iRepresents a weight, bi,j and bj,iRepresenting bias, judging and classifying the relation types between the entity pairs through the obtained probability distribution of the relation types, wherein sigma (DEG) represents a function.
9. The method for extracting entity relationship based on span and knowledge enhancement as claimed in claim 1, wherein the method for judging and classifying entity relationship with the aid of graph-based model comprises:
utilizing a HanLP natural language processing tool to obtain a dependency analysis tree of the sentence, converting the dependency analysis tree into an adjacency matrix, and obtaining an input graph G based on a graph modeli(ii) a Then input graph GiInputting the data into a graph convolution neural network model GIN realized by CogDL, and obtaining vector representation of the whole graph by repeatedly and iteratively learning the characteristics of neighbor nodes
Figure FDA0002700796440000033
Figure FDA0002700796440000034
Vector representation of a graph
Figure FDA0002700796440000035
The probability distribution of the graph classification is obtained by inputting into a full connection layer and activating with softmax:
Figure FDA0002700796440000036
wherein ,
Figure FDA0002700796440000037
the weight is represented by a weight that is,
Figure FDA0002700796440000038
and representing bias, and judging and classifying the entity relationship by using the probability distribution of graph classification.
10. The method for extracting entity relationship based on span and knowledge enhancement as claimed in claim 1, wherein the method for performing joint training on the output result based on the span model and the output result based on the graph classification model and identifying the entity included in the data and the relationship type between the entities comprises:
entity identification loss gamma of span-based model obtained by using cross entropy loss functione
Figure FDA0002700796440000039
wherein ,MeThe number of entity types;
Figure FDA00027007964400000310
is an indicator variable, the value is 0 or 1, if the class and the sample class are the same, the value is 1, otherwise the value is 0;
Figure FDA0002700796440000041
belonging to entity class c for observing entity spanseA predicted probability of (d); n represents the total number of samples in the data set, and e is the identification of the entity;
obtaining relation classification loss gamma of model based on span by using BCEWithLogs loss functionr
Figure FDA0002700796440000042
wherein ,yrThe index variable represents whether the prediction relation type is the same as the sample type, and r is the identifier of the relation;
graph classification loss gamma for graph-based models using cross entropy loss functiong
Figure FDA0002700796440000043
wherein ,MgIs the number of relationship types;
Figure FDA0002700796440000044
is an indicator variable;
Figure FDA0002700796440000045
belong to class c for observation mapsgG is a relation identifier in the graph classification;
performing joint training by using the following formula to obtain joint loss gamma:
γ=γer+f(·)γg
wherein f (-) is a linear function, and the linear function f (-) takes
Figure FDA0002700796440000046
Where x represents the number of samples input and N represents the sum of the samples in the dataset.
CN202011021524.0A 2020-09-25 2020-09-25 Entity relationship joint extraction method based on span and knowledge enhancement Active CN112214610B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011021524.0A CN112214610B (en) 2020-09-25 2020-09-25 Entity relationship joint extraction method based on span and knowledge enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011021524.0A CN112214610B (en) 2020-09-25 2020-09-25 Entity relationship joint extraction method based on span and knowledge enhancement

Publications (2)

Publication Number Publication Date
CN112214610A true CN112214610A (en) 2021-01-12
CN112214610B CN112214610B (en) 2023-09-08

Family

ID=74052289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011021524.0A Active CN112214610B (en) 2020-09-25 2020-09-25 Entity relationship joint extraction method based on span and knowledge enhancement

Country Status (1)

Country Link
CN (1) CN112214610B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989835A (en) * 2021-04-21 2021-06-18 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Extraction method of complex medical entities
CN113051356A (en) * 2021-04-21 2021-06-29 深圳壹账通智能科技有限公司 Open relationship extraction method and device, electronic equipment and storage medium
CN113094513A (en) * 2021-04-08 2021-07-09 北京工商大学 Span representation-based end-to-end menu information extraction method and system
CN113204615A (en) * 2021-04-29 2021-08-03 北京百度网讯科技有限公司 Entity extraction method, device, equipment and storage medium
CN113240443A (en) * 2021-05-28 2021-08-10 国网江苏省电力有限公司营销服务中心 Entity attribute pair extraction method and system for power customer service question answering
CN113411549A (en) * 2021-06-11 2021-09-17 上海兴容信息技术有限公司 Method for judging whether business of target store is normal or not
CN113536795A (en) * 2021-07-05 2021-10-22 杭州远传新业科技有限公司 Method, system, electronic device and storage medium for entity relation extraction
CN113627185A (en) * 2021-07-29 2021-11-09 重庆邮电大学 Entity identification method for liver cancer pathological text naming
CN113779260A (en) * 2021-08-12 2021-12-10 华东师范大学 Domain map entity and relationship combined extraction method and system based on pre-training model
CN113791791A (en) * 2021-09-01 2021-12-14 中国船舶重工集团公司第七一六研究所 Business logic code-free development method based on natural language understanding and conversion
CN114611497A (en) * 2022-05-10 2022-06-10 北京世纪好未来教育科技有限公司 Training method of language diagnosis model, language diagnosis method, device and equipment
CN114881038A (en) * 2022-07-12 2022-08-09 之江实验室 Chinese entity and relation extraction method and device based on span and attention mechanism
CN115599902A (en) * 2022-12-15 2023-01-13 西南石油大学(Cn) Oil-gas encyclopedia question-answering method and system based on knowledge graph
US20230153533A1 (en) * 2021-11-12 2023-05-18 Adobe Inc. Pre-training techniques for entity extraction in low resource domains
CN117131198A (en) * 2023-10-27 2023-11-28 中南大学 Knowledge enhancement entity relationship joint extraction method and device for medical teaching library
CN117744657A (en) * 2023-12-26 2024-03-22 广东外语外贸大学 Medicine adverse event detection method and system based on neural network model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019839B (en) * 2018-01-03 2021-11-05 中国科学院计算技术研究所 Medical knowledge graph construction method and system based on neural network and remote supervision
US10706045B1 (en) * 2019-02-11 2020-07-07 Innovaccer Inc. Natural language querying of a data lake using contextualized knowledge bases
CN110597998A (en) * 2019-07-19 2019-12-20 中国人民解放军国防科技大学 Military scenario entity relationship extraction method and device combined with syntactic analysis
CN111339774B (en) * 2020-02-07 2022-11-29 腾讯科技(深圳)有限公司 Text entity relation extraction method and model training method

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094513A (en) * 2021-04-08 2021-07-09 北京工商大学 Span representation-based end-to-end menu information extraction method and system
CN113094513B (en) * 2021-04-08 2023-08-15 北京工商大学 Span representation-based end-to-end menu information extraction method and system
CN112989835B (en) * 2021-04-21 2021-10-08 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Extraction method of complex medical entities
CN113051356A (en) * 2021-04-21 2021-06-29 深圳壹账通智能科技有限公司 Open relationship extraction method and device, electronic equipment and storage medium
CN112989835A (en) * 2021-04-21 2021-06-18 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Extraction method of complex medical entities
CN113204615A (en) * 2021-04-29 2021-08-03 北京百度网讯科技有限公司 Entity extraction method, device, equipment and storage medium
CN113204615B (en) * 2021-04-29 2023-11-24 北京百度网讯科技有限公司 Entity extraction method, device, equipment and storage medium
CN113240443B (en) * 2021-05-28 2024-02-06 国网江苏省电力有限公司营销服务中心 Entity attribute pair extraction method and system for power customer service question and answer
CN113240443A (en) * 2021-05-28 2021-08-10 国网江苏省电力有限公司营销服务中心 Entity attribute pair extraction method and system for power customer service question answering
CN113411549A (en) * 2021-06-11 2021-09-17 上海兴容信息技术有限公司 Method for judging whether business of target store is normal or not
CN113411549B (en) * 2021-06-11 2022-09-06 上海兴容信息技术有限公司 Method for judging whether business of target store is normal or not
CN113536795A (en) * 2021-07-05 2021-10-22 杭州远传新业科技有限公司 Method, system, electronic device and storage medium for entity relation extraction
CN113536795B (en) * 2021-07-05 2022-02-15 杭州远传新业科技有限公司 Method, system, electronic device and storage medium for entity relation extraction
CN113627185A (en) * 2021-07-29 2021-11-09 重庆邮电大学 Entity identification method for liver cancer pathological text naming
CN113779260A (en) * 2021-08-12 2021-12-10 华东师范大学 Domain map entity and relationship combined extraction method and system based on pre-training model
CN113791791B (en) * 2021-09-01 2023-07-25 中国船舶重工集团公司第七一六研究所 Business logic code-free development method based on natural language understanding and conversion
CN113791791A (en) * 2021-09-01 2021-12-14 中国船舶重工集团公司第七一六研究所 Business logic code-free development method based on natural language understanding and conversion
US20230153533A1 (en) * 2021-11-12 2023-05-18 Adobe Inc. Pre-training techniques for entity extraction in low resource domains
CN114611497B (en) * 2022-05-10 2022-08-16 北京世纪好未来教育科技有限公司 Training method of language diagnosis model, language diagnosis method, device and equipment
CN114611497A (en) * 2022-05-10 2022-06-10 北京世纪好未来教育科技有限公司 Training method of language diagnosis model, language diagnosis method, device and equipment
CN114881038A (en) * 2022-07-12 2022-08-09 之江实验室 Chinese entity and relation extraction method and device based on span and attention mechanism
CN115599902A (en) * 2022-12-15 2023-01-13 西南石油大学(Cn) Oil-gas encyclopedia question-answering method and system based on knowledge graph
CN117131198A (en) * 2023-10-27 2023-11-28 中南大学 Knowledge enhancement entity relationship joint extraction method and device for medical teaching library
CN117131198B (en) * 2023-10-27 2024-01-16 中南大学 Knowledge enhancement entity relationship joint extraction method and device for medical teaching library
CN117744657A (en) * 2023-12-26 2024-03-22 广东外语外贸大学 Medicine adverse event detection method and system based on neural network model

Also Published As

Publication number Publication date
CN112214610B (en) 2023-09-08

Similar Documents

Publication Publication Date Title
CN112214610B (en) Entity relationship joint extraction method based on span and knowledge enhancement
CN110597735B (en) Software defect prediction method for open-source software defect feature deep learning
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN111639171B (en) Knowledge graph question-answering method and device
CN107133220B (en) Geographic science field named entity identification method
CN107766324B (en) Text consistency analysis method based on deep neural network
CN107729468B (en) answer extraction method and system based on deep learning
CN107463658B (en) Text classification method and device
CN107315738B (en) A kind of innovation degree appraisal procedure of text information
CN109189767B (en) Data processing method and device, electronic equipment and storage medium
CN103955451A (en) Method for judging emotional tendentiousness of short text
CN107679110A (en) The method and device of knowledge mapping is improved with reference to text classification and picture attribute extraction
CN112256939A (en) Text entity relation extraction method for chemical field
CN110377690B (en) Information acquisition method and system based on remote relationship extraction
CN111159342A (en) Park text comment emotion scoring method based on machine learning
CN111984790B (en) Entity relation extraction method
CN108304382A (en) Mass analysis method based on manufacturing process text data digging and system
CN112257441A (en) Named entity identification enhancement method based on counterfactual generation
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN111274494B (en) Composite label recommendation method combining deep learning and collaborative filtering technology
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN115659947A (en) Multi-item selection answering method and system based on machine reading understanding and text summarization
CN106815209B (en) Uygur agricultural technical term identification method
CN110245234A (en) A kind of multi-source data sample correlating method based on ontology and semantic similarity
CN114547232A (en) Nested entity identification method and system with low labeling cost

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant