CN115269857A - Knowledge graph construction method and device based on document relation extraction - Google Patents

Knowledge graph construction method and device based on document relation extraction Download PDF

Info

Publication number
CN115269857A
CN115269857A CN202210470762.2A CN202210470762A CN115269857A CN 115269857 A CN115269857 A CN 115269857A CN 202210470762 A CN202210470762 A CN 202210470762A CN 115269857 A CN115269857 A CN 115269857A
Authority
CN
China
Prior art keywords
entity
document
model
entities
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210470762.2A
Other languages
Chinese (zh)
Inventor
李实�
于鸣
李强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeast Forestry University
Original Assignee
Northeast Forestry University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeast Forestry University filed Critical Northeast Forestry University
Priority to CN202210470762.2A priority Critical patent/CN115269857A/en
Publication of CN115269857A publication Critical patent/CN115269857A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention provides a knowledge graph construction method and device based on document relation extraction, belongs to the technical field of knowledge graph construction, and solves the problem that errors generated by a natural language processing tool can be propagated backwards due to the fact that a dependency syntax analysis tree is needed when a document graph is constructed by a conventional document level relation extraction method. The technology can extract the knowledge triples from the document paragraphs, convert unstructured data into structured data convenient to store and understand, and provide technical support for the construction of the knowledge graph. The model directly constructs the entity document graph and uses the self-attention matrix among the entities as the adjacency matrix of the document graph, thereby reducing the influence of natural language processing tools. And because the entity self-attention moment matrix comes from the Bert, complex calculation is not needed, model parameters and model training time are greatly reduced, and then a medical knowledge map with high quality and wide knowledge range is constructed on the basis of the technology.

Description

Knowledge graph construction method and device based on document relation extraction
Technical Field
The invention relates to the technical field of knowledge graph construction, in particular to a knowledge graph construction method and device based on entity self-attention document relation extraction.
Background
Knowledge maps can summarize knowledge and build associations between knowledge so that mass data can be better organized in the form of structured, associated information and then stored in a graph manner so that knowledge can be understood and applied by machines. The knowledge graph consists of nodes and edges, namely a node-edge-node triple, the nodes represent entities in the real world, the edges represent relations between the entities, the capability of analyzing problems from the relation perspective is provided for a machine, and the important breakthrough from perception intelligence to cognitive intelligence is realized. The application of the knowledge map-based information retrieval, intelligent question-answering and recommendation systems and the like in the fields of e-commerce, finance, medical treatment and the like shows rich values.
The key technology for constructing the knowledge graph by using the unstructured text is named entity recognition and relationship extraction technology, however, the existing relationship extraction technology mainly focuses on entity relationship in a single sentence, the data source of the knowledge graph in the medical field is mainly semi-structural knowledge or simple unstructured medical health sentences of a medical website, and the entity relationship among different sentences is ignored. A large number of knowledge triples can be omitted in an actual application scene, so that the scale of the final knowledge graph is influenced. And the current document level relation extraction model needs to rely on a dependency syntax analysis tree when constructing a document graph, and introduces errors generated by a natural language processing tool, so that the quality of a knowledge graph can be reduced.
Disclosure of Invention
The invention aims to provide a knowledge graph construction method and a device based on entity self-attention document relation extraction, which can extract entity triples from documents, reduce the influence of errors generated by natural language processing tools and finally construct a medical knowledge graph with high quality and wide knowledge range so as to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
a knowledge graph construction method based on entity self-attention document relation extraction comprises the following steps:
the method comprises the step of S1, dividing a document level relation extraction model based on entity self-attention into a coding module, an entity reasoning module, an adaptive threshold module and a classification module.
S2, the encoding module encodes the text data into a vector by using Bert and extracts an entity vector;
s3, the entity reasoning module realizes information transmission between entities;
s4, the self-adaptive threshold module is responsible for learning the classification threshold of each category;
s5, the classification module outputs the semantic relation between the two entities by using the self-adaptive threshold module;
and S6, selecting a document from the test set, extracting an entity pair to be predicted from the document, and predicting the relationship of the entity pair through the trained model to obtain an entity relationship triple.
The entity document graph is directly constructed by means of an inter-entity self-attention mechanism based on the entity self-attention document relation extraction model, and the influence of a natural language processing tool is reduced. The model directly processes and uses the self-attention moment matrix in the Bert as the adjacency matrix of the document graph, thereby greatly reducing the model parameters and the model training time. And a learnable threshold is obtained by adopting a self-adaptive threshold strategy, so that the wrong decision of the model adopting a global threshold strategy is reduced.
Preferably, in step S2, D = { D ] for a given documentiH (i =1,2,3 … n), to let the model focus more on entity information, special symbols "+" are inserted at the beginning and end of all mentioned entities to highlight the entities and document D is converted to W = { W according to BERT vocabularyi}. And then, coding the document D by adopting a pre-training language model BERT to obtain a word vector representation H = { H } fused with context informationi} (i =1,2,3 … n), the formula is as follows:
H=BERT(wi)
the word vector with the special symbol ". Before the entity is mentioned is denoted m as the word vector for the entity in question. Since an entity may appear in multiple locations of a document, the present invention uses a smooth max-pooling to integrate vector representations of multiple mentioned entities of an entity to obtain a vector representation e of the entity, the formula is as follows:
Figure RE-GDA0003853892720000031
preferably, in step S3, when the document is encoded by using the pre-training language model BERT, the multi-headed self-attention module has already performed self-attention calculation, and the present invention takes the part of the special symbol "+" before the entity mentioned in the final layer of self-attention weight matrix as the weight between the mentioned entities. The weight of an entity to other entities is equal to the average of the weights of all mentioned entities of the entity, and the formula is as follows:
Figure RE-GDA0003853892720000032
where N is the number of entities the current entity contains the referenced entity. Then all entity weights a are spliced to obtain an adjacency matrix A, and the formula is as follows:
A=[a1:a2…:aN]
the invention uses the aggregation of neighbor information by a graph convolution neural network to obtain multi-hop inference information between entities, and gives l-1 node representation
Figure RE-GDA0003853892720000033
Graph convolution operation the formula is defined as follows:
Figure RE-GDA0003853892720000034
wherein
Figure RE-GDA0003853892720000035
Representing entity information after aggregation of neighbor information at layer l, W(l)For linear transformation, b(l)σ is the activation function for the bias term.
Preferably, in step S4, the entity type plays an important role in the classification of the entity pair, and the entity type can help to define the range of the relationship class between the entities, so as to reduce the risk of being misclassified. For example, a relationship between "person" and "business/organization" might be "originator" or "affiliated organization", but not likely "son" or "time of arrival", etc. The invention randomly initializes a weight matrix T to represent the embedded representation of the entity type, then uses the one-hot code p to represent the entity type, and the vector of a certain entity type represents T, and the formula is as follows:
t=pT
given an entity-pair inference vector representation (h) computed by a graph-convolution neural networks,ho) Entity to original vector (e)s,eo) Where s represents the head entity and o represents the tail entity. Different from the implicit entity type information utilization mode of the third chapter, the method disclosed by the invention has the advantages that the original vector, the reasoning vector and the entity type vector are spliced to obtain the joint vector z, and the formula is as follows:
zs=concatenate(es:hs:ts)
zo=concatenate(eo:ho:to)
and finally, activating the joint vector through a bilinear function and sigmoid to calculate the probability of the entity relation r, wherein the formula is as follows:
Figure RE-GDA0003853892720000041
wherein is WrWeight matrix, brFor bias, σ is the activation function (e.g., reLU).
The invention uses binary cross entropy loss for training, and in the reasoning process, adjusts a global threshold theta to maximize an evaluation index (F1 score) on a test set, if P (r | z) is ins,zo)>And theta returns r as an association relation, and if no relation exists, returns NA.
Preferably, in step S5, the relation extraction classifier outputs a label distribution P (r | z) predicted for the entitys,zo) The range is (0,1).For the multi-classification task of relation extraction, the label score corresponding to the optimal F1 score in the PR curve can be used as a global threshold, when the model is used for actual reasoning, if the probability maximum value in label distribution exceeds the global threshold, the entity pair is predicted to be a relation r, otherwise, the model predicts that no relation exists between the entity pairs and outputs the NA type.
However, due to the different number of instances of different relationship classes in the training data and the differences between different entity pairs, the global threshold cannot maximize the performance of the classifier. An adaptive threshold method is used to obtain a learnable threshold value to replace a global threshold value, so that decision errors in an inference process are reduced.
Pair of entities T = (e)s,eo) The tags of (a) are divided into two subsets: positive PT and negative NT classes, defined as follows:
class III
Figure RE-GDA0003853892720000042
Are a category of relationships expressed between entities. If there is no semantic relationship between the entity pairs, then the PT is null.
Negative class
Figure RE-GDA0003853892720000051
Are a class of unexpressed relationships between entities. NT = R if T does not express any relationship.
Wherein R = RP+RNAnd R is a relation category set.
The present invention defines a threshold class TH. If the entity pair is correctly classified, the score for the positive category label should be above the threshold, and the score for the negative category label should be below the threshold.
To train the model, a special penalty function is needed to consider the TH class. The adaptive threshold loss is designed on the basis of the cross entropy loss. The loss function is divided into two parts, and the formula is as follows:
Figure RE-GDA0003853892720000052
Figure RE-GDA0003853892720000053
L=L1+L2
the first part converts the relationship classification into a comparison of positive and TH classes. If the model mispredicts the positive class as TH class, L1The loss will become large, otherwise L1And becomes smaller. If there is no positive class label between entities, L1Is 0; the second part converts the relationship classification into a comparison of the negative class and the TH class. If the model mispredicts TH class as negative, L2The loss will become large, otherwise L2And becomes smaller. The goal of the model is to minimize the loss function L so that the positive class score is higher than the TH class and the negative class score is lower than the TH class.
Preferably, in step S6, at the time of test, a class having a score higher than TH class is returned as a positive class, and if such a class does not exist, NA is returned. This threshold class learns an entity-dependent threshold and therefore does not require adjustment of the threshold across the development set.
The invention also provides a knowledge graph construction device based on entity self-attention document relation extraction, which comprises the following steps:
the data acquisition module is used for acquiring data from the Internet by utilizing a crawler technology, wherein the data comprises medical documents, popular science articles, doctor answers and the like, the data cannot extract knowledge through simple rules, and the medical entity triples need to be automatically extracted from the data by virtue of a deep learning technology. Since these data tend to be a paragraph rather than a single sentence, entity triples need to be extracted using a document-level relational extraction model.
And the data labeling module is used for manually labeling the acquired unstructured data, and the labeled content comprises entities and types thereof as well as semantic relations between entity pairs. And the labeled data is used for training a named entity recognition model and a document level relation extraction model.
And the named entity recognition module is used for training the named entity recognition model according to the training set.
And the document level relation extraction module is used for extracting a model according to the training relation of the training set.
And the triple extraction module is used for extracting entity triples from the unlabeled unstructured text by using the trained named entity recognition model and the trained relation extraction model.
And the knowledge graph construction module is used for storing the extracted triples in the Neo4j graph database to realize the construction of the knowledge graph and visually displaying the knowledge graph.
Compared with the prior art, the invention has the beneficial effects that:
the invention also provides a knowledge graph construction method and a device based on entity self-attention document relation extraction, wherein a coding module codes text data into vectors by using Bert, an entity reasoning module realizes information propagation between entities, an adaptive threshold module is responsible for learning classification thresholds of various categories, and a classification module outputs semantic relations between two entities by using the adaptive threshold module; aiming at the problem that errors generated by an NLP tool are propagated backwards due to the fact that existing models depend on a syntactic dependency analysis tree to construct documents, an entity document graph is constructed by means of a self-attention mechanism in a Bert model, and therefore the models can dynamically adjust weight dependence among entities in the document graph, propagation of related information is increased, and interference of irrelevant information is reduced; in order to reduce the influence of model decision errors caused by the global threshold, a variable threshold is learned by using a self-adaptive threshold strategy, and finally the performance of the document level relation extraction model is improved; and a knowledge graph with wider knowledge coverage range and higher quality is constructed on the basis of the model.
Drawings
FIG. 1 is a schematic diagram of a model structure of a knowledge graph construction method based on entity self-attention document relation extraction in embodiment 1 of the present invention;
FIG. 2 is a flow chart of a knowledge graph construction apparatus extracted from attention document relations based on entities according to embodiment 2 of the present invention;
fig. 3 is a schematic diagram of a knowledge graph constructing apparatus for extracting a knowledge graph from attention document relations based on an entity according to embodiment 2 of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Example 1: a verification of a knowledge graph construction method based on entity self-attention Document relation Extraction (ESAN).
1. Experimental method
The overall structure of the ESAN model is shown in figure 1.
Preferably, in step S2, D = { D } for a given documentiH (i =1,2,3 … n), to let the model focus more on entity information, special symbols "+" are inserted at the beginning and end of all mentioned entities to highlight the entities and document D is converted to W = { W according to BERT vocabularyi}. And then, coding the document D by adopting a pre-training language model BERT to obtain a word vector representation H = { H } fused with context informationi} (i =1,2,3 … n), the formula is as follows:
H=BERT(wi)
the word vector with the special symbol ". Before the entity is mentioned is denoted m as the word vector for the entity in question. Since an entity may appear in multiple locations of a document, the present invention uses a smooth max-pooling to integrate vector representations of multiple mentioned entities of an entity to obtain a vector representation e of the entity, the formula being as follows:
Figure RE-GDA0003853892720000081
preferably, in step S3, when the document is encoded by using the pre-training language model BERT, the multi-headed self-attention module has already performed self-attention calculation, and the present invention takes the part of the special symbol "+" before the entity mentioned in the final layer of self-attention weight matrix as the weight between the mentioned entities. The weight of an entity to other entities is equal to the average of the weights of all mentioned entities of the entity, and the formula is as follows:
Figure RE-GDA0003853892720000082
where N is the number of entities the current entity contains the referenced entity. Then all entity weights a are spliced to obtain an adjacency matrix A, and the formula is as follows:
A=[a1:a2…:aN]
the invention uses the aggregation of neighbor information by the convolutional neural network to obtain multi-hop inference information between entities, and gives l-1 node representation
Figure RE-GDA0003853892720000083
The graph convolution operation defines the formula as follows:
Figure RE-GDA0003853892720000084
wherein
Figure RE-GDA0003853892720000085
Representing entity information after aggregation of neighbor information at layer l, W(l)For linear transformation, b(l)σ is the activation function for the bias term.
Preferably, in step S4, the entity type plays an important role in the classification of the entity pair, and the entity type can help to define the range of the relationship class between the entities, so as to reduce the risk of being misclassified. For example, a relationship between "person" and "business/organization" may be "originator" or "affiliated organization" and may not be "son" or "time of arrival" or the like. The present invention randomly initializes a weight matrix T to represent an embedded representation of an entity type, then, the entity type is represented by using the one-hot code p, and then the vector of a certain entity type represents T, and the formula is as follows:
t=pT
given an entity-pair inference vector representation (h) computed by a graph-convolution neural networks,ho) Entity to original vector (e)s,eo) Where s represents the head entity and o represents the tail entity. Different from the implicit entity type information utilization mode of the third chapter, the method disclosed by the invention has the advantages that the original vector, the reasoning vector and the entity type vector are spliced to obtain the joint vector z, and the formula is as follows:
zs=concatenate(es:hs:ts)
zo=concatenate(eo:ho:to)
and finally, activating the joint vector through a bilinear function and sigmoid to calculate the probability of the entity relation r, wherein the formula is as follows: .
Figure RE-GDA0003853892720000091
Wherein is WrWeight matrix, brFor bias, σ is the activation function (e.g., reLU).
The invention uses binary cross entropy loss for training, and in the reasoning process, adjusts a global threshold theta to maximize an evaluation index (F1 score) on a test set, if P (r | z) is ins,zo)>And theta returns r as an association relation, and if no relation exists, returns NA.
Preferably, in step S5, the relation extraction classifier outputs a label distribution P (r | z) predicted for the entitys,zo) The range is (0,1). For the multi-classification task of relation extraction, the label score corresponding to the optimal F1 score in the PR curve can be used as a global threshold, when the model is used for actual reasoning, if the probability maximum value in label distribution exceeds the global threshold, the entity pair is predicted to be a relation r, otherwise, the model predicts that no relation exists between the entity pairs and outputs the NA type.
However, due to the different numbers of instances of different relationship classes in the training data and the differences between different entity pairs, the global threshold cannot maximize the performance of the classifier. An adaptive threshold method is used to obtain a learnable threshold value to replace a global threshold value, so that decision errors in an inference process are reduced.
Pair of entities T = (e)s,eo) The tags of (a) are divided into two subsets: positive PT and negative NT classes, defined as follows:
class III
Figure RE-GDA0003853892720000092
Are a category of relationships expressed between entities. If there is no semantic relationship between the entity pairs, then the PT is null.
Negative class
Figure RE-GDA0003853892720000101
Are a class of unexpressed relationships between entities. NT = R if T does not express any relationship.
Wherein R = RP+RNAnd R is a relation category set.
The present invention defines a threshold class TH. If the entity pair is correctly classified, the score for the positive category label should be above the threshold, and the score for the negative category label should be below the threshold.
To train the model, a special penalty function is needed to consider the TH class. The adaptive threshold loss is designed on the basis of the cross entropy loss. The loss function is divided into two parts, and the formula is as follows:
Figure RE-GDA0003853892720000102
Figure RE-GDA0003853892720000103
L=L1+L2
the first part converts the relationship classification into a comparison of positive and TH classes. If the model mispredicts the positive class as TH class, L1The loss will become large, otherwise L1And becomes smaller. If there is no entity between the entitiesWith a positive type label, then L1Is 0; the second part converts the relationship classification into a comparison of the negative class and the TH class. If the model mispredicts the TH class as a negative class, L2The loss will become large, otherwise L2And becomes smaller. The goal of the model is to minimize the loss function L so that the positive class score is above the TH class and the negative class score is below the TH class.
Preferably, in step S6, at the time of test, a class having a score higher than TH class is returned as a positive class, and if such a class does not exist, NA is returned. This threshold class learns an entity-dependent threshold and therefore does not require adjustment of the threshold across the development set.
2. Results of the experiment
The experiment evaluated the model using a HacRED dataset. The HacRED data set is from CN-DBpedia, is the largest Chinese document level relation extraction data set, is divided into a training set, a development set and a test set, respectively comprises 6231, 1500 and 1500 examples, 9231 documents and 65255 entity relation examples, 26 predefined relations and 9 types of entities, and the data quality F1 score of the HacRED data set reaches 96%. The HacRED dataset has a more balanced distribution of relationship classes, with the top 20% of relationship classes accounting for 49.96% of the total examples. The HacRED data set contains rich difficult cases in order to reduce the problem of inconsistent performance of experiments and practical application, and the model is enabled to be more robust in a practical difficult scene.
In order to evaluate the performance of the ESAN model provided by the invention, four published methods are selected as comparison, the experimental results are shown in table 1, and table 1 shows a main result table of the knowledge graph construction method based on entity self-attention document relation extraction provided in embodiment 1 of the invention.
TABLE 1 Performance of the models on HacRED
Figure RE-GDA0003853892720000111
In order to further verify that constructing the entity document map based on self-attention is better than constructing the document map based on the dependency analysis tree, the following experiment is performed in the present invention, and the experimental result is shown in table 2, where table 2 represents a comparative experiment table of the knowledge graph construction method based on entity self-attention document relation extraction provided in embodiment 1 of the present invention.
Table 2 performance of comparative models on HacRED datasets
Figure RE-GDA0003853892720000112
Extracting words among all entities in the Syntactic Dependency analysis tree by using the shortest dependent path, taking the words and the mentioned entities of all the entities as nodes in a document graph, keeping the weights among the nodes the same, keeping the other parts of the model consistent with ESAN, and marking the model as SDN (synchronous Dependency Network). The entity self-Attention matrix of Bert is used as the weight of the adjacency matrix of the document map in the SDN model, and the model is named as SDAN (synthetic Dependency Attention Network).
In order to verify the effectiveness of the entity self-attention document map construction method and the explicit entity type information utilization method proposed in the present invention, ablation experimental analysis was performed, see table 3. The ESAN-entity attention is a model removing entity self-attention module, so that the weight dependency of each node in a document graph is kept consistent, and the ESAN implicit entity type is coded and implicit utilizing entity type information.
Table 3 shows an ablation experiment table of the method for constructing a knowledge graph based on entity extraction from attention document relations, which is proposed in embodiment 1 of the present invention.
TABLE 3 ESAN ablation study on HacRED
Figure RE-GDA0003853892720000121
3. Conclusion
The effectiveness of the method for constructing the document map by using self-attention is proved by comparison experiments with other models on the public data set HacRED, comparison experiments about the mode of constructing the document map and ablation experiments, and the highest F1 score is obtained on the HacRED and reaches 79.46 percent.
Example 2: a knowledge graph construction device based on entity self-attention document relation extraction.
1. Experimental methods
Semi-structured data in a medical website can be extracted as triple structured knowledge according to a certain rule, but data in the internet exists in an unstructured mode, such as medical documents, popular science articles, doctor answers and the like, the data cannot extract knowledge through a simple rule, and triple of medical entities need to be automatically extracted from the data through a deep learning technology. The schematic diagram of the knowledge graph construction process is shown in figure 2.
The information of the medical-related entities in the unstructured text acquired by the crawler is not marked, so the text needs to be labeled manually, and then the deep learning model is trained by using the labeled data set, so that the medical entities can be automatically identified. The unstructured medical data in the Internet is characterized in that paragraphs are taken as units instead of a single sentence, and a plurality of entities are involved, so that single-sentence relation extraction is not applicable.
Data annotation, the medical entities are classified into seven categories, namely, diseases, departments, treatment methods, examination methods, symptoms, parts, and drugs, through the knowledge framework of the professional medical website.
According to the relationship between medical entities, the relationship between medical entities is divided into eight categories, which are the disease location, complications, synonyms, clinical manifestations, treatment methods, examination methods, clinic for treatment, and therapeutic drugs.
And marking the data mark by using a BIOSE sequence marking method.
And (4) identifying the medical entity by using a BILSTM-CNN-CRF-based network structure as a named entity identification model. The model has the advantages of CNN obtaining local semantics, BILSTM obtaining bidirectional context semantic information, and using conditional random fields to constrain entity labels.
Before the text is sent into the model, the data is preprocessed according to the named entity recognition task to obtain the entity mark required by the named entity recognition task. For data, the eye drops can relieve fatigue, hot compress, orally take a vision recovering tablet and improve eyesight by considering visual fatigue and dry eyes through symptoms and medicines. The label O O O O O O O B-SYM I-SYM E-SYM O B-DIS E-DIS O O O B-DRU I-DRU I-DRU I-DRU I-DRU I-DRU I-DRU E-DRU O O O O O O O B-DRU I-DRU E-DRU O O O O O O B is obtained after the pretreatment, namely the training label of the medical entity recognition task.
The input layer of the model only uses pre-trained word vectors and does not use any artificial features. The encoding layer first uses CNN to obtain the local context semantics of the word, and then uses BilSTM to model the context information of the word from both directions. When decoding, the conditional random field layer is used to search the optimal label sequence through the dependency relationship between labels. For example, the label I-DRU representing the drug middle cannot appear before the label B-DIS representing the disease middle.
And (3) extracting a document level relation, namely extracting the relation by using an ESAN model, wherein an input layer of the model is that a word in a section of text corresponds to the id of the table, the initial position of each entity in the text, the entity type corresponding to the initial position and the semantic relation among the entities.
The coding layer of the model obtains vectorization representation of words by using Bert, obtains original vectorization representation of each entity by using an addition and average method according to the position information of the entities, and extracts an entity self-attention matrix from the self-attention matrix of the last layer of the Bert.
And the inference layer of the model constructs an entity document graph by using the vector representation of the entity and the entity self-attention moment matrix, and then uses a graph convolution neural network to carry out information propagation between the entities to obtain the entity inference vector representation.
And a classification output layer of the model splices and sends the entity original vector representation, the entity inference vector representation and the entity type vector representation into a bilinear function and a sigmoid function to obtain the final relationship category between any two entities.
2. Results of the experiment
The result shows that a trained medical entity recognition model and a medical relation extraction model are obtained, and medical entity triples are extracted from unlabelled texts by the two models in a serial mode. Meanwhile, the medical entity recognition model can also be used as a user question semantic analysis module of the automatic question-answering system, and key medical test question information is extracted from questions in a self-language form so as to support the inquiry of relevant answers of user questions from a knowledge graph.
3. Conclusion
In summary, the knowledge graph constructed based on the entity self-attention document relation extraction method can obtain a knowledge graph with better quality, and an example is shown in fig. 3.
In summary, the present invention is only a preferred embodiment, and not intended to limit the scope of the invention, and all equivalent changes and modifications in the shape, structure, characteristics and spirit of the present invention described in the claims should be included in the scope of the present invention.

Claims (7)

1. A knowledge graph construction method based on document relation extraction is characterized by comprising the following steps: the method comprises the following steps:
the method comprises the following steps that S1, a document level relation extraction model based on entity self-attention is divided into a coding module, an entity reasoning module, an adaptive threshold module and a classification module;
s2, the encoding module encodes the text data into vectors by using Bert and extracts entity vectors;
s3, the entity reasoning module realizes information transmission among entities;
s4, the self-adaptive threshold module is responsible for learning the classification threshold of each category;
s5, the classification module outputs the semantic relation between the two entities by using the self-adaptive threshold module;
and S6, selecting a document from the test set, extracting an entity pair to be predicted from the document, and predicting the relationship of the entity pair through the trained model to obtain an entity relationship triple.
2. The method for constructing a knowledge graph based on document relation extraction according to claim 1, wherein: special symbols are inserted at the beginning and the end of all the mentioned entities in the document to highlight the entities, the document is converted according to a BERT word list, then the document is coded by the BERT, word vector representations of the special symbols before the mentioned entities are used as word vector representations of the mentioned entities, and the vector representations of a plurality of the mentioned entities of one entity are integrated by smooth maximum pooling.
3. The method for constructing a knowledge graph based on document relation extraction according to claim 2, wherein: and constructing an adjacent weight matrix of the document graph by utilizing the BERT final layer self-attention weight matrix, and aggregating neighbor information by using a graph convolution neural network to obtain multi-hop inference information between entities.
4. The method for constructing a knowledge graph based on document relation extraction according to claim 3, wherein: and splicing the original vector, the reasoning vector and the entity type vector to obtain a joint vector, and then activating the joint vector through a bilinear function and sigmoid to calculate the probability of the entity relationship.
5. The method for constructing a knowledge graph based on document relation extraction according to claim 4, wherein: in order to train the model, a special loss function is used for learning the adaptive threshold, so that the decision error of the model when the global threshold is used is reduced.
6. The method for constructing a knowledge graph based on document relation extraction according to claim 5, wherein: in order to verify the effectiveness of the model, a comparison experiment with other models is carried out on a public data set HacRED, the model is trained by utilizing a training set, in a testing stage, documents are selected from the testing set, entities in the documents are extracted, the semantic relation between the entities is predicted by using the model, if the score of a certain category of an entity pair to be predicted is higher than a threshold value, the semantic relation of the current entity pair is judged to be the category, if the category which is higher than the threshold value does not exist, the semantic relation between the two entities is judged not to exist, and the output is None.
7. A knowledge graph construction device based on document relation extraction is characterized in that: the device comprises:
and the data acquisition module is used for acquiring data from the Internet by utilizing a crawler technology, wherein the data comprises medical documents, popular science articles, doctor answers and the like, the data cannot extract knowledge through simple rules, and the medical entity triples need to be automatically extracted from the data by virtue of a deep learning technology. Since these data are often a paragraph rather than a single sentence, the entity triples need to be extracted using the document level relation extraction model;
the data labeling module is used for manually labeling the acquired unstructured data, the labeled content comprises entities and types thereof and semantic relations among entity pairs, and the labeled data is used for training a named entity recognition model and a document level relation extraction model;
the named entity recognition module is used for training a named entity recognition model according to a training set;
the document level relation extraction module is used for extracting a model according to the training relation of the training set;
the triple extraction module is used for extracting entity triples from the unlabeled unstructured text by using the trained named entity recognition model and the trained relation extraction model;
and the knowledge graph construction module is used for storing the extracted triples in the Neo4j graph database to realize the construction of the knowledge graph and visually displaying the knowledge graph.
CN202210470762.2A 2022-04-28 2022-04-28 Knowledge graph construction method and device based on document relation extraction Pending CN115269857A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210470762.2A CN115269857A (en) 2022-04-28 2022-04-28 Knowledge graph construction method and device based on document relation extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210470762.2A CN115269857A (en) 2022-04-28 2022-04-28 Knowledge graph construction method and device based on document relation extraction

Publications (1)

Publication Number Publication Date
CN115269857A true CN115269857A (en) 2022-11-01

Family

ID=83759711

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210470762.2A Pending CN115269857A (en) 2022-04-28 2022-04-28 Knowledge graph construction method and device based on document relation extraction

Country Status (1)

Country Link
CN (1) CN115269857A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116226408A (en) * 2023-03-27 2023-06-06 中国科学院空天信息创新研究院 Agricultural product growth environment knowledge graph construction method and device and storage medium
CN116340530A (en) * 2023-02-17 2023-06-27 江苏科技大学 Intelligent design method based on mechanical knowledge graph
CN116431757A (en) * 2023-06-13 2023-07-14 中国人民公安大学 Text relation extraction method based on active learning, electronic equipment and storage medium
CN116484010A (en) * 2023-03-15 2023-07-25 北京擎盾信息科技有限公司 Knowledge graph construction method and device, storage medium and electronic device
CN116521888A (en) * 2023-03-20 2023-08-01 麦博(上海)健康科技有限公司 Method for extracting medical long document cross-sentence relation based on DocRE model
CN116611813A (en) * 2023-05-08 2023-08-18 武汉人云智物科技有限公司 Intelligent operation and maintenance management method and system based on knowledge graph
CN116975256A (en) * 2023-07-28 2023-10-31 三峡大学 Method and system for processing multisource information in construction process of underground factory building of pumped storage power station
CN117151117A (en) * 2023-10-30 2023-12-01 国网浙江省电力有限公司营销服务中心 Automatic identification method, device and medium for power grid lightweight unstructured document content
CN117194682A (en) * 2023-11-07 2023-12-08 国网浙江省电力有限公司营销服务中心 Method, device and medium for constructing knowledge graph based on power grid related file

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116340530A (en) * 2023-02-17 2023-06-27 江苏科技大学 Intelligent design method based on mechanical knowledge graph
CN116484010A (en) * 2023-03-15 2023-07-25 北京擎盾信息科技有限公司 Knowledge graph construction method and device, storage medium and electronic device
CN116484010B (en) * 2023-03-15 2024-01-16 北京擎盾信息科技有限公司 Knowledge graph construction method and device, storage medium and electronic device
CN116521888A (en) * 2023-03-20 2023-08-01 麦博(上海)健康科技有限公司 Method for extracting medical long document cross-sentence relation based on DocRE model
CN116226408A (en) * 2023-03-27 2023-06-06 中国科学院空天信息创新研究院 Agricultural product growth environment knowledge graph construction method and device and storage medium
CN116226408B (en) * 2023-03-27 2023-12-19 中国科学院空天信息创新研究院 Agricultural product growth environment knowledge graph construction method and device and storage medium
CN116611813A (en) * 2023-05-08 2023-08-18 武汉人云智物科技有限公司 Intelligent operation and maintenance management method and system based on knowledge graph
CN116611813B (en) * 2023-05-08 2024-03-29 武汉人云智物科技有限公司 Intelligent operation and maintenance management method and system based on knowledge graph
CN116431757B (en) * 2023-06-13 2023-08-25 中国人民公安大学 Text relation extraction method based on active learning, electronic equipment and storage medium
CN116431757A (en) * 2023-06-13 2023-07-14 中国人民公安大学 Text relation extraction method based on active learning, electronic equipment and storage medium
CN116975256A (en) * 2023-07-28 2023-10-31 三峡大学 Method and system for processing multisource information in construction process of underground factory building of pumped storage power station
CN116975256B (en) * 2023-07-28 2024-01-16 三峡大学 Method and system for processing multisource information in construction process of underground factory building of pumped storage power station
CN117151117A (en) * 2023-10-30 2023-12-01 国网浙江省电力有限公司营销服务中心 Automatic identification method, device and medium for power grid lightweight unstructured document content
CN117151117B (en) * 2023-10-30 2024-03-01 国网浙江省电力有限公司营销服务中心 Automatic identification method, device and medium for power grid lightweight unstructured document content
CN117194682A (en) * 2023-11-07 2023-12-08 国网浙江省电力有限公司营销服务中心 Method, device and medium for constructing knowledge graph based on power grid related file
CN117194682B (en) * 2023-11-07 2024-03-01 国网浙江省电力有限公司营销服务中心 Method, device and medium for constructing knowledge graph based on power grid related file

Similar Documents

Publication Publication Date Title
CN115269857A (en) Knowledge graph construction method and device based on document relation extraction
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN111415740B (en) Method and device for processing inquiry information, storage medium and computer equipment
CN111291185B (en) Information extraction method, device, electronic equipment and storage medium
CN111950285B (en) Medical knowledge graph intelligent automatic construction system and method with multi-mode data fusion
CN110737763A (en) Chinese intelligent question-answering system and method integrating knowledge map and deep learning
CN111078875B (en) Method for extracting question-answer pairs from semi-structured document based on machine learning
CN111708873A (en) Intelligent question answering method and device, computer equipment and storage medium
CN114064918B (en) Multi-modal event knowledge graph construction method
CN112650840A (en) Intelligent medical question-answering processing method and system based on knowledge graph reasoning
Sharma et al. A survey of methods, datasets and evaluation metrics for visual question answering
CN110738057A (en) text style migration method based on grammatical constraint and language model
US11170169B2 (en) System and method for language-independent contextual embedding
CN115048447B (en) Database natural language interface system based on intelligent semantic completion
CN112364132A (en) Similarity calculation model and system based on dependency syntax and method for building system
CN115438674B (en) Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
CN112632250A (en) Question and answer method and system under multi-document scene
CN112925918A (en) Question-answer matching system based on disease field knowledge graph
CN112905736A (en) Unsupervised text emotion analysis method based on quantum theory
CN115659947A (en) Multi-item selection answering method and system based on machine reading understanding and text summarization
CN117648429B (en) Question-answering method and system based on multi-mode self-adaptive search type enhanced large model
CN111507093A (en) Text attack method and device based on similar dictionary and storage medium
CN114491076B (en) Data enhancement method, device, equipment and medium based on domain knowledge graph
CN111859910B (en) Word feature representation method for semantic role recognition and fusing position information
CN111813924B (en) Category detection algorithm and system based on extensible dynamic selection and attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination