CN114021584B - Knowledge representation learning method based on graph convolution network and translation model - Google Patents

Knowledge representation learning method based on graph convolution network and translation model Download PDF

Info

Publication number
CN114021584B
CN114021584B CN202111240396.3A CN202111240396A CN114021584B CN 114021584 B CN114021584 B CN 114021584B CN 202111240396 A CN202111240396 A CN 202111240396A CN 114021584 B CN114021584 B CN 114021584B
Authority
CN
China
Prior art keywords
entity
representation
entities
knowledge
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111240396.3A
Other languages
Chinese (zh)
Other versions
CN114021584A (en
Inventor
周惠巍
李雪菲
徐奕斌
姜海斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202111240396.3A priority Critical patent/CN114021584B/en
Publication of CN114021584A publication Critical patent/CN114021584A/en
Application granted granted Critical
Publication of CN114021584B publication Critical patent/CN114021584B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

A knowledge representation learning method based on a graph-convolution network and a translation model firstly adopts the translation model to learn entity and relation representations in a knowledge base based on the knowledge base. And then, taking the knowledge base as a guide, and adopting remote supervision to obtain the entity of the biomedical text and the relation label thereof. And then learn the entity representation in the text using GCGCN. Finally, aligning the entity representations in the knowledge base and the text so that the entity representations based on knowledge base and remote supervised text learning coexist in the same vector space. The invention effectively fuses the knowledge base and the large-scale remote supervision text information based on the translation model and the graph convolution network, realizes multi-source information fusion, acquires high-quality knowledge representation, and improves the performance of the biomedical relation extraction model. And learning the structured knowledge in the knowledge base based on the translation model, learning the context knowledge in the large-scale remote supervision text based on the graph convolution network, and finally obtaining high-quality knowledge representation through entity alignment fusion of the multi-source knowledge.

Description

Knowledge representation learning method based on graph convolution network and translation model
Technical Field
Based on a graph convolution network (Graph Convolutional Networks, GCN) and a translation model, the invention fuses the triples in the knowledge graph and the context in the large-scale remote supervision text to perform knowledge representation learning. First, a knowledge representation is learned using a knowledge base triplet based on a translation model. And then adopting a graph convolution network to learn the entities in the large-scale biomedical text obtained by remote supervision learning. And finally, aligning the knowledge base with the entities in the biomedical text to realize entity fusion based on the knowledge base and large-scale remote supervision text information. The invention is mainly used for biomedical relation extraction tasks in the field of natural language processing.
Background
With the rapid development of computer technology and biotechnology, the literature in the biomedical field is growing exponentially. Researchers are eagerly looking to reveal biomedical knowledge contained in massive biomedical documents, promote biomedical development and improve life quality of people. This demand has driven the generation and development of biomedical information extraction technologies.
The vast biomedical literature contains abundant and valuable knowledge. Meanwhile, researchers in the biomedical field spend a great deal of effort to study and construct large-scale and high-quality biomedical knowledge bases. The biomedical knowledge base provides powerful entity semantics and entity relation knowledge resource support for biomedical information extraction, and is an important knowledge resource for promoting intelligent medical development. In recent years, a technique of learning representations of entities and relationships in a knowledge base has received a great deal of attention.
In the prior knowledge representation learning method based on the knowledge base, learning entities and relation representations are simply based on the knowledge base by utilizing a translation model and the like. Knowledge representation learning based solely on knowledge bases lacks the entity and relationship information that is contained in large-scale biomedical text.
Thus, researchers merge the knowledge base and text information to improve knowledge representation capability, i.e. learn entity and relation representations in the knowledge base by using a translation model and the like, and learn sentence representations describing the relation of the two entities by using a convolutional neural network. And finally, aligning the knowledge base with the text entities and the relation representation thereof, and realizing knowledge representation based on knowledge base and text information fusion.
However, the expression of entity relationships in biomedical texts is complex, including intra-sentence entity relationships and inter-sentence entity relationships. Therefore, in knowledge representation learning that merges knowledge base and text information, it is necessary to consider the entity relationship of document-level text.
Moreover, the entity relationship labeling corpus in the biomedical field is lacking, and in order to obtain the large-scale entity relationship labeling corpus, remote supervision is generally adopted to label the unlabeled corpus in the large-scale biomedical field. However, the model cannot determine which sentence in the sentence set (bag) corresponding to the relationship instance is related to the relationship, and a sentence which does not express a certain relationship may be regarded as a sentence which expresses the relationship or a sentence which expresses a certain relationship may be regarded as a sentence which does not express the relationship in modeling. To avoid introducing noise data, a attentiveness mechanism is employed for a pair of entities, learning to obtain a weight for each sentence in the document. And then carrying out weighted summation on all sentences to obtain document representation of the entity pair. This approach fails to learn the entity, entity relationship, and semantic representation of the sentence as a whole for all sentences and entities in a document.
In recent years, researchers apply graph convolution networks to document-level relation extraction tasks, and good entity relation extraction performance is achieved. Therefore, it is necessary to explore how to use the graph packing network to mine semantic information of document-level entities, entity relationships and sentences, and integrate the entities and the relationship information thereof in the biomedical knowledge base at the same time, so as to realize high-quality knowledge representation based on knowledge base and text information fusion.
Disclosure of Invention
In view of the problems of the existing method, the invention provides a method (GCGCN-TransE) for combining a graph rolling network and a translation model to learn knowledge representation and obtain knowledge representation based on knowledge base and text information fusion.
First, based on the knowledge base, the entity and relationship representations in the knowledge base are learned using a translation model. And then, taking the knowledge base as a guide, and adopting remote supervision to obtain the entity of the biomedical text and the relation label thereof. Next, GCGCN (Zhou et al ,Global Context-enhanced Graph Convolutional Networks for Document-level Relation Extraction//COLING2020)) is employed to learn the entity representations in the text.
The invention can integrate information about entities in a knowledge base and a large-scale text in biomedical knowledge representation learning, realize knowledge representation learning based on multi-source information and improve knowledge representation capability.
The technical scheme of the invention is as follows:
knowledge representation learning based on graph convolution network and translation model comprises the following steps:
Step one: biomedical text entity relationship annotation based on remote supervision
Automatically identifying biomedical entities in the large-scale unlabeled corpus by adopting an entity identifier; and labeling entity relations in the large-scale unlabeled corpus by taking the biomedical knowledge base as a guide and adopting remote supervision.
Step two: feature sequence construction
Word vectors are encoded using BioBERT pre-trained language models.
Step three: knowledge representation for learning knowledge base based on translation model
And learning entity and relation representations in the biomedical knowledge base triples (h, r, t) by adopting a translation model.
Step four: GCGCN-based learning of knowledge representation of large-scale remote supervision corpus
The multi-layer graph convolution can solve a great number of cross-sentence multi-hop reasoning problems in the document level relation extraction. To collect rich global information, the node and edge representations are learned using a multi-layer graph rolling operation.
Step five: entity fusion based on knowledge base and biomedical text information
And aligning entity representations in the knowledge base and the text, and realizing knowledge representation of multi-source heterogeneous information fusion. The knowledge base and the learned entity representations in the text are made to coexist in the same vector space.
The invention has the beneficial effects that: the invention effectively fuses the knowledge base and the large-scale remote supervision text information based on the translation model and the graph convolution network, realizes multi-source information fusion, acquires high-quality knowledge representation, and improves the performance of the biomedical relation extraction model. And learning the structured knowledge in the knowledge base based on the translation model, learning the context knowledge in the large-scale remote supervision text based on the graph convolution network, and finally obtaining high-quality knowledge representation through entity alignment fusion of the multi-source knowledge.
Drawings
Fig. 1 is a basic flow diagram of a system.
FIG. 2 is an example document level entity interaction graph construction.
FIG. 3 is an example of entity alignment in a knowledge base and text.
In fig. 3:
Detailed Description
The knowledge base of the invention adopts a comparative toxicological genomics database (Comparative Toxicogenomics Database, CTD), which is a knowledge base containing knowledge of the drug-gene relationship, the drug-disease relationship, the gene-disease relationship and the like. The CTD knowledge base is adopted in the experiment to obtain the relationship between the diseases and the drug entities in the large-scale unlabeled corpus, and the relationship between the diseases induced by the drugs is studied in an important way.
The specific steps of the invention are further described below with reference to fig. 1 and the technical scheme:
Step one: labeling all drug entities and disease entities in the PubMed abstract and corresponding MeSH IDs thereof by using a text mining tool PubTator(Wei C H,Kao H Y,Lu Z.PubTator:aweb-based text mining tool for assisting biocuration[J].Nucleic acids research,2013,41(W1):W518-W522.); the entity relationship in the large-scale unlabeled corpus is labeled by adopting remote supervision with a comparative toxicological genomics database (Comparative Toxicogenomics Database, CTD) as a guide. For all entity pairs in a document, if a certain pair of entities has a certain relation in a knowledge base, the pair of entities in the document are considered to have the relation, and the relation of the pair of entities is marked.
Step two: the word vector is encoded by utilizing BioBERT pre-training language model, the input text is required to be processed into BioBERT input form, namely, a special identifier [ CLS ] is added at the head end of the text, a special separator [ SEP ] is added at the tail end of each sentence, and word segmentation processing is carried out on the input sequence. Finally, learning the segmented input sequence through BioBERT pre-training language model, extracting hidden layer representation output by the last layer of network as word vector, wherein the word vector of the ith segmented word is
Constructing an entity type matrix E type and a co-index matrix E corf by a random initialization method, and performing label mapping on each word in a text sequence to obtain a corresponding type feature vectorAnd co-index feature vectorWherein t i and c i are category labels and co-index labels of the ith word.
The obtained word vector, the type feature vector and the common-finger feature vector are spliced to construct the features finally input to the context semantic encoder, and the formula is as follows:
Wherein "; "is a vector concatenation operation, the dimension of the final feature is d=d w+dt+dc.
Step three: knowledge representation for learning knowledge base based on translation model
Knowledge representation of the knowledge base triplet (h, r, t) is learned using a translation model. e h、et、er are representations of head, tail entities and relationships, respectively. The translation model defines an energy function d (·) that can measure how well a set relationship between an entity and a relational representation is satisfied, and a loss function L k is represented as:
Wherein gamma > 0 is the boundary, S is the triplet set of the knowledge base, and S' is the negative example set of the entity relationship. Learning obtains a representation e h、et、er of head, tail entities and relationships based on the knowledge base.
Step four: GCGCN-based learning of knowledge representation of large-scale remote supervision corpus
Input data for building a graph structure is input for each document level input sample. Since each input sample labels a collection of entitiesWhere N is the number of entities. An entity interaction graph is constructed by the following two rules: each entity in the entity set is a node in the graph, namely the number of nodes of the graph is N; if the mention of two entities occurs in one sentence, then the nodes representing the two entities in the graph are connected using undirected edges.
The constructed entity interaction graph is marked as G (a, E), wherein a is an adjacency matrix, if there is an edge connection between node i and node j in the graph, a ij =1, otherwise a ij =0.
GCGCN comprises four layers: an embedding layer, a Context aware attention directed graph convolution (Context-aware Attention Guided Graph Convolution, CAGGC) module, a Multi-headed attention directed graph convolution (Multi-head Attention Guided Graph Convolution, MAGGC) module, and a relationship classification layer.
Embedded layer
Word vectors are encoded using BioBERT pre-trained language models. Given a documentThe coded word vector sequence is:
Is the word vector of the jth word in the ith sentence, and d w is the vector dimension.
Splice word vector q i,j and entity type vectorAnd co-pointing amountObtaining a final word vector sequence/>As shown in the formula:
Since an entity may contain multiple references, which may contain multiple words, an average operation is used to calculate the entity representation, and the calculated entity representation is denoted as P (0), the entity calculation process is as follows:
Wherein, Is a representation of entity e v, J is the number of its references, m q is the q-th reference to e v, s and t are their starting and ending positions.
Context-aware attention-directed graph convolution (Context-aware Attention Guided Graph Convolution, CAGGC) module
An attention mechanism and a gating mechanism are utilized to calculate an entity-aware edge representation that contains rich context information. The generation of the weighted adjacency matrix is then guided with the calculated representation of the edges, and finally the node representation is updated on the plurality of densely connected graph convolution sublayers.
Because an edge may be associated with multiple context sentences, to compute the representation of an edge between node u and node v, a word-level attention mechanism is first utilized to obtain a representation of a single sentence, and then a gating mechanism is utilized to fuse the information of multiple sentences to obtain an entity-aware edge representation.
The representation h i of the ith sentence on the edge uv is first calculated using the word vector for each word and its relative distance to the given entity as shown in the formula:
Where c e { u, v } represents either of two entities, Is the relative distance vector of the current word and entity c,/>Is the attention weight of the jth word in the ith sentence perceived by entity c, m is the number of words in the ith sentence, and W 1、W2, z and b 1 are trainable parameters.
Word level attention calculation is carried out on the ith sentence by utilizing the entities u and v respectively to obtain two sentence representationsAnd/>The two sentence representations are spliced and input into a fully connected layer to obtain sentence representations that simultaneously sense entities u and v, as shown in the formula:
in order for the model to take into account the information of all sentences on the edge uv, a physical aware gating mechanism is employed. For entity c e { u, v }, its initial representation is employed The weights of the individual sentences are calculated and all sentences are weighted and summed as an edge representation. The calculation process is as follows:
Where σ (·) is the sigmoid or ReLU activation function, W 3、W4、W5 and b 2 are both trainable parameters, and S represents the total number of sentences.
Representing edges perceived by the calculated entities u and v respectivelyAnd/>Spliced and input into a fully connected layer to obtain edge representations that are simultaneously sensed for entities u and v, as shown in the formulas.
Wherein "; "means a splice operation, W sg and b sg are trainable parameters.
Through the calculation process, CAGGC initial edge representation matrix of the network is obtainedThe proposed physically aware door mechanism has two characteristics. First, introducing representations of two entities to calculate a gating value, giving sentences related to the two entities a greater weight; second, the weights of the respective sentences are calculated using the activation function, and the model can effectively control the information flow even if only one sentence is on the side to be calculated.
The adjacency matrix used in the traditional graph-convolution network consists of 0 and 1, which indicate whether edges exist between nodes, and cannot distinguish the correlation of the adjacency node and the current node in finer granularity, so that information propagation between entities cannot be effectively controlled. A method for calculating a weighted adjacency matrix comprehensively considering node information and side information. The weight between nodes u and v is expressed asIt can be calculated from the formula:
Wherein W, W u、Wv and W e are trainable parameters, exp represents an exponential function based on e.
The GCGCN model also blends the edge representation into the graph convolution operation, updating the node representation with rich context information. The two hierarchical graph convolution inference modules (CAGGC, MAGGC) of GCGCN each contain K densely connected sublayers, and the calculation result of the node v through the kth sublayer is as follows:
Wherein, And b k is a trainable parameter for the kth sublayer.
The output of the initial node representation and the output of the previous k-1 sub-layer are fused by adopting a dense connection method as the input of the current sub-layer, as shown in the formula:
multi-head attention directed graph convolution (MAGGC) module
Interactions between all nodes, in particular nodes connected by multi-hop paths, are collected with multi-headed attention.
As a result of the introduction of the multi-headed attention mechanism MAGGC expands the partially connected graph used in the previous module into a weighted full connected graph. First computing the edge representation, MAGGC module replaces P (0) in CAGGC module with P (1), and calculates the entity-aware edge representation matrix in the same mannerIf entities u and v do not appear in any sentence, then edge/>Is a zero vector.
Unlike the CAGGC module, which considers the impact of the context information, MAGGC directly calculates the adjacency matrix using the self-attention mechanism (self-attentionmechanism), as shown by the formula:
where W Q and W K are trainable parameters, d represents the vector dimension.
Since the multi-head attention includes multiple self-attention, t different adjacency matrices are calculated using the above formulaAnd reducing the dimension of the calculated t output representations { P 1 (2);P2 (2);...;Pt (2) } and then splicing to obtain an output P (2) of the MAGGC module.
Relationship classification layer
Splicing the initial node representation obtained by the coding layer and the node representations calculated by the two graph convolution reasoning modules, inputting the spliced node representations into a full-connection layer, and obtaining a final node representation by using an activation function, wherein the final node representation is shown in the formula:
P=tanh(Wp[P(0);P(1);P(2)]+bp)
Where P (0) is the initial node representation, P (1) and P (2) are the node representations output by the CAGGC and MAGGC modules, respectively, and W p and b p are trainable parameters.
Splicing entity representation and relative distance vectors, and obtaining entity pair relation characteristics by utilizing bilinear functions and full connection layers, wherein the entity pair relation characteristics are used for relation classification, and the calculation process is as follows:
Pu′=[Pu;E(du,v)]
Pv′=[Pv;E(dv,u)]
P(r|u,v)=sigmoid(PuTWrPv′+Wt[Pu′;Pv′]+br)
Wherein "; "represents a concatenation operation, d u,v and d v,u are the first mentioned relative distances of two entities, and E is the mapping matrix of the relative distance vectors.
Because the remote supervision corpus contains various relations, a binary cross entropy loss function for multi-classification tasks is adopted to calculate a loss value in the training process:
Where S represents the entire training set, II (-) is the indicator function, and R is a collection of predefined relationship types.
Step five: entity fusion based on knowledge base and biomedical text information
Aligning the text-based entity representation with the translation model-based entity representation results in an entity alignment penalty L A, i.e., minimizing:
Where D (P i,ej) is the distance of the textual entity representation P i from the translation model-based entity representation e i. The matrix M is used to map the entity representation P i of the text to the space of the entity representation e i of the translation model:
D(Pi,ei)=||MPi-ei||
and researching the interrelationships among the knowledge base loss L K, the text loss L T and the alignment loss L A according to the credibility and consistency of the knowledge base and the text information, and obtaining the optimal knowledge representation fusing the biomedical knowledge base and the text information.
Knowledge obtained based on this patent is used to represent the extraction of entity relationships for the biomedical field. Testing was performed directly on BioCreative V CDR test data without training data of BioCreative V CDR. For a pair of candidate entities in a document in the test data, we calculate the cosine similarity between the difference value represented by the head and tail entities and each relation representation, and determine the relation of the entity pair. Referring to the description of the CDR corpus, the drug induced disease relationship (CID) in the CDR corpus refers to the "marker/mechanism" relationship in CTD. The entity pair with the maximum similarity of marker/mechanism relationship is considered to have CID relationship. The experimental results are shown in the following table:
knowledge representation P(%) R(%) F(%)
TransE (cosine similarity) 47.51 11.63 18.69
GCGCN-TransE (cosine similarity) 51.02 67.82 58.24
Experimental results show that the final F value of the method which is proposed by the inventor and only utilizes entity representation GCGCN-TransE (cosine similarity) is improved by 38.41% compared with that of the traditional TransE (cosine similarity) method, and the knowledge representation learning method GCGCN-TransE which is proposed by the inventor and is based on a graph convolution network and a translation model can effectively capture and fuse knowledge of biomedical knowledge base and remote supervision text information, so that high-quality knowledge representation is obtained.
We further use the knowledge representation for the deep neural network model GCGCN to extract biomedical entity relationships. First, based on GCGCN, training data of BioCreative V CDR is utilized to train and obtain an entity relation extraction model, and the entity relation extraction model is directly tested on BioCreative V CDR test data. Then, the classification layer at GCGCN concatenates the two entity representations learned based on TransE and GCGCN-TransE, respectively, training to obtain two models TransE (neural network) and GCGCN-TransE (neural network). The results on BioCreative V CDR test data are shown in the following table:
System name P(%) R(%) F(%)
TransE (neural network) 54.79 15.57 24.25
Zhou et al GCGCN 54.95 67.73 60.67
GCGCN-TransE (neural network) 59.83 64.26 61.96
Experimental results show that the final F value of the GCGCN-TransE method provided by the inventor is improved by 1.29% compared with that of the traditional method of Zhou et al GCGCN, and the biomedical relation extraction system GCGCN-TransE based on knowledge representation learning of a graph rolling network and a translation model provided by the inventor can effectively capture optimal knowledge representation information fusing a biomedical knowledge base and text information, so that a better result is obtained in biomedical relation extraction.

Claims (1)

1. A knowledge representation learning method based on a graph convolution network and a translation model is characterized by comprising the following steps:
Step one: labeling all drug entities and disease entities in the PubMed abstract and corresponding MeSH IDs thereof by using a text mining tool PubTator; the entity relationship in unlabeled corpus of the document is labeled by adopting remote supervision with a comparative toxicological genomics database as a guide; for all entity pairs in a document, if a certain relation exists in a knowledge base for a certain pair of entities, the relation exists for the pair of entities in the document, and the relation of the pair of entities is marked;
Step two: encoding word vectors by utilizing BioBERT pre-training language models, processing an input text into an input form of BioBERT, namely adding a special identifier [ CLS ] at the head end of the text, adding a special separator [ SEP ] at the tail end of each sentence, and performing word segmentation on an input sequence; finally, learning the segmented input sequence through BioBERT pre-training language model, extracting hidden layer representation output by the last layer of network as word vector, wherein the word vector of the ith segmented word is
Constructing an entity type matrix E type and a co-index matrix E corf by a random initialization method, and performing label mapping on each word in a text sequence to obtain a corresponding type feature vectorAnd co-index feature vectorWherein t i and c i are class tags and co-index tags of the ith word;
The obtained word vector, the type feature vector and the common-finger feature vector are spliced to construct the features finally input to the context semantic encoder, and the formula is as follows:
wherein "; "is a vector concatenation operation, the dimension of the final feature is d=d w+dt+dc;
Step three: knowledge representation for learning knowledge base based on translation model
Learning knowledge representations of the knowledge base triples (h, r, t) using the translation model; e h、et、er is a representation of the head, tail entities and relationships, respectively; the translation model defines an energy function d (·) that can measure how well a set relationship between an entity and a relational representation is satisfied, and a loss function L k is represented as:
Wherein, gamma > 0 is the boundary, S is the triplet set of the knowledge base, S' is the negative example set of the entity relation; learning to obtain a representation e h、et、er of head, tail entities and relationships based on the knowledge base;
Step four: GCGCN-based learning of knowledge representation of large-scale remote supervision corpus
Inputting samples at each document level to construct input data of a graph structure; since each input sample labels a collection of entitiesWherein N is the number of entities; an entity interaction graph is constructed by the following two rules: each entity in the entity set is a node in the graph, namely the number of nodes of the graph is N; if the mention of two entities appears in one sentence, then using the undirected edge to connect the nodes representing the two entities in the graph;
The constructed entity interaction graph is marked as G (A, E), wherein A is an adjacency matrix, if an edge connection exists between a node i and a node j in the graph, A ij =1, otherwise A ij =0;
GCGCN comprises four layers: the system comprises an embedding layer, a context-aware attention-guided graph rolling module, a multi-head attention-guided graph rolling module and a relationship classification layer;
(1) Embedding layer
Encoding word vectors using BioBERT pre-training language models; given a documentThe coded word vector sequence is:
Wherein, Is the word vector of the jth word in the ith sentence, d w is the vector dimension;
Splice word vector q i,j and entity type vector And co-finger vector/>Obtaining a final word vector sequence/>As shown in the formula:
Since an entity may contain multiple references, which may contain multiple words, an average operation is used to calculate the entity representation, and the calculated entity representation is denoted as P (0), the entity calculation process is as follows:
Wherein, Is a representation of entity e v, J is the number it refers to; m q is the q-th reference to e v, s and t are their starting and ending positions;
(2) Context aware attention directed graph convolution module
Calculating an entity-aware edge representation containing rich context information using an attention mechanism and a gating mechanism; then directing the generation of a weighted adjacency matrix using the calculated representation of edges, and finally updating the node representation on a plurality of densely connected graph convolution sublayers;
Because an edge may be associated with multiple context sentences, in order to compute the representation of an edge between node u and node v, a word-level attention mechanism is first utilized to obtain a representation of a single sentence, and then a gating mechanism is utilized to fuse the information of multiple sentences to obtain an entity-aware edge representation;
The representation h i of the ith sentence on the edge uv is first calculated using the word vector for each word and its relative distance to the given entity as shown in the formula:
wherein c ε { u, v } represents either of two entities, Is the relative distance vector of the current word and entity c,/>Is the attention weight of the jth word in the ith sentence perceived by entity c, m is the number of words in the ith sentence, W 1、W2, z and b 1 are trainable parameters;
Word level attention calculation is carried out on the ith sentence by utilizing the entities u and v respectively to obtain two sentence representations And/>The two sentence representations are spliced and input into a fully connected layer to obtain sentence representations that simultaneously sense entities u and v, as shown in the formula:
In order to make the model consider the information of all sentences on the edge uv, a physical perception gate mechanism is adopted; for entity c e { u, v }, its initial representation is employed Calculating the weight of each sentence, and taking the weighted sum of all sentences as the representation of the edge; the calculation process is as follows:
Wherein σ (·) is a sigmoid or ReLU activation function, W 3、W4、W5 and b 2 are trainable parameters, S represents the total number of sentences;
representing edges perceived by the calculated entities u and v respectively And/>Spliced and input into a fully connected layer to obtain edge representations of the entities u and v sensed simultaneously, as shown in the formula:
Wherein "; "means a splice operation, W sg and b sg are trainable parameters;
Through the calculation process, CAGGC initial edge representation matrix of the network is obtained
A calculation method of a weighted adjacent matrix comprehensively considering node information and side information; the weight between nodes u and v is expressed asIt is calculated from the formula:
Wherein W, W u、Wv and W e are trainable parameters, exp represents an exponential function based on e;
The GCGCN model also blends the edge representation into the graph convolution operation, updating the node representation with rich context information; the two hierarchical graph convolution inference modules GCGCN each comprise K densely connected sublayers, and the calculation result of the node v through the kth sublayer is as follows:
Wherein, And b k is a trainable parameter for the kth sublayer;
the output of the initial node representation and the output of the previous k-1 sub-layer are fused by adopting a dense connection method as the input of the current sub-layer, as shown in the formula:
(3) Multi-head attention-directed graph convolution module
Collecting interactions between all nodes, in particular nodes connected by multi-hop paths, by using multi-head attention;
As a result of the introduction of the multi-headed attention mechanism MAGGC expands the partially connected graph used in the previous module into a weighted fully connected graph; first computing the edge representation, MAGGC module replaces P (0) in CAGGC module with P (1), and calculates the entity-aware edge representation matrix in the same manner Edges if entities u and v do not appear in any of the sentencesIs zero vector;
MAGGC calculate the adjacency matrix directly using the self-attention mechanism as shown in the formula:
where W Q and W K are trainable parameters, d represents the vector dimension;
Since the multi-head attention includes multiple self-attention, t different adjacency matrices are calculated using the above formula Representing the calculated t outputs/>Firstly reducing the dimension and then splicing to obtain an output P (2) of the MAGGC module;
(4) Relationship classification layer
Splicing the initial node representation obtained by the coding layer and the node representations calculated by the two graph convolution reasoning modules, inputting the spliced node representations into a full-connection layer, and obtaining a final node representation by using an activation function, wherein the final node representation is shown in the formula:
P=tanh(Wp[P(0);P(1);P(2)]+bp)
Wherein, P (0) is the initial node representation, P (1) and P (2) are the node representations output by the CAGGC and MAGGC modules, respectively, and W p and b p are trainable parameters;
Splicing entity representation and relative distance vectors, and obtaining entity pair relation characteristics by utilizing bilinear functions and full connection layers, wherein the entity pair relation characteristics are used for relation classification, and the calculation process is as follows:
Pu′=[Pu;E(du,v)]
Pv′=[Pv;E(dv,u)]
P(r|u,v)=sigmoid(PuTWrPv′+Wt[Pu′;Pv′]+br)
Wherein "; "represents a concatenation operation, d u,v and d v,u are the first mentioned relative distances of two entities, E is the mapping matrix of the relative distance vectors;
because the remote supervision corpus contains various relations, a binary cross entropy loss function for multi-classification tasks is adopted to calculate a loss value in the training process:
wherein S represents the entire training set, II (-) is the indicator function, and R is a set of predefined relationship types;
step five: entity fusion based on knowledge base and biomedical text information
Aligning the text-based entity representation with the translation model-based entity representation results in an entity alignment penalty L A, i.e., minimizing:
Wherein D (P i,ej) is the distance of the textual entity representation P i from the translation model-based entity representation e i; the matrix M is used to map the entity representation P i of the text to the space of the entity representation e i of the translation model:
D(Pi,ei)=||MPi-ei||
and researching the interrelationships among the knowledge base loss L K, the text loss L T and the alignment loss L A according to the credibility and consistency of the knowledge base and the text information, and obtaining the optimal knowledge representation fusing the biomedical knowledge base and the text information.
CN202111240396.3A 2021-10-25 2021-10-25 Knowledge representation learning method based on graph convolution network and translation model Active CN114021584B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111240396.3A CN114021584B (en) 2021-10-25 2021-10-25 Knowledge representation learning method based on graph convolution network and translation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111240396.3A CN114021584B (en) 2021-10-25 2021-10-25 Knowledge representation learning method based on graph convolution network and translation model

Publications (2)

Publication Number Publication Date
CN114021584A CN114021584A (en) 2022-02-08
CN114021584B true CN114021584B (en) 2024-05-10

Family

ID=80057414

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111240396.3A Active CN114021584B (en) 2021-10-25 2021-10-25 Knowledge representation learning method based on graph convolution network and translation model

Country Status (1)

Country Link
CN (1) CN114021584B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114254655B (en) * 2022-02-28 2022-05-10 南京众智维信息科技有限公司 Network security tracing semantic identification method based on prompt self-supervision learning
CN116756596B (en) * 2023-08-17 2023-11-14 智慧眼科技股份有限公司 Text clustering model training method, text clustering device and related equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538848A (en) * 2020-04-29 2020-08-14 华中科技大学 Knowledge representation learning method fusing multi-source information
CN112507699A (en) * 2020-09-16 2021-03-16 东南大学 Remote supervision relation extraction method based on graph convolution network
CN113254663A (en) * 2021-04-21 2021-08-13 浙江工业大学 Knowledge graph joint representation learning method integrating graph convolution and translation model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200074301A1 (en) * 2018-09-04 2020-03-05 Beijing Jingdong Shangke Information Technology Co., Ltd. End-to-end structure-aware convolutional networks for knowledge base completion
KR102524766B1 (en) * 2019-12-17 2023-04-24 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 Natural language and knowledge graph-based expression learning method and apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538848A (en) * 2020-04-29 2020-08-14 华中科技大学 Knowledge representation learning method fusing multi-source information
CN112507699A (en) * 2020-09-16 2021-03-16 东南大学 Remote supervision relation extraction method based on graph convolution network
CN113254663A (en) * 2021-04-21 2021-08-13 浙江工业大学 Knowledge graph joint representation learning method integrating graph convolution and translation model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Global Context-enhanced Graph Convolutional Networks for Document-level Relation Extraction;Huiwei Zhou等;《Proceedings of the 28th International Conference on Computational Linguistics》;20201213;5259-5270 *

Also Published As

Publication number Publication date
CN114021584A (en) 2022-02-08

Similar Documents

Publication Publication Date Title
CN108959252B (en) Semi-supervised Chinese named entity recognition method based on deep learning
CN107992597B (en) Text structuring method for power grid fault case
CN111382272B (en) Electronic medical record ICD automatic coding method based on knowledge graph
CN111159407B (en) Method, apparatus, device and medium for training entity recognition and relation classification model
CN112487143A (en) Public opinion big data analysis-based multi-label text classification method
CN110083700A (en) A kind of enterprise's public sentiment sensibility classification method and system based on convolutional neural networks
CN106980608A (en) A kind of Chinese electronic health record participle and name entity recognition method and system
CN109934261A (en) A kind of Knowledge driving parameter transformation model and its few sample learning method
CN114021584B (en) Knowledge representation learning method based on graph convolution network and translation model
CN113553440B (en) Medical entity relationship extraction method based on hierarchical reasoning
CN111554360A (en) Drug relocation prediction method based on biomedical literature and domain knowledge data
CN110457479A (en) A kind of judgement document's analysis method based on criminal offence chain
CN105404632A (en) Deep neural network based biomedical text serialization labeling system and method
CN109960728A (en) A kind of open field conferencing information name entity recognition method and system
CN112989841A (en) Semi-supervised learning method for emergency news identification and classification
CN112308326A (en) Biological network link prediction method based on meta-path and bidirectional encoder
CN114239585A (en) Biomedical nested named entity recognition method
CN114548099B (en) Method for extracting and detecting aspect words and aspect categories jointly based on multitasking framework
CN116484024A (en) Multi-level knowledge base construction method based on knowledge graph
CN114911945A (en) Knowledge graph-based multi-value chain data management auxiliary decision model construction method
CN111582506A (en) Multi-label learning method based on global and local label relation
CN114077673A (en) Knowledge graph construction method based on BTBC model
CN114781382A (en) Medical named entity recognition system and method based on RWLSTM model fusion
CN116932661A (en) Event knowledge graph construction method oriented to network security
CN112069825B (en) Entity relation joint extraction method for alert condition record data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant