CN116227597A - Biomedical knowledge extraction method, device, computer equipment and storage medium - Google Patents

Biomedical knowledge extraction method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN116227597A
CN116227597A CN202310495163.0A CN202310495163A CN116227597A CN 116227597 A CN116227597 A CN 116227597A CN 202310495163 A CN202310495163 A CN 202310495163A CN 116227597 A CN116227597 A CN 116227597A
Authority
CN
China
Prior art keywords
representation
sequence
word
token
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310495163.0A
Other languages
Chinese (zh)
Inventor
邱炎龙
吴诚堃
杨灿群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202310495163.0A priority Critical patent/CN116227597A/en
Publication of CN116227597A publication Critical patent/CN116227597A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a biomedical knowledge extraction method, a biomedical knowledge extraction device, a biomedical knowledge extraction computer device and a biomedical knowledge storage medium. The method comprises the following steps: setting an interaction tuple representation; acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; adopting an attention mechanism to the hidden representation of the encoder to obtain a sequence coding vector; and embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples. The method proposes a new interactive tuple representation scheme such that the encoder-decoder model extracting a word at each time step can still find from sentences a plurality of tuples with overlapping entities and tuples with multi-labeled entities.

Description

Biomedical knowledge extraction method, device, computer equipment and storage medium
Technical Field
The present application relates to the field of big data technologies, and in particular, to a method and apparatus for extracting biomedical knowledge, a computer device, and a storage medium.
Background
As a basic study, the targeted extraction of knowledge information such as drug-protein entities and their interactions required for medical research from biomedical literature provides a powerful support for drug discovery, drug reuse, drug design, and bioinformatic knowledge base established in the form of knowledge maps. However, as researchers have conducted intensive research into this task, macroscopic and microscopic problems continue to emerge.
Extraction of the drug-protein interaction tuples from the abstract of biomedical literature is also a challenging task. Macroscopically, the predecessor mainly divides tasks into two parts, entity identification and relationship extraction. The built model not only does not consider the context information, but also derives the sequence of the two subtasks and the problem of feature information sharing. Microscopically, the difference in length of entities, the existence of multiple tuples, and the physical overlap between tuples plagues, can plague the accuracy of extracting triples, and can be specifically categorized into three categories: (1) no physical overlap (No Entity Overlap, NEO): a sequence contains one or more tuples but does not share any entity between them. (2) entity pair overlap (Entity Pair Overlap, EPO): there are multiple tuples in a given sequence, and at least two tuples share two entities in the same or opposite order. (3) single entity overlap (Single Entity Overlap, SEO): a given sequence contains multiple tuples and at least two tuples share exactly one entity. It should be noted here that a sequence may be an entity-to-overlap or a single entity overlap.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a biomedical knowledge extraction method, apparatus, computer device, and storage medium.
A biomedical knowledge extraction method, the method comprising:
setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions.
Acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; the source text sequence includes all words, symbols, and each interacted cue word in the summary text data.
Inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is configured to encode the characteristic information of the source text sequence using Bi-LSTM.
The hidden representation of the encoder is subjected to an attention mechanism to obtain a sequence encoding vector.
And embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples.
A biomedical knowledge extraction device, the device comprising:
an interaction tuple representation setting module for setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions.
The source text sequence feature representation acquisition module is used for acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain feature representation of a source text sequence; the source text sequence includes all words, symbols, and each interacted cue word in the summary text data.
The encoding module is used for inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM; the hidden representation of the encoder is subjected to an attention mechanism to obtain a sequence encoding vector.
And the decoding and predicting module is used for embedding and inputting the sequence coding vector and the target word of the previous time step into the word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples.
In one embodiment, the source text sequence feature representation acquisition module is further configured to acquire source abstract text of a biomedical document; constructing a vocabulary according to the source abstract text, wherein the vocabulary comprises a source abstract text token, an interaction name, a component separator, a target sequence start token and a target sequence end token in an interaction set R; initializing Word embedding is carried out on the source abstract text by using a Word2Vec tool, so as to obtain a pre-training Word vector; extracting word feature vectors based on characters from the source abstract text by adopting a convolution network with maximum pooling; and connecting the pre-training word vector with the word feature vector based on the characters to obtain the feature representation of the source text sequence.
A computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of any of the methods described above.
The biomedical knowledge extraction method, the biomedical knowledge extraction device, the biomedical knowledge extraction computer equipment and the storage medium. The method comprises the following steps: setting an interaction tuple representation; acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM; adopting an attention mechanism to the hidden representation of the encoder to obtain a sequence coding vector; and embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples. The method proposes a new interactive tuple representation scheme such that the encoder-decoder model extracting a word at each time step can still find from sentences a plurality of tuples with overlapping entities and tuples with multi-labeled entities.
Drawings
FIG. 1 is a flow chart of a method for extracting knowledge of biological medicine according to an embodiment;
FIG. 2 is a schematic diagram of a word-level decoding-based model in one embodiment;
FIG. 3 is a block diagram of a biomedical knowledge extraction device in one embodiment;
fig. 4 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a biomedical knowledge extraction method is provided, the method comprising the steps of:
step S100: setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions.
Specifically, the interaction tuple is a triplet, and the representation of the interaction tuple is: entity 1; an entity 2; interaction. Use "; "as a separator separating tuple components (i.e., component separator is"; ") and" | "separates symbols of multiple tuples (tuple component separator is" | "). Table 1 is an example of an interaction tuple representation, one example of which is included in table 1. These special tokens (":", and "|") can be used to represent multiple interacting tuples with overlapping entities and entities of different lengths in a simple manner. These special tokens can be used to easily extract the interaction tuples after the end of the sequence generation during the reasoning process. Because of this unified interaction tuple representation scheme, entity tokens, interaction tokens, and special tokens are treated similarly, a shared vocabulary containing all of these tokens is used between the encoder and decoder. The digest text (text is a string of words, or a sequence of characters, which may be called a sequence of text) of the input biomedical document contains each interactive clue word, which helps to generate an interaction token. Two special tokens are used so that the encoder-decoder model can distinguish between the beginning of an interacting tuple and the beginning of a tuple component. In order to extract interaction tuples from a text sequence using an encoder-decoder model, the model must generate entity tokens, find clue words for interactions and map them to interaction tokens, and generate special tokens when appropriate.
Table 1 example of the manner in which the interaction tuple is represented
Figure SMS_1
Step S102: acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; the source text sequence includes all words, symbols, and each interacted cue word in the summary text data.
Specifically, the embedded vector is a vector that maps words, symbols, or sequences into a single corresponding one.
Step S104: inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is configured to encode the characteristic information of the source text sequence using Bi-LSTM.
Step S106: the hidden representation of the encoder is subjected to an attention mechanism to obtain a sequence encoding vector.
Specifically, in order to obtain the source context feature information, a hidden representation of the encoder is subjected to an attention mechanism to obtain a sequence encoding vector. Preferably, the attention mechanism may be an Avg attention mechanism, an N-gram attention mechanism or a Single attention mechanism.
Step S108: and embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples.
Specifically, a word level decoding-based model (WDec) is formed by the encoder, the attention mechanism and the word level decoding module, and the model is structured as shown in FIG. 2.
The word level decoding module is used for decoding the input sequence coding vector and the target word embedding of the previous time step by adopting an LSTM network unit, then mapping the decoding result to a vocabulary by adopting a projection layer, and then performing prediction processing by adopting a softmax operation fused with a mask technology to obtain a group of interaction tuples.
The word level decoding module generates one token at a time and generates a target sequence token after the generation is finishedEOS) And stopping.
In the above-mentioned biomedical knowledge extraction method, the method includes: setting an interaction tuple representation; acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM; adopting an attention mechanism to the hidden representation of the encoder to obtain a sequence coding vector; and embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples. The method proposes a new interactive tuple representation scheme such that the encoder-decoder model extracting a word at each time step can still find from sentences a plurality of tuples with overlapping entities and tuples with multi-labeled entities.
In one embodiment, step S102 includes: acquiring a source abstract text of a biomedical document; building a vocabulary from the source abstract text, the vocabulary comprising source abstract text tokens and interaction setsRA component separator, a meta-component separator, a target sequence start token, an end target sequence token; extracting feature vectors of word embedding and character embedding of all words and characters in the source abstract text by adopting a convolution network with the maximum pooling; and connecting the character embedded feature vectors with word embedded feature vectors to obtain the feature representation of the new source text sequence.
Specifically, a vocabulary is createdVFrom source abstract text tokens, interaction setsRThe interaction names of the components, special separators (the 'and' are 'and' among the components, the 'component separators' and the 'I' are component separators), and a target sequence start tokenSOS) End target sequence tokenEOS)。
Word level inlayThe method comprises the following two parts: (1) pre-training word vectors; (2) character-embedded feature vector based. Using word embedding layers
Figure SMS_2
And character embedding layer->
Figure SMS_3
, wherein />
Figure SMS_4
Is the dimension of the word vector and,Ais the alphabet of characters of the token of the input text sequence, < > >
Figure SMS_5
Is the dimension of the character-embedded vector. Extracting a size +.about.for each word using convolutional neural network with maximum pool>
Figure SMS_6
Is described. Word embedding and character embedding based feature vectors are concatenated ("|") to obtain a representation of the input token.
In one embodiment, the encoder comprises a number of Bi-LSTM network elements, the number of Bi-LSTM network elements being the same as the length of the source text sequence; step S104 includes: and respectively inputting each token vector representation in the source text sequence into a corresponding Bi-LSTM network element of the encoder to obtain a hidden representation of each token vector representation in the source text sequence.
Specifically, the source text sequence S is represented by its token vector
Figure SMS_7
Representation of->
Figure SMS_8
Is the firstiA vector representation of the individual words is provided,nis the length of the source text sequence S. Vector->
Figure SMS_9
Is transferred to a two-way long and short term memory network (Bi-LSTM) to obtain the hidden representation +.>
Figure SMS_10
. The hidden dimension of the forward and backward LSTM of Bi-LSTM is set to +.>
Figure SMS_11
To obtain->
Figure SMS_12
, wherein />
Figure SMS_13
Is the hidden dimension of the decoder LSTM of the word-level decoding module.
In one embodiment, the hidden representation of the encoder uses an attention mechanism to obtain the sequence encoded vector, and the attention mechanism in the step is: avg attention mechanism, N-gram attention mechanism, or Single attention mechanism. Wherein the expression of Avg attention mechanism is:
Figure SMS_14
(1)
The expression of the N-gram attention mechanism for n=3 is:
Figure SMS_15
(2)
Figure SMS_16
(3)
Figure SMS_17
(4)
wherein ,
Figure SMS_20
is the last hidden state of the encoder, +.>
Figure SMS_23
Word gram combinations are indicated,/->
Figure SMS_25
Is the g-gram word representation sequence of the input sequence,/-gram word representation sequence>
Figure SMS_19
Is->
Figure SMS_22
The g-gram vectors (2-gram and 3-gram representation obtained by mean pooling),>
Figure SMS_24
is->
Figure SMS_26
Normalized attention score of the individual g-gram vectors,/->
Figure SMS_18
and />
Figure SMS_21
Is a trainable parameter.
The expression of the Single attention mechanism is:
Figure SMS_27
(5)
Figure SMS_28
(6)
Figure SMS_29
(7)
Figure SMS_30
(8)
Figure SMS_31
(9)
wherein ,
Figure SMS_32
、/>
Figure SMS_33
and />
Figure SMS_34
Are all trainable attention parameters, +.>
Figure SMS_35
Is a bias vector, +.>
Figure SMS_36
Is->
Figure SMS_37
Individual words at decoding time steptIs a normalized attention score of (c).
In one embodiment, the word level decoding module comprises: a decoder and a projection layer; the decoder comprises a plurality of LSTM network units; the vocabulary includes source abstract text tokens and interaction setsRA component separator, a meta-component separator, a target sequence start token, an end target sequence token; step S108 includes: embedding and inputting the sequence coding vector and the target word of the previous time step into a decoder to obtain the hidden representation of the current token; the expression of the hidden representation of the current token is:
Figure SMS_38
(10)
wherein ,
Figure SMS_39
is a hidden representation of the current token,/- >
Figure SMS_40
Is the source text sequence code,/->
Figure SMS_41
Target word embedding, which is the previous time step, +.>
Figure SMS_42
Is the forward hidden state of LSTM.
The hidden representation of the current token is input into the projection layer and the triples are predicted in terms of words using a mask-based replication mechanism based on the resulting output, resulting in a set of interacting tuples.
Specifically, the target sequence T is represented only by the word embedded vector of its token
Figure SMS_46
, wherein />
Figure SMS_48
Is->
Figure SMS_52
The embedded vector of the individual tokens is used,mis the length of the target sequence. />
Figure SMS_45
and />
Figure SMS_47
Respectively showSOSAndEOSan embedded vector of tokens. The decoder generates tokens one at a time and generatesEOSAnd stopping. LSTM is used as decoder and at time steptWhere the decoder encodes the source text sequence (++>
Figure SMS_51
) And previous target word embedding (++>
Figure SMS_54
) As input, and generates a current token (+)>
Figure SMS_43
) Is a hidden representation of (c). Sequence coding vector +.>
Figure SMS_50
. Use of a matrix with weights +.>
Figure SMS_53
And offset vector->
Figure SMS_55
Will +.>
Figure SMS_44
Projection to vocabulary +.>
Figure SMS_49
In one embodiment, the projection layer is a linear layer; inputting the hidden representation of the current token into a projection layer and predicting the triples in terms using a mask-based replication mechanism based on the resulting output to obtain a set of interacting tuples, comprising: inputting the hidden representation of the current token into a linear layer, mapping the hidden representation of the current token to the whole vocabulary, and obtaining the output of the linear layer; the expression of the linear layer is:
Figure SMS_56
(11)
wherein ,
Figure SMS_57
the hidden layer vector is linearly reduced to one dimension and then is embedded into the representation of all words in the vocabulary;
Figure SMS_58
is a weight matrix, < >>
Figure SMS_59
Is the bias vector.
Setting the output of the linear layers corresponding to all words except the current source text sequence token, the interaction token, the separator token, the unlabeled token and the ending target sequence token in the vocabulary to be minus infinity, and obtaining the output of the processed linear layers; activating by adopting a softmax function according to the processed output of the linear layer to obtain the normalized score of all words embedded in the vocabulary at the current time step; the normalized score is expressed as:
Figure SMS_60
(12)
wherein ,
Figure SMS_61
for normalizing the score->
Figure SMS_62
Is the output of the processed linear layer.
And predicting the triples in terms of words according to the normalized scores to obtain a group of interaction tuples.
In particular, the projection layer in the word level decoding module maps the decoder output to the entire vocabulary. During training, the target token of "gold label" is directly used, i.e., the similarity score is highest. However, during reasoning, the decoder can predict tokens from the vocabulary that are not present in the current text sequence or interaction set or special tokens. To prevent this, a masking technique is used when the projection layer applies a softmax operation. All words in the vocabulary are masked (excluded) except for the current source text sequence token, interaction token, partition Fu Lingpai (";" and "|") UNKAndEOSand (5) a token. To mask (exclude) some words from softmax
Figure SMS_63
The corresponding value of (2) is set to +.>
Figure SMS_64
The corresponding softmax score would be zero. This ensures that entities are only copied from the source text sequence. Inclusion in softmax procedureUNKTokens to ensure that the model generates new entities during the reasoning process. If the decoder predicts oneUNKThe token replaces it with the corresponding source word with the highest attention score. In the reasoning process, after decoding is completed, all tuples are extracted based on special tokens, and the tuples with the same repeated tuple and two entities are deleted, or the tuples with the interaction tokens not in the interaction set are deleted.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.
In one illustrative embodiment, the method is primarily used to extract tuples with overlapping entities from biomedical abstract text, which is an important task in the international evaluation platform BioCreative. The task provides two annotated corpora. (i) The DrugProt corpus contained a total of 4227 artificially annotated PubMed summaries. The original data is divided into a training set, a verification set and a test set by 6-2-2 segmentation. (ii) The ChemProt corpus consists of 1015 training sets, 611 development sets and 799 test sets. Table 2 lists the statistics for each group.
Table 2 training/validation/test data distribution statistics for experimental corpus
Figure SMS_65
The present invention runs the Word2Vec tool on the ChemProt and Drugprot corpus shared by BioCreative VI and VII to initialize Word embedding. Character embedding and interaction embedding are randomly initialized. All embeddings are updated during training. Setting word embedding dimensions
Figure SMS_66
Interaction embedding dimension +.>
Figure SMS_67
Character embedding dimension->
Figure SMS_68
Character-based word feature dimension ++>
Figure SMS_69
. To extract character-based word feature vectors, the CNN filter width is set to 3 and the maximum length of the word is set to 10. Hidden dimension of decoder LSTM unit >
Figure SMS_70
Is set to 300 and the hidden dimension of the forward and backward LSTM of the encoder is set to 150. The model was trained in small batches of 6 and Adam was used to optimize network parameters. Dropout layers with a fixed loss rate of 0.3 are used in the network to avoid overfitting.
The present invention compares the model with the first three models of text-mining drug-protein/chemical-protein interaction performance issued by BioCreative VI and VII, respectively. Table 3 is a biochartive two-task Top3 statistic that analyzed the strategies and models proposed or used by each team. The model proposed by team in BioCreative VI is less complex than in BioCreative VII, and more prone to model integration and multi-layering. However, the best performing models in BioCreative VII all come from fine-tuning, even a collection of pre-trained language models. For the task of extracting knowledge information from text sequences of biomedical literature abstract and the like, under certain experimental hardware conditions, the larger the model is, the shorter the text sequence length of the information can be extracted generally. Furthermore, while the large scale of the pre-trained language model does not lend itself to fine tuning, it may not perform well when it is transferred to other highly specialized tasks.
Table 3 working statistics of top three of the two tasks of BioCreative, respectively
Figure SMS_71
The invention is inspired by Peng et al and Mehryary et al, and the benchmark also includes the most advanced federated entity and relationship extraction model in Natural Language Processing (NLP). Table 4 is statistics of the latest benchmarks from NLP, as shown in table 4, these models are classified into traditional Machine Learning (ML), deep Learning (DL) and Reinforcement Learning (RL) based, and the strategies employed by these models can be classified into two categories: (i) series Strategy (Serial Strategy): one conventional approach targets sentence tasks. First, all entities in a sentence are identified, and then the relationships of all candidate entity pairs are determined. The disadvantage of this approach is that cascading errors, i.e. errors in the identification of the subtasks of the previous stage, will propagate to the next subtask, and the extraction result of the subsequent subtask will not affect the tasks of the previous stage. (ii) federation policy (join Strategy): in a currently popular method, the relationship is judged while the entity is identified, and interaction exists between two subtasks which can be mutually influenced. However, whether the entities are first identified and then the candidate relationships between them are predicted, or both. All tasks are divided into entity identification and relation prediction, and as two subtasks, a double-pipeline or layered method is adopted.
In the work of the present invention, the drug-protein entity and its interactions are extracted jointly from the biomedical literature in a triplet-forming manner (triplet-forming means that the units within the triplet are formed in the form of table 1 (as a whole or as a whole word). Furthermore, when performing joint extraction tasks on different corpora, the neural networks employed by the encoders and decoders of the model may vary according to adaptability, and different networks and attention modeling methods will be selected.
Table 4 statistics from the latest benchmarks for natural language processing
Figure SMS_72
Precision (Precision), recall (Recall), and F1 score (F1 score) were used as evaluation indicators for this experiment, where TP (true positive) represents the number of correctly classified and partitioned samples; FP (false positive) represents the classification and the number of incorrectly classified samplesAn amount of; FN (false negative) indicates the number of unclassified samples, which is wrong. And, set up
Figure SMS_73
To represent the value corresponding to the category, +.>
Figure SMS_74
To represent the number of prediction categories.
The precision rate is only applicable to positive samples with correct predictions, not all samples with correct predictions. The calculation method is to divide the number of positive samples predicted correctly by the ratio of the number of positive samples predicted by the model. It indicates that the predicted positive sample is indeed positive, as shown in equation (13):
Figure SMS_75
(13)
Recall is calculated by dividing the correct number of predicted positive samples by the actual number of positive samples in the test set; it shows that the number of samples that are truly positive can be invoked by using a classifier, as shown in equation (14):
Figure SMS_76
(14)
the F1 score is the harmonic mean of the accuracy and recall. The accuracy and recall rate are expected to be higher; however, these two indicators are contradictory and cannot be very high. Thus, the F1 score should be introduced as an appropriate threshold point to maximize the classifier's ability, as shown in equation (15):
Figure SMS_77
(15)
in the reference test of BioCreative, the comprehensive performance of WDec on chemProt and drug Prot corpora is superior to that of the top model in the original task, and the top model is shown in tables 5 and 6. In table 5, WDec achieves an F1 score 3% higher than the Peng model. In Table 6, the F1 score of a single WDec model on the drug prot corpus is 1.68% higher than that of the NLM-NCBI model. Furthermore, the integration mechanism is applied to the construction of the model for subsequent comparison with the most advanced model in NLP, which improves the F1 score of the model on the corpus by 3.11% and 2.95%, respectively, on average.
TABLE 5 comparison of Performance between the inventive model and the top three models on the ChemProt corpus
Figure SMS_78
TABLE 6 comparison of Performance between the inventive model and the top three models on the drug prot corpus
Figure SMS_79
Performance comparisons were made for the top three models of the two tasks listed in table 5, table 6. Comparing the model with the well-behaved BioCreative VI with two models, single model and integrated model, constructed by the integration mechanism; comparison between the model performed well on the BioCreative VII and the model.
In the NLP benchmark, HRL achieves a significantly higher F1 score on the corpus. Their models and the models described in this invention were run five times and compared across all models' performance on both corpora. Performance comparison of all models on both corpora as shown in table 7, the F1 scores obtained by WDec on ChemProt and DrugProt datasets were 1.44% and 0.54% higher than HRL, respectively. The present invention performs a statistical significance test (t-test) under bootstrap pairing between HRL and the model, and sees that the higher F1 score obtained by the model is statistically significant
Figure SMS_80
. Next, a Model of five runs and the outputs of HRLs of five runs are combined to build an Ensemble Model (Ensemble Model). For one test case, the majority of those run at five times +. >
Figure SMS_81
The tuples extracted from the database. This integration mechanism significantly improves the accuracy of both datasets, as well as the recall rate. In the aggregate scenario, the F1 score on ChemProt and DrugProt corpuses for WDec is 4.59% and 1.3% higher, respectively, compared to HRL.
Table 7 comparison of the performance of all models on two corpora
Figure SMS_82
Table 8 shows the results of ablation experiments for the WDec attention mechanism. In table 8, the present invention includes the performance of different attention mechanisms with WDec, the effect of mask-based replication mechanisms. Single-attention word-level decoding achieves the highest F1 score. The mask-based replication mechanism increases the F1 score by about 2-7% in each attention mechanism.
Table 8 ablation experiments for WDec attention mechanisms
Figure SMS_83
In summary, the present invention proposes a new interactive tuple representation scheme such that an encoder-decoder model that extracts one word at each time step can still find multiple tuples with overlapping entities and tuples with multi-labeled entities from sentences. This facilitates the transfer of the model from sentence-level knowledge extraction to document-level knowledge extraction. Experiments on ChemProt and drug prot corpora published by bioelect VI & VII show that the method of the present invention is significantly better than all previous most advanced models and establishes new benchmarks on these datasets.
In one embodiment, as shown in fig. 3, there is provided a biomedical knowledge extraction device comprising: an interaction tuple representation setting module 301, a source text sequence feature representation acquisition module 302, an encoding module 303, and a decoding and prediction module 304, wherein:
an interaction tuple representation setting module 301 for setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions.
The source text sequence feature representation obtaining module 302 is configured to obtain feature information of abstract text data of a biomedical document, and map words and characters in the abstract text data into embedded vectors to obtain feature representations of the source text sequence.
An encoding module 303, configured to input a feature representation of the source text sequence into an encoder, to obtain a hidden representation of the encoder; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM; the hidden representation of the encoder is subjected to an attention mechanism to obtain a sequence encoding vector.
The decoding and predicting module 304 is configured to embed and input the sequence encoding vector and the target word of the previous time step into the word level decoding module, and predict the triplet in the form of a generated word to obtain a set of interaction tuples.
In one embodiment, the source text sequence feature representation acquisition module 302 is further configured to acquire feature information of the source abstract text of the biomedical document; constructing a vocabulary according to the source abstract text, wherein the vocabulary comprises a source abstract text token, an interaction name, a component separator, a target sequence start token and a target sequence end token in an interaction set R; initializing Word embedding is carried out on the source abstract text by using a Word2Vec tool, so as to obtain a pre-training Word vector; extracting word feature vectors based on characters from the source abstract text by adopting a convolution network with maximum pooling; and connecting the pre-training word vector with the word feature vector based on the characters to obtain the feature representation of the source text sequence.
In one embodiment, the encoder comprises a number of Bi-LSTM network elements, the number of Bi-LSTM network elements being the same as the length of the source text sequence; the encoding module 303 is further configured to input each token vector representation in the source text sequence into a corresponding Bi-LSTM network element of the encoder, respectively, to obtain a hidden representation of each token vector representation in the source text sequence.
In one embodiment, the attention mechanism in the encoding module is: avg attention mechanism, N-gram attention mechanism, or Single attention mechanism. Wherein the expression of Avg attention mechanism is shown as formula (1).
The expression of the N-gram attention mechanism of n=3 is shown in the formulas (2) to (4). The expressions of the Single attention mechanism are shown in the formulas (5) to (9).
In one embodiment, the word level decoding module comprises: a decoder and a projection layer; the decoder comprises a plurality of LSTM network units; the vocabulary includes source abstract text tokens and interaction setsRA component separator, a meta-component separator, a target sequence start token, an end target sequence token; the decoding and predicting module 304 is further configured to embed and input the sequence encoding vector and the target word of the previous time step into a decoder, so as to obtain a hidden representation of the current token; the expression of the hidden representation of the current token is shown in formula (10); the hidden representation of the current token is input into the projection layer and the triples are predicted in terms of words using a mask-based replication mechanism based on the resulting output, resulting in a set of interacting tuples.
In one embodiment, the projection layer is a linear layer; the decoding and predicting module 304 is further configured to input the hidden representation of the current token into the linear layer, map the hidden representation of the current token to the entire vocabulary, and obtain an output of the linear layer; the expression of the linear layer is shown in formula (11).
Setting the output of the linear layers corresponding to all words except the current source text sequence token, the interaction token, the separator token, the unlabeled token and the ending target sequence token in the vocabulary to be minus infinity, and obtaining the output of the processed linear layers; activating by adopting a softmax function according to the processed output of the linear layer to obtain the normalized score of all words embedded in the vocabulary at the current time step; the expression of the normalized score is shown in formula (12). And predicting the triples in terms of words according to the normalized scores to obtain a group of interaction tuples.
For specific limitations on the biomedical knowledge extraction device, reference may be made to the above limitations on the biomedical knowledge extraction method, and no further description is given here. The respective modules in the above-described biomedical knowledge extraction apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 4. The computer device comprises a processor 401, a memory 402, a network interface 403, a display 404 and an input means 405 connected by a system bus. Wherein the processor 401 of the computer device is used to provide computing and control capabilities. The memory 402 of the computer device includes a non-volatile storage medium 4022, an internal memory 4021. The nonvolatile memory medium 4022 stores an operating system and a computer program. The internal memory 4021 provides an environment for the operation of the operating system and computer programs in the nonvolatile storage medium. The network interface 403 of the computer device is used for communication with an external terminal via a network connection. The computer program is executed by the processor 401 to implement a biomedical knowledge extraction method. The display screen 404 of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device 405 of the computer device may be a touch layer covered on the display screen, or may be a key, a track ball or a touch pad arranged on the casing of the computer device, or may be an external keyboard, a touch pad or a mouse, etc.
Those skilled in the art will appreciate that the structures shown in FIG. 4 are block diagrams only and do not constitute a limitation of the computer device on which the present aspects apply, and that a particular computer device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment a computer device is provided comprising a memory storing a computer program and a processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (10)

1. A method of biomedical knowledge extraction, the method comprising:
setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions;
Acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; the source text sequence includes all words, symbols and each interacted clue word in the abstract text data;
inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM;
adopting an attention mechanism to the hidden representation of the encoder to obtain a sequence coding vector;
and embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples.
2. The method of claim 1, wherein obtaining summary text data of a biomedical document and mapping words and characters in the summary text data into embedded vectors to obtain a feature representation of a source text sequence, comprises:
acquiring a source abstract text of a biomedical document;
constructing a vocabulary from the source abstract text, the vocabulary including source abstract text tokens and interaction sets RA component separator, a meta-component separator, a target sequence start token, an end target sequence token;
initializing Word embedding for the source abstract text by using a Word2Vec tool to obtain a pre-training Word vector;
extracting word feature vectors based on characters from the source abstract text by adopting a convolution network with maximum pooling;
and connecting the pre-training word vector with the word feature vector based on the characters to obtain the feature representation of the source text sequence.
3. The method of claim 1, wherein the encoder comprises a number of Bi-LSTM network elements, the number of Bi-LSTM network elements being the same as the length of the source text sequence;
inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder, comprising:
and respectively inputting each token vector representation in the source text sequence into a corresponding Bi-LSTM network unit of an encoder to obtain a hidden representation of each token vector representation in the source text sequence.
4. The method of claim 1, wherein the hidden representation of the encoder is encoded using an attention mechanism to obtain the sequence encoded vector, wherein the attention mechanism is: avg attention mechanism, N-gram attention mechanism or Single attention mechanism;
Wherein the expression of Avg attention mechanism is:
Figure QLYQS_1
the expression of the N-gram attention mechanism for n=3 is:
Figure QLYQS_2
Figure QLYQS_3
Figure QLYQS_4
;/>
wherein ,
Figure QLYQS_7
is the last hidden state of the encoder, +.>
Figure QLYQS_10
Word gram combinations are indicated,/->
Figure QLYQS_12
Is the g-gram word representation sequence of the input sequence,/-gram word representation sequence>
Figure QLYQS_6
Is->
Figure QLYQS_8
Individual g-gram vectors,/>
Figure QLYQS_11
Is->
Figure QLYQS_13
Normalized attention score of the individual g-gram vectors,/->
Figure QLYQS_5
and />
Figure QLYQS_9
Is a trainable parameter;
the expression of the Single attention mechanism is:
Figure QLYQS_14
Figure QLYQS_15
Figure QLYQS_16
Figure QLYQS_17
Figure QLYQS_18
wherein ,
Figure QLYQS_19
、/>
Figure QLYQS_20
and />
Figure QLYQS_21
Are all attention parameters that can be trained,
Figure QLYQS_22
is a bias vector, +.>
Figure QLYQS_23
Is->
Figure QLYQS_24
Individual words at decoding time steptIs a normalized attention score of (c).
5. The method of claim 1, wherein the word-level decoding module comprises: a decoder and a projection layer; the decoder comprises a plurality of LSTM networks;
the vocabulary includes source abstract text tokens and interaction setsRA component separator, a meta-component separator, a target sequence start token, an end target sequence token;
embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting triples in the form of generated words to obtain a group of interaction tuples, wherein the method comprises the following steps:
embedding and inputting the sequence coding vector and the target word of the previous time step into the decoder to obtain the hidden representation of the current token; the hidden representation of the current token has the following expression:
Figure QLYQS_25
wherein ,
Figure QLYQS_26
is a hidden representation of the current token,/->
Figure QLYQS_27
Is the source text sequence code,/->
Figure QLYQS_28
Target word embedding, which is the previous time step, +.>
Figure QLYQS_29
Is the forward hidden state of LSTM;
and inputting the hidden representation of the current token into a projection layer, and predicting the triples in terms by adopting a mask-based replication mechanism according to the obtained output to obtain a group of interaction tuples.
6. The method of claim 5, wherein the projection layer is a linear layer;
inputting the hidden representation of the current token into a projection layer, and predicting triples in terms using a mask-based replication mechanism based on the resulting output to obtain a set of interacting tuples, comprising:
inputting the hidden representation of the current token into a linear layer, and mapping the hidden representation of the current token to the whole vocabulary to obtain the output of the linear layer; the expression of the linear layer is:
Figure QLYQS_30
wherein ,
Figure QLYQS_31
the hidden layer vector is linearly reduced to one dimension and then is embedded into the representation of all words in the vocabulary;
Figure QLYQS_32
is a weight matrix, < >>
Figure QLYQS_33
Is a bias vector;
setting the output of the linear layers corresponding to all words except the current source text sequence token, the interaction token, the separator token, the unlabeled token and the ending target sequence token in the vocabulary to be minus infinity, and obtaining the output of the processed linear layers;
Activating by adopting a softmax function according to the processed output of the linear layer to obtain the normalized score of all words embedded in the vocabulary at the current time step; the normalized score is expressed as:
Figure QLYQS_34
wherein ,
Figure QLYQS_35
for normalizing the score->
Figure QLYQS_36
Output for the processed linear layer;
and predicting the triples in terms of words according to the normalized scores to obtain a group of interaction tuples.
7. A biomedical knowledge extraction device, the device comprising:
an interaction tuple representation setting module for setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions;
the source text sequence feature representation acquisition module is used for acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain feature representation of a source text sequence; the source text sequence includes all words, symbols and each interacted clue word in the abstract text data;
The encoding module is used for inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; adopting an attention mechanism to the hidden representation of the encoder to obtain a sequence coding vector; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM;
and the decoding and predicting module is used for embedding and inputting the sequence coding vector and the target word of the previous time step into the word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples.
8. The apparatus of claim 7, wherein the source text sequence feature representation acquisition module is further configured to acquire source digest text of the biomedical literature; constructing a vocabulary according to the source abstract text, wherein the vocabulary comprises a source abstract text token, an interaction name, a component separator, a target sequence start token and a target sequence end token in an interaction set R; initializing Word embedding is carried out on the source abstract text by using a Word2Vec tool, so as to obtain a pre-training Word vector; extracting word feature vectors based on characters from the source abstract text by adopting a convolution network with maximum pooling; and connecting the pre-training word vector with the word feature vector based on the characters to obtain the feature representation of the source text sequence.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the method of any one of claims 1 to 6 when executing the computer program.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any one of claims 1 to 6.
CN202310495163.0A 2023-05-05 2023-05-05 Biomedical knowledge extraction method, device, computer equipment and storage medium Pending CN116227597A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310495163.0A CN116227597A (en) 2023-05-05 2023-05-05 Biomedical knowledge extraction method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310495163.0A CN116227597A (en) 2023-05-05 2023-05-05 Biomedical knowledge extraction method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116227597A true CN116227597A (en) 2023-06-06

Family

ID=86569724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310495163.0A Pending CN116227597A (en) 2023-05-05 2023-05-05 Biomedical knowledge extraction method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116227597A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556277A (en) * 2024-01-12 2024-02-13 暨南大学 Initial alignment seed generation method for knowledge-graph entity alignment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180329883A1 (en) * 2017-05-15 2018-11-15 Thomson Reuters Global Resources Unlimited Company Neural paraphrase generator
CN109408812A (en) * 2018-09-30 2019-03-01 北京工业大学 A method of the sequence labelling joint based on attention mechanism extracts entity relationship
WO2020253669A1 (en) * 2019-06-19 2020-12-24 腾讯科技(深圳)有限公司 Translation method, apparatus and device based on machine translation model, and storage medium
WO2022212008A1 (en) * 2021-03-31 2022-10-06 Microsoft Technology Licensing, Llc Learning molecule graphs embedding using encoder-decoder architecture
CN115270761A (en) * 2022-07-28 2022-11-01 中国人民解放军国防科技大学 Relation extraction method fusing prototype knowledge
US11544943B1 (en) * 2022-05-31 2023-01-03 Intuit Inc. Entity extraction with encoder decoder machine learning model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180329883A1 (en) * 2017-05-15 2018-11-15 Thomson Reuters Global Resources Unlimited Company Neural paraphrase generator
CN109408812A (en) * 2018-09-30 2019-03-01 北京工业大学 A method of the sequence labelling joint based on attention mechanism extracts entity relationship
WO2020253669A1 (en) * 2019-06-19 2020-12-24 腾讯科技(深圳)有限公司 Translation method, apparatus and device based on machine translation model, and storage medium
WO2022212008A1 (en) * 2021-03-31 2022-10-06 Microsoft Technology Licensing, Llc Learning molecule graphs embedding using encoder-decoder architecture
US11544943B1 (en) * 2022-05-31 2023-01-03 Intuit Inc. Entity extraction with encoder decoder machine learning model
CN115270761A (en) * 2022-07-28 2022-11-01 中国人民解放军国防科技大学 Relation extraction method fusing prototype knowledge

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TAPAS NAYAK等: "Effective Modeling of Encoder-Decoder Architecture for Joint Entity and Relation Extraction", PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 34, no. 05, pages 8529 - 8532 *
黄培馨;赵翔;方阳;朱慧明;肖卫东;: "融合对抗训练的端到端知识三元组联合抽取", 计算机研究与发展, no. 12, pages 20 - 32 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556277A (en) * 2024-01-12 2024-02-13 暨南大学 Initial alignment seed generation method for knowledge-graph entity alignment
CN117556277B (en) * 2024-01-12 2024-04-05 暨南大学 Initial alignment seed generation method for knowledge-graph entity alignment

Similar Documents

Publication Publication Date Title
Verga et al. Simultaneously self-attending to all mentions for full-abstract biological relation extraction
CN107506414B (en) Code recommendation method based on long-term and short-term memory network
CN110032739B (en) Method and system for extracting named entities of Chinese electronic medical record
US10949456B2 (en) Method and system for mapping text phrases to a taxonomy
CN113420163B (en) Heterogeneous information network knowledge graph completion method and device based on matrix fusion
CN113268612B (en) Heterogeneous information network knowledge graph completion method and device based on mean value fusion
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN113806493B (en) Entity relationship joint extraction method and device for Internet text data
CN114678061A (en) Protein conformation perception representation learning method based on pre-training language model
Yang et al. Modality-DTA: multimodality fusion strategy for drug–target affinity prediction
CN116227597A (en) Biomedical knowledge extraction method, device, computer equipment and storage medium
CN114881035A (en) Method, device, equipment and storage medium for augmenting training data
CN110008482A (en) Text handling method, device, computer readable storage medium and computer equipment
Dalai et al. Part-of-speech tagging of Odia language using statistical and deep learning based approaches
CN114090769A (en) Entity mining method, entity mining device, computer equipment and storage medium
Che et al. Fast and effective biomedical named entity recognition using temporal convolutional network with conditional random field
Devkota et al. A Gated Recurrent Unit based architecture for recognizing ontology concepts from biological literature
Passban Machine translation of morphologically rich languages using deep neural networks
He et al. Neural unsupervised reconstruction of protolanguage word forms
CN116414988A (en) Graph convolution aspect emotion classification method and system based on dependency relation enhancement
Gao et al. Citation entity recognition method using multi‐feature semantic fusion based on deep learning
Singh et al. Comparing RNNs and log-linear interpolation of improved skip-model on four Babel languages: Cantonese, Pashto, Tagalog, Turkish
Heaps et al. Toward detection of access control models from source code via word embedding
CN113961715A (en) Entity linking method, device, equipment, medium and computer program product
Wei et al. Biomedical named entity recognition via a hybrid neural network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination