CN116227597A - Biomedical knowledge extraction method, device, computer equipment and storage medium - Google Patents
Biomedical knowledge extraction method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN116227597A CN116227597A CN202310495163.0A CN202310495163A CN116227597A CN 116227597 A CN116227597 A CN 116227597A CN 202310495163 A CN202310495163 A CN 202310495163A CN 116227597 A CN116227597 A CN 116227597A
- Authority
- CN
- China
- Prior art keywords
- representation
- sequence
- word
- token
- source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 29
- 239000013598 vector Substances 0.000 claims abstract description 85
- 230000003993 interaction Effects 0.000 claims abstract description 78
- 230000007246 mechanism Effects 0.000 claims abstract description 52
- 238000000034 method Methods 0.000 claims abstract description 37
- 238000013507 mapping Methods 0.000 claims abstract description 12
- 230000015654 memory Effects 0.000 claims description 20
- 230000014509 gene expression Effects 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 14
- 230000010076 replication Effects 0.000 claims description 7
- 238000011176 pooling Methods 0.000 claims description 6
- 230000003213 activating effect Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000002829 reductive effect Effects 0.000 claims description 2
- 230000002452 interceptive effect Effects 0.000 abstract description 5
- 239000003814 drug Substances 0.000 description 6
- 238000003058 natural language processing Methods 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 229940079593 drug Drugs 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 230000010354 integration Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 101100136092 Drosophila melanogaster peng gene Proteins 0.000 description 2
- 206010035148 Plague Diseases 0.000 description 2
- 241000607479 Yersinia pestis Species 0.000 description 2
- 238000002679 ablation Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 101100261000 Caenorhabditis elegans top-3 gene Proteins 0.000 description 1
- 238000000692 Student's t-test Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000009510 drug design Methods 0.000 description 1
- 238000007876 drug discovery Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000000551 statistical hypothesis test Methods 0.000 description 1
- 238000012353 t test Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
- G06N5/025—Extracting rules from data
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Machine Translation (AREA)
Abstract
The application relates to a biomedical knowledge extraction method, a biomedical knowledge extraction device, a biomedical knowledge extraction computer device and a biomedical knowledge storage medium. The method comprises the following steps: setting an interaction tuple representation; acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; adopting an attention mechanism to the hidden representation of the encoder to obtain a sequence coding vector; and embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples. The method proposes a new interactive tuple representation scheme such that the encoder-decoder model extracting a word at each time step can still find from sentences a plurality of tuples with overlapping entities and tuples with multi-labeled entities.
Description
Technical Field
The present application relates to the field of big data technologies, and in particular, to a method and apparatus for extracting biomedical knowledge, a computer device, and a storage medium.
Background
As a basic study, the targeted extraction of knowledge information such as drug-protein entities and their interactions required for medical research from biomedical literature provides a powerful support for drug discovery, drug reuse, drug design, and bioinformatic knowledge base established in the form of knowledge maps. However, as researchers have conducted intensive research into this task, macroscopic and microscopic problems continue to emerge.
Extraction of the drug-protein interaction tuples from the abstract of biomedical literature is also a challenging task. Macroscopically, the predecessor mainly divides tasks into two parts, entity identification and relationship extraction. The built model not only does not consider the context information, but also derives the sequence of the two subtasks and the problem of feature information sharing. Microscopically, the difference in length of entities, the existence of multiple tuples, and the physical overlap between tuples plagues, can plague the accuracy of extracting triples, and can be specifically categorized into three categories: (1) no physical overlap (No Entity Overlap, NEO): a sequence contains one or more tuples but does not share any entity between them. (2) entity pair overlap (Entity Pair Overlap, EPO): there are multiple tuples in a given sequence, and at least two tuples share two entities in the same or opposite order. (3) single entity overlap (Single Entity Overlap, SEO): a given sequence contains multiple tuples and at least two tuples share exactly one entity. It should be noted here that a sequence may be an entity-to-overlap or a single entity overlap.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a biomedical knowledge extraction method, apparatus, computer device, and storage medium.
A biomedical knowledge extraction method, the method comprising:
setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions.
Acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; the source text sequence includes all words, symbols, and each interacted cue word in the summary text data.
Inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is configured to encode the characteristic information of the source text sequence using Bi-LSTM.
The hidden representation of the encoder is subjected to an attention mechanism to obtain a sequence encoding vector.
And embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples.
A biomedical knowledge extraction device, the device comprising:
an interaction tuple representation setting module for setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions.
The source text sequence feature representation acquisition module is used for acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain feature representation of a source text sequence; the source text sequence includes all words, symbols, and each interacted cue word in the summary text data.
The encoding module is used for inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM; the hidden representation of the encoder is subjected to an attention mechanism to obtain a sequence encoding vector.
And the decoding and predicting module is used for embedding and inputting the sequence coding vector and the target word of the previous time step into the word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples.
In one embodiment, the source text sequence feature representation acquisition module is further configured to acquire source abstract text of a biomedical document; constructing a vocabulary according to the source abstract text, wherein the vocabulary comprises a source abstract text token, an interaction name, a component separator, a target sequence start token and a target sequence end token in an interaction set R; initializing Word embedding is carried out on the source abstract text by using a Word2Vec tool, so as to obtain a pre-training Word vector; extracting word feature vectors based on characters from the source abstract text by adopting a convolution network with maximum pooling; and connecting the pre-training word vector with the word feature vector based on the characters to obtain the feature representation of the source text sequence.
A computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of any of the methods described above.
The biomedical knowledge extraction method, the biomedical knowledge extraction device, the biomedical knowledge extraction computer equipment and the storage medium. The method comprises the following steps: setting an interaction tuple representation; acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM; adopting an attention mechanism to the hidden representation of the encoder to obtain a sequence coding vector; and embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples. The method proposes a new interactive tuple representation scheme such that the encoder-decoder model extracting a word at each time step can still find from sentences a plurality of tuples with overlapping entities and tuples with multi-labeled entities.
Drawings
FIG. 1 is a flow chart of a method for extracting knowledge of biological medicine according to an embodiment;
FIG. 2 is a schematic diagram of a word-level decoding-based model in one embodiment;
FIG. 3 is a block diagram of a biomedical knowledge extraction device in one embodiment;
fig. 4 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a biomedical knowledge extraction method is provided, the method comprising the steps of:
step S100: setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions.
Specifically, the interaction tuple is a triplet, and the representation of the interaction tuple is: entity 1; an entity 2; interaction. Use "; "as a separator separating tuple components (i.e., component separator is"; ") and" | "separates symbols of multiple tuples (tuple component separator is" | "). Table 1 is an example of an interaction tuple representation, one example of which is included in table 1. These special tokens (":", and "|") can be used to represent multiple interacting tuples with overlapping entities and entities of different lengths in a simple manner. These special tokens can be used to easily extract the interaction tuples after the end of the sequence generation during the reasoning process. Because of this unified interaction tuple representation scheme, entity tokens, interaction tokens, and special tokens are treated similarly, a shared vocabulary containing all of these tokens is used between the encoder and decoder. The digest text (text is a string of words, or a sequence of characters, which may be called a sequence of text) of the input biomedical document contains each interactive clue word, which helps to generate an interaction token. Two special tokens are used so that the encoder-decoder model can distinguish between the beginning of an interacting tuple and the beginning of a tuple component. In order to extract interaction tuples from a text sequence using an encoder-decoder model, the model must generate entity tokens, find clue words for interactions and map them to interaction tokens, and generate special tokens when appropriate.
Table 1 example of the manner in which the interaction tuple is represented
Step S102: acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; the source text sequence includes all words, symbols, and each interacted cue word in the summary text data.
Specifically, the embedded vector is a vector that maps words, symbols, or sequences into a single corresponding one.
Step S104: inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is configured to encode the characteristic information of the source text sequence using Bi-LSTM.
Step S106: the hidden representation of the encoder is subjected to an attention mechanism to obtain a sequence encoding vector.
Specifically, in order to obtain the source context feature information, a hidden representation of the encoder is subjected to an attention mechanism to obtain a sequence encoding vector. Preferably, the attention mechanism may be an Avg attention mechanism, an N-gram attention mechanism or a Single attention mechanism.
Step S108: and embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples.
Specifically, a word level decoding-based model (WDec) is formed by the encoder, the attention mechanism and the word level decoding module, and the model is structured as shown in FIG. 2.
The word level decoding module is used for decoding the input sequence coding vector and the target word embedding of the previous time step by adopting an LSTM network unit, then mapping the decoding result to a vocabulary by adopting a projection layer, and then performing prediction processing by adopting a softmax operation fused with a mask technology to obtain a group of interaction tuples.
The word level decoding module generates one token at a time and generates a target sequence token after the generation is finishedEOS) And stopping.
In the above-mentioned biomedical knowledge extraction method, the method includes: setting an interaction tuple representation; acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM; adopting an attention mechanism to the hidden representation of the encoder to obtain a sequence coding vector; and embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples. The method proposes a new interactive tuple representation scheme such that the encoder-decoder model extracting a word at each time step can still find from sentences a plurality of tuples with overlapping entities and tuples with multi-labeled entities.
In one embodiment, step S102 includes: acquiring a source abstract text of a biomedical document; building a vocabulary from the source abstract text, the vocabulary comprising source abstract text tokens and interaction setsRA component separator, a meta-component separator, a target sequence start token, an end target sequence token; extracting feature vectors of word embedding and character embedding of all words and characters in the source abstract text by adopting a convolution network with the maximum pooling; and connecting the character embedded feature vectors with word embedded feature vectors to obtain the feature representation of the new source text sequence.
Specifically, a vocabulary is createdVFrom source abstract text tokens, interaction setsRThe interaction names of the components, special separators (the 'and' are 'and' among the components, the 'component separators' and the 'I' are component separators), and a target sequence start tokenSOS) End target sequence tokenEOS)。
Word level inlayThe method comprises the following two parts: (1) pre-training word vectors; (2) character-embedded feature vector based. Using word embedding layersAnd character embedding layer->, wherein />Is the dimension of the word vector and,Ais the alphabet of characters of the token of the input text sequence, < > >Is the dimension of the character-embedded vector. Extracting a size +.about.for each word using convolutional neural network with maximum pool>Is described. Word embedding and character embedding based feature vectors are concatenated ("|") to obtain a representation of the input token.
In one embodiment, the encoder comprises a number of Bi-LSTM network elements, the number of Bi-LSTM network elements being the same as the length of the source text sequence; step S104 includes: and respectively inputting each token vector representation in the source text sequence into a corresponding Bi-LSTM network element of the encoder to obtain a hidden representation of each token vector representation in the source text sequence.
Specifically, the source text sequence S is represented by its token vectorRepresentation of->Is the firstiA vector representation of the individual words is provided,nis the length of the source text sequence S. Vector->Is transferred to a two-way long and short term memory network (Bi-LSTM) to obtain the hidden representation +.>. The hidden dimension of the forward and backward LSTM of Bi-LSTM is set to +.>To obtain->, wherein />Is the hidden dimension of the decoder LSTM of the word-level decoding module.
In one embodiment, the hidden representation of the encoder uses an attention mechanism to obtain the sequence encoded vector, and the attention mechanism in the step is: avg attention mechanism, N-gram attention mechanism, or Single attention mechanism. Wherein the expression of Avg attention mechanism is:
The expression of the N-gram attention mechanism for n=3 is:
wherein ,is the last hidden state of the encoder, +.>Word gram combinations are indicated,/->Is the g-gram word representation sequence of the input sequence,/-gram word representation sequence>Is->The g-gram vectors (2-gram and 3-gram representation obtained by mean pooling),>is->Normalized attention score of the individual g-gram vectors,/-> and />Is a trainable parameter.
The expression of the Single attention mechanism is:
wherein ,、/> and />Are all trainable attention parameters, +.>Is a bias vector, +.>Is->Individual words at decoding time steptIs a normalized attention score of (c).
In one embodiment, the word level decoding module comprises: a decoder and a projection layer; the decoder comprises a plurality of LSTM network units; the vocabulary includes source abstract text tokens and interaction setsRA component separator, a meta-component separator, a target sequence start token, an end target sequence token; step S108 includes: embedding and inputting the sequence coding vector and the target word of the previous time step into a decoder to obtain the hidden representation of the current token; the expression of the hidden representation of the current token is:
wherein ,is a hidden representation of the current token,/- >Is the source text sequence code,/->Target word embedding, which is the previous time step, +.>Is the forward hidden state of LSTM.
The hidden representation of the current token is input into the projection layer and the triples are predicted in terms of words using a mask-based replication mechanism based on the resulting output, resulting in a set of interacting tuples.
Specifically, the target sequence T is represented only by the word embedded vector of its token, wherein />Is->The embedded vector of the individual tokens is used,mis the length of the target sequence. /> and />Respectively showSOSAndEOSan embedded vector of tokens. The decoder generates tokens one at a time and generatesEOSAnd stopping. LSTM is used as decoder and at time steptWhere the decoder encodes the source text sequence (++>) And previous target word embedding (++>) As input, and generates a current token (+)>) Is a hidden representation of (c). Sequence coding vector +.>. Use of a matrix with weights +.>And offset vector->Will +.>Projection to vocabulary +.>。
In one embodiment, the projection layer is a linear layer; inputting the hidden representation of the current token into a projection layer and predicting the triples in terms using a mask-based replication mechanism based on the resulting output to obtain a set of interacting tuples, comprising: inputting the hidden representation of the current token into a linear layer, mapping the hidden representation of the current token to the whole vocabulary, and obtaining the output of the linear layer; the expression of the linear layer is:
wherein ,the hidden layer vector is linearly reduced to one dimension and then is embedded into the representation of all words in the vocabulary;is a weight matrix, < >>Is the bias vector.
Setting the output of the linear layers corresponding to all words except the current source text sequence token, the interaction token, the separator token, the unlabeled token and the ending target sequence token in the vocabulary to be minus infinity, and obtaining the output of the processed linear layers; activating by adopting a softmax function according to the processed output of the linear layer to obtain the normalized score of all words embedded in the vocabulary at the current time step; the normalized score is expressed as:
And predicting the triples in terms of words according to the normalized scores to obtain a group of interaction tuples.
In particular, the projection layer in the word level decoding module maps the decoder output to the entire vocabulary. During training, the target token of "gold label" is directly used, i.e., the similarity score is highest. However, during reasoning, the decoder can predict tokens from the vocabulary that are not present in the current text sequence or interaction set or special tokens. To prevent this, a masking technique is used when the projection layer applies a softmax operation. All words in the vocabulary are masked (excluded) except for the current source text sequence token, interaction token, partition Fu Lingpai (";" and "|") UNKAndEOSand (5) a token. To mask (exclude) some words from softmaxThe corresponding value of (2) is set to +.>The corresponding softmax score would be zero. This ensures that entities are only copied from the source text sequence. Inclusion in softmax procedureUNKTokens to ensure that the model generates new entities during the reasoning process. If the decoder predicts oneUNKThe token replaces it with the corresponding source word with the highest attention score. In the reasoning process, after decoding is completed, all tuples are extracted based on special tokens, and the tuples with the same repeated tuple and two entities are deleted, or the tuples with the interaction tokens not in the interaction set are deleted.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.
In one illustrative embodiment, the method is primarily used to extract tuples with overlapping entities from biomedical abstract text, which is an important task in the international evaluation platform BioCreative. The task provides two annotated corpora. (i) The DrugProt corpus contained a total of 4227 artificially annotated PubMed summaries. The original data is divided into a training set, a verification set and a test set by 6-2-2 segmentation. (ii) The ChemProt corpus consists of 1015 training sets, 611 development sets and 799 test sets. Table 2 lists the statistics for each group.
Table 2 training/validation/test data distribution statistics for experimental corpus
The present invention runs the Word2Vec tool on the ChemProt and Drugprot corpus shared by BioCreative VI and VII to initialize Word embedding. Character embedding and interaction embedding are randomly initialized. All embeddings are updated during training. Setting word embedding dimensionsInteraction embedding dimension +.>Character embedding dimension->Character-based word feature dimension ++>. To extract character-based word feature vectors, the CNN filter width is set to 3 and the maximum length of the word is set to 10. Hidden dimension of decoder LSTM unit >Is set to 300 and the hidden dimension of the forward and backward LSTM of the encoder is set to 150. The model was trained in small batches of 6 and Adam was used to optimize network parameters. Dropout layers with a fixed loss rate of 0.3 are used in the network to avoid overfitting.
The present invention compares the model with the first three models of text-mining drug-protein/chemical-protein interaction performance issued by BioCreative VI and VII, respectively. Table 3 is a biochartive two-task Top3 statistic that analyzed the strategies and models proposed or used by each team. The model proposed by team in BioCreative VI is less complex than in BioCreative VII, and more prone to model integration and multi-layering. However, the best performing models in BioCreative VII all come from fine-tuning, even a collection of pre-trained language models. For the task of extracting knowledge information from text sequences of biomedical literature abstract and the like, under certain experimental hardware conditions, the larger the model is, the shorter the text sequence length of the information can be extracted generally. Furthermore, while the large scale of the pre-trained language model does not lend itself to fine tuning, it may not perform well when it is transferred to other highly specialized tasks.
Table 3 working statistics of top three of the two tasks of BioCreative, respectively
The invention is inspired by Peng et al and Mehryary et al, and the benchmark also includes the most advanced federated entity and relationship extraction model in Natural Language Processing (NLP). Table 4 is statistics of the latest benchmarks from NLP, as shown in table 4, these models are classified into traditional Machine Learning (ML), deep Learning (DL) and Reinforcement Learning (RL) based, and the strategies employed by these models can be classified into two categories: (i) series Strategy (Serial Strategy): one conventional approach targets sentence tasks. First, all entities in a sentence are identified, and then the relationships of all candidate entity pairs are determined. The disadvantage of this approach is that cascading errors, i.e. errors in the identification of the subtasks of the previous stage, will propagate to the next subtask, and the extraction result of the subsequent subtask will not affect the tasks of the previous stage. (ii) federation policy (join Strategy): in a currently popular method, the relationship is judged while the entity is identified, and interaction exists between two subtasks which can be mutually influenced. However, whether the entities are first identified and then the candidate relationships between them are predicted, or both. All tasks are divided into entity identification and relation prediction, and as two subtasks, a double-pipeline or layered method is adopted.
In the work of the present invention, the drug-protein entity and its interactions are extracted jointly from the biomedical literature in a triplet-forming manner (triplet-forming means that the units within the triplet are formed in the form of table 1 (as a whole or as a whole word). Furthermore, when performing joint extraction tasks on different corpora, the neural networks employed by the encoders and decoders of the model may vary according to adaptability, and different networks and attention modeling methods will be selected.
Table 4 statistics from the latest benchmarks for natural language processing
Precision (Precision), recall (Recall), and F1 score (F1 score) were used as evaluation indicators for this experiment, where TP (true positive) represents the number of correctly classified and partitioned samples; FP (false positive) represents the classification and the number of incorrectly classified samplesAn amount of; FN (false negative) indicates the number of unclassified samples, which is wrong. And, set upTo represent the value corresponding to the category, +.>To represent the number of prediction categories.
The precision rate is only applicable to positive samples with correct predictions, not all samples with correct predictions. The calculation method is to divide the number of positive samples predicted correctly by the ratio of the number of positive samples predicted by the model. It indicates that the predicted positive sample is indeed positive, as shown in equation (13):
Recall is calculated by dividing the correct number of predicted positive samples by the actual number of positive samples in the test set; it shows that the number of samples that are truly positive can be invoked by using a classifier, as shown in equation (14):
the F1 score is the harmonic mean of the accuracy and recall. The accuracy and recall rate are expected to be higher; however, these two indicators are contradictory and cannot be very high. Thus, the F1 score should be introduced as an appropriate threshold point to maximize the classifier's ability, as shown in equation (15):
in the reference test of BioCreative, the comprehensive performance of WDec on chemProt and drug Prot corpora is superior to that of the top model in the original task, and the top model is shown in tables 5 and 6. In table 5, WDec achieves an F1 score 3% higher than the Peng model. In Table 6, the F1 score of a single WDec model on the drug prot corpus is 1.68% higher than that of the NLM-NCBI model. Furthermore, the integration mechanism is applied to the construction of the model for subsequent comparison with the most advanced model in NLP, which improves the F1 score of the model on the corpus by 3.11% and 2.95%, respectively, on average.
TABLE 5 comparison of Performance between the inventive model and the top three models on the ChemProt corpus
TABLE 6 comparison of Performance between the inventive model and the top three models on the drug prot corpus
Performance comparisons were made for the top three models of the two tasks listed in table 5, table 6. Comparing the model with the well-behaved BioCreative VI with two models, single model and integrated model, constructed by the integration mechanism; comparison between the model performed well on the BioCreative VII and the model.
In the NLP benchmark, HRL achieves a significantly higher F1 score on the corpus. Their models and the models described in this invention were run five times and compared across all models' performance on both corpora. Performance comparison of all models on both corpora as shown in table 7, the F1 scores obtained by WDec on ChemProt and DrugProt datasets were 1.44% and 0.54% higher than HRL, respectively. The present invention performs a statistical significance test (t-test) under bootstrap pairing between HRL and the model, and sees that the higher F1 score obtained by the model is statistically significant. Next, a Model of five runs and the outputs of HRLs of five runs are combined to build an Ensemble Model (Ensemble Model). For one test case, the majority of those run at five times +. >The tuples extracted from the database. This integration mechanism significantly improves the accuracy of both datasets, as well as the recall rate. In the aggregate scenario, the F1 score on ChemProt and DrugProt corpuses for WDec is 4.59% and 1.3% higher, respectively, compared to HRL.
Table 7 comparison of the performance of all models on two corpora
Table 8 shows the results of ablation experiments for the WDec attention mechanism. In table 8, the present invention includes the performance of different attention mechanisms with WDec, the effect of mask-based replication mechanisms. Single-attention word-level decoding achieves the highest F1 score. The mask-based replication mechanism increases the F1 score by about 2-7% in each attention mechanism.
Table 8 ablation experiments for WDec attention mechanisms
In summary, the present invention proposes a new interactive tuple representation scheme such that an encoder-decoder model that extracts one word at each time step can still find multiple tuples with overlapping entities and tuples with multi-labeled entities from sentences. This facilitates the transfer of the model from sentence-level knowledge extraction to document-level knowledge extraction. Experiments on ChemProt and drug prot corpora published by bioelect VI & VII show that the method of the present invention is significantly better than all previous most advanced models and establishes new benchmarks on these datasets.
In one embodiment, as shown in fig. 3, there is provided a biomedical knowledge extraction device comprising: an interaction tuple representation setting module 301, a source text sequence feature representation acquisition module 302, an encoding module 303, and a decoding and prediction module 304, wherein:
an interaction tuple representation setting module 301 for setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions.
The source text sequence feature representation obtaining module 302 is configured to obtain feature information of abstract text data of a biomedical document, and map words and characters in the abstract text data into embedded vectors to obtain feature representations of the source text sequence.
An encoding module 303, configured to input a feature representation of the source text sequence into an encoder, to obtain a hidden representation of the encoder; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM; the hidden representation of the encoder is subjected to an attention mechanism to obtain a sequence encoding vector.
The decoding and predicting module 304 is configured to embed and input the sequence encoding vector and the target word of the previous time step into the word level decoding module, and predict the triplet in the form of a generated word to obtain a set of interaction tuples.
In one embodiment, the source text sequence feature representation acquisition module 302 is further configured to acquire feature information of the source abstract text of the biomedical document; constructing a vocabulary according to the source abstract text, wherein the vocabulary comprises a source abstract text token, an interaction name, a component separator, a target sequence start token and a target sequence end token in an interaction set R; initializing Word embedding is carried out on the source abstract text by using a Word2Vec tool, so as to obtain a pre-training Word vector; extracting word feature vectors based on characters from the source abstract text by adopting a convolution network with maximum pooling; and connecting the pre-training word vector with the word feature vector based on the characters to obtain the feature representation of the source text sequence.
In one embodiment, the encoder comprises a number of Bi-LSTM network elements, the number of Bi-LSTM network elements being the same as the length of the source text sequence; the encoding module 303 is further configured to input each token vector representation in the source text sequence into a corresponding Bi-LSTM network element of the encoder, respectively, to obtain a hidden representation of each token vector representation in the source text sequence.
In one embodiment, the attention mechanism in the encoding module is: avg attention mechanism, N-gram attention mechanism, or Single attention mechanism. Wherein the expression of Avg attention mechanism is shown as formula (1).
The expression of the N-gram attention mechanism of n=3 is shown in the formulas (2) to (4). The expressions of the Single attention mechanism are shown in the formulas (5) to (9).
In one embodiment, the word level decoding module comprises: a decoder and a projection layer; the decoder comprises a plurality of LSTM network units; the vocabulary includes source abstract text tokens and interaction setsRA component separator, a meta-component separator, a target sequence start token, an end target sequence token; the decoding and predicting module 304 is further configured to embed and input the sequence encoding vector and the target word of the previous time step into a decoder, so as to obtain a hidden representation of the current token; the expression of the hidden representation of the current token is shown in formula (10); the hidden representation of the current token is input into the projection layer and the triples are predicted in terms of words using a mask-based replication mechanism based on the resulting output, resulting in a set of interacting tuples.
In one embodiment, the projection layer is a linear layer; the decoding and predicting module 304 is further configured to input the hidden representation of the current token into the linear layer, map the hidden representation of the current token to the entire vocabulary, and obtain an output of the linear layer; the expression of the linear layer is shown in formula (11).
Setting the output of the linear layers corresponding to all words except the current source text sequence token, the interaction token, the separator token, the unlabeled token and the ending target sequence token in the vocabulary to be minus infinity, and obtaining the output of the processed linear layers; activating by adopting a softmax function according to the processed output of the linear layer to obtain the normalized score of all words embedded in the vocabulary at the current time step; the expression of the normalized score is shown in formula (12). And predicting the triples in terms of words according to the normalized scores to obtain a group of interaction tuples.
For specific limitations on the biomedical knowledge extraction device, reference may be made to the above limitations on the biomedical knowledge extraction method, and no further description is given here. The respective modules in the above-described biomedical knowledge extraction apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 4. The computer device comprises a processor 401, a memory 402, a network interface 403, a display 404 and an input means 405 connected by a system bus. Wherein the processor 401 of the computer device is used to provide computing and control capabilities. The memory 402 of the computer device includes a non-volatile storage medium 4022, an internal memory 4021. The nonvolatile memory medium 4022 stores an operating system and a computer program. The internal memory 4021 provides an environment for the operation of the operating system and computer programs in the nonvolatile storage medium. The network interface 403 of the computer device is used for communication with an external terminal via a network connection. The computer program is executed by the processor 401 to implement a biomedical knowledge extraction method. The display screen 404 of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device 405 of the computer device may be a touch layer covered on the display screen, or may be a key, a track ball or a touch pad arranged on the casing of the computer device, or may be an external keyboard, a touch pad or a mouse, etc.
Those skilled in the art will appreciate that the structures shown in FIG. 4 are block diagrams only and do not constitute a limitation of the computer device on which the present aspects apply, and that a particular computer device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment a computer device is provided comprising a memory storing a computer program and a processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.
Claims (10)
1. A method of biomedical knowledge extraction, the method comprising:
setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions;
Acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; the source text sequence includes all words, symbols and each interacted clue word in the abstract text data;
inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM;
adopting an attention mechanism to the hidden representation of the encoder to obtain a sequence coding vector;
and embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples.
2. The method of claim 1, wherein obtaining summary text data of a biomedical document and mapping words and characters in the summary text data into embedded vectors to obtain a feature representation of a source text sequence, comprises:
acquiring a source abstract text of a biomedical document;
constructing a vocabulary from the source abstract text, the vocabulary including source abstract text tokens and interaction sets RA component separator, a meta-component separator, a target sequence start token, an end target sequence token;
initializing Word embedding for the source abstract text by using a Word2Vec tool to obtain a pre-training Word vector;
extracting word feature vectors based on characters from the source abstract text by adopting a convolution network with maximum pooling;
and connecting the pre-training word vector with the word feature vector based on the characters to obtain the feature representation of the source text sequence.
3. The method of claim 1, wherein the encoder comprises a number of Bi-LSTM network elements, the number of Bi-LSTM network elements being the same as the length of the source text sequence;
inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder, comprising:
and respectively inputting each token vector representation in the source text sequence into a corresponding Bi-LSTM network unit of an encoder to obtain a hidden representation of each token vector representation in the source text sequence.
4. The method of claim 1, wherein the hidden representation of the encoder is encoded using an attention mechanism to obtain the sequence encoded vector, wherein the attention mechanism is: avg attention mechanism, N-gram attention mechanism or Single attention mechanism;
Wherein the expression of Avg attention mechanism is:
the expression of the N-gram attention mechanism for n=3 is:
wherein ,is the last hidden state of the encoder, +.>Word gram combinations are indicated,/->Is the g-gram word representation sequence of the input sequence,/-gram word representation sequence>Is->Individual g-gram vectors,/>Is->Normalized attention score of the individual g-gram vectors,/-> and />Is a trainable parameter;
the expression of the Single attention mechanism is:
5. The method of claim 1, wherein the word-level decoding module comprises: a decoder and a projection layer; the decoder comprises a plurality of LSTM networks;
the vocabulary includes source abstract text tokens and interaction setsRA component separator, a meta-component separator, a target sequence start token, an end target sequence token;
embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting triples in the form of generated words to obtain a group of interaction tuples, wherein the method comprises the following steps:
embedding and inputting the sequence coding vector and the target word of the previous time step into the decoder to obtain the hidden representation of the current token; the hidden representation of the current token has the following expression:
wherein ,is a hidden representation of the current token,/->Is the source text sequence code,/->Target word embedding, which is the previous time step, +.>Is the forward hidden state of LSTM;
and inputting the hidden representation of the current token into a projection layer, and predicting the triples in terms by adopting a mask-based replication mechanism according to the obtained output to obtain a group of interaction tuples.
6. The method of claim 5, wherein the projection layer is a linear layer;
inputting the hidden representation of the current token into a projection layer, and predicting triples in terms using a mask-based replication mechanism based on the resulting output to obtain a set of interacting tuples, comprising:
inputting the hidden representation of the current token into a linear layer, and mapping the hidden representation of the current token to the whole vocabulary to obtain the output of the linear layer; the expression of the linear layer is:
wherein ,the hidden layer vector is linearly reduced to one dimension and then is embedded into the representation of all words in the vocabulary;is a weight matrix, < >>Is a bias vector;
setting the output of the linear layers corresponding to all words except the current source text sequence token, the interaction token, the separator token, the unlabeled token and the ending target sequence token in the vocabulary to be minus infinity, and obtaining the output of the processed linear layers;
Activating by adopting a softmax function according to the processed output of the linear layer to obtain the normalized score of all words embedded in the vocabulary at the current time step; the normalized score is expressed as:
and predicting the triples in terms of words according to the normalized scores to obtain a group of interaction tuples.
7. A biomedical knowledge extraction device, the device comprising:
an interaction tuple representation setting module for setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions;
the source text sequence feature representation acquisition module is used for acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain feature representation of a source text sequence; the source text sequence includes all words, symbols and each interacted clue word in the abstract text data;
The encoding module is used for inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; adopting an attention mechanism to the hidden representation of the encoder to obtain a sequence coding vector; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM;
and the decoding and predicting module is used for embedding and inputting the sequence coding vector and the target word of the previous time step into the word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples.
8. The apparatus of claim 7, wherein the source text sequence feature representation acquisition module is further configured to acquire source digest text of the biomedical literature; constructing a vocabulary according to the source abstract text, wherein the vocabulary comprises a source abstract text token, an interaction name, a component separator, a target sequence start token and a target sequence end token in an interaction set R; initializing Word embedding is carried out on the source abstract text by using a Word2Vec tool, so as to obtain a pre-training Word vector; extracting word feature vectors based on characters from the source abstract text by adopting a convolution network with maximum pooling; and connecting the pre-training word vector with the word feature vector based on the characters to obtain the feature representation of the source text sequence.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the method of any one of claims 1 to 6 when executing the computer program.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310495163.0A CN116227597A (en) | 2023-05-05 | 2023-05-05 | Biomedical knowledge extraction method, device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310495163.0A CN116227597A (en) | 2023-05-05 | 2023-05-05 | Biomedical knowledge extraction method, device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116227597A true CN116227597A (en) | 2023-06-06 |
Family
ID=86569724
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310495163.0A Pending CN116227597A (en) | 2023-05-05 | 2023-05-05 | Biomedical knowledge extraction method, device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116227597A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117556277A (en) * | 2024-01-12 | 2024-02-13 | 暨南大学 | Initial alignment seed generation method for knowledge-graph entity alignment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180329883A1 (en) * | 2017-05-15 | 2018-11-15 | Thomson Reuters Global Resources Unlimited Company | Neural paraphrase generator |
CN109408812A (en) * | 2018-09-30 | 2019-03-01 | 北京工业大学 | A method of the sequence labelling joint based on attention mechanism extracts entity relationship |
WO2020253669A1 (en) * | 2019-06-19 | 2020-12-24 | 腾讯科技(深圳)有限公司 | Translation method, apparatus and device based on machine translation model, and storage medium |
WO2022212008A1 (en) * | 2021-03-31 | 2022-10-06 | Microsoft Technology Licensing, Llc | Learning molecule graphs embedding using encoder-decoder architecture |
CN115270761A (en) * | 2022-07-28 | 2022-11-01 | 中国人民解放军国防科技大学 | Relation extraction method fusing prototype knowledge |
US11544943B1 (en) * | 2022-05-31 | 2023-01-03 | Intuit Inc. | Entity extraction with encoder decoder machine learning model |
-
2023
- 2023-05-05 CN CN202310495163.0A patent/CN116227597A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180329883A1 (en) * | 2017-05-15 | 2018-11-15 | Thomson Reuters Global Resources Unlimited Company | Neural paraphrase generator |
CN109408812A (en) * | 2018-09-30 | 2019-03-01 | 北京工业大学 | A method of the sequence labelling joint based on attention mechanism extracts entity relationship |
WO2020253669A1 (en) * | 2019-06-19 | 2020-12-24 | 腾讯科技(深圳)有限公司 | Translation method, apparatus and device based on machine translation model, and storage medium |
WO2022212008A1 (en) * | 2021-03-31 | 2022-10-06 | Microsoft Technology Licensing, Llc | Learning molecule graphs embedding using encoder-decoder architecture |
US11544943B1 (en) * | 2022-05-31 | 2023-01-03 | Intuit Inc. | Entity extraction with encoder decoder machine learning model |
CN115270761A (en) * | 2022-07-28 | 2022-11-01 | 中国人民解放军国防科技大学 | Relation extraction method fusing prototype knowledge |
Non-Patent Citations (2)
Title |
---|
TAPAS NAYAK等: "Effective Modeling of Encoder-Decoder Architecture for Joint Entity and Relation Extraction", PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 34, no. 05, pages 8529 - 8532 * |
黄培馨;赵翔;方阳;朱慧明;肖卫东;: "融合对抗训练的端到端知识三元组联合抽取", 计算机研究与发展, no. 12, pages 20 - 32 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117556277A (en) * | 2024-01-12 | 2024-02-13 | 暨南大学 | Initial alignment seed generation method for knowledge-graph entity alignment |
CN117556277B (en) * | 2024-01-12 | 2024-04-05 | 暨南大学 | Initial alignment seed generation method for knowledge-graph entity alignment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Verga et al. | Simultaneously self-attending to all mentions for full-abstract biological relation extraction | |
CN107506414B (en) | Code recommendation method based on long-term and short-term memory network | |
CN110032739B (en) | Method and system for extracting named entities of Chinese electronic medical record | |
US10949456B2 (en) | Method and system for mapping text phrases to a taxonomy | |
CN113420163B (en) | Heterogeneous information network knowledge graph completion method and device based on matrix fusion | |
CN113268612B (en) | Heterogeneous information network knowledge graph completion method and device based on mean value fusion | |
CN117076653B (en) | Knowledge base question-answering method based on thinking chain and visual lifting context learning | |
CN113806493B (en) | Entity relationship joint extraction method and device for Internet text data | |
CN114678061A (en) | Protein conformation perception representation learning method based on pre-training language model | |
Yang et al. | Modality-DTA: multimodality fusion strategy for drug–target affinity prediction | |
CN116227597A (en) | Biomedical knowledge extraction method, device, computer equipment and storage medium | |
CN114881035A (en) | Method, device, equipment and storage medium for augmenting training data | |
CN110008482A (en) | Text handling method, device, computer readable storage medium and computer equipment | |
Dalai et al. | Part-of-speech tagging of Odia language using statistical and deep learning based approaches | |
CN114090769A (en) | Entity mining method, entity mining device, computer equipment and storage medium | |
Che et al. | Fast and effective biomedical named entity recognition using temporal convolutional network with conditional random field | |
Devkota et al. | A Gated Recurrent Unit based architecture for recognizing ontology concepts from biological literature | |
Passban | Machine translation of morphologically rich languages using deep neural networks | |
He et al. | Neural unsupervised reconstruction of protolanguage word forms | |
CN116414988A (en) | Graph convolution aspect emotion classification method and system based on dependency relation enhancement | |
Gao et al. | Citation entity recognition method using multi‐feature semantic fusion based on deep learning | |
Singh et al. | Comparing RNNs and log-linear interpolation of improved skip-model on four Babel languages: Cantonese, Pashto, Tagalog, Turkish | |
Heaps et al. | Toward detection of access control models from source code via word embedding | |
CN113961715A (en) | Entity linking method, device, equipment, medium and computer program product | |
Wei et al. | Biomedical named entity recognition via a hybrid neural network model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |