CN116227597A

CN116227597A - Biomedical knowledge extraction method, device, computer equipment and storage medium

Info

Publication number: CN116227597A
Application number: CN202310495163.0A
Authority: CN
Inventors: 邱炎龙; 吴诚堃; 杨灿群
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2023-06-06

Abstract

The application relates to a biomedical knowledge extraction method, a biomedical knowledge extraction device, a biomedical knowledge extraction computer device and a biomedical knowledge storage medium. The method comprises the following steps: setting an interaction tuple representation; acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; adopting an attention mechanism to the hidden representation of the encoder to obtain a sequence coding vector; and embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples. The method proposes a new interactive tuple representation scheme such that the encoder-decoder model extracting a word at each time step can still find from sentences a plurality of tuples with overlapping entities and tuples with multi-labeled entities.

Description

Biomedical knowledge extraction method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of big data technologies, and in particular, to a method and apparatus for extracting biomedical knowledge, a computer device, and a storage medium.

Background

As a basic study, the targeted extraction of knowledge information such as drug-protein entities and their interactions required for medical research from biomedical literature provides a powerful support for drug discovery, drug reuse, drug design, and bioinformatic knowledge base established in the form of knowledge maps. However, as researchers have conducted intensive research into this task, macroscopic and microscopic problems continue to emerge.

Extraction of the drug-protein interaction tuples from the abstract of biomedical literature is also a challenging task. Macroscopically, the predecessor mainly divides tasks into two parts, entity identification and relationship extraction. The built model not only does not consider the context information, but also derives the sequence of the two subtasks and the problem of feature information sharing. Microscopically, the difference in length of entities, the existence of multiple tuples, and the physical overlap between tuples plagues, can plague the accuracy of extracting triples, and can be specifically categorized into three categories: (1) no physical overlap (No Entity Overlap, NEO): a sequence contains one or more tuples but does not share any entity between them. (2) entity pair overlap (Entity Pair Overlap, EPO): there are multiple tuples in a given sequence, and at least two tuples share two entities in the same or opposite order. (3) single entity overlap (Single Entity Overlap, SEO): a given sequence contains multiple tuples and at least two tuples share exactly one entity. It should be noted here that a sequence may be an entity-to-overlap or a single entity overlap.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a biomedical knowledge extraction method, apparatus, computer device, and storage medium.

A biomedical knowledge extraction method, the method comprising:

setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions.

Acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; the source text sequence includes all words, symbols, and each interacted cue word in the summary text data.

Inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is configured to encode the characteristic information of the source text sequence using Bi-LSTM.

The hidden representation of the encoder is subjected to an attention mechanism to obtain a sequence encoding vector.

And embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples.

A biomedical knowledge extraction device, the device comprising:

an interaction tuple representation setting module for setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions.

The source text sequence feature representation acquisition module is used for acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain feature representation of a source text sequence; the source text sequence includes all words, symbols, and each interacted cue word in the summary text data.

The encoding module is used for inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM; the hidden representation of the encoder is subjected to an attention mechanism to obtain a sequence encoding vector.

And the decoding and predicting module is used for embedding and inputting the sequence coding vector and the target word of the previous time step into the word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples.

In one embodiment, the source text sequence feature representation acquisition module is further configured to acquire source abstract text of a biomedical document; constructing a vocabulary according to the source abstract text, wherein the vocabulary comprises a source abstract text token, an interaction name, a component separator, a target sequence start token and a target sequence end token in an interaction set R; initializing Word embedding is carried out on the source abstract text by using a Word2Vec tool, so as to obtain a pre-training Word vector; extracting word feature vectors based on characters from the source abstract text by adopting a convolution network with maximum pooling; and connecting the pre-training word vector with the word feature vector based on the characters to obtain the feature representation of the source text sequence.

A computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of any of the methods described above.

The biomedical knowledge extraction method, the biomedical knowledge extraction device, the biomedical knowledge extraction computer equipment and the storage medium. The method comprises the following steps: setting an interaction tuple representation; acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM; adopting an attention mechanism to the hidden representation of the encoder to obtain a sequence coding vector; and embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples. The method proposes a new interactive tuple representation scheme such that the encoder-decoder model extracting a word at each time step can still find from sentences a plurality of tuples with overlapping entities and tuples with multi-labeled entities.

Drawings

FIG. 1 is a flow chart of a method for extracting knowledge of biological medicine according to an embodiment;

FIG. 2 is a schematic diagram of a word-level decoding-based model in one embodiment;

FIG. 3 is a block diagram of a biomedical knowledge extraction device in one embodiment;

fig. 4 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, a biomedical knowledge extraction method is provided, the method comprising the steps of:

step S100: setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions.

Specifically, the interaction tuple is a triplet, and the representation of the interaction tuple is: entity 1; an entity 2; interaction. Use "; "as a separator separating tuple components (i.e., component separator is"; ") and" | "separates symbols of multiple tuples (tuple component separator is" | "). Table 1 is an example of an interaction tuple representation, one example of which is included in table 1. These special tokens (":", and "|") can be used to represent multiple interacting tuples with overlapping entities and entities of different lengths in a simple manner. These special tokens can be used to easily extract the interaction tuples after the end of the sequence generation during the reasoning process. Because of this unified interaction tuple representation scheme, entity tokens, interaction tokens, and special tokens are treated similarly, a shared vocabulary containing all of these tokens is used between the encoder and decoder. The digest text (text is a string of words, or a sequence of characters, which may be called a sequence of text) of the input biomedical document contains each interactive clue word, which helps to generate an interaction token. Two special tokens are used so that the encoder-decoder model can distinguish between the beginning of an interacting tuple and the beginning of a tuple component. In order to extract interaction tuples from a text sequence using an encoder-decoder model, the model must generate entity tokens, find clue words for interactions and map them to interaction tokens, and generate special tokens when appropriate.

Table 1 example of the manner in which the interaction tuple is represented

Step S102: acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; the source text sequence includes all words, symbols, and each interacted cue word in the summary text data.

Specifically, the embedded vector is a vector that maps words, symbols, or sequences into a single corresponding one.

Step S104: inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is configured to encode the characteristic information of the source text sequence using Bi-LSTM.

Step S106: the hidden representation of the encoder is subjected to an attention mechanism to obtain a sequence encoding vector.

Specifically, in order to obtain the source context feature information, a hidden representation of the encoder is subjected to an attention mechanism to obtain a sequence encoding vector. Preferably, the attention mechanism may be an Avg attention mechanism, an N-gram attention mechanism or a Single attention mechanism.

Step S108: and embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples.

Specifically, a word level decoding-based model (WDec) is formed by the encoder, the attention mechanism and the word level decoding module, and the model is structured as shown in FIG. 2.

The word level decoding module is used for decoding the input sequence coding vector and the target word embedding of the previous time step by adopting an LSTM network unit, then mapping the decoding result to a vocabulary by adopting a projection layer, and then performing prediction processing by adopting a softmax operation fused with a mask technology to obtain a group of interaction tuples.

The word level decoding module generates one token at a time and generates a target sequence token after the generation is finishedEOS) And stopping.

In the above-mentioned biomedical knowledge extraction method, the method includes: setting an interaction tuple representation; acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM; adopting an attention mechanism to the hidden representation of the encoder to obtain a sequence coding vector; and embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting the triples in the form of generated words to obtain a group of interaction tuples. The method proposes a new interactive tuple representation scheme such that the encoder-decoder model extracting a word at each time step can still find from sentences a plurality of tuples with overlapping entities and tuples with multi-labeled entities.

In one embodiment, step S102 includes: acquiring a source abstract text of a biomedical document; building a vocabulary from the source abstract text, the vocabulary comprising source abstract text tokens and interaction setsRA component separator, a meta-component separator, a target sequence start token, an end target sequence token; extracting feature vectors of word embedding and character embedding of all words and characters in the source abstract text by adopting a convolution network with the maximum pooling; and connecting the character embedded feature vectors with word embedded feature vectors to obtain the feature representation of the new source text sequence.

Specifically, a vocabulary is createdVFrom source abstract text tokens, interaction setsRThe interaction names of the components, special separators (the 'and' are 'and' among the components, the 'component separators' and the 'I' are component separators), and a target sequence start tokenSOS) End target sequence tokenEOS）。

Word level inlayThe method comprises the following two parts: (1) pre-training word vectors; (2) character-embedded feature vector based. Using word embedding layers

And character embedding layer->

, wherein />

Is the dimension of the word vector and,Ais the alphabet of characters of the token of the input text sequence, < > >

Is the dimension of the character-embedded vector. Extracting a size +.about.for each word using convolutional neural network with maximum pool>

Is described. Word embedding and character embedding based feature vectors are concatenated ("|") to obtain a representation of the input token.

In one embodiment, the encoder comprises a number of Bi-LSTM network elements, the number of Bi-LSTM network elements being the same as the length of the source text sequence; step S104 includes: and respectively inputting each token vector representation in the source text sequence into a corresponding Bi-LSTM network element of the encoder to obtain a hidden representation of each token vector representation in the source text sequence.

Specifically, the source text sequence S is represented by its token vector

Representation of->

Is the firstiA vector representation of the individual words is provided,nis the length of the source text sequence S. Vector->

Is transferred to a two-way long and short term memory network (Bi-LSTM) to obtain the hidden representation +.>

. The hidden dimension of the forward and backward LSTM of Bi-LSTM is set to +.>

To obtain->

, wherein />

Is the hidden dimension of the decoder LSTM of the word-level decoding module.

In one embodiment, the hidden representation of the encoder uses an attention mechanism to obtain the sequence encoded vector, and the attention mechanism in the step is: avg attention mechanism, N-gram attention mechanism, or Single attention mechanism. Wherein the expression of Avg attention mechanism is:

（1）

The expression of the N-gram attention mechanism for n=3 is:

（2）

（3）

（4）

wherein ,

is the last hidden state of the encoder, +.>

Word gram combinations are indicated,/->

Is the g-gram word representation sequence of the input sequence,/-gram word representation sequence>

Is->

The g-gram vectors (2-gram and 3-gram representation obtained by mean pooling),>

is->

Normalized attention score of the individual g-gram vectors,/->

and />

Is a trainable parameter.

The expression of the Single attention mechanism is:

（5）

（6）

（7）

（8）

（9）

wherein ,

、/>

and />

Are all trainable attention parameters, +.>

Is a bias vector, +.>

Is->

Individual words at decoding time steptIs a normalized attention score of (c).

In one embodiment, the word level decoding module comprises: a decoder and a projection layer; the decoder comprises a plurality of LSTM network units; the vocabulary includes source abstract text tokens and interaction setsRA component separator, a meta-component separator, a target sequence start token, an end target sequence token; step S108 includes: embedding and inputting the sequence coding vector and the target word of the previous time step into a decoder to obtain the hidden representation of the current token; the expression of the hidden representation of the current token is:

（10）

wherein ,

is a hidden representation of the current token,/- >

Is the source text sequence code,/->

Target word embedding, which is the previous time step, +.>

Is the forward hidden state of LSTM.

The hidden representation of the current token is input into the projection layer and the triples are predicted in terms of words using a mask-based replication mechanism based on the resulting output, resulting in a set of interacting tuples.

Specifically, the target sequence T is represented only by the word embedded vector of its token

, wherein />

Is->

The embedded vector of the individual tokens is used,mis the length of the target sequence. />

and />

Respectively showSOSAndEOSan embedded vector of tokens. The decoder generates tokens one at a time and generatesEOSAnd stopping. LSTM is used as decoder and at time steptWhere the decoder encodes the source text sequence (++>

) And previous target word embedding (++>

) As input, and generates a current token (+)>

) Is a hidden representation of (c). Sequence coding vector +.>

. Use of a matrix with weights +.>

And offset vector->

Will +.>

Projection to vocabulary +.>

。

In one embodiment, the projection layer is a linear layer; inputting the hidden representation of the current token into a projection layer and predicting the triples in terms using a mask-based replication mechanism based on the resulting output to obtain a set of interacting tuples, comprising: inputting the hidden representation of the current token into a linear layer, mapping the hidden representation of the current token to the whole vocabulary, and obtaining the output of the linear layer; the expression of the linear layer is:

（11）

wherein ,

the hidden layer vector is linearly reduced to one dimension and then is embedded into the representation of all words in the vocabulary;

is a weight matrix, < >>

Is the bias vector.

Setting the output of the linear layers corresponding to all words except the current source text sequence token, the interaction token, the separator token, the unlabeled token and the ending target sequence token in the vocabulary to be minus infinity, and obtaining the output of the processed linear layers; activating by adopting a softmax function according to the processed output of the linear layer to obtain the normalized score of all words embedded in the vocabulary at the current time step; the normalized score is expressed as:

（12）

wherein ,

for normalizing the score->

Is the output of the processed linear layer.

And predicting the triples in terms of words according to the normalized scores to obtain a group of interaction tuples.

In particular, the projection layer in the word level decoding module maps the decoder output to the entire vocabulary. During training, the target token of "gold label" is directly used, i.e., the similarity score is highest. However, during reasoning, the decoder can predict tokens from the vocabulary that are not present in the current text sequence or interaction set or special tokens. To prevent this, a masking technique is used when the projection layer applies a softmax operation. All words in the vocabulary are masked (excluded) except for the current source text sequence token, interaction token, partition Fu Lingpai (";" and "|") UNKAndEOSand (5) a token. To mask (exclude) some words from softmax

The corresponding value of (2) is set to +.>

The corresponding softmax score would be zero. This ensures that entities are only copied from the source text sequence. Inclusion in softmax procedureUNKTokens to ensure that the model generates new entities during the reasoning process. If the decoder predicts oneUNKThe token replaces it with the corresponding source word with the highest attention score. In the reasoning process, after decoding is completed, all tuples are extracted based on special tokens, and the tuples with the same repeated tuple and two entities are deleted, or the tuples with the interaction tokens not in the interaction set are deleted.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.

In one illustrative embodiment, the method is primarily used to extract tuples with overlapping entities from biomedical abstract text, which is an important task in the international evaluation platform BioCreative. The task provides two annotated corpora. (i) The DrugProt corpus contained a total of 4227 artificially annotated PubMed summaries. The original data is divided into a training set, a verification set and a test set by 6-2-2 segmentation. (ii) The ChemProt corpus consists of 1015 training sets, 611 development sets and 799 test sets. Table 2 lists the statistics for each group.

Table 2 training/validation/test data distribution statistics for experimental corpus

The present invention runs the Word2Vec tool on the ChemProt and Drugprot corpus shared by BioCreative VI and VII to initialize Word embedding. Character embedding and interaction embedding are randomly initialized. All embeddings are updated during training. Setting word embedding dimensions

Interaction embedding dimension +.>

Character embedding dimension->

Character-based word feature dimension ++>

. To extract character-based word feature vectors, the CNN filter width is set to 3 and the maximum length of the word is set to 10. Hidden dimension of decoder LSTM unit >

Is set to 300 and the hidden dimension of the forward and backward LSTM of the encoder is set to 150. The model was trained in small batches of 6 and Adam was used to optimize network parameters. Dropout layers with a fixed loss rate of 0.3 are used in the network to avoid overfitting.

The present invention compares the model with the first three models of text-mining drug-protein/chemical-protein interaction performance issued by BioCreative VI and VII, respectively. Table 3 is a biochartive two-task Top3 statistic that analyzed the strategies and models proposed or used by each team. The model proposed by team in BioCreative VI is less complex than in BioCreative VII, and more prone to model integration and multi-layering. However, the best performing models in BioCreative VII all come from fine-tuning, even a collection of pre-trained language models. For the task of extracting knowledge information from text sequences of biomedical literature abstract and the like, under certain experimental hardware conditions, the larger the model is, the shorter the text sequence length of the information can be extracted generally. Furthermore, while the large scale of the pre-trained language model does not lend itself to fine tuning, it may not perform well when it is transferred to other highly specialized tasks.

Table 3 working statistics of top three of the two tasks of BioCreative, respectively

The invention is inspired by Peng et al and Mehryary et al, and the benchmark also includes the most advanced federated entity and relationship extraction model in Natural Language Processing (NLP). Table 4 is statistics of the latest benchmarks from NLP, as shown in table 4, these models are classified into traditional Machine Learning (ML), deep Learning (DL) and Reinforcement Learning (RL) based, and the strategies employed by these models can be classified into two categories: (i) series Strategy (Serial Strategy): one conventional approach targets sentence tasks. First, all entities in a sentence are identified, and then the relationships of all candidate entity pairs are determined. The disadvantage of this approach is that cascading errors, i.e. errors in the identification of the subtasks of the previous stage, will propagate to the next subtask, and the extraction result of the subsequent subtask will not affect the tasks of the previous stage. (ii) federation policy (join Strategy): in a currently popular method, the relationship is judged while the entity is identified, and interaction exists between two subtasks which can be mutually influenced. However, whether the entities are first identified and then the candidate relationships between them are predicted, or both. All tasks are divided into entity identification and relation prediction, and as two subtasks, a double-pipeline or layered method is adopted.

In the work of the present invention, the drug-protein entity and its interactions are extracted jointly from the biomedical literature in a triplet-forming manner (triplet-forming means that the units within the triplet are formed in the form of table 1 (as a whole or as a whole word). Furthermore, when performing joint extraction tasks on different corpora, the neural networks employed by the encoders and decoders of the model may vary according to adaptability, and different networks and attention modeling methods will be selected.

Table 4 statistics from the latest benchmarks for natural language processing

Precision (Precision), recall (Recall), and F1 score (F1 score) were used as evaluation indicators for this experiment, where TP (true positive) represents the number of correctly classified and partitioned samples; FP (false positive) represents the classification and the number of incorrectly classified samplesAn amount of; FN (false negative) indicates the number of unclassified samples, which is wrong. And, set up

To represent the value corresponding to the category, +.>

To represent the number of prediction categories.

The precision rate is only applicable to positive samples with correct predictions, not all samples with correct predictions. The calculation method is to divide the number of positive samples predicted correctly by the ratio of the number of positive samples predicted by the model. It indicates that the predicted positive sample is indeed positive, as shown in equation (13):

(13)

Recall is calculated by dividing the correct number of predicted positive samples by the actual number of positive samples in the test set; it shows that the number of samples that are truly positive can be invoked by using a classifier, as shown in equation (14):

(14)

the F1 score is the harmonic mean of the accuracy and recall. The accuracy and recall rate are expected to be higher; however, these two indicators are contradictory and cannot be very high. Thus, the F1 score should be introduced as an appropriate threshold point to maximize the classifier's ability, as shown in equation (15):

(15)

in the reference test of BioCreative, the comprehensive performance of WDec on chemProt and drug Prot corpora is superior to that of the top model in the original task, and the top model is shown in tables 5 and 6. In table 5, WDec achieves an F1 score 3% higher than the Peng model. In Table 6, the F1 score of a single WDec model on the drug prot corpus is 1.68% higher than that of the NLM-NCBI model. Furthermore, the integration mechanism is applied to the construction of the model for subsequent comparison with the most advanced model in NLP, which improves the F1 score of the model on the corpus by 3.11% and 2.95%, respectively, on average.

TABLE 5 comparison of Performance between the inventive model and the top three models on the ChemProt corpus

TABLE 6 comparison of Performance between the inventive model and the top three models on the drug prot corpus

Performance comparisons were made for the top three models of the two tasks listed in table 5, table 6. Comparing the model with the well-behaved BioCreative VI with two models, single model and integrated model, constructed by the integration mechanism; comparison between the model performed well on the BioCreative VII and the model.

In the NLP benchmark, HRL achieves a significantly higher F1 score on the corpus. Their models and the models described in this invention were run five times and compared across all models' performance on both corpora. Performance comparison of all models on both corpora as shown in table 7, the F1 scores obtained by WDec on ChemProt and DrugProt datasets were 1.44% and 0.54% higher than HRL, respectively. The present invention performs a statistical significance test (t-test) under bootstrap pairing between HRL and the model, and sees that the higher F1 score obtained by the model is statistically significant

. Next, a Model of five runs and the outputs of HRLs of five runs are combined to build an Ensemble Model (Ensemble Model). For one test case, the majority of those run at five times +. >

The tuples extracted from the database. This integration mechanism significantly improves the accuracy of both datasets, as well as the recall rate. In the aggregate scenario, the F1 score on ChemProt and DrugProt corpuses for WDec is 4.59% and 1.3% higher, respectively, compared to HRL.

Table 7 comparison of the performance of all models on two corpora

Table 8 shows the results of ablation experiments for the WDec attention mechanism. In table 8, the present invention includes the performance of different attention mechanisms with WDec, the effect of mask-based replication mechanisms. Single-attention word-level decoding achieves the highest F1 score. The mask-based replication mechanism increases the F1 score by about 2-7% in each attention mechanism.

Table 8 ablation experiments for WDec attention mechanisms

In summary, the present invention proposes a new interactive tuple representation scheme such that an encoder-decoder model that extracts one word at each time step can still find multiple tuples with overlapping entities and tuples with multi-labeled entities from sentences. This facilitates the transfer of the model from sentence-level knowledge extraction to document-level knowledge extraction. Experiments on ChemProt and drug prot corpora published by bioelect VI & VII show that the method of the present invention is significantly better than all previous most advanced models and establishes new benchmarks on these datasets.

In one embodiment, as shown in fig. 3, there is provided a biomedical knowledge extraction device comprising: an interaction tuple representation setting module 301, a source text sequence feature representation acquisition module 302, an encoding module 303, and a decoding and prediction module 304, wherein:

an interaction tuple representation setting module 301 for setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions.

The source text sequence feature representation obtaining module 302 is configured to obtain feature information of abstract text data of a biomedical document, and map words and characters in the abstract text data into embedded vectors to obtain feature representations of the source text sequence.

An encoding module 303, configured to input a feature representation of the source text sequence into an encoder, to obtain a hidden representation of the encoder; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM; the hidden representation of the encoder is subjected to an attention mechanism to obtain a sequence encoding vector.

The decoding and predicting module 304 is configured to embed and input the sequence encoding vector and the target word of the previous time step into the word level decoding module, and predict the triplet in the form of a generated word to obtain a set of interaction tuples.

In one embodiment, the source text sequence feature representation acquisition module 302 is further configured to acquire feature information of the source abstract text of the biomedical document; constructing a vocabulary according to the source abstract text, wherein the vocabulary comprises a source abstract text token, an interaction name, a component separator, a target sequence start token and a target sequence end token in an interaction set R; initializing Word embedding is carried out on the source abstract text by using a Word2Vec tool, so as to obtain a pre-training Word vector; extracting word feature vectors based on characters from the source abstract text by adopting a convolution network with maximum pooling; and connecting the pre-training word vector with the word feature vector based on the characters to obtain the feature representation of the source text sequence.

In one embodiment, the encoder comprises a number of Bi-LSTM network elements, the number of Bi-LSTM network elements being the same as the length of the source text sequence; the encoding module 303 is further configured to input each token vector representation in the source text sequence into a corresponding Bi-LSTM network element of the encoder, respectively, to obtain a hidden representation of each token vector representation in the source text sequence.

In one embodiment, the attention mechanism in the encoding module is: avg attention mechanism, N-gram attention mechanism, or Single attention mechanism. Wherein the expression of Avg attention mechanism is shown as formula (1).

The expression of the N-gram attention mechanism of n=3 is shown in the formulas (2) to (4). The expressions of the Single attention mechanism are shown in the formulas (5) to (9).

In one embodiment, the word level decoding module comprises: a decoder and a projection layer; the decoder comprises a plurality of LSTM network units; the vocabulary includes source abstract text tokens and interaction setsRA component separator, a meta-component separator, a target sequence start token, an end target sequence token; the decoding and predicting module 304 is further configured to embed and input the sequence encoding vector and the target word of the previous time step into a decoder, so as to obtain a hidden representation of the current token; the expression of the hidden representation of the current token is shown in formula (10); the hidden representation of the current token is input into the projection layer and the triples are predicted in terms of words using a mask-based replication mechanism based on the resulting output, resulting in a set of interacting tuples.

In one embodiment, the projection layer is a linear layer; the decoding and predicting module 304 is further configured to input the hidden representation of the current token into the linear layer, map the hidden representation of the current token to the entire vocabulary, and obtain an output of the linear layer; the expression of the linear layer is shown in formula (11).

Setting the output of the linear layers corresponding to all words except the current source text sequence token, the interaction token, the separator token, the unlabeled token and the ending target sequence token in the vocabulary to be minus infinity, and obtaining the output of the processed linear layers; activating by adopting a softmax function according to the processed output of the linear layer to obtain the normalized score of all words embedded in the vocabulary at the current time step; the expression of the normalized score is shown in formula (12). And predicting the triples in terms of words according to the normalized scores to obtain a group of interaction tuples.

For specific limitations on the biomedical knowledge extraction device, reference may be made to the above limitations on the biomedical knowledge extraction method, and no further description is given here. The respective modules in the above-described biomedical knowledge extraction apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 4. The computer device comprises a processor 401, a memory 402, a network interface 403, a display 404 and an input means 405 connected by a system bus. Wherein the processor 401 of the computer device is used to provide computing and control capabilities. The memory 402 of the computer device includes a non-volatile storage medium 4022, an internal memory 4021. The nonvolatile memory medium 4022 stores an operating system and a computer program. The internal memory 4021 provides an environment for the operation of the operating system and computer programs in the nonvolatile storage medium. The network interface 403 of the computer device is used for communication with an external terminal via a network connection. The computer program is executed by the processor 401 to implement a biomedical knowledge extraction method. The display screen 404 of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device 405 of the computer device may be a touch layer covered on the display screen, or may be a key, a track ball or a touch pad arranged on the casing of the computer device, or may be an external keyboard, a touch pad or a mouse, etc.

Those skilled in the art will appreciate that the structures shown in FIG. 4 are block diagrams only and do not constitute a limitation of the computer device on which the present aspects apply, and that a particular computer device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment a computer device is provided comprising a memory storing a computer program and a processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method of biomedical knowledge extraction, the method comprising:

setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions;

Acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain characteristic representation of a source text sequence; the source text sequence includes all words, symbols and each interacted clue word in the abstract text data;

inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM;

adopting an attention mechanism to the hidden representation of the encoder to obtain a sequence coding vector;

2. The method of claim 1, wherein obtaining summary text data of a biomedical document and mapping words and characters in the summary text data into embedded vectors to obtain a feature representation of a source text sequence, comprises:

acquiring a source abstract text of a biomedical document;

constructing a vocabulary from the source abstract text, the vocabulary including source abstract text tokens and interaction sets RA component separator, a meta-component separator, a target sequence start token, an end target sequence token;

initializing Word embedding for the source abstract text by using a Word2Vec tool to obtain a pre-training Word vector;

extracting word feature vectors based on characters from the source abstract text by adopting a convolution network with maximum pooling;

and connecting the pre-training word vector with the word feature vector based on the characters to obtain the feature representation of the source text sequence.

3. The method of claim 1, wherein the encoder comprises a number of Bi-LSTM network elements, the number of Bi-LSTM network elements being the same as the length of the source text sequence;

inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder, comprising:

and respectively inputting each token vector representation in the source text sequence into a corresponding Bi-LSTM network unit of an encoder to obtain a hidden representation of each token vector representation in the source text sequence.

4. The method of claim 1, wherein the hidden representation of the encoder is encoded using an attention mechanism to obtain the sequence encoded vector, wherein the attention mechanism is: avg attention mechanism, N-gram attention mechanism or Single attention mechanism;

Wherein the expression of Avg attention mechanism is:

；

the expression of the N-gram attention mechanism for n=3 is:

；

；

；/>

wherein ,

is the last hidden state of the encoder, +.>

Word gram combinations are indicated,/->

Is->

Individual g-gram vectors,/>

Is->

Normalized attention score of the individual g-gram vectors,/->

and />

Is a trainable parameter;

the expression of the Single attention mechanism is:

；

；

；

；

；

wherein ,

、/>

and />

Are all attention parameters that can be trained,

is a bias vector, +.>

Is->

Individual words at decoding time steptIs a normalized attention score of (c).

5. The method of claim 1, wherein the word-level decoding module comprises: a decoder and a projection layer; the decoder comprises a plurality of LSTM networks;

the vocabulary includes source abstract text tokens and interaction setsRA component separator, a meta-component separator, a target sequence start token, an end target sequence token;

embedding and inputting the sequence coding vector and the target word of the previous time step into a word level decoding module, and predicting triples in the form of generated words to obtain a group of interaction tuples, wherein the method comprises the following steps:

embedding and inputting the sequence coding vector and the target word of the previous time step into the decoder to obtain the hidden representation of the current token; the hidden representation of the current token has the following expression:

；

wherein ,

is a hidden representation of the current token,/->

Is the source text sequence code,/->

Target word embedding, which is the previous time step, +.>

Is the forward hidden state of LSTM;

and inputting the hidden representation of the current token into a projection layer, and predicting the triples in terms by adopting a mask-based replication mechanism according to the obtained output to obtain a group of interaction tuples.

6. The method of claim 5, wherein the projection layer is a linear layer;

inputting the hidden representation of the current token into a projection layer, and predicting triples in terms using a mask-based replication mechanism based on the resulting output to obtain a set of interacting tuples, comprising:

inputting the hidden representation of the current token into a linear layer, and mapping the hidden representation of the current token to the whole vocabulary to obtain the output of the linear layer; the expression of the linear layer is:

；

wherein ,

is a weight matrix, < >>

Is a bias vector;

setting the output of the linear layers corresponding to all words except the current source text sequence token, the interaction token, the separator token, the unlabeled token and the ending target sequence token in the vocabulary to be minus infinity, and obtaining the output of the processed linear layers;

Activating by adopting a softmax function according to the processed output of the linear layer to obtain the normalized score of all words embedded in the vocabulary at the current time step; the normalized score is expressed as:

；

wherein ,

for normalizing the score->

Output for the processed linear layer;

7. A biomedical knowledge extraction device, the device comprising:

an interaction tuple representation setting module for setting an interaction tuple representation; the interaction tuple is expressed in the form of: the tuple components are separated by a component separator, the plurality of tuples are separated by a tuple component separator, and the interacting tuple components comprise: entity 1, entity 2 and interactions;

the source text sequence feature representation acquisition module is used for acquiring abstract text data of a biomedical document, and mapping words and characters in the abstract text data into embedded vectors to obtain feature representation of a source text sequence; the source text sequence includes all words, symbols and each interacted clue word in the abstract text data;

The encoding module is used for inputting the characteristic representation of the source text sequence into an encoder to obtain a hidden representation of the encoder; adopting an attention mechanism to the hidden representation of the encoder to obtain a sequence coding vector; the encoder is used for encoding the characteristic information of the source text sequence by adopting Bi-LSTM;

8. The apparatus of claim 7, wherein the source text sequence feature representation acquisition module is further configured to acquire source digest text of the biomedical literature; constructing a vocabulary according to the source abstract text, wherein the vocabulary comprises a source abstract text token, an interaction name, a component separator, a target sequence start token and a target sequence end token in an interaction set R; initializing Word embedding is carried out on the source abstract text by using a Word2Vec tool, so as to obtain a pre-training Word vector; extracting word feature vectors based on characters from the source abstract text by adopting a convolution network with maximum pooling; and connecting the pre-training word vector with the word feature vector based on the characters to obtain the feature representation of the source text sequence.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the method of any one of claims 1 to 6 when executing the computer program.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any one of claims 1 to 6.