CN114925678A

CN114925678A - Drug entity and relationship combined extraction method based on high-level interaction mechanism

Info

Publication number: CN114925678A
Application number: CN202210425479.8A
Authority: CN
Inventors: 朱嘉静; 邓皓瀚; 刘勇国; 张云; 李巧勤
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-08-19
Anticipated expiration: 2042-04-21
Also published as: CN114925678B

Abstract

The invention discloses a drug entity and relationship combined extraction method based on a high-level interaction mechanism, which comprises the following steps: s1: acquiring a medicine entity corpus and preprocessing characters of the medicine entity corpus; s2: constructing a medical anatomy classification dictionary; s3: identifying the drug entities in the preprocessed drug entity corpus; s4: acquiring a drug sentence inserted with an entity mark; s5: extracting the drug relationship; s6: and performing interactive training of the drug entities and the relationships based on a high-level interactive mechanism, and performing drug entity and relationship combined extraction on the acquired latest drug input texts by using a drug entity recognition model and a drug relationship extraction model. The invention solves the problem of error propagation of combined extraction, and simultaneously introduces a text labeling mechanism based on ATC coding after the identification task of the drug entity is completed, thereby improving the accuracy of drug relation extraction.

Description

Drug entity and relationship combined extraction method based on high-level interaction mechanism

Technical Field

The invention belongs to the technical field of drug identification, and particularly relates to a drug entity and relationship combined extraction method based on a high-level interaction mechanism.

Background

Drug-Drug Interactions (DDI) refer to the positive or negative effect that one Drug has on the other Drug when the two drugs are used simultaneously. When two or more drugs are used simultaneously, a known drug may be altered by other drugs, for example, to change the safety and efficacy of the known drug, or even to cause serious side effects. Fully acquiring and knowing the drug interaction information has important significance for reducing medical cost and avoiding medical accidents.

The existing medicine relation extraction method based on a residual error network and an attention mechanism uses a two-layer bidirectional long-and-short-term memory network model to perform time sequence modeling on input medicine relation sentences, excavates the dependency relationship among long-distance words in medicine relation description, and overcomes the problem of gradient dispersion during model training; and introducing residual connection to dynamically construct network models with different depths and structures, integrating an attention mechanism to complete weight calculation of word information, and finally integrating time sequence information and attention information and inputting the fused information into a Softmax classifier to extract medicine relations. It has the following disadvantages: 1) complicated manual feature extraction and data preprocessing steps are needed, such as negative case screening, entity blinding and other operations, and the entity blinding operations inevitably affect the information of word level hiding in model learning, such as Cefazolin (Cefazolin), Cefalexin (Cefalexin) and Cefradine (Cefradine) (all three are cephalosporin antibiotic drugs); 2) the prediction result of the drug relationship will depend heavily on the recognition result of the external tool for the drug entity, i.e. there is an error propagation problem; 3) the existence of correlations between drug entity identifications and drug relationship extractions is ignored.

The entity and relation extraction combined extraction method of the existing feedback mechanism uses Bi-LSTM to extract the context feature of the text at the bottom layer, and two tasks of named entity identification and relation extraction share the feature vector extracted at the layer. Introducing an Attention mechanism to capture key local features to finish decoding, finishing named entity recognition work by using a conditional random field, splicing a final feature vector corresponding to the recognized entity with a shared layer feature vector, and inputting the final feature vector into a CNN (content-based network) to finish relation extraction. Presetting loss value influence coefficients of the entity identification and relationship extraction two tasks, calculating the loss values of the two tasks, and simply adding the loss values of the relationship extraction and the loss values of the entity identification by taking the influence coefficients as weights to feed back and correct parameters of the sharing layer and the entity identification two modules. It has the following disadvantages: 1) entity identification and relationship extraction two task bottom layers share Bi-LSTM feature extraction, which does not bring better effect but may influence performance; 2) the super-parameter is preset, losses of the two tasks are linearly added according to the influence coefficients, and in the actual training, the backward propagation process may be dominated by one task due to different convergence speeds of loss functions of the two tasks, so that the effect of the other task is reduced, and the significance of loss interaction is lost.

Disclosure of Invention

In order to solve the problems, the invention provides a drug entity and relationship combined extraction method based on a high-level interaction mechanism.

The technical scheme of the invention is as follows: a drug entity and relationship combined extraction method based on a high-level interaction mechanism comprises the following steps:

s1: acquiring a medicine entity corpus and preprocessing characters of the medicine entity corpus;

s2: acquiring ATC classification codes of all medicines, and constructing a medicine anatomy classification dictionary;

s3: identifying the drug entities in the preprocessed drug entity corpus;

s4: according to the medicine entity and the medicine anatomy classification dictionary, medicine sentences inserted with entity marks are obtained;

s5: extracting the drug relation according to the drug sentence inserted with the entity mark;

s6: and performing interactive training of the drug entities and the drug relations based on a high-level interactive mechanism according to the identified drug entities and the extracted drug relations to obtain a drug entity identification model and a drug relation extraction model, and performing drug entity and relation combined extraction on the acquired latest drug input texts by using the drug entity identification model and the drug relation extraction model.

Further, in step S1, the specific method for preprocessing the characters of the corpus of pharmaceutical entities is as follows: converting the medicine relation description text in the medicine entity corpus into lowercase, removing punctuation marks and non-English characters, replacing numbers with num, labeling by adopting a BIO (binary encoding) mode, and setting the maximum sentence length.

Further, in step S2, the specific method for constructing the pharmaceutical anatomical classification dictionary is as follows: and (4) crawling ATC classification codes of all the medicines from drug Bank, and constructing a medicine anatomy classification dictionary by taking the medicine name as a key and the first position of the ATC classification code as a value.

Further, step S3 includes the following sub-steps:

s31: obtaining sentence vectors in the preprocessed drug entity corpus by using a BioBert model;

s32: inputting the sentence vectors into a linear layer for linear conversion to obtain score vectors;

s33: and inputting the score vector into the linear conditional random field to obtain a final labeling sequence, and extracting the drug entity from the final labeling sequence.

Further, in step S31, the BioBert model includes 12 layers of transform structures, and the output of each layer of transform structure is input to the next layer of transform structure;

in step S31, the specific method for obtaining the sentence vectors in the preprocessed corpus of drug entities is as follows: using each character of the medicine entity corpus as an input text X ═ { X ₁ ,x ₂ ,...,x _m Sequentially inputting the input texts into a BioBert model, and splicing output vectors of the last four layers of transform structures to obtain a sentence vector V (BioBert) (X) (V) ₁ ,v ₂ ,...,v _m}, wherein ,x_i The ith character, v, representing the input text sequence _i Represents the BioBert input corresponding to the ith characterOutputting a vector, wherein BioBert (·) represents BioBert model operation, and m represents the length of the input text sequence;

in step S32, the calculation formula of the score vector H is:

H＝WV+b

wherein W represents the weight of the linear layer and b represents the bias of the linear layer;

in step S33, the specific method for determining the final annotation sequence is: and calculating by using a linear conditional random field to enable the final labeling sequence to become a score vector P (Y | H) with the maximum probability of the prediction result, wherein the probability P (Y | H) of the final labeling sequence when the score vector H takes the value of H and the final labeling sequence Y takes the value of Y is as follows:

wherein Z (h) represents a normalization factor, i represents the word order of the current word, and t _k (. cndot.) represents a local feature function, exp (. cndot.) represents an exponential operation, K represents the total number of local feature functions defined at the node, m represents the input text length, s _l (. The) representation is defined at node y _i L represents the total number of node feature functions defined at the node; lambda _k Representing a local characteristic function t _k Weight coefficient of (. mu.) _l Representing a characteristic function s of a node _l And h represents the value of the score vector.

Further, in step S4, the specific method for obtaining the drug sentence inserted with the entity mark is as follows: inserting entity labels into two sides of the drug entity according to the format of < d1e > </d1e > or < d2e > </d2e > to obtain a drug sentence inserted with the entity labels, wherein d represents the drug entity, 1 or 2 represents the drug entity ordinal number, and e represents the ATC classification code in the drug anatomy classification dictionary.

Further, step S5 includes the following sub-steps:

s51: inserting a mark [ CLS ] into the sentence head of the medicine sentence inserted with the entity mark by using a BioBert model, and inputting the inserted mark [ CLS ] into the BioBert model to obtain the code expression of the medicine sentence;

s52: splicing the coded representation of the drug sentences and the marked vector representation thereof to obtain the overall characteristic vector of the drug entity;

s53: extracting two drug entities and words between the two drug entities from the coded representation of the drug sentences to form a subsequence, inputting the vector representation of the subsequence into a convolutional neural network, and sequentially performing convolutional operation and maximum pooling operation to obtain a word element relation characteristic vector;

s54: sequentially splicing the coding expression of the medicine sentences, the overall characteristic vector of the medicine entity and the characteristic vector of the word element relation to obtain a final characteristic vector;

s55: and compressing the dimensionality of the final feature vector by using a feedforward neural network, inputting the dimensionality into a softmax layer for classification, obtaining a class label with the maximum probability of the final prediction result in classification results, and taking the class label as a medicine relation.

Further, in step S51, the coding of the drug sentence represents the expression of S as:

S＝(s ₀ ,...,s _<d1e> ,s _d1 ,s _</d1e> ,...,s _<d2e> ,s _d2 ,s _</d2e> ,...s _m+5 )

wherein ,s₀ Indicating a mark [ CLS]Corresponding sentence-level feature vector, s _d1 Representing the corresponding vector representation, s, of the first pharmaceutical entity _d2 Representing the corresponding vector representation, s, of the second pharmaceutical entity _<d1e> and s_</d1e> The text labels respectively representing the left and right sides of the first drug entity are represented by vectors, s, obtained after passing through a BioBert model _<d2e> and s_</d2e> The text marks respectively representing the left side and the right side of the second drug entity are represented by vectors obtained after passing through a BioBert model, and m represents the length of an original input text sequence;

step (ii) ofIn S52, the i-th _d Global feature vector of individual drug entities

The expression of (c) is:

in step S53, the calculation formula for performing the convolution operation is:

z ^(a) ＝σ(w ^(a) ·s _g:g+k-1 +b ^(a) )

where σ (-) denotes a non-linear activation function, w ^(a) ∈R ^k×d Denotes the a-th convolution kernel, k denotes the size of the context window of the convolution kernel, d is the dimension of the input vector, b ^(a) Bias unit, s, representing the a-th convolutional layer _g:g+k-1 Representing a vector matrix formed by the g th to the g + k-1 th words in the input sequence;

after convolution operation, for a feature sequence S' with the length of G, a convolution kernel with the dimension of k × d generates a set of G-k +1 values, and potential features of the input sequence after convolution by an a-th convolution kernel are represented as:

wherein ,

a set of values obtained by convolving the convolution kernels is represented;

the calculation formula for performing the maximum pooling operation is:

wherein ,z^* The characteristic vector of the word element relation is represented,

representing potential feature vectors obtained after each convolution kernel is subjected to maximum pooling;

in step S54, the expression of the final feature vector R is:

in step S55, the category label

The calculation formula of (c) is:

P(y)＝Softmax(W'R+b')

wherein argmax (·) represents the maximum value operation, W 'represents the weight of the Softmax layer, b' represents the bias of the Softmax layer, p (y) represents the probability value of y for the candidate drug pair relationship, and C represents the set of all relationship categories.

Further, step S6 includes the following sub-steps:

s61: calculating the loss value of the drug entity identification model and the loss value of the drug relationship extraction model;

s62: calculating the ratio of the corresponding loss value of the previous period and the previous period in the process of drug entity identification and drug relation extraction;

s63: calculating the weight of a loss function in the process of identifying the drug entities and extracting the drug relationship according to the ratio of the corresponding loss value of the previous period to the corresponding loss value of the previous period in the process of identifying the drug entities and extracting the drug relationship;

s64: calculating a total loss value of a high-level interaction mechanism in the drug entity identification and drug relationship extraction processes according to the loss function weight in the drug entity identification and drug relationship extraction processes, thereby completing interactive training of drug entities and relationships and obtaining a drug entity identification model and a drug relationship extraction model;

s65: and (4) preprocessing the acquired latest drug input file by adopting the method of the step S1, inputting the preprocessed drug input text into a drug entity recognition model to obtain a drug entity, carrying out drug entity anatomical classification marking according to the drug entity, and inputting the drug entity anatomical classification marking into a drug relation extraction model to obtain the classification of drug interaction.

Further, in step S61, the calculation formula of the loss value of the drug entity identification model and the loss value of the drug relationship extraction model is the same, and both are:

where U represents the number of training samples, h _θ (. cndot.) denotes training sample x _u C represents the number of label categories,

the prediction result of the u-th training sample for the class c is shown, lambda represents a controllable penalty factor, o represents the number of parameters of training, and theta _j Parameters representing training;

in step S62, the ratio of the corresponding loss value of the previous period and the previous period in the process of drug entity identification and drug relationship extraction

The calculation formula of (2) is as follows:

where n denotes the nth task, t denotes the number of iterations,

indicating the value of the loss for the previous epoch,

a loss value representing a time period immediately preceding the previous time period;

in step S63, loss function weights during drug entity identification and drug relationship extraction

The calculation formula of (2) is as follows:

wherein N represents the total number of tasks in the drug entity identification and drug relationship extraction process, exp (-) represents the exponential operation,

representing the ratio of the corresponding loss value of the previous period and the previous period in the process of identifying the drug entities and extracting the drug relationship, wherein T represents the looseness degree between tasks;

in step S64, the total loss value L of the high-level interaction mechanism in the process of identifying the drug entity and extracting the drug relationship _total The formula for the calculation of (θ) is:

wherein ,

weight of loss function, L, representing drug entity identification ₁ (theta) represents a loss value of drug entity identification,

weight of loss function, L, representing drug relation extraction ₂ (θ) represents the loss value of drug relationship extraction.

The beneficial effects of the invention are:

(1) the invention provides a drug entity and relationship combined extraction method based on a high-level interaction mechanism, which does not need to add artificial features and repeated data preprocessing steps, fully utilizes the mutual correlation information between two tasks, introduces the high-level dynamic interaction mechanism of drug entity identification and drug relationship extraction, completes the combined extraction of the drug entities and the mutual relationships, solves the error propagation problem of the combined extraction, and simultaneously introduces a text labeling mechanism based on ATC coding after completing the drug entity identification task, thereby improving the accuracy of the drug relationship extraction.

(2) The invention can introduce the position and type information of the drug by inserting the entity type mark in the drug relationship extraction stage, so that the model can learn the relationship between the entity type and the relationship, and does not need the complex data preprocessing steps such as entity blinding, etc., the hidden information of the words between the drug entities can be fully learned by the BioBert + CNN model, the high-level information interaction of the two tasks is realized by the designed dynamic weight mechanism, the mutual information between the two tasks is fully utilized, the combined extraction of the drug relationship and the drug entities is realized, and the error propagation problem in the combined extraction process is greatly relieved.

Drawings

FIG. 1 is a flow chart of a method of combined drug entity and relationship extraction;

FIG. 2 is a block diagram of a method for extracting drug entities and relationships in a combined manner.

Detailed Description

The embodiments of the present invention will be further described with reference to the accompanying drawings.

Before describing specific embodiments of the present invention, in order to make the solution of the present invention more clear and complete, the definitions of abbreviations and key terms appearing in the present invention will be explained:

ATC classification coding: a (anatomical) shows the anatomy, illustrating the organ system of medication in the body; t (therapeutic) indicates therapeutics, indicating the purpose of medication; c (chemical) indicates the chemistry, which indicates the classification of the drug. The ATC classification codes have 7 bits, wherein the 1 st, 4 th and 5 th bits are letters, and the 2 nd, 3 rd, 6 th and 7 th bits are numbers. The ATC system classifies drugs into 5 classes.

BioBert model: a pre-trained biomedical language representation model for biomedical text mining.

As shown in FIG. 1, the present invention provides a method for extracting drug entities and relationships jointly based on a high-level interaction mechanism, which comprises the following steps:

s3: identifying the drug entities in the preprocessed drug entity corpus;

s6: and performing interactive training of the drug entities and the relationships based on a high-level interactive mechanism according to the identified drug entities and the extracted drug relationships to obtain a drug entity identification model and a drug relationship extraction model, and performing drug entity and relationship combined extraction on the acquired latest drug input texts by using the drug entity identification model and the drug relationship extraction model.

In the embodiment of the present invention, as shown in fig. 2, the method mainly includes the steps of corpus acquisition and preprocessing, ATC classification dictionary construction, text preliminary coding representation, drug entity recognition, drug relationship extraction, cross-feedback joint training model establishment, and the like.

In the embodiment of the present invention, in step S1, the specific method for preprocessing the characters of the corpus of pharmaceutical entities is as follows: converting the medicine relation description text in the medicine entity corpus into lowercase, removing punctuation marks and non-English characters, replacing numbers with num, labeling by adopting a BIO coding mode, and setting the maximum sentence length.

In the embodiment of the invention, the DDIExtraction2013 challenge data set is used as a training and testing corpus, and all numbers in the medicine relation description text are replaced by the word "num" because the interrelation between medicines mostly extracted from text semantic information has no relation with unit quantifier words, and noise interference training is formed. The data set samples are labeled using a "BIO" coding mode, i.e., each word is assigned a label, each entity contains a start label and an intermediate label, and non-entity words are labeled with O. The maximum sentence length that the model can handle is set, and if the sentence instance does not reach the maximum length, it is filled in with the character "0".

In the embodiment of the present invention, in step S2, a specific method for constructing the pharmaceutical anatomical classification dictionary is as follows: and (4) crawling ATC classification codes of all the medicines from drug Bank, and constructing a medicine anatomy classification dictionary by taking the medicine name as a key and the first position of the ATC classification code as a value.

The pharmaceutical anatomical categories are as follows: 1. digestive tract and metabolism (a); 2. blood and blood-forming organs (B); 3. the cardiovascular system (C); 4. skin disease (D); 5. genitourinary and sex hormones (G); 6. systemic hormone preparation (H); 7. a systemic anti-infective (J); 8. anti-tumor and immune modulators (L); 9. the musculoskeletal system (M); 10. the nervous system (N); 11. antiparasitic agents (P); 12. a respiratory system (R); 13. a sense organ (S); 14. other drugs (V). In the process of constructing the medical anatomy classification dictionary, a small number of medicines correspond to two or more ATC codes, and the first of the drug bank websites is taken as the standard.

In the embodiment of the present invention, step S3 includes the following sub-steps:

In the embodiment of the present invention, in step S31, the BioBert model includes 12 layers of transform structures, and the output of each layer of transform structure is input to the next layer of transform structure;

in step S31, the specific method for obtaining the sentence vectors in the preprocessed corpus of pharmaceutical entities includes: using each character of the medicine entity corpus as an input text X ═ { X ₁ ,x ₂ ,...,x _m And sequentially inputting the input texts into a BioBert model, and splicing output vectors of the transform structure of the last four layers to obtain a sentence vector V (BioBert) (X) (V) ₁ ,v ₂ ,...,v _m}, wherein ,x_i The ith character, v, representing the input text sequence _i Representing a BioBert output vector corresponding to the ith character, representing BioBert model operation, and m representing the length of an input text sequence;

in step S32, the calculation formula of the score vector H is:

H＝WV+b

in step S33, the specific method for determining the final annotation sequence is: and calculating a score vector P (Y | H) which enables the final labeling sequence to be the maximum prediction result probability by using the linear conditional random field, wherein the probability P (Y | H) of the final labeling sequence when the score vector H is H and the final labeling sequence Y is Y is as follows:

wherein Z (h) represents a normalization factor, i represents the word order of the current word, and t _k (. cndot.) represents a local feature function, exp (. cndot.) represents an exponential operation, K represents the total number of local feature functions defined at the node, m represents the input text length, s represents the input text length _l (. represents and is defined at node y _i L represents the total number of node feature functions defined at the node; lambda [ alpha ] _k Representing local feature functionsNumber t _k Weight coefficient of (. mu.),. mu. _l Representing a characteristic function s of a node _l And h represents the value of the score vector.

In the embodiment of the present invention, in step S4, a specific method for obtaining the drug sentence inserted with the entity mark includes: inserting entity markers into both sides of the drug entity according to the format of < d1e > </d1e > or < d2e > </d2e > resulting in drug sentences into which the entity markers are inserted, wherein d denotes the drug entity, 1 or 2 denotes the drug entity ordinal number and e denotes the ATC classification code in the medical anatomy classification dictionary. The purpose of extracting the drug relationship is to extract the relationship between the two entities, and the third letter e is 14 representations, namely, the drug entity anatomical classification code.

In order for the model to focus on drug entities in the text for relational extraction, only two entities of a text sample are inserted into the ATC classification labels. If one text contains more than two drug entities, the text can be decomposed into a plurality of drug entity pairs, ATC classification marks are respectively inserted into each drug entity pair, and a corresponding example sample is generated; meanwhile, if the two drug entity names of the ATC classification mark inserted in one example sample are the same, the corresponding example sample is deleted. For example, the original text "Co-administration of naltrexone with acquired product a 25% increase in AUC and a 33% in the Cmax of acquisition" contains three drugs, i.e., three pairs of drug entities and three example samples can be constructed as follows:

1)Co-administration of<d1N>naltrexone</d1N>with<d2N>Acamprosate</d2N>produced a 25％increase in AUC and a 33％in the Cmax of acamprosate

2)Co-administration of<d1N>naltrexone</d1N>with Acamprosate produced a 25％increase in AUC and a 33％in the Cmax of<d2N>acamprosate</d2N>

3)Co-administration of naltrexone with<d1N>Acamprosate</d1N>produced a 25％increase in AUC and a 33％in the Cmax of<d2N>acamprosate</d2N>

in example 3, the drug "Acamprosate" and the drug "Acamprosate" inserted with the ATC marker are actually the same drug, and the drug does not interact with itself, and the corresponding example 3 is deleted.

By inserting the marks, on the basis of keeping original word characteristics of the entities, anatomical category information of the entities is injected in the context, and the relative position relationship of the subjects and the objects in the context is highlighted, so that the model can learn hidden connection of the entities at a word level, the understanding of the model to complex medicine names is increased, the model is favorable for better learning the connection between the type characteristics of the subjects and the objects and the specific relationship in the relationship extraction stage, and the effect of the medicine relationship extraction stage is improved.

In the embodiment of the present invention, step S5 includes the following sub-steps:

s52: splicing the coded representation of the medicine sentences and the marked vector representation thereof to obtain the overall characteristic vector of the medicine entity;

s53: extracting two drug entities and words between the two drug entities from the coded representation of the drug sentences to form a subsequence, inputting the vector representation of the subsequence into a convolutional neural network, and sequentially performing convolution operation and maximum pooling operation to obtain a characteristic vector of the word element relationship;

In the embodiment of the present invention, in step S51, the expression of the encoded representation S of the drug sentence is:

wherein ,s₀ Indicating a mark [ CLS]Corresponding sentence-level feature vector, s _d1 Representing the corresponding vector representation, s, of the first drug entity _d2 Representing the corresponding vector representation, s, of the second drug entity _<d1e> and s_</d1e> The text labels respectively representing the left and right sides of the first drug entity are represented by vectors, s, obtained after passing through a BioBert model _<d2e> and s_</d2e> The text marks respectively representing the left side and the right side of the second drug entity are represented by vectors obtained after passing through a BioBert model, and m represents the length of an original input text sequence;

in step S52, the ith _d Global feature vector of individual drug entities

The expression of (a) is:

after convolution operation, for the characteristic sequence S with the length of G ^’ The convolution kernel with dimension k × d generates a set of G-k +1 values, and the potential features of the input sequence after being convolved by the a-th convolution kernel are represented as:

wherein ,

a set of values obtained by convolving the convolution kernel;

the calculation formula for performing the maximum pooling operation is:

wherein ,z^* The feature vector of the word element relation is represented,

in step S54, the expression of the final feature vector R is:

in step S55, a category label

The calculation formula of (2) is as follows:

P(y)＝Softmax(W'R+b')

In the embodiment of the present invention, step S6 includes the following sub-steps:

s64: calculating a total loss value of a high-level interaction mechanism in the processes of drug entity identification and drug relationship extraction according to the weight of a loss function in the processes of drug entity identification and drug relationship extraction so as to finish interactive training of drug entities and relationships and obtain a drug entity identification model and a drug relationship extraction model;

In the embodiment of the present invention, in step S61, the drug entity identification task is completed, and the drug relationship extraction task is further completed in an auxiliary manner, and then the relationship extraction result is fed back to the entity identification module, and the entity identification module readjusts the parameters of the entity identification module according to the loss value of the entity identification module and the loss value of the relationship extraction module, and the relationship extraction module is similar to the above. Because both drug entity identification and drug relationship extraction can be ascribed to the multi-classification problem, the multi-classification logistic regression loss values are calculated using the following formula, and the loss values of the drug entity identification model and the drug relationship extraction model are the same as the calculation formula, and are both:

where U represents the number of training samples, h _θ (. represents a training sample x) _u C represents the number of label categories,

in step S62, after the loss values of the two tasks, i.e., drug entity identification and drug relationship extraction, are obtained, a dynamic weighted average algorithm is introduced to dynamically calculate the loss proportion of the two tasks in order to balance the importance of the two tasks during the high-level interaction process. For each subtask, the ratio of the loss value corresponding to the previous epoch to the loss value corresponding to the previous epoch is firstly calculated, and the ratio of the loss value corresponding to the previous epoch to the loss value corresponding to the previous epoch in the process of medicine entity identification and medicine relation extraction is firstly calculated

The calculation formula of (2) is as follows:

where n denotes the nth task, t denotes the number of iterations,

indicating the value of the loss for the previous epoch,

a loss value representing a previous epoch to a previous epoch;

in step S63, the loss function weight in the process of drug entity identification and drug relationship extraction

The calculation formula of (2) is as follows:

wherein N represents the total number of tasks in the drug entity identification and drug relationship extraction process, N is 2 in the invention, exp (-) represents exponential operation,

in step S64, the total loss value L of the high-level interaction mechanism during the drug entity identification and drug relationship extraction process _total The formula for the calculation of (θ) is:

wherein ,

The working principle and the process of the invention are as follows: in the invention, an effective pre-training model is adopted in a drug entity recognition module to obtain a preliminary text code which is input to a CRF layer to complete the drug entity recognition work; marking the identified drug entities with the drug anatomy type information by using a self-defined text marking mechanism so as to highlight the position information of the drug entities; inputting the marked text into a relation extraction model, acquiring a sentence characteristic vector and an entity characteristic vector by using BioBert, acquiring a word element relation characteristic vector by using a CNN model, splicing the sentence characteristic vector, the entity characteristic vector and the word element relation characteristic vector for medicine relation extraction, and finally realizing high-level information interaction of a medicine entity identification module and a relation extraction module based on a dynamic weighting loss mechanism to realize combined extraction of medicine relation and medicine entity.

The invention has the beneficial effects that:

(1) the invention provides a drug entity and relationship combined extraction method based on a high-level interaction mechanism, which does not need to add artificial features and repeated data preprocessing steps, fully utilizes mutual correlation information between two tasks, introduces a high-level dynamic interaction mechanism of drug entity identification and drug relationship extraction, completes the combined extraction of the drug entities and the mutual relationships, solves the error propagation problem of the combined extraction, and simultaneously introduces a text labeling mechanism based on ATC coding after completing the drug entity identification task, thereby improving the accuracy of the drug relationship extraction.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A drug entity and relationship combined extraction method based on a high-level interaction mechanism is characterized by comprising the following steps:

s3: identifying the drug entities in the preprocessed drug entity corpus;

2. The method for extracting drug entity and relationship association based on high-level interaction mechanism as claimed in claim 1, wherein in step S1, the specific method for preprocessing the characters in the drug entity corpus is as follows: converting the medicine relation description text in the medicine entity corpus into lowercase, removing punctuation marks and non-English characters, replacing numbers with num, labeling by adopting a BIO coding mode, and setting the maximum sentence length.

3. The method for extracting drug entities and relationships jointly based on high-level interaction mechanism according to claim 1, wherein in step S2, the concrete method for constructing the drug anatomy classification dictionary is as follows: and (4) crawling ATC classification codes of all the medicines from drug Bank, and constructing a medicine anatomy classification dictionary by taking the medicine name as a key and the first bit of the ATC classification code as a value.

4. The method for extracting drug entities and relationships according to claim 1, wherein said step S3 comprises the following sub-steps:

s33: and inputting the score vector into the linear conditional random field to obtain a final labeling sequence, and extracting a drug entity from the final labeling sequence.

5. The method for extracting drug entities and relationships based on high-level interaction mechanism as claimed in claim 4, wherein in step S31, the BioBert model includes 12 layers of fransformer structures, and the output of each layer of fransformer structure is inputted to the next layer of fransformer structure;

in step S31, the specific method for obtaining the sentence vector in the preprocessed drug entity corpus is as follows: using each character of the medicine entity corpus as an input text X ═ { X ₁ ,...,x _i ,...,x _m Sequentially inputting the input texts into a BioBert model, and splicing output vectors of the last four layers of transform structures to obtain a sentence vector V (BioBert) (X) (V) ₁ ,...,v _i ,...,v _m}, wherein ,x_i The ith character, v, representing the input text sequence _i Representing a BioBert output vector corresponding to the ith character, representing BioBert model operation, and m representing the length of an input text sequence;

in step S32, the calculation formula of the score vector H is:

H＝WV+b

wherein Z (h) represents a normalization factor, i represents the word order of the current word, and t _k (. cndot.) represents a local feature function, exp (. cndot.) represents an exponential operation, K represents the total number of local feature functions defined at the node, m represents the input text length, s _l (. The) representation is defined at node y _i L represents the total number of node feature functions defined at the node; lambda [ alpha ] _k Representing a local characteristic function t _k Weight coefficient of (. mu.) _l Representing a characteristic function s of a node _l And h represents the value of the score vector.

6. The method for extracting drug entities and relationships based on high-level interaction mechanism as claimed in claim 1, wherein in step S4, the specific method for obtaining the drug sentences inserted with the entity labels is: inserting entity markers into both sides of the drug entity according to the format of < d1e > </d1e > or < d2e > </d2e > resulting in drug sentences into which the entity markers are inserted, wherein d denotes the drug entity, 1 or 2 denotes the drug entity ordinal number and e denotes the ATC classification code in the medical anatomy classification dictionary.

7. The method for extracting drug entities and relationships according to claim 1, wherein the step S5 includes the following sub-steps:

s51: inserting a mark [ CLS ] into the beginning of the drug sentence inserted with the entity mark by using a BioBert model, and inputting the marking [ CLS ] into the BioBert model to obtain the code expression of the drug sentence;

8. The method for extracting drug entity and relationship association based on high-level interaction mechanism as claimed in claim 7, wherein in step S51, the expression of the coded representation S of the drug sentence is:

in the step S52, the ith step _d Individual drug entityIs the global feature vector

The expression of (a) is:

z ^(a) ＝σ(w ^(a) ·s _g:g+k-1 +b ^(a) )

where σ (-) denotes a non-linear activation function, w ^(a) ∈R ^k×d Denotes the a convolution kernel, k denotes the size of the context window of the convolution kernel, d is the dimension of the input vector, b ^(a) Bias unit, s, representing the a-th convolutional layer _g:g+k-1 Representing a vector matrix formed by the g th to the g + k-1 th words in the input sequence;

wherein ,

a set of values obtained by convolving the convolution kernel;

the calculation formula for performing the maximum pooling operation is:

in step S54, the expression of the final feature vector R is:

in the step S55, the category label

The calculation formula of (2) is as follows:

P(y)＝Soft max(W'R+b')

9. The method for extracting drug entities and relationships according to claim 1, wherein said step S6 comprises the following sub-steps:

10. The method for extracting drug entity and relationship combination based on high-level interaction mechanism as claimed in claim 9, wherein in step S61, the calculation formula of the loss value of the drug entity identification model and the loss value of the drug relationship extraction model are the same, and respectively are:

the prediction result of the u-th training sample to the class c is shown, lambda represents a controllable penalty factor, o represents the number of parameters of training, and theta _j A parameter representing a training;

in step S62, the ratio of the corresponding loss value of the previous period to the previous period in the process of identifying the drug entity and extracting the drug relationship

The calculation formula of (c) is:

where n denotes the nth task, t denotes the number of iterations,

indicating the value of the loss for the previous epoch,

in the step S63, the weight of the loss function in the process of identifying the drug entity and extracting the drug relationship

The calculation formula of (c) is:

representing the ratio of the corresponding loss value of the previous period to the previous period in the process of identifying the drug entities and extracting the drug relation, wherein T represents the looseness degree between tasks;

L _total (θ)＝λ ₁ ^(t) L ₁ (θ)+λ ₂ ^(t) L ₂ (θ)

wherein ,λ₁ ^(t) Weight of loss function, L, representing drug entity identification ₁ (theta) represents the loss value of drug entity recognition, lambda ₂ ^(t) Weight of loss function, L, representing drug relation extraction ₂ (θ) represents the loss value of drug relationship extraction.