CN114925678A - Drug entity and relationship combined extraction method based on high-level interaction mechanism - Google Patents
Drug entity and relationship combined extraction method based on high-level interaction mechanism Download PDFInfo
- Publication number
- CN114925678A CN114925678A CN202210425479.8A CN202210425479A CN114925678A CN 114925678 A CN114925678 A CN 114925678A CN 202210425479 A CN202210425479 A CN 202210425479A CN 114925678 A CN114925678 A CN 114925678A
- Authority
- CN
- China
- Prior art keywords
- drug
- entity
- relationship
- vector
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000003814 drug Substances 0.000 title claims abstract description 385
- 229940079593 drug Drugs 0.000 title claims abstract description 321
- 238000000605 extraction Methods 0.000 title claims abstract description 98
- 230000007246 mechanism Effects 0.000 title claims abstract description 38
- 230000003993 interaction Effects 0.000 title claims abstract description 27
- 238000002372 labelling Methods 0.000 claims abstract description 21
- 210000003484 anatomy Anatomy 0.000 claims abstract description 17
- 238000007781 pre-processing Methods 0.000 claims abstract description 16
- 230000002452 interceptive effect Effects 0.000 claims abstract description 11
- 239000013598 vector Substances 0.000 claims description 103
- 238000000034 method Methods 0.000 claims description 57
- 230000006870 function Effects 0.000 claims description 33
- 230000008569 process Effects 0.000 claims description 26
- 238000004364 calculation method Methods 0.000 claims description 25
- 238000011176 pooling Methods 0.000 claims description 9
- 206010013710 Drug interaction Diseases 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 230000009193 crawling Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 229960004047 acamprosate Drugs 0.000 description 8
- AFCGFAGUEYAMAO-UHFFFAOYSA-N acamprosate Chemical compound CC(=O)NCCCS(O)(=O)=O AFCGFAGUEYAMAO-UHFFFAOYSA-N 0.000 description 8
- 230000000694 effects Effects 0.000 description 5
- 238000011260 co-administration Methods 0.000 description 4
- 229960003086 naltrexone Drugs 0.000 description 4
- DQCKKXVULJGBQN-XFWGSAIBSA-N naltrexone Chemical compound N1([C@@H]2CC3=CC=C(C=4O[C@@H]5[C@](C3=4)([C@]2(CCC5=O)O)CC1)O)CC1CC1 DQCKKXVULJGBQN-XFWGSAIBSA-N 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 2
- 229960001139 cefazolin Drugs 0.000 description 2
- MLYYVTUWGNIJIB-BXKDBHETSA-N cefazolin Chemical compound S1C(C)=NN=C1SCC1=C(C(O)=O)N2C(=O)[C@@H](NC(=O)CN3N=NN=C3)[C@H]2SC1 MLYYVTUWGNIJIB-BXKDBHETSA-N 0.000 description 2
- 229960002588 cefradine Drugs 0.000 description 2
- 229940106164 cephalexin Drugs 0.000 description 2
- ZAIPMKNFIOOWCQ-UEKVPHQBSA-N cephalexin Chemical compound C1([C@@H](N)C(=O)N[C@H]2[C@@H]3N(C2=O)C(=C(CS3)C)C(O)=O)=CC=CC=C1 ZAIPMKNFIOOWCQ-UEKVPHQBSA-N 0.000 description 2
- RDLPVSKMFDYCOR-UEKVPHQBSA-N cephradine Chemical compound C1([C@@H](N)C(=O)N[C@H]2[C@@H]3N(C2=O)C(=C(CS3)C)C(O)=O)=CCC=CC1 RDLPVSKMFDYCOR-UEKVPHQBSA-N 0.000 description 2
- 230000008406 drug-drug interaction Effects 0.000 description 2
- 230000008846 dynamic interplay Effects 0.000 description 2
- 230000009885 systemic effect Effects 0.000 description 2
- 229930186147 Cephalosporin Natural products 0.000 description 1
- 230000002924 anti-infective effect Effects 0.000 description 1
- 230000000259 anti-tumor effect Effects 0.000 description 1
- 239000003096 antiparasitic agent Substances 0.000 description 1
- 229940125687 antiparasitic agent Drugs 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000003115 biocidal effect Effects 0.000 description 1
- 229940000031 blood and blood forming organ drug Drugs 0.000 description 1
- 210000000748 cardiovascular system Anatomy 0.000 description 1
- 229940124587 cephalosporin Drugs 0.000 description 1
- 150000001780 cephalosporins Chemical class 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000008713 feedback mechanism Effects 0.000 description 1
- 210000001035 gastrointestinal tract Anatomy 0.000 description 1
- 239000003163 gonadal steroid hormone Substances 0.000 description 1
- 239000005556 hormone Substances 0.000 description 1
- 229940088597 hormone Drugs 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 230000004060 metabolic process Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 210000002346 musculoskeletal system Anatomy 0.000 description 1
- 210000000653 nervous system Anatomy 0.000 description 1
- 210000004789 organ system Anatomy 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 210000002345 respiratory system Anatomy 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 210000000697 sensory organ Anatomy 0.000 description 1
- 208000017520 skin disease Diseases 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention discloses a drug entity and relationship combined extraction method based on a high-level interaction mechanism, which comprises the following steps: s1: acquiring a medicine entity corpus and preprocessing characters of the medicine entity corpus; s2: constructing a medical anatomy classification dictionary; s3: identifying the drug entities in the preprocessed drug entity corpus; s4: acquiring a drug sentence inserted with an entity mark; s5: extracting the drug relationship; s6: and performing interactive training of the drug entities and the relationships based on a high-level interactive mechanism, and performing drug entity and relationship combined extraction on the acquired latest drug input texts by using a drug entity recognition model and a drug relationship extraction model. The invention solves the problem of error propagation of combined extraction, and simultaneously introduces a text labeling mechanism based on ATC coding after the identification task of the drug entity is completed, thereby improving the accuracy of drug relation extraction.
Description
Technical Field
The invention belongs to the technical field of drug identification, and particularly relates to a drug entity and relationship combined extraction method based on a high-level interaction mechanism.
Background
Drug-Drug Interactions (DDI) refer to the positive or negative effect that one Drug has on the other Drug when the two drugs are used simultaneously. When two or more drugs are used simultaneously, a known drug may be altered by other drugs, for example, to change the safety and efficacy of the known drug, or even to cause serious side effects. Fully acquiring and knowing the drug interaction information has important significance for reducing medical cost and avoiding medical accidents.
The existing medicine relation extraction method based on a residual error network and an attention mechanism uses a two-layer bidirectional long-and-short-term memory network model to perform time sequence modeling on input medicine relation sentences, excavates the dependency relationship among long-distance words in medicine relation description, and overcomes the problem of gradient dispersion during model training; and introducing residual connection to dynamically construct network models with different depths and structures, integrating an attention mechanism to complete weight calculation of word information, and finally integrating time sequence information and attention information and inputting the fused information into a Softmax classifier to extract medicine relations. It has the following disadvantages: 1) complicated manual feature extraction and data preprocessing steps are needed, such as negative case screening, entity blinding and other operations, and the entity blinding operations inevitably affect the information of word level hiding in model learning, such as Cefazolin (Cefazolin), Cefalexin (Cefalexin) and Cefradine (Cefradine) (all three are cephalosporin antibiotic drugs); 2) the prediction result of the drug relationship will depend heavily on the recognition result of the external tool for the drug entity, i.e. there is an error propagation problem; 3) the existence of correlations between drug entity identifications and drug relationship extractions is ignored.
The entity and relation extraction combined extraction method of the existing feedback mechanism uses Bi-LSTM to extract the context feature of the text at the bottom layer, and two tasks of named entity identification and relation extraction share the feature vector extracted at the layer. Introducing an Attention mechanism to capture key local features to finish decoding, finishing named entity recognition work by using a conditional random field, splicing a final feature vector corresponding to the recognized entity with a shared layer feature vector, and inputting the final feature vector into a CNN (content-based network) to finish relation extraction. Presetting loss value influence coefficients of the entity identification and relationship extraction two tasks, calculating the loss values of the two tasks, and simply adding the loss values of the relationship extraction and the loss values of the entity identification by taking the influence coefficients as weights to feed back and correct parameters of the sharing layer and the entity identification two modules. It has the following disadvantages: 1) entity identification and relationship extraction two task bottom layers share Bi-LSTM feature extraction, which does not bring better effect but may influence performance; 2) the super-parameter is preset, losses of the two tasks are linearly added according to the influence coefficients, and in the actual training, the backward propagation process may be dominated by one task due to different convergence speeds of loss functions of the two tasks, so that the effect of the other task is reduced, and the significance of loss interaction is lost.
Disclosure of Invention
In order to solve the problems, the invention provides a drug entity and relationship combined extraction method based on a high-level interaction mechanism.
The technical scheme of the invention is as follows: a drug entity and relationship combined extraction method based on a high-level interaction mechanism comprises the following steps:
s1: acquiring a medicine entity corpus and preprocessing characters of the medicine entity corpus;
s2: acquiring ATC classification codes of all medicines, and constructing a medicine anatomy classification dictionary;
s3: identifying the drug entities in the preprocessed drug entity corpus;
s4: according to the medicine entity and the medicine anatomy classification dictionary, medicine sentences inserted with entity marks are obtained;
s5: extracting the drug relation according to the drug sentence inserted with the entity mark;
s6: and performing interactive training of the drug entities and the drug relations based on a high-level interactive mechanism according to the identified drug entities and the extracted drug relations to obtain a drug entity identification model and a drug relation extraction model, and performing drug entity and relation combined extraction on the acquired latest drug input texts by using the drug entity identification model and the drug relation extraction model.
Further, in step S1, the specific method for preprocessing the characters of the corpus of pharmaceutical entities is as follows: converting the medicine relation description text in the medicine entity corpus into lowercase, removing punctuation marks and non-English characters, replacing numbers with num, labeling by adopting a BIO (binary encoding) mode, and setting the maximum sentence length.
Further, in step S2, the specific method for constructing the pharmaceutical anatomical classification dictionary is as follows: and (4) crawling ATC classification codes of all the medicines from drug Bank, and constructing a medicine anatomy classification dictionary by taking the medicine name as a key and the first position of the ATC classification code as a value.
Further, step S3 includes the following sub-steps:
s31: obtaining sentence vectors in the preprocessed drug entity corpus by using a BioBert model;
s32: inputting the sentence vectors into a linear layer for linear conversion to obtain score vectors;
s33: and inputting the score vector into the linear conditional random field to obtain a final labeling sequence, and extracting the drug entity from the final labeling sequence.
Further, in step S31, the BioBert model includes 12 layers of transform structures, and the output of each layer of transform structure is input to the next layer of transform structure;
in step S31, the specific method for obtaining the sentence vectors in the preprocessed corpus of drug entities is as follows: using each character of the medicine entity corpus as an input text X ═ { X 1 ,x 2 ,...,x m Sequentially inputting the input texts into a BioBert model, and splicing output vectors of the last four layers of transform structures to obtain a sentence vector V (BioBert) (X) (V) 1 ,v 2 ,...,v m}, wherein ,xi The ith character, v, representing the input text sequence i Represents the BioBert input corresponding to the ith characterOutputting a vector, wherein BioBert (·) represents BioBert model operation, and m represents the length of the input text sequence;
in step S32, the calculation formula of the score vector H is:
H=WV+b
wherein W represents the weight of the linear layer and b represents the bias of the linear layer;
in step S33, the specific method for determining the final annotation sequence is: and calculating by using a linear conditional random field to enable the final labeling sequence to become a score vector P (Y | H) with the maximum probability of the prediction result, wherein the probability P (Y | H) of the final labeling sequence when the score vector H takes the value of H and the final labeling sequence Y takes the value of Y is as follows:
wherein Z (h) represents a normalization factor, i represents the word order of the current word, and t k (. cndot.) represents a local feature function, exp (. cndot.) represents an exponential operation, K represents the total number of local feature functions defined at the node, m represents the input text length, s l (. The) representation is defined at node y i L represents the total number of node feature functions defined at the node; lambda k Representing a local characteristic function t k Weight coefficient of (. mu.) l Representing a characteristic function s of a node l And h represents the value of the score vector.
Further, in step S4, the specific method for obtaining the drug sentence inserted with the entity mark is as follows: inserting entity labels into two sides of the drug entity according to the format of < d1e > </d1e > or < d2e > </d2e > to obtain a drug sentence inserted with the entity labels, wherein d represents the drug entity, 1 or 2 represents the drug entity ordinal number, and e represents the ATC classification code in the drug anatomy classification dictionary.
Further, step S5 includes the following sub-steps:
s51: inserting a mark [ CLS ] into the sentence head of the medicine sentence inserted with the entity mark by using a BioBert model, and inputting the inserted mark [ CLS ] into the BioBert model to obtain the code expression of the medicine sentence;
s52: splicing the coded representation of the drug sentences and the marked vector representation thereof to obtain the overall characteristic vector of the drug entity;
s53: extracting two drug entities and words between the two drug entities from the coded representation of the drug sentences to form a subsequence, inputting the vector representation of the subsequence into a convolutional neural network, and sequentially performing convolutional operation and maximum pooling operation to obtain a word element relation characteristic vector;
s54: sequentially splicing the coding expression of the medicine sentences, the overall characteristic vector of the medicine entity and the characteristic vector of the word element relation to obtain a final characteristic vector;
s55: and compressing the dimensionality of the final feature vector by using a feedforward neural network, inputting the dimensionality into a softmax layer for classification, obtaining a class label with the maximum probability of the final prediction result in classification results, and taking the class label as a medicine relation.
Further, in step S51, the coding of the drug sentence represents the expression of S as:
S=(s 0 ,...,s <d1e> ,s d1 ,s </d1e> ,...,s <d2e> ,s d2 ,s </d2e> ,...s m+5 )
wherein ,s0 Indicating a mark [ CLS]Corresponding sentence-level feature vector, s d1 Representing the corresponding vector representation, s, of the first pharmaceutical entity d2 Representing the corresponding vector representation, s, of the second pharmaceutical entity <d1e> and s</d1e> The text labels respectively representing the left and right sides of the first drug entity are represented by vectors, s, obtained after passing through a BioBert model <d2e> and s</d2e> The text marks respectively representing the left side and the right side of the second drug entity are represented by vectors obtained after passing through a BioBert model, and m represents the length of an original input text sequence;
step (ii) ofIn S52, the i-th d Global feature vector of individual drug entitiesThe expression of (c) is:
in step S53, the calculation formula for performing the convolution operation is:
z (a) =σ(w (a) ·s g:g+k-1 +b (a) )
where σ (-) denotes a non-linear activation function, w (a) ∈R k×d Denotes the a-th convolution kernel, k denotes the size of the context window of the convolution kernel, d is the dimension of the input vector, b (a) Bias unit, s, representing the a-th convolutional layer g:g+k-1 Representing a vector matrix formed by the g th to the g + k-1 th words in the input sequence;
after convolution operation, for a feature sequence S' with the length of G, a convolution kernel with the dimension of k × d generates a set of G-k +1 values, and potential features of the input sequence after convolution by an a-th convolution kernel are represented as:
the calculation formula for performing the maximum pooling operation is:
wherein ,z* The characteristic vector of the word element relation is represented,representing potential feature vectors obtained after each convolution kernel is subjected to maximum pooling;
in step S54, the expression of the final feature vector R is:
P(y)=Softmax(W'R+b')
wherein argmax (·) represents the maximum value operation, W 'represents the weight of the Softmax layer, b' represents the bias of the Softmax layer, p (y) represents the probability value of y for the candidate drug pair relationship, and C represents the set of all relationship categories.
Further, step S6 includes the following sub-steps:
s61: calculating the loss value of the drug entity identification model and the loss value of the drug relationship extraction model;
s62: calculating the ratio of the corresponding loss value of the previous period and the previous period in the process of drug entity identification and drug relation extraction;
s63: calculating the weight of a loss function in the process of identifying the drug entities and extracting the drug relationship according to the ratio of the corresponding loss value of the previous period to the corresponding loss value of the previous period in the process of identifying the drug entities and extracting the drug relationship;
s64: calculating a total loss value of a high-level interaction mechanism in the drug entity identification and drug relationship extraction processes according to the loss function weight in the drug entity identification and drug relationship extraction processes, thereby completing interactive training of drug entities and relationships and obtaining a drug entity identification model and a drug relationship extraction model;
s65: and (4) preprocessing the acquired latest drug input file by adopting the method of the step S1, inputting the preprocessed drug input text into a drug entity recognition model to obtain a drug entity, carrying out drug entity anatomical classification marking according to the drug entity, and inputting the drug entity anatomical classification marking into a drug relation extraction model to obtain the classification of drug interaction.
Further, in step S61, the calculation formula of the loss value of the drug entity identification model and the loss value of the drug relationship extraction model is the same, and both are:
where U represents the number of training samples, h θ (. cndot.) denotes training sample x u C represents the number of label categories,the prediction result of the u-th training sample for the class c is shown, lambda represents a controllable penalty factor, o represents the number of parameters of training, and theta j Parameters representing training;
in step S62, the ratio of the corresponding loss value of the previous period and the previous period in the process of drug entity identification and drug relationship extractionThe calculation formula of (2) is as follows:
where n denotes the nth task, t denotes the number of iterations,indicating the value of the loss for the previous epoch,a loss value representing a time period immediately preceding the previous time period;
in step S63, loss function weights during drug entity identification and drug relationship extractionThe calculation formula of (2) is as follows:
wherein N represents the total number of tasks in the drug entity identification and drug relationship extraction process, exp (-) represents the exponential operation,representing the ratio of the corresponding loss value of the previous period and the previous period in the process of identifying the drug entities and extracting the drug relationship, wherein T represents the looseness degree between tasks;
in step S64, the total loss value L of the high-level interaction mechanism in the process of identifying the drug entity and extracting the drug relationship total The formula for the calculation of (θ) is:
wherein ,weight of loss function, L, representing drug entity identification 1 (theta) represents a loss value of drug entity identification,weight of loss function, L, representing drug relation extraction 2 (θ) represents the loss value of drug relationship extraction.
The beneficial effects of the invention are:
(1) the invention provides a drug entity and relationship combined extraction method based on a high-level interaction mechanism, which does not need to add artificial features and repeated data preprocessing steps, fully utilizes the mutual correlation information between two tasks, introduces the high-level dynamic interaction mechanism of drug entity identification and drug relationship extraction, completes the combined extraction of the drug entities and the mutual relationships, solves the error propagation problem of the combined extraction, and simultaneously introduces a text labeling mechanism based on ATC coding after completing the drug entity identification task, thereby improving the accuracy of the drug relationship extraction.
(2) The invention can introduce the position and type information of the drug by inserting the entity type mark in the drug relationship extraction stage, so that the model can learn the relationship between the entity type and the relationship, and does not need the complex data preprocessing steps such as entity blinding, etc., the hidden information of the words between the drug entities can be fully learned by the BioBert + CNN model, the high-level information interaction of the two tasks is realized by the designed dynamic weight mechanism, the mutual information between the two tasks is fully utilized, the combined extraction of the drug relationship and the drug entities is realized, and the error propagation problem in the combined extraction process is greatly relieved.
Drawings
FIG. 1 is a flow chart of a method of combined drug entity and relationship extraction;
FIG. 2 is a block diagram of a method for extracting drug entities and relationships in a combined manner.
Detailed Description
The embodiments of the present invention will be further described with reference to the accompanying drawings.
Before describing specific embodiments of the present invention, in order to make the solution of the present invention more clear and complete, the definitions of abbreviations and key terms appearing in the present invention will be explained:
ATC classification coding: a (anatomical) shows the anatomy, illustrating the organ system of medication in the body; t (therapeutic) indicates therapeutics, indicating the purpose of medication; c (chemical) indicates the chemistry, which indicates the classification of the drug. The ATC classification codes have 7 bits, wherein the 1 st, 4 th and 5 th bits are letters, and the 2 nd, 3 rd, 6 th and 7 th bits are numbers. The ATC system classifies drugs into 5 classes.
BioBert model: a pre-trained biomedical language representation model for biomedical text mining.
As shown in FIG. 1, the present invention provides a method for extracting drug entities and relationships jointly based on a high-level interaction mechanism, which comprises the following steps:
s1: acquiring a medicine entity corpus and preprocessing characters of the medicine entity corpus;
s2: acquiring ATC classification codes of all medicines, and constructing a medicine anatomy classification dictionary;
s3: identifying the drug entities in the preprocessed drug entity corpus;
s4: according to the medicine entity and the medicine anatomy classification dictionary, medicine sentences inserted with entity marks are obtained;
s5: extracting the drug relation according to the drug sentence inserted with the entity mark;
s6: and performing interactive training of the drug entities and the relationships based on a high-level interactive mechanism according to the identified drug entities and the extracted drug relationships to obtain a drug entity identification model and a drug relationship extraction model, and performing drug entity and relationship combined extraction on the acquired latest drug input texts by using the drug entity identification model and the drug relationship extraction model.
In the embodiment of the present invention, as shown in fig. 2, the method mainly includes the steps of corpus acquisition and preprocessing, ATC classification dictionary construction, text preliminary coding representation, drug entity recognition, drug relationship extraction, cross-feedback joint training model establishment, and the like.
In the embodiment of the present invention, in step S1, the specific method for preprocessing the characters of the corpus of pharmaceutical entities is as follows: converting the medicine relation description text in the medicine entity corpus into lowercase, removing punctuation marks and non-English characters, replacing numbers with num, labeling by adopting a BIO coding mode, and setting the maximum sentence length.
In the embodiment of the invention, the DDIExtraction2013 challenge data set is used as a training and testing corpus, and all numbers in the medicine relation description text are replaced by the word "num" because the interrelation between medicines mostly extracted from text semantic information has no relation with unit quantifier words, and noise interference training is formed. The data set samples are labeled using a "BIO" coding mode, i.e., each word is assigned a label, each entity contains a start label and an intermediate label, and non-entity words are labeled with O. The maximum sentence length that the model can handle is set, and if the sentence instance does not reach the maximum length, it is filled in with the character "0".
In the embodiment of the present invention, in step S2, a specific method for constructing the pharmaceutical anatomical classification dictionary is as follows: and (4) crawling ATC classification codes of all the medicines from drug Bank, and constructing a medicine anatomy classification dictionary by taking the medicine name as a key and the first position of the ATC classification code as a value.
The pharmaceutical anatomical categories are as follows: 1. digestive tract and metabolism (a); 2. blood and blood-forming organs (B); 3. the cardiovascular system (C); 4. skin disease (D); 5. genitourinary and sex hormones (G); 6. systemic hormone preparation (H); 7. a systemic anti-infective (J); 8. anti-tumor and immune modulators (L); 9. the musculoskeletal system (M); 10. the nervous system (N); 11. antiparasitic agents (P); 12. a respiratory system (R); 13. a sense organ (S); 14. other drugs (V). In the process of constructing the medical anatomy classification dictionary, a small number of medicines correspond to two or more ATC codes, and the first of the drug bank websites is taken as the standard.
In the embodiment of the present invention, step S3 includes the following sub-steps:
s31: obtaining sentence vectors in the preprocessed drug entity corpus by using a BioBert model;
s32: inputting the sentence vectors into a linear layer for linear conversion to obtain score vectors;
s33: and inputting the score vector into the linear conditional random field to obtain a final labeling sequence, and extracting the drug entity from the final labeling sequence.
In the embodiment of the present invention, in step S31, the BioBert model includes 12 layers of transform structures, and the output of each layer of transform structure is input to the next layer of transform structure;
in step S31, the specific method for obtaining the sentence vectors in the preprocessed corpus of pharmaceutical entities includes: using each character of the medicine entity corpus as an input text X ═ { X 1 ,x 2 ,...,x m And sequentially inputting the input texts into a BioBert model, and splicing output vectors of the transform structure of the last four layers to obtain a sentence vector V (BioBert) (X) (V) 1 ,v 2 ,...,v m}, wherein ,xi The ith character, v, representing the input text sequence i Representing a BioBert output vector corresponding to the ith character, representing BioBert model operation, and m representing the length of an input text sequence;
in step S32, the calculation formula of the score vector H is:
H=WV+b
wherein W represents the weight of the linear layer and b represents the bias of the linear layer;
in step S33, the specific method for determining the final annotation sequence is: and calculating a score vector P (Y | H) which enables the final labeling sequence to be the maximum prediction result probability by using the linear conditional random field, wherein the probability P (Y | H) of the final labeling sequence when the score vector H is H and the final labeling sequence Y is Y is as follows:
wherein Z (h) represents a normalization factor, i represents the word order of the current word, and t k (. cndot.) represents a local feature function, exp (. cndot.) represents an exponential operation, K represents the total number of local feature functions defined at the node, m represents the input text length, s represents the input text length l (. represents and is defined at node y i L represents the total number of node feature functions defined at the node; lambda [ alpha ] k Representing local feature functionsNumber t k Weight coefficient of (. mu.),. mu. l Representing a characteristic function s of a node l And h represents the value of the score vector.
In the embodiment of the present invention, in step S4, a specific method for obtaining the drug sentence inserted with the entity mark includes: inserting entity markers into both sides of the drug entity according to the format of < d1e > </d1e > or < d2e > </d2e > resulting in drug sentences into which the entity markers are inserted, wherein d denotes the drug entity, 1 or 2 denotes the drug entity ordinal number and e denotes the ATC classification code in the medical anatomy classification dictionary. The purpose of extracting the drug relationship is to extract the relationship between the two entities, and the third letter e is 14 representations, namely, the drug entity anatomical classification code.
In order for the model to focus on drug entities in the text for relational extraction, only two entities of a text sample are inserted into the ATC classification labels. If one text contains more than two drug entities, the text can be decomposed into a plurality of drug entity pairs, ATC classification marks are respectively inserted into each drug entity pair, and a corresponding example sample is generated; meanwhile, if the two drug entity names of the ATC classification mark inserted in one example sample are the same, the corresponding example sample is deleted. For example, the original text "Co-administration of naltrexone with acquired product a 25% increase in AUC and a 33% in the Cmax of acquisition" contains three drugs, i.e., three pairs of drug entities and three example samples can be constructed as follows:
1)Co-administration of<d1N>naltrexone</d1N>with<d2N>Acamprosate</d2N>produced a 25%increase in AUC and a 33%in the Cmax of acamprosate
2)Co-administration of<d1N>naltrexone</d1N>with Acamprosate produced a 25%increase in AUC and a 33%in the Cmax of<d2N>acamprosate</d2N>
3)Co-administration of naltrexone with<d1N>Acamprosate</d1N>produced a 25%increase in AUC and a 33%in the Cmax of<d2N>acamprosate</d2N>
in example 3, the drug "Acamprosate" and the drug "Acamprosate" inserted with the ATC marker are actually the same drug, and the drug does not interact with itself, and the corresponding example 3 is deleted.
By inserting the marks, on the basis of keeping original word characteristics of the entities, anatomical category information of the entities is injected in the context, and the relative position relationship of the subjects and the objects in the context is highlighted, so that the model can learn hidden connection of the entities at a word level, the understanding of the model to complex medicine names is increased, the model is favorable for better learning the connection between the type characteristics of the subjects and the objects and the specific relationship in the relationship extraction stage, and the effect of the medicine relationship extraction stage is improved.
In the embodiment of the present invention, step S5 includes the following sub-steps:
s51: inserting a mark [ CLS ] into the sentence head of the medicine sentence inserted with the entity mark by using a BioBert model, and inputting the inserted mark [ CLS ] into the BioBert model to obtain the code expression of the medicine sentence;
s52: splicing the coded representation of the medicine sentences and the marked vector representation thereof to obtain the overall characteristic vector of the medicine entity;
s53: extracting two drug entities and words between the two drug entities from the coded representation of the drug sentences to form a subsequence, inputting the vector representation of the subsequence into a convolutional neural network, and sequentially performing convolution operation and maximum pooling operation to obtain a characteristic vector of the word element relationship;
s54: sequentially splicing the coding expression of the medicine sentences, the overall characteristic vector of the medicine entity and the characteristic vector of the word element relation to obtain a final characteristic vector;
s55: and compressing the dimensionality of the final feature vector by using a feedforward neural network, inputting the dimensionality into a softmax layer for classification, obtaining a class label with the maximum probability of the final prediction result in classification results, and taking the class label as a medicine relation.
In the embodiment of the present invention, in step S51, the expression of the encoded representation S of the drug sentence is:
S=(s 0 ,...,s <d1e> ,s d1 ,s </d1e> ,...,s <d2e> ,s d2 ,s </d2e> ,...s m+5 )
wherein ,s0 Indicating a mark [ CLS]Corresponding sentence-level feature vector, s d1 Representing the corresponding vector representation, s, of the first drug entity d2 Representing the corresponding vector representation, s, of the second drug entity <d1e> and s</d1e> The text labels respectively representing the left and right sides of the first drug entity are represented by vectors, s, obtained after passing through a BioBert model <d2e> and s</d2e> The text marks respectively representing the left side and the right side of the second drug entity are represented by vectors obtained after passing through a BioBert model, and m represents the length of an original input text sequence;
in step S53, the calculation formula for performing the convolution operation is:
where σ (-) denotes a non-linear activation function, w (a) ∈R k×d Denotes the a-th convolution kernel, k denotes the size of the context window of the convolution kernel, d is the dimension of the input vector, b (a) Bias unit, s, representing the a-th convolutional layer g:g+k-1 Representing a vector matrix formed by the g th to the g + k-1 th words in the input sequence;
after convolution operation, for the characteristic sequence S with the length of G ’ The convolution kernel with dimension k × d generates a set of G-k +1 values, and the potential features of the input sequence after being convolved by the a-th convolution kernel are represented as:
the calculation formula for performing the maximum pooling operation is:
wherein ,z* The feature vector of the word element relation is represented,representing potential feature vectors obtained after each convolution kernel is subjected to maximum pooling;
in step S54, the expression of the final feature vector R is:
P(y)=Softmax(W'R+b')
wherein argmax (·) represents the maximum value operation, W 'represents the weight of the Softmax layer, b' represents the bias of the Softmax layer, p (y) represents the probability value of y for the candidate drug pair relationship, and C represents the set of all relationship categories.
In the embodiment of the present invention, step S6 includes the following sub-steps:
s61: calculating the loss value of the drug entity identification model and the loss value of the drug relationship extraction model;
s62: calculating the ratio of the corresponding loss value of the previous period and the previous period in the process of drug entity identification and drug relation extraction;
s63: calculating the weight of a loss function in the process of identifying the drug entities and extracting the drug relationship according to the ratio of the corresponding loss value of the previous period to the corresponding loss value of the previous period in the process of identifying the drug entities and extracting the drug relationship;
s64: calculating a total loss value of a high-level interaction mechanism in the processes of drug entity identification and drug relationship extraction according to the weight of a loss function in the processes of drug entity identification and drug relationship extraction so as to finish interactive training of drug entities and relationships and obtain a drug entity identification model and a drug relationship extraction model;
s65: and (4) preprocessing the acquired latest drug input file by adopting the method of the step S1, inputting the preprocessed drug input text into a drug entity recognition model to obtain a drug entity, carrying out drug entity anatomical classification marking according to the drug entity, and inputting the drug entity anatomical classification marking into a drug relation extraction model to obtain the classification of drug interaction.
In the embodiment of the present invention, in step S61, the drug entity identification task is completed, and the drug relationship extraction task is further completed in an auxiliary manner, and then the relationship extraction result is fed back to the entity identification module, and the entity identification module readjusts the parameters of the entity identification module according to the loss value of the entity identification module and the loss value of the relationship extraction module, and the relationship extraction module is similar to the above. Because both drug entity identification and drug relationship extraction can be ascribed to the multi-classification problem, the multi-classification logistic regression loss values are calculated using the following formula, and the loss values of the drug entity identification model and the drug relationship extraction model are the same as the calculation formula, and are both:
where U represents the number of training samples, h θ (. represents a training sample x) u C represents the number of label categories,the prediction result of the u-th training sample for the class c is shown, lambda represents a controllable penalty factor, o represents the number of parameters of training, and theta j Parameters representing training;
in step S62, after the loss values of the two tasks, i.e., drug entity identification and drug relationship extraction, are obtained, a dynamic weighted average algorithm is introduced to dynamically calculate the loss proportion of the two tasks in order to balance the importance of the two tasks during the high-level interaction process. For each subtask, the ratio of the loss value corresponding to the previous epoch to the loss value corresponding to the previous epoch is firstly calculated, and the ratio of the loss value corresponding to the previous epoch to the loss value corresponding to the previous epoch in the process of medicine entity identification and medicine relation extraction is firstly calculatedThe calculation formula of (2) is as follows:
where n denotes the nth task, t denotes the number of iterations,indicating the value of the loss for the previous epoch,a loss value representing a previous epoch to a previous epoch;
in step S63, the loss function weight in the process of drug entity identification and drug relationship extractionThe calculation formula of (2) is as follows:
wherein N represents the total number of tasks in the drug entity identification and drug relationship extraction process, N is 2 in the invention, exp (-) represents exponential operation,representing the ratio of the corresponding loss value of the previous period and the previous period in the process of identifying the drug entities and extracting the drug relationship, wherein T represents the looseness degree between tasks;
in step S64, the total loss value L of the high-level interaction mechanism during the drug entity identification and drug relationship extraction process total The formula for the calculation of (θ) is:
wherein ,weight of loss function, L, representing drug entity identification 1 (theta) represents a loss value of drug entity identification,weight of loss function, L, representing drug relation extraction 2 (θ) represents the loss value of drug relationship extraction.
The working principle and the process of the invention are as follows: in the invention, an effective pre-training model is adopted in a drug entity recognition module to obtain a preliminary text code which is input to a CRF layer to complete the drug entity recognition work; marking the identified drug entities with the drug anatomy type information by using a self-defined text marking mechanism so as to highlight the position information of the drug entities; inputting the marked text into a relation extraction model, acquiring a sentence characteristic vector and an entity characteristic vector by using BioBert, acquiring a word element relation characteristic vector by using a CNN model, splicing the sentence characteristic vector, the entity characteristic vector and the word element relation characteristic vector for medicine relation extraction, and finally realizing high-level information interaction of a medicine entity identification module and a relation extraction module based on a dynamic weighting loss mechanism to realize combined extraction of medicine relation and medicine entity.
The invention has the beneficial effects that:
(1) the invention provides a drug entity and relationship combined extraction method based on a high-level interaction mechanism, which does not need to add artificial features and repeated data preprocessing steps, fully utilizes mutual correlation information between two tasks, introduces a high-level dynamic interaction mechanism of drug entity identification and drug relationship extraction, completes the combined extraction of the drug entities and the mutual relationships, solves the error propagation problem of the combined extraction, and simultaneously introduces a text labeling mechanism based on ATC coding after completing the drug entity identification task, thereby improving the accuracy of the drug relationship extraction.
(2) The invention can introduce the position and type information of the drug by inserting the entity type mark in the drug relationship extraction stage, so that the model can learn the relationship between the entity type and the relationship, and does not need the complex data preprocessing steps such as entity blinding, etc., the hidden information of the words between the drug entities can be fully learned by the BioBert + CNN model, the high-level information interaction of the two tasks is realized by the designed dynamic weight mechanism, the mutual information between the two tasks is fully utilized, the combined extraction of the drug relationship and the drug entities is realized, and the error propagation problem in the combined extraction process is greatly relieved.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.
Claims (10)
1. A drug entity and relationship combined extraction method based on a high-level interaction mechanism is characterized by comprising the following steps:
s1: acquiring a medicine entity corpus and preprocessing characters of the medicine entity corpus;
s2: acquiring ATC classification codes of all medicines, and constructing a medicine anatomy classification dictionary;
s3: identifying the drug entities in the preprocessed drug entity corpus;
s4: according to the medicine entity and the medicine anatomy classification dictionary, medicine sentences inserted with entity marks are obtained;
s5: extracting the drug relation according to the drug sentence inserted with the entity mark;
s6: and performing interactive training of the drug entities and the relationships based on a high-level interactive mechanism according to the identified drug entities and the extracted drug relationships to obtain a drug entity identification model and a drug relationship extraction model, and performing drug entity and relationship combined extraction on the acquired latest drug input texts by using the drug entity identification model and the drug relationship extraction model.
2. The method for extracting drug entity and relationship association based on high-level interaction mechanism as claimed in claim 1, wherein in step S1, the specific method for preprocessing the characters in the drug entity corpus is as follows: converting the medicine relation description text in the medicine entity corpus into lowercase, removing punctuation marks and non-English characters, replacing numbers with num, labeling by adopting a BIO coding mode, and setting the maximum sentence length.
3. The method for extracting drug entities and relationships jointly based on high-level interaction mechanism according to claim 1, wherein in step S2, the concrete method for constructing the drug anatomy classification dictionary is as follows: and (4) crawling ATC classification codes of all the medicines from drug Bank, and constructing a medicine anatomy classification dictionary by taking the medicine name as a key and the first bit of the ATC classification code as a value.
4. The method for extracting drug entities and relationships according to claim 1, wherein said step S3 comprises the following sub-steps:
s31: obtaining sentence vectors in the preprocessed drug entity corpus by using a BioBert model;
s32: inputting the sentence vectors into a linear layer for linear conversion to obtain score vectors;
s33: and inputting the score vector into the linear conditional random field to obtain a final labeling sequence, and extracting a drug entity from the final labeling sequence.
5. The method for extracting drug entities and relationships based on high-level interaction mechanism as claimed in claim 4, wherein in step S31, the BioBert model includes 12 layers of fransformer structures, and the output of each layer of fransformer structure is inputted to the next layer of fransformer structure;
in step S31, the specific method for obtaining the sentence vector in the preprocessed drug entity corpus is as follows: using each character of the medicine entity corpus as an input text X ═ { X 1 ,...,x i ,...,x m Sequentially inputting the input texts into a BioBert model, and splicing output vectors of the last four layers of transform structures to obtain a sentence vector V (BioBert) (X) (V) 1 ,...,v i ,...,v m}, wherein ,xi The ith character, v, representing the input text sequence i Representing a BioBert output vector corresponding to the ith character, representing BioBert model operation, and m representing the length of an input text sequence;
in step S32, the calculation formula of the score vector H is:
H=WV+b
wherein W represents the weight of the linear layer and b represents the bias of the linear layer;
in step S33, the specific method for determining the final annotation sequence is: and calculating by using a linear conditional random field to enable the final labeling sequence to become a score vector P (Y | H) with the maximum probability of the prediction result, wherein the probability P (Y | H) of the final labeling sequence when the score vector H takes the value of H and the final labeling sequence Y takes the value of Y is as follows:
wherein Z (h) represents a normalization factor, i represents the word order of the current word, and t k (. cndot.) represents a local feature function, exp (. cndot.) represents an exponential operation, K represents the total number of local feature functions defined at the node, m represents the input text length, s l (. The) representation is defined at node y i L represents the total number of node feature functions defined at the node; lambda [ alpha ] k Representing a local characteristic function t k Weight coefficient of (. mu.) l Representing a characteristic function s of a node l And h represents the value of the score vector.
6. The method for extracting drug entities and relationships based on high-level interaction mechanism as claimed in claim 1, wherein in step S4, the specific method for obtaining the drug sentences inserted with the entity labels is: inserting entity markers into both sides of the drug entity according to the format of < d1e > </d1e > or < d2e > </d2e > resulting in drug sentences into which the entity markers are inserted, wherein d denotes the drug entity, 1 or 2 denotes the drug entity ordinal number and e denotes the ATC classification code in the medical anatomy classification dictionary.
7. The method for extracting drug entities and relationships according to claim 1, wherein the step S5 includes the following sub-steps:
s51: inserting a mark [ CLS ] into the beginning of the drug sentence inserted with the entity mark by using a BioBert model, and inputting the marking [ CLS ] into the BioBert model to obtain the code expression of the drug sentence;
s52: splicing the coded representation of the drug sentences and the marked vector representation thereof to obtain the overall characteristic vector of the drug entity;
s53: extracting two drug entities and words between the two drug entities from the coded representation of the drug sentences to form a subsequence, inputting the vector representation of the subsequence into a convolutional neural network, and sequentially performing convolutional operation and maximum pooling operation to obtain a word element relation characteristic vector;
s54: sequentially splicing the coding expression of the medicine sentences, the overall characteristic vector of the medicine entity and the characteristic vector of the word element relation to obtain a final characteristic vector;
s55: and compressing the dimensionality of the final feature vector by using a feedforward neural network, inputting the dimensionality into a softmax layer for classification, obtaining a class label with the maximum probability of the final prediction result in classification results, and taking the class label as a medicine relation.
8. The method for extracting drug entity and relationship association based on high-level interaction mechanism as claimed in claim 7, wherein in step S51, the expression of the coded representation S of the drug sentence is:
S=(s 0 ,...,s <d1e> ,s d1 ,s </d1e> ,...,s <d2e> ,s d2 ,s </d2e> ,...s m+5 )
wherein ,s0 Indicating a mark [ CLS]Corresponding sentence-level feature vector, s d1 Representing the corresponding vector representation, s, of the first pharmaceutical entity d2 Representing the corresponding vector representation, s, of the second pharmaceutical entity <d1e> and s</d1e> The text labels respectively representing the left and right sides of the first drug entity are represented by vectors, s, obtained after passing through a BioBert model <d2e> and s</d2e> The text marks respectively representing the left side and the right side of the second drug entity are represented by vectors obtained after passing through a BioBert model, and m represents the length of an original input text sequence;
in the step S52, the ith step d Individual drug entityIs the global feature vectorThe expression of (a) is:
in step S53, the calculation formula for performing the convolution operation is:
z (a) =σ(w (a) ·s g:g+k-1 +b (a) )
where σ (-) denotes a non-linear activation function, w (a) ∈R k×d Denotes the a convolution kernel, k denotes the size of the context window of the convolution kernel, d is the dimension of the input vector, b (a) Bias unit, s, representing the a-th convolutional layer g:g+k-1 Representing a vector matrix formed by the g th to the g + k-1 th words in the input sequence;
after convolution operation, for a feature sequence S' with the length of G, a convolution kernel with the dimension of k × d generates a set of G-k +1 values, and potential features of the input sequence after convolution by an a-th convolution kernel are represented as:
the calculation formula for performing the maximum pooling operation is:
wherein ,z* The characteristic vector of the word element relation is represented,representing potential feature vectors obtained after each convolution kernel is subjected to maximum pooling;
in step S54, the expression of the final feature vector R is:
P(y)=Soft max(W'R+b')
wherein argmax (·) represents the maximum value operation, W 'represents the weight of the Softmax layer, b' represents the bias of the Softmax layer, p (y) represents the probability value of y for the candidate drug pair relationship, and C represents the set of all relationship categories.
9. The method for extracting drug entities and relationships according to claim 1, wherein said step S6 comprises the following sub-steps:
s61: calculating the loss value of the drug entity identification model and the loss value of the drug relationship extraction model;
s62: calculating the ratio of the corresponding loss value of the previous period and the previous period in the process of drug entity identification and drug relation extraction;
s63: calculating the weight of a loss function in the process of identifying the drug entities and extracting the drug relationship according to the ratio of the corresponding loss value of the previous period to the corresponding loss value of the previous period in the process of identifying the drug entities and extracting the drug relationship;
s64: calculating a total loss value of a high-level interaction mechanism in the processes of drug entity identification and drug relationship extraction according to the weight of a loss function in the processes of drug entity identification and drug relationship extraction so as to finish interactive training of drug entities and relationships and obtain a drug entity identification model and a drug relationship extraction model;
s65: and (4) preprocessing the acquired latest drug input file by adopting the method of the step S1, inputting the preprocessed drug input text into a drug entity recognition model to obtain a drug entity, carrying out drug entity anatomical classification marking according to the drug entity, and inputting the drug entity anatomical classification marking into a drug relation extraction model to obtain the classification of drug interaction.
10. The method for extracting drug entity and relationship combination based on high-level interaction mechanism as claimed in claim 9, wherein in step S61, the calculation formula of the loss value of the drug entity identification model and the loss value of the drug relationship extraction model are the same, and respectively are:
where U represents the number of training samples, h θ (. represents a training sample x) u C represents the number of label categories,the prediction result of the u-th training sample to the class c is shown, lambda represents a controllable penalty factor, o represents the number of parameters of training, and theta j A parameter representing a training;
in step S62, the ratio of the corresponding loss value of the previous period to the previous period in the process of identifying the drug entity and extracting the drug relationshipThe calculation formula of (c) is:
where n denotes the nth task, t denotes the number of iterations,indicating the value of the loss for the previous epoch,a loss value representing a time period immediately preceding the previous time period;
in the step S63, the weight of the loss function in the process of identifying the drug entity and extracting the drug relationshipThe calculation formula of (c) is:
wherein N represents the total number of tasks in the drug entity identification and drug relationship extraction process, exp (-) represents the exponential operation,representing the ratio of the corresponding loss value of the previous period to the previous period in the process of identifying the drug entities and extracting the drug relation, wherein T represents the looseness degree between tasks;
in step S64, the total loss value L of the high-level interaction mechanism during the drug entity identification and drug relationship extraction process total The formula for the calculation of (θ) is:
L total (θ)=λ 1 (t) L 1 (θ)+λ 2 (t) L 2 (θ)
wherein ,λ1 (t) Weight of loss function, L, representing drug entity identification 1 (theta) represents the loss value of drug entity recognition, lambda 2 (t) Weight of loss function, L, representing drug relation extraction 2 (θ) represents the loss value of drug relationship extraction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210425479.8A CN114925678B (en) | 2022-04-21 | 2022-04-21 | Pharmaceutical entity and relationship joint extraction method based on high-level interaction mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210425479.8A CN114925678B (en) | 2022-04-21 | 2022-04-21 | Pharmaceutical entity and relationship joint extraction method based on high-level interaction mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114925678A true CN114925678A (en) | 2022-08-19 |
CN114925678B CN114925678B (en) | 2023-05-26 |
Family
ID=82806745
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210425479.8A Active CN114925678B (en) | 2022-04-21 | 2022-04-21 | Pharmaceutical entity and relationship joint extraction method based on high-level interaction mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114925678B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106919794A (en) * | 2017-02-24 | 2017-07-04 | 黑龙江特士信息技术有限公司 | Towards the drug class entity recognition method and device of multi-data source |
CN111078889A (en) * | 2019-12-20 | 2020-04-28 | 大连理工大学 | Method for extracting relationships among medicines based on attention of various entities and improved pre-training language model |
CN111125378A (en) * | 2019-12-25 | 2020-05-08 | 同方知网(北京)技术有限公司 | Closed-loop entity extraction method based on automatic sample labeling |
CN111368528A (en) * | 2020-03-09 | 2020-07-03 | 西南交通大学 | Entity relation joint extraction method for medical texts |
WO2020211275A1 (en) * | 2019-04-18 | 2020-10-22 | 五邑大学 | Pre-trained model and fine-tuning technology-based medical text relationship extraction method |
CN112256825A (en) * | 2020-10-19 | 2021-01-22 | 平安科技(深圳)有限公司 | Medical field multi-turn dialogue intelligent question-answering method and device and computer equipment |
CN113553440A (en) * | 2021-06-25 | 2021-10-26 | 武汉理工大学 | Medical entity relationship extraction method based on hierarchical reasoning |
-
2022
- 2022-04-21 CN CN202210425479.8A patent/CN114925678B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106919794A (en) * | 2017-02-24 | 2017-07-04 | 黑龙江特士信息技术有限公司 | Towards the drug class entity recognition method and device of multi-data source |
WO2020211275A1 (en) * | 2019-04-18 | 2020-10-22 | 五邑大学 | Pre-trained model and fine-tuning technology-based medical text relationship extraction method |
CN111078889A (en) * | 2019-12-20 | 2020-04-28 | 大连理工大学 | Method for extracting relationships among medicines based on attention of various entities and improved pre-training language model |
CN111125378A (en) * | 2019-12-25 | 2020-05-08 | 同方知网(北京)技术有限公司 | Closed-loop entity extraction method based on automatic sample labeling |
CN111368528A (en) * | 2020-03-09 | 2020-07-03 | 西南交通大学 | Entity relation joint extraction method for medical texts |
CN112256825A (en) * | 2020-10-19 | 2021-01-22 | 平安科技(深圳)有限公司 | Medical field multi-turn dialogue intelligent question-answering method and device and computer equipment |
CN113553440A (en) * | 2021-06-25 | 2021-10-26 | 武汉理工大学 | Medical entity relationship extraction method based on hierarchical reasoning |
Non-Patent Citations (2)
Title |
---|
朱嘉静: "基于机器学习的药物不良反应关键问题研究" * |
金汉: "基于XLNet-CRF的医学命名实体识别方法研究" * |
Also Published As
Publication number | Publication date |
---|---|
CN114925678B (en) | 2023-05-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110609891B (en) | Visual dialog generation method based on context awareness graph neural network | |
CN107977361B (en) | Chinese clinical medical entity identification method based on deep semantic information representation | |
CN108628823B (en) | Named entity recognition method combining attention mechanism and multi-task collaborative training | |
CN112579778B (en) | Aspect-level emotion classification method based on multi-level feature attention | |
CN111858944B (en) | Entity aspect level emotion analysis method based on attention mechanism | |
CN110866401A (en) | Chinese electronic medical record named entity identification method and system based on attention mechanism | |
CN109145190B (en) | Local citation recommendation method and system based on neural machine translation technology | |
CN109960728B (en) | Method and system for identifying named entities of open domain conference information | |
CN110647612A (en) | Visual conversation generation method based on double-visual attention network | |
CN114943230B (en) | Method for linking entities in Chinese specific field by fusing common sense knowledge | |
CN111243699A (en) | Chinese electronic medical record entity extraction method based on word information fusion | |
CN111914556A (en) | Emotion guiding method and system based on emotion semantic transfer map | |
CN115019906B (en) | Drug entity and interaction combined extraction method for multi-task sequence labeling | |
CN112199503B (en) | Feature-enhanced unbalanced Bi-LSTM-based Chinese text classification method | |
Madhfar et al. | Effective deep learning models for automatic diacritization of Arabic text | |
CN114239612A (en) | Multi-modal neural machine translation method, computer equipment and storage medium | |
CN113723103A (en) | Chinese medical named entity and part-of-speech combined learning method integrating multi-source knowledge | |
CN114386417A (en) | Chinese nested named entity recognition method integrated with word boundary information | |
CN114153971A (en) | Error-containing Chinese text error correction, identification and classification equipment | |
CN114781382A (en) | Medical named entity recognition system and method based on RWLSTM model fusion | |
CN115879546A (en) | Method and system for constructing composite neural network psychology medicine knowledge map | |
CN113297374B (en) | Text classification method based on BERT and word feature fusion | |
CN111523320A (en) | Chinese medical record word segmentation method based on deep learning | |
CN114266905A (en) | Image description generation model method and device based on Transformer structure and computer equipment | |
CN114757188A (en) | Standard medical text rewriting method based on generation of confrontation network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |