WO2023173823A1 - Method for predicting interaction relationship of drug pair, and device and medium - Google Patents

Method for predicting interaction relationship of drug pair, and device and medium Download PDF

Info

Publication number
WO2023173823A1
WO2023173823A1 PCT/CN2022/137046 CN2022137046W WO2023173823A1 WO 2023173823 A1 WO2023173823 A1 WO 2023173823A1 CN 2022137046 W CN2022137046 W CN 2022137046W WO 2023173823 A1 WO2023173823 A1 WO 2023173823A1
Authority
WO
WIPO (PCT)
Prior art keywords
drug
text
entity representation
drugs
pair
Prior art date
Application number
PCT/CN2022/137046
Other languages
French (fr)
Chinese (zh)
Inventor
唐继军
窦明亮
郗文辉
郭菲
Original Assignee
中国科学院深圳理工大学(筹)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳理工大学(筹) filed Critical 中国科学院深圳理工大学(筹)
Publication of WO2023173823A1 publication Critical patent/WO2023173823A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/40ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/832Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H20/00ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
    • G16H20/10ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to drugs or medications, e.g. for ensuring correct administration to patients

Definitions

  • This application belongs to the field of biomedical technology, and in particular relates to a method, equipment and medium for predicting drug-drug interaction relationships.
  • DAI Drug-Drug Interaction
  • the above-mentioned DDI prediction methods all process drug pairs at the sentence level, and in practice, the interaction relationship between a drug pair is often determined by multiple sentences. Therefore, in the existing technology, the accuracy of predicting the interaction relationship of each drug pair in the document is low.
  • the embodiments of the present application provide a method, device and storage medium for predicting the interaction relationship of drug pairs, which can solve the problem of low accuracy in predicting the interaction relationship of each drug pair in the document.
  • embodiments of the present application provide a method for predicting drug-drug interaction relationships, which method includes:
  • the target text includes multiple drugs, and each drug appears at least once in the target text;
  • the comprehensive entity representation corresponding to each drug is determined separately.
  • the comprehensive entity representation is used to describe the semantic information of the corresponding drug at various positions in the target text;
  • embodiments of the present application provide a device for predicting drug pair interactions, which device includes:
  • the acquisition module is used to obtain the target text;
  • the target text includes multiple drugs, and each drug appears at least once in the target text;
  • the comprehensive entity representation determination module is used to determine the comprehensive entity representation corresponding to each drug respectively.
  • the comprehensive entity representation is used to describe the semantic information of the corresponding drug at various positions in the target text;
  • the fusion entity representation determination module is used for determining the fusion entity representation of the drug pair based on the comprehensive entity representation of two drugs in the drug pair for any drug pair among multiple drugs;
  • the prediction module is used to predict the interaction relationship of the drug pair based on the fusion entity representation of the drug pair.
  • embodiments of the present application provide a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the computer program, the method of the first aspect is implemented. .
  • embodiments of the present application provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program is executed by a processor, the method of the first aspect is implemented.
  • embodiments of the present application provide a computer program product, which when the computer program product is run on a terminal device, causes the terminal device to execute the method of the first aspect.
  • the beneficial effects of the embodiments of the present application are: by obtaining the corresponding comprehensive entity representation of each drug in the target text, the terminal device can use a comprehensive entity representation to comprehensively describe the corresponding drug in the target text. Semantic information at each location in . Afterwards, for each target drug pair, the terminal device can fuse according to the comprehensive entity representation of each drug to generate a fused entity representation of the drug pair. Furthermore, the terminal device can accurately predict the interaction relationship of the drug pair based on the fused entity representation.
  • Figure 1 is a schematic diagram of an implementation method for generating a sentence-level embodiment provided by an embodiment of the present application
  • Figure 2 is an implementation flow chart of a method for predicting drug-drug interaction relationships provided by an embodiment of the present application
  • Figure 3 is a schematic diagram of an implementation method for generating a comprehensive entity representation in a method for predicting drug-drug interaction relationships provided by an embodiment of the present application;
  • Figure 4 is a schematic structural diagram of a model that generates a comprehensive entity representation of a drug provided by an embodiment of the present application
  • Figure 5 is a schematic flow chart of predicting drug pair interaction relationships using a drug relationship prediction model provided by an embodiment of the present application
  • Figure 6 is a schematic diagram of an implementation method for generating a training set in a method for predicting drug-drug interaction relationships provided by an embodiment of the present application;
  • Figure 7 is a schematic diagram of an implementation method for determining relationship labels in a method for predicting drug-drug interaction relationships provided by another embodiment of the present application.
  • Figure 8 is a schematic structural diagram of a device for predicting drug pair interaction relationships provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • the method for predicting drug interaction relationships provided by the embodiments of this application can be applied to terminal devices such as tablet computers, wearable devices, notebook computers, ultra-mobile personal computers (UMPC), netbooks, etc.
  • terminal devices such as tablet computers, wearable devices, notebook computers, ultra-mobile personal computers (UMPC), netbooks, etc.
  • UMPC ultra-mobile personal computers
  • This application implements The example does not impose any restrictions on the specific type of terminal equipment.
  • each XML file consists of at least one or more sentences.
  • Each sentence contains several drugs and the relationship tags between each two drugs.
  • Figure 1 is a schematic diagram of an implementation method for generating a sentence-level embodiment provided by an embodiment of the present application. Specifically, assuming there is a sentence (Sentence) containing three non-repeating drugs (Drug1, Drug2, Drug3), the terminal device needs to convert it into three instances (Instance_A, Instance_B, Instance_C), and each instance only focuses on A drug pair (e1 and e2).
  • sentence Session
  • Drug3 drug1, Drug2, Drug3
  • the terminal device needs to convert it into three instances (Instance_A, Instance_B, Instance_C), and each instance only focuses on A drug pair (e1 and e2).
  • the same drug in a text is described using a comprehensive entity representation, and a drug pair corresponds to a fused entity representation. That is, unlike the sentence-level training drug relationship prediction model, the text is divided into multiple sentences, and the drugs and drug pairs in each sentence are represented differently by vectors. In this way, the terminal device can greatly reduce the memory space required for model training.
  • the sentence-level drug relationship prediction model only focuses on certain two drugs in the sentence during training, the prediction process is only based on the semantic information between certain two drugs in the sentence.
  • the relationship between a drug is often determined by multiple sentences. Therefore, existing methods for predicting drug-pair interaction relationships still have the problem of being unable to mine more complete semantic information about interactions between drug pairs from biomedical-related texts.
  • the terminal device processes the target text in the following manner S101-S104, so that Generate semantic information that can be used to describe the interaction between two drugs in the target text to improve the accuracy of predicting drug interaction relationships.
  • Figure 2 shows an implementation flow chart of a method for predicting drug pair interaction relationships provided by an embodiment of the present application. The method includes the following steps:
  • the terminal device acquires the target text; the target text includes multiple drugs, and each drug appears at least once in the target text.
  • the above-mentioned target text is usually a text related to the pharmaceutical field, which includes but is not limited to texts in the form of journals, papers, etc.
  • the target text can be text in Chinese, English or other languages, and there is no limit to this.
  • the target text in order to predict drug pair interactions, must include at least two or more drugs. Otherwise, the interaction relationship between the two drugs cannot be predicted from the target text.
  • the terminal device may obtain the target text through the following steps, as detailed below:
  • the terminal device obtains the initial text, which includes multiple drugs, and each drug appears at least once in the initial text; if there is a drug name using a drug-sharing suffix in the initial text, the drug name is expanded to obtain the target text.
  • the above-mentioned initial text is an unprocessed text, which can be a text crawled by the terminal device from the network based on the name of the drug, or can be a text specified in advance and stored in the terminal device.
  • the path for the terminal device to obtain the target text is not limited.
  • the above-mentioned entity name is the drug name of the drug.
  • the drug sharing suffix is: When the drug names of multiple drugs have part of the same name, abbreviations may appear in the initial text, causing multiple drugs to share a suffix.
  • the shared suffix is the same part name.
  • the entity names of the two drugs can be respectively: 1) diagnostic monoclonal antibodies (diagnostic monoclonal antibodies); 2) therapeutic monoclonal antibodies (therapeutic monoclonal antibodies).
  • the initial text containing two drugs can be "...when treated with other diagnostic or therapeutic monoclonal antibodies. (When treated with other diagnostic or therapeutic monoclonal antibodies.)". That is to say, the above two drug names share "monoclonal antibodies" as a shared suffix. In this case, the terminal device needs to expand the above sentences in the initial text to obtain the target text. That is to say, change the above sentence to: "...when treated with other diagnostic monoclonal antibodies or therapeutic monoclonal antibodies”.
  • the terminal device determines the comprehensive entity representation corresponding to each drug.
  • the comprehensive entity representation is used to describe the semantic information of the corresponding drug at each position in the target text.
  • the above-mentioned comprehensive entity represents semantic information used to comprehensively describe the corresponding drug at each position in the target text. Specifically, when a drug appears in multiple positions in the target text, if only the sentence containing the drug at a certain position is processed and the semantic information used to represent the drug in the sentence is obtained, then the final result is based on the semantic information. When the information is used to predict subsequent drug-drug interaction relationships, the accuracy of the prediction may be inaccurate. That is to say, the method of extracting the semantic information of the drug is only obtained by processing a certain sentence containing the drug in the target text. This semantic information cannot replace the semantic information corresponding to the drug at other positions in the target text.
  • the terminal device can be represented by a comprehensive entity corresponding to the drug to participate in subsequent processing, thereby improving the prediction accuracy of the drug-drug interaction relationship.
  • the terminal device may be implemented through the following sub-steps S301-S303, as detailed below:
  • the terminal device determines multiple positions of the drug in the target text.
  • the terminal device generates a text sequence corresponding to each position according to multiple positions.
  • the terminal device performs vector processing on the text sequence corresponding to each position to obtain a comprehensive entity representation.
  • the above-mentioned position is the position information of the drug in the target text.
  • the text sequence is a sequence generated based on position.
  • the terminal device can determine the location information of the drug in the target text based on the entity name of the drug.
  • the terminal device can use "[" and "]" to identify the starting position and ending position of each appearance of the drug, so as to mark the location of the drug.
  • sequence information corresponding to each location can also be assigned according to the order in which the drug appears in the target text.
  • the terminal device can represent the text sequence of the drug in the following manner:
  • X ⁇ x 1 , x 2 ,...x n ⁇
  • X represents the entire text sequence of the target text
  • x n represents the n-th character in the target text
  • n also represents the total number of characters in the target text.
  • P1 ⁇ xi ,xi +1 ,...xi +k -1 ⁇
  • P2 ⁇ x j , x j+1 ,...x j+k-1 ⁇ .
  • 1 and 2 in P1 and P2 respectively represent the order in which the drug appears in the target text.
  • x i represents the i-th character in the target text when the drug first appears; since the drug name consists of k characters, x i+k-1 is the end position of the drug in the text after it first appears. It can be understood that if the drug appears multiple times, there will also be multiple text sequences.
  • performing vector processing on the text sequence means: representing the text sequence as a process that can be recognized by the terminal device.
  • the terminal device can perform vector representation on the text sequence corresponding to each position, and obtain multiple text vectors; each text vector is used to describe the semantic information of the drug at the corresponding position; each text vector is vectorized Integration, generating a comprehensive entity representation of the drug.
  • the terminal device performs a vector representation of each text sequence, which can be processed and generated through the model.
  • the text sequence is vector processed through BioBERT (named entity recognition model) to obtain a comprehensive entity representation.
  • DrugP1 represents the text vector corresponding to the P1 text sequence
  • vp1-k represents the vector corresponding to the k-th character in the P1 text sequence.
  • each text vector can only describe the semantic information of the drug at the corresponding position in the target text. Based on this, in order to obtain a comprehensive entity representation of the drug, the terminal device also needs to integrate multiple text vectors of the drug. Specifically, the terminal device can integrate text vectors through the following formulas 1 and 2:
  • Drug e1 represents the integrated vector obtained after integrating the first text vector; that is, for DrugP1, after summing each vector representing DrugP1, the average value is calculated. Integrate vectors. After that, each integrated vector corresponding to the drug is averaged again to generate a comprehensive entity representation of the drug, Drug a .
  • FIG. 4 is a schematic structural diagram of a model for generating a comprehensive entity representation of a drug according to an embodiment of the present application.
  • Drug- ⁇ at the bottom in Figure 4 represents the name of the drug; ⁇ xi ,...,xi +k-1 ⁇ represents the text sequence of Drug- ⁇ , and then the BioBERT model performs vector representation processing to generate Text vector (Drug e1 and Drug e2 in the picture). After that, the text vector is processed by the above-mentioned Formula 1 and Formula 2 to generate the top-level Drug a . That is, a comprehensive entity representation is generated. It should be noted that this process provides the terminal device with a final comprehensive entity representation by integrating all parts of a single drug and integrating a single text vector of all the same drugs.
  • the terminal device determines the fused entity representation of the drug pair based on the comprehensive entity representations of the two drugs in the drug pair.
  • the above-mentioned comprehensive entity representation of a single drug is used to describe the semantic information of the corresponding drug at each position in the target text. Therefore, it can be considered that the above fused entities represent semantic information used to describe the interaction of two drugs in the target text.
  • the terminal device can process the comprehensive entity representation through the following formula 3 to obtain the fused entity representation:
  • H 1 W 1 [tanh(Drug ⁇ )]+b 1
  • H 2 W 2 [tanh(Drug ⁇ )]+b 2 (3)
  • H 1 and H 2 respectively represent the target vector obtained after processing the comprehensive entity representation
  • W 1 and W 2 represent the known parameter matrix
  • b1 and b2 represent the known offset terms
  • tanh represents the comprehensive entity Indicates hyperbolic tangent processing.
  • the terminal device can splice H 1 and H 2 , and then input them into Formula 4 again to obtain the fused entity representation.
  • H 0 W 3 [concat(H 1 ,H 2 )]+b 3 (4)
  • H 0 is the fusion entity representation of the drug pair
  • W 3 represents the known parameter matrix
  • concat represents the concatenation and merging of multiple string functions (that is, splicing H 1 and H 2 ).
  • the terminal device predicts the interaction relationship of the drug pair based on the fusion entity representation of the drug pair.
  • the above S203 has explained that the fused entity representation can be used to describe the semantic information of the interaction between two drugs in the target text. Based on this, when the terminal device predicts the interaction relationship of the drug pair based on the fused entity representation, its prediction accuracy will be higher.
  • the terminal device can predict the interaction relationship between drug pairs through the following formula 5:
  • softmax represents the classification function, which is used to process H 0 and output the probability value of the drug pair belonging to each interaction relationship. Afterwards, the maximum value of the probability value corresponds to the interaction relationship and is determined as the final predicted interaction relationship of the drug pair.
  • the terminal device can use a comprehensive entity representation to comprehensively describe the semantic information of the corresponding drug at each position in the target text. Afterwards, for each target drug pair, the terminal device can fuse according to the comprehensive entity representation of each drug to generate a fused entity representation of the drug pair. Furthermore, the terminal device can accurately predict the interaction relationship of the drug pair based on the fused entity representation.
  • the above-mentioned S202-S204 may all process the target text by the drug relationship prediction model in the terminal device. That is, after executing S201, the terminal device can input the obtained target text into the drug relationship prediction model to predict the interaction relationship of each drug pair among multiple drugs.
  • the drug relationship prediction model may include a first activation layer, a second activation layer, a first fully connected layer and a second fully connected layer.
  • the first activation layer is used to perform tanh function processing in Formula 3 on the comprehensive entity representation.
  • the first fully connected layer is used to perform W 1 [] + b 1 or W 2 [] + b 2 processing in Formula 3 on the vector processed by the tanh function to obtain the target vector.
  • the terminal device can splice the target vectors of the two drugs and input them to the second activation layer for processing.
  • the second activation layer is used to perform the concat function processing in Formula 4 on the spliced target vector, and will be processed by
  • the vector processed by the concat function is input to W 3 [] + b 3 for processing to obtain the fused entity representation.
  • the above example only illustrates the model structure in which the drug relationship prediction model processes the comprehensive entity representation of the two drugs in the drug pair and generates the fused entity representation of the drug pair. That is, only the model structure for processing S203 is explained. Among them, the drug relationship prediction model should also include a model structure for executing the processes of S202 and S204, which will not be explained one by one in this embodiment.
  • FIG. 5 is a schematic flowchart of predicting the interaction relationship between drug pairs using a drug relationship prediction model according to an embodiment of the present application.
  • the data processing specifically involves converting the sentence-level DDI2013 data set (sentence-level DDI Extraction 2013) into the document set DDI2013 data set (document-level DDI Extraction 2013).
  • the process of loading key information is performed on the document set DDI2013 data set.
  • the establishment of text sequence (Article seq) is performed for each text in the data set, which includes but is not limited to the establishment of text sequence of the entire text and the establishment of text sequence of each drug; determines drug pairs (Pairs) and generates Drug information.
  • document-entity embedding is performed on each drug in the drug pair.
  • a comprehensive entity representation is performed for each drug (drug) (that is, a drug emb is generated).
  • tanh+fully-connected processing is performed on the comprehensive entity representation of the drug respectively. That is, the comprehensive entity representation is sequentially input to the first activation layer and the first fully connected layer and processed to obtain the target vector (H 1 and H 2 ) of each drug.
  • the two target vectors are spliced, and the spliced target vector is input into the second activation layer and the second fully connected layer to obtain the fused entity representation (H 0 ).
  • the fused entity representation is input into the Sofmax layer for classification prediction, and the interaction relationship (Type) of the drug pair is obtained.
  • the drug relationship prediction model is a pre-trained model.
  • the above-mentioned drug relationship prediction model can be BERT, SciBERT, BioBERT and other models.
  • the drug relationship prediction model may specifically be BioBERT.
  • the terminal device can specifically The original data set is processed through the following steps S601-S604 to reduce the data volume of the original data set and improve the efficiency of network training.
  • the details are as follows:
  • the terminal device obtains the original data set, which includes multiple original texts.
  • the terminal device counts the number of drugs contained in each original text respectively.
  • the terminal device filters the original text containing at least two drugs in the original data set to obtain a subset of the original data.
  • the terminal device performs label processing on each original text in the original data subset to obtain a training set.
  • the method of obtaining the original text may be similar to the method of obtaining the initial text, and the comparison will not be described again. It should be noted that if the original data set is used directly for model training, it will consume a lot of training time.
  • the terminal device can separately count the number of drugs contained in each original text. Afterwards, the original text that only included one drug was deleted, and preprocessing was performed on the original text that was not deleted.
  • the above-mentioned preprocessing at least includes a process of expanding drugs using drug-shared suffixes.
  • the above processing process has been explained in the above S201 and will not be described again. It should be added that the above preprocessing also includes but is not limited to: lowercase English characters in the original text, remove punctuation, and convert all numbers in the original text to "NUM" instead, which is not limited.
  • the original text after the above preprocessing is the text that can be used to train the drug relationship prediction model, thereby reducing the redundancy of the training data.
  • labeling the original text specifically includes labeling the drug pairs appearing at each position in the original text to participate in model training.
  • the terminal device can also process the label relationship of each drug pair in the original text in the following manner S701-S702, so that the same drug pair can also use the optimal Relationship tags to solve the problem of confusing relationship tags:
  • the terminal device obtains each drug pair contained in the original text and the relationship labels between various drug pairs.
  • the terminal device determines the relationship label with a higher priority among the multiple relationship labels as the relationship label of the drug pair according to the preset label priority.
  • the above-mentioned relationship labels are used to represent the action relationship between drug pairs and to participate in the iterative process in the drug relationship prediction model.
  • the relationship label of each drug pair is usually pre-annotated by the staff. Therefore, the terminal device can directly obtain each drug pair contained in the original text, as well as the relationship labels between each drug pair.
  • the above-mentioned tag priority is a preset priority.
  • the above tags can be: False, Int, Advise, Effect, and Mechanism respectively. Its priority can be: False ⁇ Int ⁇ Advise ⁇ Effect ⁇ Mechanism
  • the label Mechanism has the highest priority, that is, it contains more pharmacokinetic information between the two drugs;
  • the label Effect indicates that there is a certain degree of reaction between the two drugs, but not as much as Mechanism;
  • the label Advise indicates that there is an interaction between the two drugs, and the degree is less than that of Effect;
  • the label Int indicates that the degree of interaction between the two drugs is low, and the degree is not as high as Advise;
  • the label False indicates that there is no drug interaction between the two drugs.
  • the multi-relationship labels can be converted into single-relationship labels through the label priority rules as shown above, so that the converted single-relationship labels can better Represents the interaction relationship between drug pairs in the original text.
  • the method for predicting drug-pair interaction relationships in this application is a document-level prediction method. Compared with the sentence-level prediction method for drug-pair interaction relationships, its advantages are as follows:
  • sentence-level prediction of drug-pair interaction relationships must transform a sentence into multiple instances containing only two drug entities.
  • document-level prediction of drug-to-drug interaction relationships can target multiple drug entities simultaneously. Therefore, document-level drug-drug interaction relationship prediction can simplify data preprocessing operations and reduce the text input into the drug-drug relationship prediction model.
  • the number of sentences recorded in sentence-level drug-drug pair interaction prediction methods in recent years was collected and compared with the number of sentences included in this article. Compare. See Table 1 below for details:
  • the highest amount of text is included in the original DDI Extraction 2013.
  • the number after preprocessing is the smallest: 3784 sentences in the training set and 790 sentences in the test set, for a total of 4574 sentences.
  • the method using BERT has the lowest performance, with macro-P (macro average precision, an evaluation index of the model) reaching 78.78%, and macro-R (macro average recall, another evaluation index of the model). reached 73.27%, and macro-F1 (macro average harmonic mean, another evaluation index of the model) reached 75.92%.
  • the results of SciBERT method are moderate, macro-R is the highest, reaching 74.80%. This is because SciBERT is trained on a large-scale corpus of scientific literature, resulting in greatly improved performance compared to BERT.
  • the best results were obtained using the BioBERT method, with macro-P reaching 85.89% and macro-F1 reaching 79.19%. This shows that BioBERT trained on the biomedical corpus can more accurately express the text data of drug pair interaction relationships.
  • This embodiment performs special preprocessing on the DDI Extraction 2013 data set for the first time, achieving document-level prediction of drug interaction relationships.
  • DDI Extraction 2013 data set for the first time.
  • BiLSTM the two most commonly used neural network models.
  • These two methods also adopt document-level drug embedding, but use different neural network model structures after obtaining a comprehensive entity representation of the drug.
  • the terminal device changed it to a method that combines CNN and BiLSTM neural network models, and expressed it as "CNN+BiLSTM”.
  • macro-P reaches 66.98% in the neural network model structure of CNN+BiLSTM, which is the highest among the three methods, macro-R only reaches 50.19%, and macro-F1 only reaches 57.38%. Therefore The network structure has the lowest overall performance.
  • the macro-P of the CNN method reaches 56.75%, macro-R reaches 59.97%, and macro-F1 reaches 58.32%.
  • the overall performance of CNN is slightly higher than CNN+BiLSTM.
  • macro-P is only 1.38% lower than CNN+BiLSTM, and macro-R is 0.26% lower than CNN.
  • Macro-P and macro-R are both almost the highest and therefore have the best overall performance.
  • the reason is: the input is a comprehensive entity representation of two drugs, not a complete sentence. Therefore, suitable sentence-level neural network model structures cannot achieve the same performance as document-level neural network model structures (especially in BiLSTM).
  • the document-level prediction method of drug-pair interaction relationships is used.
  • the input to drug relationships can be greatly reduced.
  • the amount of data in the prediction model can be combined to accurately express the semantics of drugs in multiple different locations, so that the drug relationship prediction model can extract the real semantic information of the drug in the document and improve the model prediction accuracy.
  • FIG. 8 is a structural block diagram of a device for predicting drug pair interaction relationships provided by an embodiment of the present application.
  • the modules included in the device for predicting drug pair interaction relationships in this embodiment are used to execute the steps in the embodiments corresponding to Figures 2, 3, 6 and 7.
  • FIG. 2, FIG. 3, FIG. 6, and FIG. 7 and the relevant descriptions in the embodiments corresponding to FIG. 2, FIG. 3, FIG. 6, and FIG. 7.
  • the device 800 for predicting drug pair interaction relationships may include: an acquisition module 810, a comprehensive entity representation determination module 820, a fusion entity representation determination module 830, and a prediction module 840, wherein:
  • the acquisition module 810 is used to acquire target text; the target text includes multiple drugs, and each drug appears at least once in the target text.
  • the comprehensive entity representation determination module 820 is used to determine the comprehensive entity representation corresponding to each drug, and the comprehensive entity representation is used to describe the semantic information of the corresponding drug at each position in the target text.
  • the fusion entity representation determination module 830 is used for determining, for any drug pair among a plurality of drugs, the fusion entity representation of the drug pair based on the comprehensive entity representations of two drugs in the drug pair.
  • the prediction module 840 is used to predict the interaction relationship of the drug pair according to the fusion entity representation of the drug pair.
  • the acquisition module 810 is also used to:
  • the initial text includes multiple drugs, and each drug appears at least once in the initial text; if there is a drug name using a drug-shared suffix in the initial text, the drug name is expanded to obtain the target text.
  • the comprehensive entity representation determination module 820 is also used to:
  • multiple positions of the drug in the target text are determined respectively; based on the multiple positions, a text sequence corresponding to each position is generated; the text sequence corresponding to each position is vector processed to obtain a comprehensive entity representation.
  • the comprehensive entity representation determination module 820 is also used to:
  • the text sequence corresponding to each position is represented by a vector, and multiple text vectors are obtained; each text vector is used to describe the semantic information of the drug at the corresponding position; each text vector is vector integrated to generate a comprehensive summary of the drug Entity representation.
  • the device 800 for predicting drug pair interaction relationships further includes:
  • the input module is used to input the target text into the pre-trained drug relationship prediction model for processing, and obtain the interaction relationship of each drug pair among multiple drugs.
  • the drug relationship prediction model includes a first activation layer, a second activation layer, a first fully connected layer and a second fully connected layer; the fusion entity representation determination module 830 is also used to:
  • the drug relationship prediction model is trained based on the training set; the drug pair interaction relationship prediction device 800 also includes the following module to obtain the training set:
  • the original data set acquisition module is used to obtain the original data set, which includes multiple original texts.
  • the statistics module is used to separately count the number of drugs contained in each original text.
  • the filtering module is used to filter the original text containing at least two drugs in the original data set to obtain a subset of the original data.
  • the label processing module is used to label each original text in the original data subset to obtain a training set.
  • the tag processing module is also used to:
  • the high relationship label is determined as the relationship label of the drug pair.
  • each module is used to execute each step in the embodiment corresponding to Figure 2, Figure 3, Figure 6 and Figure 7, and for Each step in the embodiment corresponding to Figure 2, Figure 3, Figure 6 and Figure 7 has been explained in detail in the above embodiment.
  • Figure 9 is a structural block diagram of a terminal device provided by an embodiment of the present application.
  • the terminal device 900 of this embodiment includes: a processor 910, a memory 920, and a computer program 930 stored in the memory 920 and executable on the processor 910, such as a program for predicting drug interaction relationships. .
  • the processor 910 executes the computer program 930, the steps in each embodiment of the method for predicting the interaction relationship of each drug pair are implemented, such as S101 to S104 shown in Figure 1.
  • the processor 910 executes the computer program 930, it implements the functions of each module in the embodiment corresponding to FIG. 8, for example, the functions of modules 810 to 840 shown in FIG. 8.
  • the computer program 930 can be divided into one or more modules, and the one or more modules are stored in the memory 920 and executed by the processor 910 to realize the drug pair interaction relationship provided by the embodiments of the present application.
  • method of prediction One or more modules may be a series of computer program instruction segments capable of completing specific functions. The instruction segments are used to describe the execution process of the computer program 930 in the terminal device 900 .
  • the computer program 930 can implement the method for predicting drug-drug interaction relationships provided in the embodiments of the present application.
  • the terminal device 900 may include, but is not limited to, a processor 910 and a memory 920. Those skilled in the art can understand that FIG. 9 is only an example of the terminal device 900 and does not constitute a limitation on the terminal device 900. It may include more or less components than shown in the figure, or some components may be combined, or different components may be used. , for example, the terminal device may also include input and output devices, network access devices, buses, etc.
  • the processor 910 may be a central processing unit, or other general-purpose processors, digital signal processors, application-specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. wait.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the memory 920 may be an internal storage unit of the terminal device 900, such as a hard disk or memory of the terminal device 900.
  • the memory 920 may also be an external storage device of the terminal device 900, such as a plug-in hard disk, a smart memory card, a flash memory card, etc. equipped on the terminal device 900. Further, the memory 920 may also include both an internal storage unit of the terminal device 900 and an external storage device.
  • Embodiments of the present application provide a computer-readable storage medium, including a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the computer program, the medicines in the above embodiments are implemented. Predictive methods for interaction relationships.
  • Embodiments of the present application provide a computer program product.
  • the terminal device executes the method for predicting drug pair interaction relationships in each of the above embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Primary Health Care (AREA)
  • Medical Informatics (AREA)
  • Medicinal Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Toxicology (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

Disclosed are a method for predicting an interaction relationship of a drug pair, and a device and a medium, which are applicable to the technical field of biomedicine. The method comprises: acquiring a target text, the target text comprising a plurality of drugs, and each drug appearing at least once in the target text (S201); respectively determining a comprehensive entity representation corresponding to each drug, the comprehensive entity representation being used for describing semantic information of the corresponding drug at each position in the target text (S202); with regard to any drug pair among the plurality of drugs, determining a fusion entity representation of the drug pair according to the comprehensive entity representation of two drugs in the drug pair (S203); and according to the fusion entity representation of the drug pair, predicting the interaction relationship of the drug pair (S204). By means of the method, the accuracy of predicting the interaction relationship of each drug pair in a document can be improved.

Description

药物对相互作用关系的预测方法、设备及介质Prediction methods, equipment and media for drug-drug interactions 技术领域Technical field
本申请属于生物医药技术领域,尤其涉及一种药物对相互作用关系的预测方法、设备及介质。This application belongs to the field of biomedical technology, and in particular relates to a method, equipment and medium for predicting drug-drug interaction relationships.
背景技术Background technique
药物-药物相互作用(Drug-Drug Interaction,DDI)预测是药物警戒中的一个重要研究领域,在药物虚拟筛选、病人治疗方案、治疗效果以及病人安全研究等起着至关重要的作用。Drug-Drug Interaction (DDI) prediction is an important research field in pharmacovigilance and plays a vital role in drug virtual screening, patient treatment plans, treatment effects, and patient safety research.
目前所有的DDI预测都是基于句子级的关系进行预测。首先,将每一篇科学文献的摘要或部分关于药物描述的文字部分分为多个句子,每个句子中又包含若干种药物。因此,已有的工作都是将每种药物对输入特定的网络中进行处理,预测每种药物对之间的相互作用关系。All current DDI predictions are based on sentence-level relationships. First, the abstract or part of the drug description of each scientific document is divided into multiple sentences, and each sentence contains several drugs. Therefore, existing work inputs each drug pair into a specific network for processing and predicts the interaction relationship between each drug pair.
然而,上述DDI预测方法都是对句子级的药物对进行处理,而在实际中,一个药物对之间的相互作用关系往往是由多个句子共同决定的。因此,现有技术中,对文档中每种药物对的相互作用关系的预测准确率低。However, the above-mentioned DDI prediction methods all process drug pairs at the sentence level, and in practice, the interaction relationship between a drug pair is often determined by multiple sentences. Therefore, in the existing technology, the accuracy of predicting the interaction relationship of each drug pair in the document is low.
发明内容Contents of the invention
本申请实施例提供了一种药物对相互作用关系的预测方法、设备及存储介质,可以解决对文档中每种药物对的相互作用关系的预测准确率低的问题。The embodiments of the present application provide a method, device and storage medium for predicting the interaction relationship of drug pairs, which can solve the problem of low accuracy in predicting the interaction relationship of each drug pair in the document.
第一方面,本申请实施例提供了一种药物对相互作用关系的预测方法,该方法包括:In the first aspect, embodiments of the present application provide a method for predicting drug-drug interaction relationships, which method includes:
获取目标文本;目标文本中包括多种药物,每种药物在目标文本中出现至少一次;Obtain the target text; the target text includes multiple drugs, and each drug appears at least once in the target text;
分别确定每种药物对应的综合实体表示,综合实体表示用于描述对应的药物在目标文本中的各个位置处的语义信息;The comprehensive entity representation corresponding to each drug is determined separately. The comprehensive entity representation is used to describe the semantic information of the corresponding drug at various positions in the target text;
针对多种药物中的任一药物对,根据药物对中的两种药物的综合实体表示确定药物对的融合实体表示;For any drug pair among the multiple drugs, determine the fused entity representation of the drug pair based on the comprehensive entity representation of two drugs in the drug pair;
根据药物对的融合实体表示,预测药物对的相互作用关系。Predict the interaction relationship of drug pairs based on their fused entity representation.
第二方面,本申请实施例提供了一种药物对相互作用关系的预测装置,该装置包括:In the second aspect, embodiments of the present application provide a device for predicting drug pair interactions, which device includes:
获取模块,用于获取目标文本;目标文本中包括多种药物,每种药物在目标文本中出现至少一次;The acquisition module is used to obtain the target text; the target text includes multiple drugs, and each drug appears at least once in the target text;
综合实体表示确定模块,用于分别确定每种药物对应的综合实体表示,综合实体表示用于描述对应的药物在目标文本中的各个位置处的语义信息;The comprehensive entity representation determination module is used to determine the comprehensive entity representation corresponding to each drug respectively. The comprehensive entity representation is used to describe the semantic information of the corresponding drug at various positions in the target text;
融合实体表示确定模块,用于针对多种药物中的任一药物对,根据药物对中的两种药物的综合实体表示确定药物对的融合实体表示;The fusion entity representation determination module is used for determining the fusion entity representation of the drug pair based on the comprehensive entity representation of two drugs in the drug pair for any drug pair among multiple drugs;
预测模块,用于根据药物对的融合实体表示,预测药物对的相互作用关系。The prediction module is used to predict the interaction relationship of the drug pair based on the fusion entity representation of the drug pair.
第三方面,本申请实施例提供了一种终端设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上述第一方面的方法。In a third aspect, embodiments of the present application provide a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the method of the first aspect is implemented. .
第四方面,本申请实施例提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时实现如上述第一方面的方法。In a fourth aspect, embodiments of the present application provide a computer-readable storage medium. The computer-readable storage medium stores a computer program. When the computer program is executed by a processor, the method of the first aspect is implemented.
第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面的方法。In a fifth aspect, embodiments of the present application provide a computer program product, which when the computer program product is run on a terminal device, causes the terminal device to execute the method of the first aspect.
本申请实施例与现有技术相比存在的有益效果是:通过获取目标文本中每种药物的对应的综合实体表示,使终端设备可以采用一个综合实体表示能够综合的描述对应的药物在目标文本中各个位置处的语义信息。之后,对于每种目标药物对,终端设备均可以根据每种药物的综合实体表示进行融合,生成得到药物对的融合实体表示。进而,终端设备可以基于融合实体表示,准确的预测出药物对的相互作用关系。Compared with the existing technology, the beneficial effects of the embodiments of the present application are: by obtaining the corresponding comprehensive entity representation of each drug in the target text, the terminal device can use a comprehensive entity representation to comprehensively describe the corresponding drug in the target text. Semantic information at each location in . Afterwards, for each target drug pair, the terminal device can fuse according to the comprehensive entity representation of each drug to generate a fused entity representation of the drug pair. Furthermore, the terminal device can accurately predict the interaction relationship of the drug pair based on the fused entity representation.
附图说明Description of the drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or description of the prior art will be briefly introduced below. Obviously, the drawings in the following description are only for the purpose of the present application. For some embodiments, for those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.
图1为本申请一实施例提供的一种生成句子级实施例的实现方式示意图;Figure 1 is a schematic diagram of an implementation method for generating a sentence-level embodiment provided by an embodiment of the present application;
图2是本申请一实施例提供的一种药物对相互作用关系的预测方法的实现流程图;Figure 2 is an implementation flow chart of a method for predicting drug-drug interaction relationships provided by an embodiment of the present application;
图3是本申请一实施例提供的一种药物对相互作用关系的预测方法中生成综合实体表示的一种实现方式示意图;Figure 3 is a schematic diagram of an implementation method for generating a comprehensive entity representation in a method for predicting drug-drug interaction relationships provided by an embodiment of the present application;
图4为本申请一实施例提供的一种生成药物的综合实体表示的模型结构示意图;Figure 4 is a schematic structural diagram of a model that generates a comprehensive entity representation of a drug provided by an embodiment of the present application;
图5为本申请一实施例提供的一种药物关系预测模型预测药物对相互作用关系的流程示意图;Figure 5 is a schematic flow chart of predicting drug pair interaction relationships using a drug relationship prediction model provided by an embodiment of the present application;
图6是本申请一实施例提供的一种药物对相互作用关系的预测方法中生成训练集的一种实现方式示意图;Figure 6 is a schematic diagram of an implementation method for generating a training set in a method for predicting drug-drug interaction relationships provided by an embodiment of the present application;
图7是本申请另一实施例提供的一种药物对相互作用关系的预测方法中确定关系标签的一种实现方式示意图;Figure 7 is a schematic diagram of an implementation method for determining relationship labels in a method for predicting drug-drug interaction relationships provided by another embodiment of the present application;
图8是本申请一实施例提供的一种药物对相互作用关系的预测装置的结构示意图;Figure 8 is a schematic structural diagram of a device for predicting drug pair interaction relationships provided by an embodiment of the present application;
图9是本申请一实施例提供的一种终端设备的结构示意图。Figure 9 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
具体实施方式Detailed ways
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。In the following description, for the purpose of explanation rather than limitation, specific details such as specific system structures and technologies are provided to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to those skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It will be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of the described features, integers, steps, operations, elements and/or components but does not exclude one or more other The presence or addition of features, integers, steps, operations, elements, components and/or collections thereof.
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。In addition, in the description of this application and the appended claims, the terms "first", "second", "third", etc. are only used to distinguish the description, and cannot be understood as indicating or implying relative importance.
本申请实施例提供的药物对相互作用关系的预测方法可以应用于平板电脑、可穿戴设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本等终端设备上,本申请实施例对终端设备的具体类型不作任何限制。The method for predicting drug interaction relationships provided by the embodiments of this application can be applied to terminal devices such as tablet computers, wearable devices, notebook computers, ultra-mobile personal computers (UMPC), netbooks, etc. This application implements The example does not impose any restrictions on the specific type of terminal equipment.
如背景技术记载,目前所有的DDI预测都是基于句子级的关系进行预测。其中,在训练句子级的药物关系预测模型时,通常需要使用大量的数据集。具体的,在获取数据集后,将数据集中每一篇科学文献的摘要或部分关于药物描 述的文字被作为一个文档,保存在独立的XML文件中。此时,每一个XML文件中,又由至少一个或多个句子组成。每个句子中又包含若干个药物以及每两个药物之间的关系标签。As described in the background art, all current DDI predictions are based on sentence-level relationships. Among them, when training sentence-level drug relationship prediction models, it is usually necessary to use a large number of data sets. Specifically, after obtaining the data set, the abstract of each scientific document or part of the drug description text in the data set is used as a document and saved in an independent XML file. At this time, each XML file consists of at least one or more sentences. Each sentence contains several drugs and the relationship tags between each two drugs.
因此,已有的工作都是将每一个句子转换成若干个实例(每一个实例称为instance,只关注两个药物)。具体的,详情见图1),图1为本申请一实施例提供的一种生成句子级实施例的实现方式示意图。具体的,假设有一个句子(Sentence),包含三个不重复的药物(Drug1,Drug2,Drug3),则终端设备需要将其转成三个实例(Instance_A,Instance_B,Instance_C),每个实例只关注一个药物对(e1和e2)。Therefore, existing work converts each sentence into several instances (each instance is called an instance, and only focuses on two drugs). Specifically, see Figure 1 for details. Figure 1 is a schematic diagram of an implementation method for generating a sentence-level embodiment provided by an embodiment of the present application. Specifically, assuming there is a sentence (Sentence) containing three non-repeating drugs (Drug1, Drug2, Drug3), the terminal device needs to convert it into three instances (Instance_A, Instance_B, Instance_C), and each instance only focuses on A drug pair (e1 and e2).
基于此,可以确定一个句子生成实例的数量,是由句子中包含的药物数量决定的(排列组合,例如一个句子中包含4个药物时,需要转换成C 2 4=6个实例)。也就意味着,基于句子级训练药物关系预测模型需要处理的实例数量是原文本中包含句子数量的数倍。因此,句子级的药物关系预测模型在训练过程中,需要更大的内存空间,且网络训练的耗时也更长。而在本实施例中,因将一个文本中相同的药物采用一个综合实体表示进行描述,以及一个药物对对应一个融合实体表示。也即,与句子级训练药物关系预测模型中将文本分为多个句子,并对每个句子中的药物和药物对分别使用向量进行表示不同。以此,终端设备可以极大降低模型训练时所需使用的内存空间。 Based on this, it can be determined that the number of instances of generating a sentence is determined by the number of drugs contained in the sentence (permutations and combinations, for example, when a sentence contains 4 drugs, it needs to be converted into C 2 4 =6 instances). This means that the number of instances that need to be processed to train a drug relationship prediction model based on the sentence level is several times the number of sentences contained in the original text. Therefore, the sentence-level drug relationship prediction model requires larger memory space during the training process, and network training also takes longer. In this embodiment, the same drug in a text is described using a comprehensive entity representation, and a drug pair corresponds to a fused entity representation. That is, unlike the sentence-level training drug relationship prediction model, the text is divided into multiple sentences, and the drugs and drug pairs in each sentence are represented differently by vectors. In this way, the terminal device can greatly reduce the memory space required for model training.
另外,由于句子级的药物关系预测模型在训练时只关注句子中的某两个药物,因此,预测过程中也只基于句子中的某两个药物之间的语义信息进行预测。然而,在实际中,一个药物之间的关系往往是由多个句子共同决定的。因此,现有药物对相互作用关系的预测方法还存在无法从生物医学相关的文本中,挖掘出药物对之间更完整的相互作用的语义信息的问题。In addition, since the sentence-level drug relationship prediction model only focuses on certain two drugs in the sentence during training, the prediction process is only based on the semantic information between certain two drugs in the sentence. However, in practice, the relationship between a drug is often determined by multiple sentences. Therefore, existing methods for predicting drug-pair interaction relationships still have the problem of being unable to mine more complete semantic information about interactions between drug pairs from biomedical-related texts.
基于此,为了解决无法从生物医学相关的文本中,挖掘出药物对之间更完整的相互作用的语义信息,在本实施例中,终端设备通如下方式S101-S104对目标文本进行处理,从而生成能够用于描述两个药物在目标文本中的相互作用的语义信息,以提高对药物相互作用关系预测的准确率。Based on this, in order to solve the problem of being unable to mine more complete semantic information of interactions between drug pairs from biomedical-related texts, in this embodiment, the terminal device processes the target text in the following manner S101-S104, so that Generate semantic information that can be used to describe the interaction between two drugs in the target text to improve the accuracy of predicting drug interaction relationships.
下面结合具体实施例对本申请提供的一种药物对相互作用关系的预测方法进行示例性的说明。The following is an exemplary description of a method for predicting drug-drug interaction relationships provided in this application with reference to specific examples.
请参阅图2,图2示出了本申请实施例提供的一种药物对相互作用关系的预测方法的实现流程图,该方法包括如下步骤:Please refer to Figure 2. Figure 2 shows an implementation flow chart of a method for predicting drug pair interaction relationships provided by an embodiment of the present application. The method includes the following steps:
S201、终端设备获取目标文本;目标文本中包括多种药物,每种药物在目 标文本中出现至少一次。S201. The terminal device acquires the target text; the target text includes multiple drugs, and each drug appears at least once in the target text.
在一实施例中,上述目标文本通常为与药物领域相关的文本,其包括但不限于期刊、论文等形式的文本。其中,目标文本可以为中文、英文或其他语言的文本,对此不作限定。In one embodiment, the above-mentioned target text is usually a text related to the pharmaceutical field, which includes but is not limited to texts in the form of journals, papers, etc. Among them, the target text can be text in Chinese, English or other languages, and there is no limit to this.
在一实施例中,为预测药物对相互作用,上述目标文本至少需包括两种以上的药物,否则,无法从目标文本中预测两种药物的相互作用关系。In one embodiment, in order to predict drug pair interactions, the target text must include at least two or more drugs. Otherwise, the interaction relationship between the two drugs cannot be predicted from the target text.
可以理解的是,对于目标文本中的任一药物,其可能多次出现在目标文本中的不同位置。因此,本实施例中,对于目标药物出现的此时不作限制。It is understood that for any drug in the target text, it may appear multiple times in different positions in the target text. Therefore, in this embodiment, there is no restriction on when the target drug appears.
在一实施例中,终端设备具体可以通过如下步骤获取目标文本,详述如下:In one embodiment, the terminal device may obtain the target text through the following steps, as detailed below:
终端设备获取初始文本,初始文本中包括多种药物,每种药物在初始文本中出现至少一次;若初始文本中存在使用药物共享后缀的药物名称,则对药物名称进行扩充,得到目标文本。The terminal device obtains the initial text, which includes multiple drugs, and each drug appears at least once in the initial text; if there is a drug name using a drug-sharing suffix in the initial text, the drug name is expanded to obtain the target text.
在一实施例中,上述初始文本为未经过处理的文本,其可以为终端设备基于药物名称从网络上爬取的文本,也可以为预先存储在终端设备指定的文本,本实施例中,对终端设备获取目标文本的路径不作限定。In one embodiment, the above-mentioned initial text is an unprocessed text, which can be a text crawled by the terminal device from the network based on the name of the drug, or can be a text specified in advance and stored in the terminal device. In this embodiment, for The path for the terminal device to obtain the target text is not limited.
在一实施例中,上述实体名称为药物的药物名称。其中,药物共享后缀为:多个药物的药物名称具有部分名称相同时,在初始文本中可能出现简写的情况,使多个药物共享一个后缀。该共享的后缀即为相同的部分名称。In one embodiment, the above-mentioned entity name is the drug name of the drug. Among them, the drug sharing suffix is: When the drug names of multiple drugs have part of the same name, abbreviations may appear in the initial text, causing multiple drugs to share a suffix. The shared suffix is the same part name.
示例性的,两个药物的实体名称可以分别为:1)diagnostic monoclonal antibodies(诊断性单克隆抗体);2)therapeutic monoclonal antibodies(治疗性单克隆抗体)。而包含两个药物的初始文本可以为“…when treated with other diagnostic or therapeutic monoclonal antibodies.(使用其他诊断性或治疗性单克隆抗体治疗时)”。也即上述两个药物名称共用了“monoclonal antibodies”作为共享后缀。对于该情况,终端设备需要对初始文本中的上述语句进行扩充,得到目标文本。即将上述语句改为:“…when treated with other diagnostic monoclonal antibodies or therapeutic monoclonal antibodies”。For example, the entity names of the two drugs can be respectively: 1) diagnostic monoclonal antibodies (diagnostic monoclonal antibodies); 2) therapeutic monoclonal antibodies (therapeutic monoclonal antibodies). The initial text containing two drugs can be "...when treated with other diagnostic or therapeutic monoclonal antibodies. (When treated with other diagnostic or therapeutic monoclonal antibodies.)". That is to say, the above two drug names share "monoclonal antibodies" as a shared suffix. In this case, the terminal device needs to expand the above sentences in the initial text to obtain the target text. That is to say, change the above sentence to: "...when treated with other diagnostic monoclonal antibodies or therapeutic monoclonal antibodies".
S202、终端设备分别确定每种药物对应的综合实体表示,综合实体表示用于描述对应的药物在目标文本中的各个位置处的语义信息。S202. The terminal device determines the comprehensive entity representation corresponding to each drug. The comprehensive entity representation is used to describe the semantic information of the corresponding drug at each position in the target text.
在一实施例中,上述综合实体表示用于综合描述对应的药物在目标文本中的各个位置处的语义信息。具体的,当药物在目标文本中的多个位置出现时,若只对其中某一位置处的包含药物的语句进行处理,得到用于表示药物在该语句中的语义信息,则最后基于该语义信息进行后续药物对相互作用关系预测时, 其预测的准确率可能不准。也即该提取药物的语义信息的方式仅仅是基于目标文本中包含药物的某一句子进行处理后得到,该语义信息无法代替目标文本中其余位置处药物对应的语义信息。In one embodiment, the above-mentioned comprehensive entity represents semantic information used to comprehensively describe the corresponding drug at each position in the target text. Specifically, when a drug appears in multiple positions in the target text, if only the sentence containing the drug at a certain position is processed and the semantic information used to represent the drug in the sentence is obtained, then the final result is based on the semantic information. When the information is used to predict subsequent drug-drug interaction relationships, the accuracy of the prediction may be inaccurate. That is to say, the method of extracting the semantic information of the drug is only obtained by processing a certain sentence containing the drug in the target text. This semantic information cannot replace the semantic information corresponding to the drug at other positions in the target text.
基于此,在本实施例中,终端设备可以通过药物对应的综合实体表示,以参与后续处理,进而提到药物对相互作用关系的预测准确率。Based on this, in this embodiment, the terminal device can be represented by a comprehensive entity corresponding to the drug to participate in subsequent processing, thereby improving the prediction accuracy of the drug-drug interaction relationship.
在一具体实施例在,参照图3,在S202中,终端设备具体可通过如下子步骤S301-S303实现,详述如下:In a specific embodiment, referring to Figure 3, in S202, the terminal device may be implemented through the following sub-steps S301-S303, as detailed below:
S301、针对任一种药物,终端设备分别确定药物在目标文本中的多个位置。S301. For any kind of drug, the terminal device determines multiple positions of the drug in the target text.
S302、终端设备根据多个位置,分别生成每个位置对应的文本序列。S302. The terminal device generates a text sequence corresponding to each position according to multiple positions.
S303、终端设备对每个位置对应的文本序列进行向量处理,得到综合实体表示。S303. The terminal device performs vector processing on the text sequence corresponding to each position to obtain a comprehensive entity representation.
在一实施例中,上述位置为药物在目标文本中的位置信息。其中,文本序列为基于位置生成的序列。具体的,终端设备可以根据药物的实体名称,确定药物在目标文本中的位置信息。并且,终端设备可以采用“[”和“]”标识药物每次出现的起始位置和结束位置,以便对药物的位置进行标志。其中,对于一个药物具有多个位置信息的情况(也即药物多次出现在目标文本中),还可以根据药物在目标文本中出现的次序,赋予每个位置对应的次序信息。In one embodiment, the above-mentioned position is the position information of the drug in the target text. Among them, the text sequence is a sequence generated based on position. Specifically, the terminal device can determine the location information of the drug in the target text based on the entity name of the drug. In addition, the terminal device can use "[" and "]" to identify the starting position and ending position of each appearance of the drug, so as to mark the location of the drug. Among them, for the case where a drug has multiple location information (that is, the drug appears multiple times in the target text), sequence information corresponding to each location can also be assigned according to the order in which the drug appears in the target text.
示例性的,终端设备可以采用如下方式表示药物的文本序列:For example, the terminal device can represent the text sequence of the drug in the following manner:
X={x 1,x 2,…x n},X表示目标文本整体的文本序列,x n表示目标文本中第n个字符,n也表示目标文本中字符的总数量。假设,对于给定的药物Drug-α,其由k个字符组成,且出现的次数为2次,则其文本序列可以分别为:P1={x i,x i+1,…x i+k-1},P2={x j,x j+1,…x j+k-1}。其中,P1和P2中的1和2分别表示药物在目标文本中出现的次序。x i表示为药物首次出现时处于目标文本中的第i个字符;因药物名称由k个字符组成,则x i+k-1即为药物第一次出现后在文本中的结束位置。可以理解的是,若药物出现的次数具有多次,则文本序列也将对应有多个。 X={x 1 , x 2 ,...x n }, X represents the entire text sequence of the target text, x n represents the n-th character in the target text, and n also represents the total number of characters in the target text. Assume that for a given drug Drug-α, it consists of k characters and the number of occurrences is 2, then its text sequences can be respectively: P1={ xi ,xi +1 ,...xi +k -1 }, P2={x j , x j+1 ,...x j+k-1 }. Among them, 1 and 2 in P1 and P2 respectively represent the order in which the drug appears in the target text. x i represents the i-th character in the target text when the drug first appears; since the drug name consists of k characters, x i+k-1 is the end position of the drug in the text after it first appears. It can be understood that if the drug appears multiple times, there will also be multiple text sequences.
在一实施例中,对文本序列进行向量处理即为:将文本序列表示为可被终端设备识别的处理。In one embodiment, performing vector processing on the text sequence means: representing the text sequence as a process that can be recognized by the terminal device.
具体的,终端设备可以分别对每个位置对应的文本序列进行向量表示,对应得到多个文本向量;每个文本向量用于描述药物在与对应位置处的语义信息;将每个文本向量进行向量整合,生成药物的综合实体表示。Specifically, the terminal device can perform vector representation on the text sequence corresponding to each position, and obtain multiple text vectors; each text vector is used to describe the semantic information of the drug at the corresponding position; each text vector is vectorized Integration, generating a comprehensive entity representation of the drug.
其中,终端设备对每个文本序列进行向量表示,具体可以通过模型进行处 理生成。例如,通过BioBERT(命名实体识别模型)对文本序列进行向量处理,得到综合实体表示。示例性的,BioBERT对上述P1和P2两个文本序列进行处理后生成的文本向量可以为:DrugP1={vp1_1,vp1_2,…vp1_k},DrugP2={vp2_1,vp2_2,…vp2_k}。其中,DrugP1表示为P1文本序列对应的文本向量;vp1-k表示为P1文本序列中第k个字符对应的向量。Among them, the terminal device performs a vector representation of each text sequence, which can be processed and generated through the model. For example, the text sequence is vector processed through BioBERT (named entity recognition model) to obtain a comprehensive entity representation. For example, the text vector generated by BioBERT after processing the above two text sequences P1 and P2 can be: DrugP1={vp1_1, vp1_2,…vp1_k}, DrugP2={vp2_1, vp2_2,…vp2_k}. Among them, DrugP1 represents the text vector corresponding to the P1 text sequence; vp1-k represents the vector corresponding to the k-th character in the P1 text sequence.
可以理解的是,此时每个文本向量只能描述药物在目标文本中对应位置处的语义信息。基于此,为了得到药物的综合实体表示,终端设备还需对药物的多个文本向量进行整合。具体的,终端设备可以通过如下公式1和2对文本向量进行整合:It can be understood that at this time, each text vector can only describe the semantic information of the drug at the corresponding position in the target text. Based on this, in order to obtain a comprehensive entity representation of the drug, the terminal device also needs to integrate multiple text vectors of the drug. Specifically, the terminal device can integrate text vectors through the following formulas 1 and 2:
Figure PCTCN2022137046-appb-000001
Figure PCTCN2022137046-appb-000001
Figure PCTCN2022137046-appb-000002
Figure PCTCN2022137046-appb-000002
其中,Drug e1表示对第一个文本向量进行整合处理后,得到的整合向量;即对于DrugP1,将表示DrugP1中的每个向量进行求和后,在计算其平均值,此时平均值即为整合向量。之后,将该药物对应的每个整合向量再次进行求平均,生成药物的综合实体表示Drug aAmong them, Drug e1 represents the integrated vector obtained after integrating the first text vector; that is, for DrugP1, after summing each vector representing DrugP1, the average value is calculated. Integrate vectors. After that, each integrated vector corresponding to the drug is averaged again to generate a comprehensive entity representation of the drug, Drug a .
具体的,可以参照图4,图4为本申请一实施例提供的一种生成药物的综合实体表示的模型结构示意图。其中,图4中最下层的Drug-α表示药物的药物名称;{x i,...,x i+k-1}表示Drug-α的文本序列,而后经过BioBERT模型进行向量表示处理,生成文本向量(图中Drug e1和Drug e2)。之后,对文本向量进行上述公式1和公式2处理生成最上层的Drug a。即生成综合实体表示。需要说明的是,此过程为终端设备通过整合单个药物的所有部分以及整合所有相同药物的单个文本向量来获得最终的综合实体表示。 Specifically, reference may be made to FIG. 4 , which is a schematic structural diagram of a model for generating a comprehensive entity representation of a drug according to an embodiment of the present application. Among them, Drug-α at the bottom in Figure 4 represents the name of the drug; { xi ,...,xi +k-1 } represents the text sequence of Drug-α, and then the BioBERT model performs vector representation processing to generate Text vector (Drug e1 and Drug e2 in the picture). After that, the text vector is processed by the above-mentioned Formula 1 and Formula 2 to generate the top-level Drug a . That is, a comprehensive entity representation is generated. It should be noted that this process provides the terminal device with a final comprehensive entity representation by integrating all parts of a single drug and integrating a single text vector of all the same drugs.
S203、针对多种药物中的任一药物对,终端设备根据药物对中的两种药物的综合实体表示确定药物对的融合实体表示。S203. For any drug pair among multiple drugs, the terminal device determines the fused entity representation of the drug pair based on the comprehensive entity representations of the two drugs in the drug pair.
在一实施例中,上述单个药物的综合实体表示用于描述对应的药物在目标文本中的各个位置处的语义信息。因此,可以认为上述融合实体表示用于描述两个药物在目标文本中的相互作用的语义信息。In one embodiment, the above-mentioned comprehensive entity representation of a single drug is used to describe the semantic information of the corresponding drug at each position in the target text. Therefore, it can be considered that the above fused entities represent semantic information used to describe the interaction of two drugs in the target text.
其中,终端设备可以通过如下公式3对综合实体表示进行处理,得到融合实体表示:Among them, the terminal device can process the comprehensive entity representation through the following formula 3 to obtain the fused entity representation:
H 1=W 1[tanh(Drug α)]+b 1,H 2=W 2[tanh(Drug β)]+b 2   (3) H 1 =W 1 [tanh(Drug α )]+b 1 ,H 2 =W 2 [tanh(Drug β )]+b 2 (3)
其中,H 1和H 2分别表示对综合实体表示进行处理后,得到的目标向量,W 1和W 2表示已知的参数矩阵,b1和b2表示已知的偏移项;tanh表示对综合实体表示进行双曲正切处理。 Among them, H 1 and H 2 respectively represent the target vector obtained after processing the comprehensive entity representation, W 1 and W 2 represent the known parameter matrix, b1 and b2 represent the known offset terms; tanh represents the comprehensive entity Indicates hyperbolic tangent processing.
在得到H 1和H 2后,终端设备可以将H 1和H 2进行拼接,然后再次将其输入至公式4中得到融合实体表示。 After obtaining H 1 and H 2 , the terminal device can splice H 1 and H 2 , and then input them into Formula 4 again to obtain the fused entity representation.
H 0=W 3[concat(H 1,H 2)]+b 3  (4) H 0 =W 3 [concat(H 1 ,H 2 )]+b 3 (4)
其中,H 0即为药物对的融合实体表示,W 3表示已知的参数矩阵,b3已知的偏移项;concat表示为联结合并多个字符串函数(即对H 1和H 2进行拼接)。 Among them, H 0 is the fusion entity representation of the drug pair, W 3 represents the known parameter matrix, and the known offset item of b3; concat represents the concatenation and merging of multiple string functions (that is, splicing H 1 and H 2 ).
需要说明的是,上述公式仅对在目标文本出现次数为两次的药物进行处理的计算公式,在出现次数为多次时,其公式也应当对应进行适应性修改。It should be noted that the above formula only processes the calculation formula for drugs that appear twice in the target text. When the number of occurrences is multiple, the formula should also be adapted accordingly.
S204、终端设备根据药物对的融合实体表示,预测药物对的相互作用关系。S204. The terminal device predicts the interaction relationship of the drug pair based on the fusion entity representation of the drug pair.
在一实施例中,上述S203已说明融合实体表示可以用于描述两个药物在目标文本中的相互作用的语义信息。基于此,终端设备在基于该融合实体表示预测药物对的相互作用关系时,其预测准确率将更高。In one embodiment, the above S203 has explained that the fused entity representation can be used to describe the semantic information of the interaction between two drugs in the target text. Based on this, when the terminal device predicts the interaction relationship of the drug pair based on the fused entity representation, its prediction accuracy will be higher.
具体的,终端设备可以通过如下公式5,预测药物对的相互作用关系:Specifically, the terminal device can predict the interaction relationship between drug pairs through the following formula 5:
Type=soft max(H 0)  (5) Type=soft max(H 0 ) (5)
其中,softmax表示分类函数,用于对H 0进行处理,并输出药物对属于每个相互作用关系的概率值。之后,将概率值的最大值对应相互作用关系,确定为最终预测的药物对的相互作用关系。 Among them, softmax represents the classification function, which is used to process H 0 and output the probability value of the drug pair belonging to each interaction relationship. Afterwards, the maximum value of the probability value corresponds to the interaction relationship and is determined as the final predicted interaction relationship of the drug pair.
在本实施例中,通过获取目标文本中每种药物的对应的综合实体表示,使终端设备可以采用一个综合实体表示能够综合的描述对应的药物在目标文本中各个位置处的语义信息。之后,对于每种目标药物对,终端设备均可以根据每种药物的综合实体表示进行融合,生成得到药物对的融合实体表示。进而,终端设备可以基于融合实体表示,准确的预测出药物对的相互作用关系。In this embodiment, by obtaining the corresponding comprehensive entity representation of each drug in the target text, the terminal device can use a comprehensive entity representation to comprehensively describe the semantic information of the corresponding drug at each position in the target text. Afterwards, for each target drug pair, the terminal device can fuse according to the comprehensive entity representation of each drug to generate a fused entity representation of the drug pair. Furthermore, the terminal device can accurately predict the interaction relationship of the drug pair based on the fused entity representation.
在一实施例中,上述S202-S204均可以由终端设备中的药物关系预测模型对目标文本进行处理。即终端设备在执行S201之后,可以将获取到的目标文本输入至药物关系预测模型中,以预测多种药物中的每一药物对的相互作用关系。In one embodiment, the above-mentioned S202-S204 may all process the target text by the drug relationship prediction model in the terminal device. That is, after executing S201, the terminal device can input the obtained target text into the drug relationship prediction model to predict the interaction relationship of each drug pair among multiple drugs.
具体的,药物关系预测模型中可以包括第一激活层、第二激活层、第一全连接层和第二全连接层。其中,第一激活层用于对综合实体表示执行公式3中的tanh函数处理。第一全连接层用于对经过tanh函数处理后的向量执行公式3中的W 1[]+b 1或W 2[]+b 2处理,得到目标向量。之后,终端设备可以将两个药物的目标向量进行拼接,并输入至第二激活层处理,此时第二激活层用于对拼接 后的目标向量执行公式4中的concat函数处理,并将经过concat函数处理后的向量输入至W 3[]+b 3中进行处理得到融合实体表示。 Specifically, the drug relationship prediction model may include a first activation layer, a second activation layer, a first fully connected layer and a second fully connected layer. Among them, the first activation layer is used to perform tanh function processing in Formula 3 on the comprehensive entity representation. The first fully connected layer is used to perform W 1 [] + b 1 or W 2 [] + b 2 processing in Formula 3 on the vector processed by the tanh function to obtain the target vector. After that, the terminal device can splice the target vectors of the two drugs and input them to the second activation layer for processing. At this time, the second activation layer is used to perform the concat function processing in Formula 4 on the spliced target vector, and will be processed by The vector processed by the concat function is input to W 3 [] + b 3 for processing to obtain the fused entity representation.
需要说明的是,上述示例仅说明了药物关系预测模型对药物对中的两种药物的综合实体表示进行处理,生成药物对的融合实体表示的模型结构。也即仅仅只说明了对S203进行处理的模型结构。其中,药物关系预测模型还应当包括执行S202以及S204过程的模型结构,对此本实施例不一一进行解释。It should be noted that the above example only illustrates the model structure in which the drug relationship prediction model processes the comprehensive entity representation of the two drugs in the drug pair and generates the fused entity representation of the drug pair. That is, only the model structure for processing S203 is explained. Among them, the drug relationship prediction model should also include a model structure for executing the processes of S202 and S204, which will not be explained one by one in this embodiment.
在一具体实施例中,请参照图5,图5为本申请一实施例提供的一种药物关系预测模型预测药物对相互作用关系的流程示意图。其中,数据处理具体为将句子级DDI2013数据集(sentence-level DDI Extraction 2013)转换为也即文档集DDI2013数据集(document-level DDI Extraction 2013)。之后,对文档集DDI2013数据集执行加载关键信息过程。具体的,对数据集中的每个文本执行文本序列的建立(Article seq),其包括但不限于整体文本的文本序列建立,以及每个药物的文本序列的建立;确定药物对(Pairs)以及生成药物信息(Drug info)。之后,对于确定的药物对,分别对药物对中的每种药物执行文档实体嵌入处理(Document-entity Embedding)。具体的,对每种药物(Drug)分别进行综合实体表示(即生成Drug emb)。之后,分别对药物的综合实体表示执行tanh+fully-connected处理。即分别将综合实体表示依次输入至第一激活层和第一全连接层和处理,得到每种药物的目标向量(H 1和H 2)。之后,将两个目标向量进行拼接,并将拼接后的目标向量输入至第二激活层和第二全连接层中,得到融合实体表示(H 0)。最后,将融合实体表示输入至Sofmax层中进行分类预测,得到药物对的相互作用关系(Type)。 In a specific embodiment, please refer to FIG. 5 , which is a schematic flowchart of predicting the interaction relationship between drug pairs using a drug relationship prediction model according to an embodiment of the present application. Among them, the data processing specifically involves converting the sentence-level DDI2013 data set (sentence-level DDI Extraction 2013) into the document set DDI2013 data set (document-level DDI Extraction 2013). After that, the process of loading key information is performed on the document set DDI2013 data set. Specifically, the establishment of text sequence (Article seq) is performed for each text in the data set, which includes but is not limited to the establishment of text sequence of the entire text and the establishment of text sequence of each drug; determines drug pairs (Pairs) and generates Drug information. Afterwards, for the determined drug pair, document-entity embedding is performed on each drug in the drug pair. Specifically, a comprehensive entity representation is performed for each drug (drug) (that is, a drug emb is generated). Afterwards, tanh+fully-connected processing is performed on the comprehensive entity representation of the drug respectively. That is, the comprehensive entity representation is sequentially input to the first activation layer and the first fully connected layer and processed to obtain the target vector (H 1 and H 2 ) of each drug. After that, the two target vectors are spliced, and the spliced target vector is input into the second activation layer and the second fully connected layer to obtain the fused entity representation (H 0 ). Finally, the fused entity representation is input into the Sofmax layer for classification prediction, and the interaction relationship (Type) of the drug pair is obtained.
在一实施例中,上述药物关系预测模型为预先进行训练的模型。示例性的,上述药物关系预测模型可以为BERT、SciBERT和BioBERT等模型。在本实施例中,药物关系预测模型具体可以为BioBERT。In one embodiment, the drug relationship prediction model is a pre-trained model. For example, the above-mentioned drug relationship prediction model can be BERT, SciBERT, BioBERT and other models. In this embodiment, the drug relationship prediction model may specifically be BioBERT.
在一实施例中,对于上述还存在的句子级的药物关系预测模型在训练过程中,需要更大的内存空间,且网络训练的耗时也更长的问题,参照图6,终端设备具体可以通过如下步骤S601-S604对原始数据集进行处理,降低原始数据集的数据量,提升网络训练的效率,详述如下:In one embodiment, for the above-mentioned problem that the sentence-level drug relationship prediction model requires larger memory space during the training process, and the network training also takes longer, referring to Figure 6, the terminal device can specifically The original data set is processed through the following steps S601-S604 to reduce the data volume of the original data set and improve the efficiency of network training. The details are as follows:
S601、终端设备获取原始数据集,原始数据集包括多个原始文本。S601. The terminal device obtains the original data set, which includes multiple original texts.
S602、终端设备分别统计每个原始文本中包含的药物的数量。S602. The terminal device counts the number of drugs contained in each original text respectively.
S603、终端设备筛选原始数据集中包含至少两种药物的原始文本,得到原始数据子集。S603. The terminal device filters the original text containing at least two drugs in the original data set to obtain a subset of the original data.
S604、终端设备对原始数据子集中的每个原始文本进行标签处理,得到训练集。S604. The terminal device performs label processing on each original text in the original data subset to obtain a training set.
在一实施例中,上述原始文本的获取方式可以与初始文本的获取方式类似,对比不再进行说明。需要说明的是,若直接将原始数据集用于模型训练,则将耗费大量的训练时间。In one embodiment, the method of obtaining the original text may be similar to the method of obtaining the initial text, and the comparison will not be described again. It should be noted that if the original data set is used directly for model training, it will consume a lot of training time.
可以理解的是,原始文本中可能存在不包含两种药物的文本。此类原始文本无法直接用于训练。基于此,终端设备可以分别统计每个原始文本中包含的药物的数量。之后,将只包括一种药物的原始文本进行删除,并对未删除的原始文本执行预处理。It is understood that there may be text in the original text that does not contain both drugs. Such raw text cannot be used directly for training. Based on this, the terminal device can separately count the number of drugs contained in each original text. Afterwards, the original text that only included one drug was deleted, and preprocessing was performed on the original text that was not deleted.
在一实施例中,上述预处理至少包括对使用药物共享后缀的药物进行扩充的处理。其中,上述处理过程已在上述S201中进行解释,对此不再进行说明。需要补充的是,上述预处理还包括但不限于:将原始文本中英文字符进行小写、将标点去除、将原始文本中的所有数字转换为“NUM”代替,对此不作限定。In one embodiment, the above-mentioned preprocessing at least includes a process of expanding drugs using drug-shared suffixes. The above processing process has been explained in the above S201 and will not be described again. It should be added that the above preprocessing also includes but is not limited to: lowercase English characters in the original text, remove punctuation, and convert all numbers in the original text to "NUM" instead, which is not limited.
可以理解的是,在经过上述预处理后的原始文本,即为可以用于训练药物关系预测模型的文本,以此可以降低训练数据的冗余。It can be understood that the original text after the above preprocessing is the text that can be used to train the drug relationship prediction model, thereby reducing the redundancy of the training data.
在一实施例中,对原始文本进行标签处理具体为:对原始文本中每个位置出现的药物对分别进行打标签处理,以此参与模型训练。In one embodiment, labeling the original text specifically includes labeling the drug pairs appearing at each position in the original text to participate in model training.
然而,基于上述对现有技术中句子级的药物关系预测模型的描述,可知,在将一个文档划分成多个句子时,可能存在某个多个多个句子包含相同的药物对。然而,相同的药物对在文档中不同的位置时,其对应的关系标签可能不同。即意味着相同药物对在文档中分别对应着不同的语义信息。若使用第一个药物对或者某一药物对的关系标签参与模型训练,则最终生成的药物关系预测模型的预测准确率也将降低。若一个文档中每个相同的药物对分别使用不同关系标签,则将造成数据集中关系标签混乱的问题。However, based on the above description of the sentence-level drug relationship prediction model in the prior art, it can be seen that when a document is divided into multiple sentences, there may be multiple sentences containing the same drug pair. However, when the same drug pair is located in different locations in the document, its corresponding relationship labels may be different. This means that the same drug pair corresponds to different semantic information in the document. If the first drug pair or the relationship label of a certain drug pair is used to participate in model training, the prediction accuracy of the final drug relationship prediction model will also be reduced. If each identical drug pair in a document uses different relationship labels, it will cause confusion in the relationship labels in the data set.
基于此,在本实施例中,参照图7,终端设备还可以通过如下方式S701-S702对原始文本中的每个药物对的标签关系进行处理,以使相同的药物对也可以使用最优的关系标签,进而解决关系标签混乱的问题:Based on this, in this embodiment, referring to Figure 7, the terminal device can also process the label relationship of each drug pair in the original text in the following manner S701-S702, so that the same drug pair can also use the optimal Relationship tags to solve the problem of confusing relationship tags:
S701、终端设备获取原始文本中包含的每种药物对,以及各种药物对之间的关系标签。S701. The terminal device obtains each drug pair contained in the original text and the relationship labels between various drug pairs.
S702、若存在具备多种关系标签的药物对,则终端设备根据预设的标签优先级,将多种关系标签中优先级高的关系标签确定为药物对的关系标签。S702. If there is a drug pair with multiple relationship labels, the terminal device determines the relationship label with a higher priority among the multiple relationship labels as the relationship label of the drug pair according to the preset label priority.
在一实施例中,上述关系标签用于表示药物对之间的作用关系,用于参与 药物关系预测模型中的迭代过程。在训练过程中,每个药物对的关系标签通常有工作人员预先进行标注,因此,终端设备可以直接获取原始文本中包含的每种药物对,以及每种药物对之间的关系标签。In one embodiment, the above-mentioned relationship labels are used to represent the action relationship between drug pairs and to participate in the iterative process in the drug relationship prediction model. During the training process, the relationship label of each drug pair is usually pre-annotated by the staff. Therefore, the terminal device can directly obtain each drug pair contained in the original text, as well as the relationship labels between each drug pair.
需要说明的是,一个相同的药物对若出现在原始文本中的不同位置时,其语境以及语义可能各不相同,因此,对应的关系标签也可能各不相同。本实施例中,因药物对是基于两种药物的综合实体表示,生成的一个融合实体表示。也即在药物对具有多个关系标签时,也应当只使用一个关系标签与药物对进行对应,进行模型训练。It should be noted that if the same drug pair appears in different locations in the original text, its context and semantics may be different, so the corresponding relationship labels may also be different. In this embodiment, since the drug pair is based on the comprehensive entity representation of two drugs, a fusion entity representation is generated. That is, when a drug pair has multiple relationship labels, only one relationship label should be used to correspond to the drug pair for model training.
在一实施例中,上述标签优先级为预先设置的优先级。示例性的,上述标签可以分别为:False,Int,Advise,Effect,Mechanism。其优先级可以为:False<Int<Advise<Effect<MechanismIn one embodiment, the above-mentioned tag priority is a preset priority. For example, the above tags can be: False, Int, Advise, Effect, and Mechanism respectively. Its priority can be: False<Int<Advise<Effect<Mechanism
其中,上述所示,标签Mechanism拥有最高的优先级,即两个药物之间包含更多的药代动力学信息;标签Effect表示两种药物之间具有一定程度的反应,但程度上不及Mechanism;标签Advise表示两种药物之间具有交互,程度不及Effect;标签Int为表示两种药物之间交互程度低,且程度不及Advise;标签False表示两种药物之间不具有药物相互作用。Among them, as shown above, the label Mechanism has the highest priority, that is, it contains more pharmacokinetic information between the two drugs; the label Effect indicates that there is a certain degree of reaction between the two drugs, but not as much as Mechanism; The label Advise indicates that there is an interaction between the two drugs, and the degree is less than that of Effect; the label Int indicates that the degree of interaction between the two drugs is low, and the degree is not as high as Advise; the label False indicates that there is no drug interaction between the two drugs.
在本实施例中,在药物对具有多个关系标签时,可以通过如上述所示的标签优先级的规则,将多关系标签转换为单关系标签,使转换后的单关系标签可以更好的表示药物对在原始文本中的相互作用关系。In this embodiment, when a drug pair has multiple relationship labels, the multi-relationship labels can be converted into single-relationship labels through the label priority rules as shown above, so that the converted single-relationship labels can better Represents the interaction relationship between drug pairs in the original text.
在一实施例中,本申请中的药物对相互作用关系的预测方法为基于文档级的预测方法,相比于句子级的药物对相互作用关系的预测方法其优点如下:In one embodiment, the method for predicting drug-pair interaction relationships in this application is a document-level prediction method. Compared with the sentence-level prediction method for drug-pair interaction relationships, its advantages are as follows:
在实际应用中,句子级药物对相互作用关系的预测必须将一个句子转换为只包含两个药物实体的多个实例。相比之下,文档级药物对相互作用关系的预测可以同时针对多个药物实体。因此,文档级药物对相互作用关系的预测可以简化数据预处理的操作,也可以减少输入至药物关系预测模型中的文本。为了更直观地反映这一优势,收集近年来句子级药药物对相互作用关系的预测方法中所记录的(需要输入至药物关系预测模型)句子数量,并将其与本文所包含的句子数量进行比较。详见下表1:In practical applications, sentence-level prediction of drug-pair interaction relationships must transform a sentence into multiple instances containing only two drug entities. In contrast, document-level prediction of drug-to-drug interaction relationships can target multiple drug entities simultaneously. Therefore, document-level drug-drug interaction relationship prediction can simplify data preprocessing operations and reduce the text input into the drug-drug relationship prediction model. In order to reflect this advantage more intuitively, the number of sentences recorded in sentence-level drug-drug pair interaction prediction methods in recent years (which need to be input to the drug-drug relationship prediction model) was collected and compared with the number of sentences included in this article. Compare. See Table 1 below for details:
表1.不同方法中的句子数量Table 1. Number of sentences in different methods
Figure PCTCN2022137046-appb-000003
Figure PCTCN2022137046-appb-000003
Figure PCTCN2022137046-appb-000004
Figure PCTCN2022137046-appb-000004
从表1可以看出,在原始的DDI Extraction 2013中包含的文本数量最高。经过预处理,训练集中有27792个句子,测试集中有5716个句子,共33508个句子。本实施例中,预处理后的数量最小:训练集中有3784个句子,测试集中有790个句子,总共4574个句子。As can be seen from Table 1, the highest amount of text is included in the original DDI Extraction 2013. After preprocessing, there are 27792 sentences in the training set and 5716 sentences in the test set, for a total of 33508 sentences. In this embodiment, the number after preprocessing is the smallest: 3784 sentences in the training set and 790 sentences in the test set, for a total of 4574 sentences.
(2)不同BERT模型的比较:在文本处理中,有三种常用的BERT预训练模型,即BERT、SciBERT和BioBERT。为了观察三种预训练模型在文档级药物对相互作用关系的预测中的效果,将该方法中的BioBERT替换为BERT和SciBERT。然而,在实际的实验中,发现使用BERT或SciBERT代替BioBERT后,所提出的方法将无法正常工作。具体为:文档级的药物对相互作用关系的预测方法中不会使药物盲化,并且大多数药物由复杂的药物名词组成。在三种预训练模型中,只有BioBERT是在大规模生物医学语料库上训练的,因此也只有BioBERT能够准确地表达复杂的药物的实体表示。为了进一步得到三种预训练模型的表征效果,我们采用了在DDI语料库上提取句子级药物对相互作用关系的方法。其他实验设置完全一致,以便分析哪一个预训练模型能够更好地表达药物对相互作用关系的文本数据。详见下表2(2) Comparison of different BERT models: In text processing, there are three commonly used BERT pre-training models, namely BERT, SciBERT and BioBERT. In order to observe the effect of the three pre-trained models in the prediction of document-level drug pair interaction relationships, BioBERT in this method was replaced with BERT and SciBERT. However, in actual experiments, it was found that the proposed method would not work properly after using BERT or SciBERT instead of BioBERT. Specifically: the document-level drug-to-drug interaction prediction method does not blind drugs, and most drugs are composed of complex drug nouns. Among the three pre-trained models, only BioBERT is trained on a large-scale biomedical corpus, so only BioBERT can accurately express the entity representation of complex drugs. In order to further obtain the representation effects of the three pre-trained models, we adopted the method of extracting sentence-level drug pair interaction relationships on the DDI corpus. Other experimental settings are exactly the same in order to analyze which pre-trained model can better express the text data of drug pair interaction relationships. See Table 2 below for details
表2.采用不同BERT模型的结果Table 2. Results using different BERT models
预训练模型Pre-trained model macro-P(%)macro-P(%) macro-R(%)macro-R(%) macro-F1(%)macro-F1(%)
BERTBERT 78.7878.78 73.2773.27 75.9275.92
SciBERTSciBERT 81.7181.71 74.8074.80 78.1078.10
BioBERTBioBERT 85.8985.89 73.4673.46 79.1979.19
如表2所示,使用BERT的方法性能最低,macro-P(宏平均精确率,模型 的一种评价指标)达到78.78%,macro-R(宏平均召回率,模型的另一种评价指标)达到73.27%,macro-F1(宏平均调和平均值,模型的再一种评价指标)达到75.92%。SciBERT方法的结果适中,macro-R最高,达到74.80%。这是因为SciBERT是在大规模的科学文献语料库上训练的,因此与BERT相比,性能有了很大的提高。使用BioBERT方法得到的结果最好,macro-P达到85.89%,macro-F1达到79.19%。这表明在生物医学语料库上训练的BioBERT能够更准确地表达药物对相互作用关系的文本数据。As shown in Table 2, the method using BERT has the lowest performance, with macro-P (macro average precision, an evaluation index of the model) reaching 78.78%, and macro-R (macro average recall, another evaluation index of the model). reached 73.27%, and macro-F1 (macro average harmonic mean, another evaluation index of the model) reached 75.92%. The results of SciBERT method are moderate, macro-R is the highest, reaching 74.80%. This is because SciBERT is trained on a large-scale corpus of scientific literature, resulting in greatly improved performance compared to BERT. The best results were obtained using the BioBERT method, with macro-P reaching 85.89% and macro-F1 reaching 79.19%. This shows that BioBERT trained on the biomedical corpus can more accurately express the text data of drug pair interaction relationships.
(3)文档级药物的综合实体表示(embedding)的性能:为了验证上述实际性能,可以设计一个实验来比较不使用文档级药物的embedding和使用文档级药物的embedding的效果。具体的,将前者标记为Without DEE(对于每个药物,仅在第一次出现时进行embedding),并将其与本实施例提出的采用DEE的方法进行比较。(3) Performance of comprehensive entity representation (embedding) of document-level drugs: In order to verify the above actual performance, an experiment can be designed to compare the effect of embedding without document-level drugs and embedding using document-level drugs. Specifically, the former is marked as Without DEE (for each drug, embedding is only performed when it appears for the first time), and compared with the method using DEE proposed in this embodiment.
表3.采用DEE及不采用DEE的效果Table 3. Effects of using DEE and not using DEE
方法method macro-P(%)macro-P(%) macro-R(%)macro-R(%) macro-F1(%)macro-F1(%)
Without DEEWithout DEE 60.0760.07 56.3256.32 58.4358.43
Use DEEUse DEE 65.6065.60 59.7159.71 62.5162.51
如表3所示,在没有采用DEE方法的情况下,macro-P达到60.07%,macro-R达到56:32%,macro-F1达到58.43%。在采用DEE方法的情况下,macro-P达到65.60%,macro-R达到59.71%,macro-F1达到62.51%,分别比没有采用DEE的情况高出5.53%,3.39%和4.08%。原因在于:在没有采用DEE的情况下,不会考虑文档中不同位置的同一药物的上下文语义信息。因此,该方法通过用文档级药物的embedding可以获得药物在文档中完整的综合实体表示,以获得更精确的预测结果。As shown in Table 3, without using the DEE method, macro-P reached 60.07%, macro-R reached 56:32%, and macro-F1 reached 58.43%. In the case of adopting the DEE method, macro-P reached 65.60%, macro-R reached 59.71%, and macro-F1 reached 62.51%, which were 5.53%, 3.39%, and 4.08% higher than those without DEE, respectively. The reason is that without using DEE, the contextual semantic information of the same drug at different locations in the document will not be considered. Therefore, this method can obtain a complete comprehensive entity representation of the drug in the document by using document-level drug embedding to obtain more accurate prediction results.
(4)不同神经网络模型结构的比较:本实施例首次对DDI Extraction 2013数据集进行了专门预处理,实现了文档级药物对相互作用关系的预测。目前,还没有关于文档级DDI数据集的工作。为了验证所提方法的有效性,将其与使用CNN和BiLSTM(两种最常用的神经网络模型)的方法进行比较。这两种方法也采用文档级药物的embedding,但在获得药物的综合实体表示后,使用不同的神经网络模型结构。然而,在实际应用中,发现仅采用BiLSTM网络模型的方法无法工作。因此,终端设备将其改为CNN和BiLSTM神经网络模型相结合的方法,并将其表示为“CNN+BiLSTM”。(4) Comparison of different neural network model structures: This embodiment performs special preprocessing on the DDI Extraction 2013 data set for the first time, achieving document-level prediction of drug interaction relationships. Currently, there is no work on document-level DDI datasets. To verify the effectiveness of the proposed method, it is compared with methods using CNN and BiLSTM, the two most commonly used neural network models. These two methods also adopt document-level drug embedding, but use different neural network model structures after obtaining a comprehensive entity representation of the drug. However, in practical applications, it was found that the method using only the BiLSTM network model did not work. Therefore, the terminal device changed it to a method that combines CNN and BiLSTM neural network models, and expressed it as "CNN+BiLSTM".
表4.采用不同神经网络模型结构的结果Table 4. Results using different neural network model structures
Figure PCTCN2022137046-appb-000005
Figure PCTCN2022137046-appb-000005
从表4可以看出,虽然CNN+BiLSTM的神经网络模型结构中macro-P达到66.98%,是三种方法中最高的,但macro-R仅达到50.19%,macro-F1仅达到57.38%,因此网络结构的整体性能最低。CNN方法的macro-P达到56.75%,macro-R达到59.97%,macro-F1达到58.32%。CNN的总体性能略高于CNN+BiLSTM。但是,使用本申请中的药物关系预测模型的结构时,macro-P仅比CNN+BiLSTM低1.38%,macro-R比CNN低0.26%。Macro-P和macro-R几乎都是最高的,因此总体性能最好。原因在于:输入为两个药物的综合实体表示,而不是一个完整的句子。因此,适合句子级神经网络模型结构在文档级神经网络模型结构(尤其是在BiLSTM中)无法达到相同的性能。As can be seen from Table 4, although macro-P reaches 66.98% in the neural network model structure of CNN+BiLSTM, which is the highest among the three methods, macro-R only reaches 50.19%, and macro-F1 only reaches 57.38%. Therefore The network structure has the lowest overall performance. The macro-P of the CNN method reaches 56.75%, macro-R reaches 59.97%, and macro-F1 reaches 58.32%. The overall performance of CNN is slightly higher than CNN+BiLSTM. However, when using the structure of the drug relationship prediction model in this application, macro-P is only 1.38% lower than CNN+BiLSTM, and macro-R is 0.26% lower than CNN. Macro-P and macro-R are both almost the highest and therefore have the best overall performance. The reason is: the input is a comprehensive entity representation of two drugs, not a complete sentence. Therefore, suitable sentence-level neural network model structures cannot achieve the same performance as document-level neural network model structures (especially in BiLSTM).
综上所示,本实施例中,采用文档级药物对相互作用关系的预测方法,相比于现有技术中采用句子级药物对相互作用关系的预测方法,可以极大的降低输入到药物关系预测模型中的数据量,且可以将多个不同位置的药物综合进行准确的语义表达,从而使药物关系预测模型能够提取到药物在文档中的真实语义信息,提高模型预测准确率。In summary, in this embodiment, the document-level prediction method of drug-pair interaction relationships is used. Compared with the sentence-level prediction method of drug-pair interaction relationships in the prior art, the input to drug relationships can be greatly reduced. The amount of data in the prediction model can be combined to accurately express the semantics of drugs in multiple different locations, so that the drug relationship prediction model can extract the real semantic information of the drug in the document and improve the model prediction accuracy.
请参阅图8,图8是本申请实施例提供的一种药物对相互作用关系的预测装置的结构框图。本实施例中药物对相互作用关系的预测装置包括的各模块用于执行图2、图3、图6和图7对应的实施例中的各步骤。具体请参阅图2、图3、图6和图7以及图2、图3、图6和图7所对应的实施例中的相关描述。为了便于说明,仅示出了与本实施例相关的部分。参见图8,药物对相互作用关系的预测装置800可以包括:获取模块810、综合实体表示确定模块820、融合实体表示确定模块830以及预测模块840,其中:Please refer to FIG. 8 , which is a structural block diagram of a device for predicting drug pair interaction relationships provided by an embodiment of the present application. The modules included in the device for predicting drug pair interaction relationships in this embodiment are used to execute the steps in the embodiments corresponding to Figures 2, 3, 6 and 7. For details, please refer to FIG. 2, FIG. 3, FIG. 6, and FIG. 7 and the relevant descriptions in the embodiments corresponding to FIG. 2, FIG. 3, FIG. 6, and FIG. 7. For convenience of explanation, only parts related to this embodiment are shown. Referring to Figure 8, the device 800 for predicting drug pair interaction relationships may include: an acquisition module 810, a comprehensive entity representation determination module 820, a fusion entity representation determination module 830, and a prediction module 840, wherein:
获取模块810,用于获取目标文本;目标文本中包括多种药物,每种药物在目标文本中出现至少一次。The acquisition module 810 is used to acquire target text; the target text includes multiple drugs, and each drug appears at least once in the target text.
综合实体表示确定模块820,用于分别确定每种药物对应的综合实体表示,综合实体表示用于描述对应的药物在目标文本中的各个位置处的语义信息。The comprehensive entity representation determination module 820 is used to determine the comprehensive entity representation corresponding to each drug, and the comprehensive entity representation is used to describe the semantic information of the corresponding drug at each position in the target text.
融合实体表示确定模块830,用于针对多种药物中的任一药物对,根据药 物对中的两种药物的综合实体表示确定药物对的融合实体表示。The fusion entity representation determination module 830 is used for determining, for any drug pair among a plurality of drugs, the fusion entity representation of the drug pair based on the comprehensive entity representations of two drugs in the drug pair.
预测模块840,用于根据药物对的融合实体表示,预测药物对的相互作用关系。The prediction module 840 is used to predict the interaction relationship of the drug pair according to the fusion entity representation of the drug pair.
在一实施例中,获取模块810还用于:In one embodiment, the acquisition module 810 is also used to:
获取初始文本,初始文本中包括多种药物,每种药物在初始文本中出现至少一次;若初始文本中存在使用药物共享后缀的药物名称,则对药物名称进行扩充,得到目标文本。Obtain the initial text. The initial text includes multiple drugs, and each drug appears at least once in the initial text; if there is a drug name using a drug-shared suffix in the initial text, the drug name is expanded to obtain the target text.
在一实施例中,综合实体表示确定模块820还用于:In one embodiment, the comprehensive entity representation determination module 820 is also used to:
针对任一种药物,分别确定药物在目标文本中的多个位置;根据多个位置,分别生成每个位置对应的文本序列;对每个位置对应的文本序列进行向量处理,得到综合实体表示。For any kind of drug, multiple positions of the drug in the target text are determined respectively; based on the multiple positions, a text sequence corresponding to each position is generated; the text sequence corresponding to each position is vector processed to obtain a comprehensive entity representation.
在一实施例中,综合实体表示确定模块820还用于:In one embodiment, the comprehensive entity representation determination module 820 is also used to:
分别对每个位置对应的文本序列进行向量表示,对应得到多个文本向量;每个文本向量用于描述药物在与对应位置处的语义信息;将每个文本向量进行向量整合,生成药物的综合实体表示。The text sequence corresponding to each position is represented by a vector, and multiple text vectors are obtained; each text vector is used to describe the semantic information of the drug at the corresponding position; each text vector is vector integrated to generate a comprehensive summary of the drug Entity representation.
在一实施例中,药物对相互作用关系的预测装置800还包括:In one embodiment, the device 800 for predicting drug pair interaction relationships further includes:
输入模块,用于将目标文本输入至预训练的药物关系预测模型中进行处理,得到多种药物中的每一药物对的相互作用关系。The input module is used to input the target text into the pre-trained drug relationship prediction model for processing, and obtain the interaction relationship of each drug pair among multiple drugs.
在一实施例中,药物关系预测模型包括第一激活层、第二激活层、第一全连接层和第二全连接层;融合实体表示确定模块830还用于:In one embodiment, the drug relationship prediction model includes a first activation layer, a second activation layer, a first fully connected layer and a second fully connected layer; the fusion entity representation determination module 830 is also used to:
将两种药物的综合实体表示依次输入第一激活层以及第一全连接层中,得到两种药物分别对应的目标向量;将两种目标向量进行拼接,并将拼接后的目标向量依次输入至第二激活层以及第二全连接层中,得到融合实体表示。Input the comprehensive entity representations of the two drugs into the first activation layer and the first fully connected layer in sequence to obtain the target vectors corresponding to the two drugs; splice the two target vectors, and input the spliced target vectors into In the second activation layer and the second fully connected layer, the fused entity representation is obtained.
在一实施例中,药物关系预测模型是根据训练集训练得到;药物对相互作用关系的预测装置800还包括如下模块获取训练集:In one embodiment, the drug relationship prediction model is trained based on the training set; the drug pair interaction relationship prediction device 800 also includes the following module to obtain the training set:
原始数据集获取模块,用于获取原始数据集,原始数据集包括多个原始文本。The original data set acquisition module is used to obtain the original data set, which includes multiple original texts.
统计模块,用于分别统计每个原始文本中包含的药物的数量。The statistics module is used to separately count the number of drugs contained in each original text.
筛选模块,用于筛选原始数据集中包含至少两种药物的原始文本,得到原始数据子集。The filtering module is used to filter the original text containing at least two drugs in the original data set to obtain a subset of the original data.
标签处理模块,用于对原始数据子集中的每个原始文本进行标签处理,得到训练集。The label processing module is used to label each original text in the original data subset to obtain a training set.
在一实施例中,标签处理模块还用于:In one embodiment, the tag processing module is also used to:
获取原始文本中包含的每种药物对,以及各种药物对之间的关系标签;若存在具备多种关系标签的药物对,则根据预设的标签优先级,将多种关系标签中优先级高的关系标签确定为药物对的关系标签。Obtain each drug pair included in the original text and the relationship labels between the various drug pairs; if there are drug pairs with multiple relationship labels, prioritize the multiple relationship labels according to the preset label priority. The high relationship label is determined as the relationship label of the drug pair.
当理解的是,图8示出的药物对相互作用关系的预测装置的结构框图中,各模块用于执行图2、图3、图6和图7对应的实施例中的各步骤,而对于图2、图3、图6和图7对应的实施例中的各步骤已在上述实施例中进行详细解释,具体请参阅图2、图3、图6和图7以及图2、图3、图6和图7所对应的实施例中的相关描述,此处不再赘述。It should be understood that in the structural block diagram of the device for predicting drug pair interaction relationships shown in Figure 8, each module is used to execute each step in the embodiment corresponding to Figure 2, Figure 3, Figure 6 and Figure 7, and for Each step in the embodiment corresponding to Figure 2, Figure 3, Figure 6 and Figure 7 has been explained in detail in the above embodiment. For details, please refer to Figure 2, Figure 3, Figure 6 and Figure 7 and Figure 2, Figure 3, The relevant descriptions in the embodiments corresponding to Figures 6 and 7 will not be described again here.
图9是本申请一实施例提供的一种终端设备的结构框图。如图9所示,该实施例的终端设备900包括:处理器910、存储器920以及存储在存储器920中并可在处理器910运行的计算机程序930,例如药物对相互作用关系的预测方法的程序。处理器910执行计算机程序930时实现上述各个药物对相互作用关系的预测方法各实施例中的步骤,例如图1所示的S101至S104。或者,处理器910执行计算机程序930时实现上述图8对应的实施例中各模块的功能,例如,图8所示的模块810至840的功能,具体请参阅图8对应的实施例中的相关描述。Figure 9 is a structural block diagram of a terminal device provided by an embodiment of the present application. As shown in Figure 9, the terminal device 900 of this embodiment includes: a processor 910, a memory 920, and a computer program 930 stored in the memory 920 and executable on the processor 910, such as a program for predicting drug interaction relationships. . When the processor 910 executes the computer program 930, the steps in each embodiment of the method for predicting the interaction relationship of each drug pair are implemented, such as S101 to S104 shown in Figure 1. Alternatively, when the processor 910 executes the computer program 930, it implements the functions of each module in the embodiment corresponding to FIG. 8, for example, the functions of modules 810 to 840 shown in FIG. 8. For details, please refer to the relevant information in the embodiment corresponding to FIG. 8. describe.
示例性的,计算机程序930可以被分割成一个或多个模块,一个或者多个模块被存储在存储器920中,并由处理器910执行,以实现本申请实施例提供的药物对相互作用关系的预测方法。一个或多个模块可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述计算机程序930在终端设备900中的执行过程。例如,计算机程序930可以实现本申请实施例提供的药物对相互作用关系的预测方法。Exemplarily, the computer program 930 can be divided into one or more modules, and the one or more modules are stored in the memory 920 and executed by the processor 910 to realize the drug pair interaction relationship provided by the embodiments of the present application. method of prediction. One or more modules may be a series of computer program instruction segments capable of completing specific functions. The instruction segments are used to describe the execution process of the computer program 930 in the terminal device 900 . For example, the computer program 930 can implement the method for predicting drug-drug interaction relationships provided in the embodiments of the present application.
终端设备900可包括,但不仅限于,处理器910、存储器920。本领域技术人员可以理解,图9仅仅是终端设备900的示例,并不构成对终端设备900的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如终端设备还可以包括输入输出设备、网络接入设备、总线等。The terminal device 900 may include, but is not limited to, a processor 910 and a memory 920. Those skilled in the art can understand that FIG. 9 is only an example of the terminal device 900 and does not constitute a limitation on the terminal device 900. It may include more or less components than shown in the figure, or some components may be combined, or different components may be used. , for example, the terminal device may also include input and output devices, network access devices, buses, etc.
所称处理器910可以是中央处理单元,还可以是其他通用处理器、数字信号处理器、专用集成电路、现成可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 910 may be a central processing unit, or other general-purpose processors, digital signal processors, application-specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. wait. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
存储器920可以是终端设备900的内部存储单元,例如终端设备900的硬 盘或内存。存储器920也可以是终端设备900的外部存储设备,例如终端设备900上配备的插接式硬盘,智能存储卡,闪存卡等。进一步地,存储器920还可以既包括终端设备900的内部存储单元也包括外部存储设备。The memory 920 may be an internal storage unit of the terminal device 900, such as a hard disk or memory of the terminal device 900. The memory 920 may also be an external storage device of the terminal device 900, such as a plug-in hard disk, a smart memory card, a flash memory card, etc. equipped on the terminal device 900. Further, the memory 920 may also include both an internal storage unit of the terminal device 900 and an external storage device.
本申请实施例提供了一种计算机可读存储介质,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上述各个实施例中的药物对相互作用关系的预测方法。Embodiments of the present application provide a computer-readable storage medium, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the medicines in the above embodiments are implemented. Predictive methods for interaction relationships.
本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述各个实施例中的药物对相互作用关系的预测方法。Embodiments of the present application provide a computer program product. When the computer program product is run on a terminal device, the terminal device executes the method for predicting drug pair interaction relationships in each of the above embodiments.
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above embodiments are only used to illustrate the technical solutions of the present application, but are not intended to limit them. Although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still modify the technical solutions described in the foregoing embodiments. Modifications are made to the recorded technical solutions, or equivalent substitutions are made to some of the technical features; these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and shall be included in this application. within the scope of protection.

Claims (10)

  1. 一种药物对相互作用关系的预测方法,其特征在于,所述方法包括:A method for predicting drug-drug interaction relationships, characterized in that the method includes:
    获取目标文本;所述目标文本中包括多种药物,每种药物在所述目标文本中出现至少一次;Obtain target text; the target text includes multiple drugs, and each drug appears at least once in the target text;
    分别确定所述每种药物对应的综合实体表示,所述综合实体表示用于描述对应的药物在所述目标文本中的各个位置处的语义信息;Determine the comprehensive entity representation corresponding to each drug respectively, and the comprehensive entity representation is used to describe the semantic information of the corresponding drug at each position in the target text;
    针对所述多种药物中的任一药物对,根据所述药物对中的两种药物的综合实体表示确定所述药物对的融合实体表示;For any drug pair in the plurality of drugs, determine the fusion entity representation of the drug pair based on the comprehensive entity representation of two drugs in the drug pair;
    根据所述药物对的融合实体表示,预测所述药物对的相互作用关系。According to the fusion entity representation of the drug pair, the interaction relationship of the drug pair is predicted.
  2. 根据权利要求1所述的方法,其特征在于,所述获取目标文本,包括:The method according to claim 1, characterized in that said obtaining target text includes:
    获取初始文本,所述初始文本中包括所述多种药物,每种药物在所述初始文本中出现至少一次;Obtain initial text, the initial text includes the plurality of drugs, and each drug appears at least once in the initial text;
    若所述初始文本中存在使用药物共享后缀的药物名称,则对所述药物名称进行扩充,得到所述目标文本。If there is a drug name using a drug-shared suffix in the initial text, the drug name is expanded to obtain the target text.
  3. 根据权利要求1所述的方法,其特征在于,所述分别确定所述每种药物对应的综合实体表示,包括:The method according to claim 1, wherein the separately determining the comprehensive entity representation corresponding to each drug includes:
    针对任一种所述药物,分别确定所述药物在所述目标文本中的多个位置;For any one of the drugs, determine multiple positions of the drug in the target text;
    根据所述多个位置,分别生成每个位置对应的文本序列;According to the multiple positions, generate a text sequence corresponding to each position;
    对每个位置对应的所述文本序列进行向量处理,得到所述综合实体表示。The text sequence corresponding to each position is vector processed to obtain the comprehensive entity representation.
  4. 根据权利要求3所述的方法,其特征在于,所述对每个位置对应的所述文本序列进行向量处理,得到所述综合实体表示,包括:The method according to claim 3, characterized in that, performing vector processing on the text sequence corresponding to each position to obtain the comprehensive entity representation includes:
    分别对每个位置对应的所述文本序列进行向量表示,对应得到多个文本向量;每个所述文本向量用于描述所述药物在与对应位置处的语义信息;The text sequence corresponding to each position is represented by a vector, and multiple text vectors are obtained correspondingly; each text vector is used to describe the semantic information of the drug at the corresponding position;
    将每个所述文本向量进行向量整合,生成所述药物的综合实体表示。Each text vector is vector integrated to generate a comprehensive entity representation of the drug.
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述分别确定所述每种药物对应的综合实体表示,所述综合实体表示用于描述对应的所述药物在所述目标文本中的各个位置处的语义信息;针对所述多种药物中的任一药物对,根据所述药物对中的两种药物的综合实体表示确定所述药物对的融合实体表示;根据所述药物对的融合实体表示,预测所述药物对的相互作用关系,包括:The method according to any one of claims 1 to 4, characterized in that the comprehensive entity representation corresponding to each drug is determined respectively, and the comprehensive entity representation is used to describe the corresponding drug in the target. Semantic information at various positions in the text; for any drug pair among the plurality of drugs, determine a fused entity representation of the drug pair based on the comprehensive entity representation of two drugs in the drug pair; according to the The fusion entity representation of the drug pair predicts the interaction relationship of the drug pair, including:
    将所述目标文本输入至预训练的药物关系预测模型中进行处理,得到所述多种药物中的每一所述药物对的相互作用关系。The target text is input into the pre-trained drug relationship prediction model for processing, and the interaction relationship of each drug pair in the multiple drugs is obtained.
  6. 根据权利要求5所述的方法,其特征在于,所述药物关系预测模型包括第一激活层、第二激活层、第一全连接层和第二全连接层;所述根据所述药物对中的两种药物的综合实体表示确定所述药物对的融合实体表示,包括:The method according to claim 5, wherein the drug relationship prediction model includes a first activation layer, a second activation layer, a first fully connected layer and a second fully connected layer; The integrated entity representation of the two drugs determines the fused entity representation of the drug pair, including:
    将所述两种药物的综合实体表示依次输入所述第一激活层以及所述第一全连接层中,得到所述两种药物分别对应的目标向量;Input the comprehensive entity representations of the two drugs into the first activation layer and the first fully connected layer in sequence to obtain target vectors corresponding to the two drugs;
    将两种所述目标向量进行拼接,并将拼接后的所述目标向量依次输入至所述第二激活层以及所述第二全连接层中,得到所述融合实体表示。The two target vectors are spliced, and the spliced target vectors are sequentially input into the second activation layer and the second fully connected layer to obtain the fused entity representation.
  7. 根据权利要求5所述的方法,其特征在于,所述药物关系预测模型是根据训练集训练得到,所述训练集的获取方式为:The method according to claim 5, characterized in that the drug relationship prediction model is obtained by training based on a training set, and the acquisition method of the training set is:
    获取原始数据集,所述原始数据集包括多个原始文本;Obtain an original data set, the original data set including a plurality of original texts;
    分别统计每个所述原始文本中包含的药物数量;Separately count the number of drugs contained in each of the original texts;
    筛选所述原始数据集中包含至少两种药物的原始文本,得到原始数据子集;Filter the original text containing at least two drugs in the original data set to obtain a subset of the original data;
    对所述原始数据子集中的每个所述原始文本进行标签处理,得到所述训练集。Label processing is performed on each original text in the original data subset to obtain the training set.
  8. 根据权利要求7所述的方法,其特征在于,所述对所述原始数据子集中的每个所述原始文本进行标签处理,得到所述训练集,包括:The method according to claim 7, characterized in that, performing label processing on each original text in the original data subset to obtain the training set includes:
    获取所述原始文本中包含的每种药物对,以及各种药物对之间的关系标签;Obtain each drug pair contained in the original text, and relationship labels between various drug pairs;
    若存在具备多种关系标签的药物对,则根据预设的标签优先级,将所述多种关系标签中优先级高的关系标签确定为所述药物对的关系标签。If there is a drug pair with multiple relationship labels, the relationship label with a higher priority among the multiple relationship labels is determined as the relationship label of the drug pair according to the preset label priority.
  9. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至8任一项所述的方法。A terminal device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that when the processor executes the computer program, it implements claims 1 to 1 The method described in any one of 8.
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至8任一项所述的方法。A computer-readable storage medium stores a computer program, characterized in that when the computer program is executed by a processor, the method according to any one of claims 1 to 8 is implemented.
PCT/CN2022/137046 2022-03-17 2022-12-06 Method for predicting interaction relationship of drug pair, and device and medium WO2023173823A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210263559.8A CN114678141A (en) 2022-03-17 2022-03-17 Method, apparatus and medium for predicting drug-pair interaction relationship
CN202210263559.8 2022-03-17

Publications (1)

Publication Number Publication Date
WO2023173823A1 true WO2023173823A1 (en) 2023-09-21

Family

ID=82073728

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/137046 WO2023173823A1 (en) 2022-03-17 2022-12-06 Method for predicting interaction relationship of drug pair, and device and medium

Country Status (2)

Country Link
CN (1) CN114678141A (en)
WO (1) WO2023173823A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117198426A (en) * 2023-11-06 2023-12-08 武汉纺织大学 Multi-scale medicine-medicine response interpretable prediction method and system

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114678141A (en) * 2022-03-17 2022-06-28 中国科学院深圳理工大学(筹) Method, apparatus and medium for predicting drug-pair interaction relationship

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020671A (en) * 2019-03-08 2019-07-16 西北大学 The building of drug relationship disaggregated model and classification method based on binary channels CNN-LSTM network
CN111241298A (en) * 2020-01-08 2020-06-05 腾讯科技(深圳)有限公司 Information processing method, apparatus and computer readable storage medium
CN112860816A (en) * 2021-03-01 2021-05-28 三维通信股份有限公司 Construction method and detection method of interaction relation detection model of drug entity pair
EP3859745A1 (en) * 2020-02-03 2021-08-04 National Centre for Scientific Research "Demokritos" System and method for identifying drug-drug interactions
CN113806531A (en) * 2021-08-26 2021-12-17 西北大学 Drug relationship classification model construction method, drug relationship classification method and system
CN114678141A (en) * 2022-03-17 2022-06-28 中国科学院深圳理工大学(筹) Method, apparatus and medium for predicting drug-pair interaction relationship

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111221982B (en) * 2020-01-13 2023-09-01 腾讯科技(深圳)有限公司 Information processing method, information processing apparatus, computer readable storage medium, and computer device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020671A (en) * 2019-03-08 2019-07-16 西北大学 The building of drug relationship disaggregated model and classification method based on binary channels CNN-LSTM network
CN111241298A (en) * 2020-01-08 2020-06-05 腾讯科技(深圳)有限公司 Information processing method, apparatus and computer readable storage medium
EP3859745A1 (en) * 2020-02-03 2021-08-04 National Centre for Scientific Research "Demokritos" System and method for identifying drug-drug interactions
CN112860816A (en) * 2021-03-01 2021-05-28 三维通信股份有限公司 Construction method and detection method of interaction relation detection model of drug entity pair
CN113806531A (en) * 2021-08-26 2021-12-17 西北大学 Drug relationship classification model construction method, drug relationship classification method and system
CN114678141A (en) * 2022-03-17 2022-06-28 中国科学院深圳理工大学(筹) Method, apparatus and medium for predicting drug-pair interaction relationship

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117198426A (en) * 2023-11-06 2023-12-08 武汉纺织大学 Multi-scale medicine-medicine response interpretable prediction method and system
CN117198426B (en) * 2023-11-06 2024-01-30 武汉纺织大学 Multi-scale medicine-medicine response interpretable prediction method and system

Also Published As

Publication number Publication date
CN114678141A (en) 2022-06-28

Similar Documents

Publication Publication Date Title
Xiao et al. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review
Zhang et al. Deep learning for drug–drug interaction extraction from the literature: a review
CN112214995B (en) Hierarchical multitasking term embedded learning for synonym prediction
Verga et al. Simultaneously self-attending to all mentions for full-abstract biological relation extraction
WO2023173823A1 (en) Method for predicting interaction relationship of drug pair, and device and medium
US20240013055A1 (en) Adversarial pretraining of machine learning models
Yang et al. Combining deep learning with token selection for patient phenotyping from electronic health records
Campos et al. Biomedical named entity recognition: a survey of machine-learning tools
Zhao et al. Disease named entity recognition from biomedical literature using a novel convolutional neural network
Shao et al. Self-attention-based conditional random fields latent variables model for sequence labeling
US20180025121A1 (en) Systems and methods for finer-grained medical entity extraction
Jindal et al. Extraction of events and temporal expressions from clinical narratives
US12008313B2 (en) Medical data verification method and electronic device
Jehangir et al. A survey on Named Entity Recognition—datasets, tools, and methodologies
Yan et al. A survey of automated International Classification of Diseases coding: development, challenges, and applications
Sun et al. Hybrid neural conditional random fields for multi-view sequence labeling
Lin et al. Disorder recognition in clinical texts using multi-label structured SVM
Zhou et al. Feature engineering vs. deep learning for paper section identification: Toward applications in Chinese medical literature
Xiong et al. Leveraging Multi-source knowledge for Chinese clinical named entity recognition via relational graph convolutional network
Yuan et al. Llm for patient-trial matching: Privacy-aware data augmentation towards better performance and generalizability
Rumeng et al. A hybrid neural network model for joint prediction of presence and period assertions of medical events in clinical notes
Wang et al. EHR2Vec: representation learning of medical concepts from temporal patterns of clinical notes based on self-attention mechanism
Hu et al. A novel neural network model fusion approach for improving medical named entity recognition in online health expert question-answering services
Sivarethinamohan et al. Envisioning the potential of natural language processing (nlp) in health care management
Hernandez et al. An automated approach to identify scientific publications reporting pharmacokinetic parameters

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22931834

Country of ref document: EP

Kind code of ref document: A1