WO2023173823A1

WO2023173823A1 - Method for predicting interaction relationship of drug pair, and device and medium

Info

Publication number: WO2023173823A1
Application number: PCT/CN2022/137046
Authority: WO
Inventors: 唐继军; 窦明亮; 郗文辉; 郭菲
Original assignee: 中国科学院深圳理工大学(筹)
Priority date: 2022-03-17
Filing date: 2022-12-06
Publication date: 2023-09-21
Also published as: CN114678141A

Abstract

Disclosed are a method for predicting an interaction relationship of a drug pair, and a device and a medium, which are applicable to the technical field of biomedicine. The method comprises: acquiring a target text, the target text comprising a plurality of drugs, and each drug appearing at least once in the target text (S201); respectively determining a comprehensive entity representation corresponding to each drug, the comprehensive entity representation being used for describing semantic information of the corresponding drug at each position in the target text (S202); with regard to any drug pair among the plurality of drugs, determining a fusion entity representation of the drug pair according to the comprehensive entity representation of two drugs in the drug pair (S203); and according to the fusion entity representation of the drug pair, predicting the interaction relationship of the drug pair (S204). By means of the method, the accuracy of predicting the interaction relationship of each drug pair in a document can be improved.

Description

Prediction methods, equipment and media for drug-drug interactions

Technical field

This application belongs to the field of biomedical technology, and in particular relates to a method, equipment and medium for predicting drug-drug interaction relationships.

Background technique

Drug-Drug Interaction (DDI) prediction is an important research field in pharmacovigilance and plays a vital role in drug virtual screening, patient treatment plans, treatment effects, and patient safety research.

All current DDI predictions are based on sentence-level relationships. First, the abstract or part of the drug description of each scientific document is divided into multiple sentences, and each sentence contains several drugs. Therefore, existing work inputs each drug pair into a specific network for processing and predicts the interaction relationship between each drug pair.

However, the above-mentioned DDI prediction methods all process drug pairs at the sentence level, and in practice, the interaction relationship between a drug pair is often determined by multiple sentences. Therefore, in the existing technology, the accuracy of predicting the interaction relationship of each drug pair in the document is low.

Contents of the invention

The embodiments of the present application provide a method, device and storage medium for predicting the interaction relationship of drug pairs, which can solve the problem of low accuracy in predicting the interaction relationship of each drug pair in the document.

In the first aspect, embodiments of the present application provide a method for predicting drug-drug interaction relationships, which method includes:

Obtain the target text; the target text includes multiple drugs, and each drug appears at least once in the target text;

The comprehensive entity representation corresponding to each drug is determined separately. The comprehensive entity representation is used to describe the semantic information of the corresponding drug at various positions in the target text;

For any drug pair among the multiple drugs, determine the fused entity representation of the drug pair based on the comprehensive entity representation of two drugs in the drug pair;

Predict the interaction relationship of drug pairs based on their fused entity representation.

In the second aspect, embodiments of the present application provide a device for predicting drug pair interactions, which device includes:

The acquisition module is used to obtain the target text; the target text includes multiple drugs, and each drug appears at least once in the target text;

The comprehensive entity representation determination module is used to determine the comprehensive entity representation corresponding to each drug respectively. The comprehensive entity representation is used to describe the semantic information of the corresponding drug at various positions in the target text;

The fusion entity representation determination module is used for determining the fusion entity representation of the drug pair based on the comprehensive entity representation of two drugs in the drug pair for any drug pair among multiple drugs;

The prediction module is used to predict the interaction relationship of the drug pair based on the fusion entity representation of the drug pair.

In a third aspect, embodiments of the present application provide a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the method of the first aspect is implemented. .

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium. The computer-readable storage medium stores a computer program. When the computer program is executed by a processor, the method of the first aspect is implemented.

In a fifth aspect, embodiments of the present application provide a computer program product, which when the computer program product is run on a terminal device, causes the terminal device to execute the method of the first aspect.

Compared with the existing technology, the beneficial effects of the embodiments of the present application are: by obtaining the corresponding comprehensive entity representation of each drug in the target text, the terminal device can use a comprehensive entity representation to comprehensively describe the corresponding drug in the target text. Semantic information at each location in . Afterwards, for each target drug pair, the terminal device can fuse according to the comprehensive entity representation of each drug to generate a fused entity representation of the drug pair. Furthermore, the terminal device can accurately predict the interaction relationship of the drug pair based on the fused entity representation.

Description of the drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or description of the prior art will be briefly introduced below. Obviously, the drawings in the following description are only for the purpose of the present application. For some embodiments, for those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

Figure 1 is a schematic diagram of an implementation method for generating a sentence-level embodiment provided by an embodiment of the present application;

Figure 2 is an implementation flow chart of a method for predicting drug-drug interaction relationships provided by an embodiment of the present application;

Figure 3 is a schematic diagram of an implementation method for generating a comprehensive entity representation in a method for predicting drug-drug interaction relationships provided by an embodiment of the present application;

Figure 4 is a schematic structural diagram of a model that generates a comprehensive entity representation of a drug provided by an embodiment of the present application;

Figure 5 is a schematic flow chart of predicting drug pair interaction relationships using a drug relationship prediction model provided by an embodiment of the present application;

Figure 6 is a schematic diagram of an implementation method for generating a training set in a method for predicting drug-drug interaction relationships provided by an embodiment of the present application;

Figure 7 is a schematic diagram of an implementation method for determining relationship labels in a method for predicting drug-drug interaction relationships provided by another embodiment of the present application;

Figure 8 is a schematic structural diagram of a device for predicting drug pair interaction relationships provided by an embodiment of the present application;

Figure 9 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.

Detailed ways

In the following description, for the purpose of explanation rather than limitation, specific details such as specific system structures and technologies are provided to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to those skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of the described features, integers, steps, operations, elements and/or components but does not exclude one or more other The presence or addition of features, integers, steps, operations, elements, components and/or collections thereof.

In addition, in the description of this application and the appended claims, the terms "first", "second", "third", etc. are only used to distinguish the description, and cannot be understood as indicating or implying relative importance.

The method for predicting drug interaction relationships provided by the embodiments of this application can be applied to terminal devices such as tablet computers, wearable devices, notebook computers, ultra-mobile personal computers (UMPC), netbooks, etc. This application implements The example does not impose any restrictions on the specific type of terminal equipment.

As described in the background art, all current DDI predictions are based on sentence-level relationships. Among them, when training sentence-level drug relationship prediction models, it is usually necessary to use a large number of data sets. Specifically, after obtaining the data set, the abstract of each scientific document or part of the drug description text in the data set is used as a document and saved in an independent XML file. At this time, each XML file consists of at least one or more sentences. Each sentence contains several drugs and the relationship tags between each two drugs.

Therefore, existing work converts each sentence into several instances (each instance is called an instance, and only focuses on two drugs). Specifically, see Figure 1 for details. Figure 1 is a schematic diagram of an implementation method for generating a sentence-level embodiment provided by an embodiment of the present application. Specifically, assuming there is a sentence (Sentence) containing three non-repeating drugs (Drug1, Drug2, Drug3), the terminal device needs to convert it into three instances (Instance_A, Instance_B, Instance_C), and each instance only focuses on A drug pair (e1 and e2).

Based on this, it can be determined that the number of instances of generating a sentence is determined by the number of drugs contained in the sentence (permutations and combinations, for example, when a sentence contains 4 drugs, it needs to be converted into C ² ₄ =6 instances). This means that the number of instances that need to be processed to train a drug relationship prediction model based on the sentence level is several times the number of sentences contained in the original text. Therefore, the sentence-level drug relationship prediction model requires larger memory space during the training process, and network training also takes longer. In this embodiment, the same drug in a text is described using a comprehensive entity representation, and a drug pair corresponds to a fused entity representation. That is, unlike the sentence-level training drug relationship prediction model, the text is divided into multiple sentences, and the drugs and drug pairs in each sentence are represented differently by vectors. In this way, the terminal device can greatly reduce the memory space required for model training.

In addition, since the sentence-level drug relationship prediction model only focuses on certain two drugs in the sentence during training, the prediction process is only based on the semantic information between certain two drugs in the sentence. However, in practice, the relationship between a drug is often determined by multiple sentences. Therefore, existing methods for predicting drug-pair interaction relationships still have the problem of being unable to mine more complete semantic information about interactions between drug pairs from biomedical-related texts.

Based on this, in order to solve the problem of being unable to mine more complete semantic information of interactions between drug pairs from biomedical-related texts, in this embodiment, the terminal device processes the target text in the following manner S101-S104, so that Generate semantic information that can be used to describe the interaction between two drugs in the target text to improve the accuracy of predicting drug interaction relationships.

The following is an exemplary description of a method for predicting drug-drug interaction relationships provided in this application with reference to specific examples.

Please refer to Figure 2. Figure 2 shows an implementation flow chart of a method for predicting drug pair interaction relationships provided by an embodiment of the present application. The method includes the following steps:

S201. The terminal device acquires the target text; the target text includes multiple drugs, and each drug appears at least once in the target text.

In one embodiment, the above-mentioned target text is usually a text related to the pharmaceutical field, which includes but is not limited to texts in the form of journals, papers, etc. Among them, the target text can be text in Chinese, English or other languages, and there is no limit to this.

In one embodiment, in order to predict drug pair interactions, the target text must include at least two or more drugs. Otherwise, the interaction relationship between the two drugs cannot be predicted from the target text.

It is understood that for any drug in the target text, it may appear multiple times in different positions in the target text. Therefore, in this embodiment, there is no restriction on when the target drug appears.

In one embodiment, the terminal device may obtain the target text through the following steps, as detailed below:

The terminal device obtains the initial text, which includes multiple drugs, and each drug appears at least once in the initial text; if there is a drug name using a drug-sharing suffix in the initial text, the drug name is expanded to obtain the target text.

In one embodiment, the above-mentioned initial text is an unprocessed text, which can be a text crawled by the terminal device from the network based on the name of the drug, or can be a text specified in advance and stored in the terminal device. In this embodiment, for The path for the terminal device to obtain the target text is not limited.

In one embodiment, the above-mentioned entity name is the drug name of the drug. Among them, the drug sharing suffix is: When the drug names of multiple drugs have part of the same name, abbreviations may appear in the initial text, causing multiple drugs to share a suffix. The shared suffix is the same part name.

For example, the entity names of the two drugs can be respectively: 1) diagnostic monoclonal antibodies (diagnostic monoclonal antibodies); 2) therapeutic monoclonal antibodies (therapeutic monoclonal antibodies). The initial text containing two drugs can be "...when treated with other diagnostic or therapeutic monoclonal antibodies. (When treated with other diagnostic or therapeutic monoclonal antibodies.)". That is to say, the above two drug names share "monoclonal antibodies" as a shared suffix. In this case, the terminal device needs to expand the above sentences in the initial text to obtain the target text. That is to say, change the above sentence to: "...when treated with other diagnostic monoclonal antibodies or therapeutic monoclonal antibodies".

S202. The terminal device determines the comprehensive entity representation corresponding to each drug. The comprehensive entity representation is used to describe the semantic information of the corresponding drug at each position in the target text.

In one embodiment, the above-mentioned comprehensive entity represents semantic information used to comprehensively describe the corresponding drug at each position in the target text. Specifically, when a drug appears in multiple positions in the target text, if only the sentence containing the drug at a certain position is processed and the semantic information used to represent the drug in the sentence is obtained, then the final result is based on the semantic information. When the information is used to predict subsequent drug-drug interaction relationships, the accuracy of the prediction may be inaccurate. That is to say, the method of extracting the semantic information of the drug is only obtained by processing a certain sentence containing the drug in the target text. This semantic information cannot replace the semantic information corresponding to the drug at other positions in the target text.

Based on this, in this embodiment, the terminal device can be represented by a comprehensive entity corresponding to the drug to participate in subsequent processing, thereby improving the prediction accuracy of the drug-drug interaction relationship.

In a specific embodiment, referring to Figure 3, in S202, the terminal device may be implemented through the following sub-steps S301-S303, as detailed below:

S301. For any kind of drug, the terminal device determines multiple positions of the drug in the target text.

S302. The terminal device generates a text sequence corresponding to each position according to multiple positions.

S303. The terminal device performs vector processing on the text sequence corresponding to each position to obtain a comprehensive entity representation.

In one embodiment, the above-mentioned position is the position information of the drug in the target text. Among them, the text sequence is a sequence generated based on position. Specifically, the terminal device can determine the location information of the drug in the target text based on the entity name of the drug. In addition, the terminal device can use "[" and "]" to identify the starting position and ending position of each appearance of the drug, so as to mark the location of the drug. Among them, for the case where a drug has multiple location information (that is, the drug appears multiple times in the target text), sequence information corresponding to each location can also be assigned according to the order in which the drug appears in the target text.

For example, the terminal device can represent the text sequence of the drug in the following manner:

X={x ₁ , x ₂ ,...x _n }, X represents the entire text sequence of the target text, x _n represents the n-th character in the target text, and n also represents the total number of characters in the target text. Assume that for a given drug Drug-α, it consists of k characters and the number of occurrences is 2, then its text sequences can be respectively: P1={ _xi ,xi ₊₁ ,...xi _{+k -1} }, P2={x _j , x _j+1 ,...x _j+k-1 }. Among them, 1 and 2 in P1 and P2 respectively represent the order in which the drug appears in the target text. x _i represents the i-th character in the target text when the drug first appears; since the drug name consists of k characters, x _i+k-1 is the end position of the drug in the text after it first appears. It can be understood that if the drug appears multiple times, there will also be multiple text sequences.

In one embodiment, performing vector processing on the text sequence means: representing the text sequence as a process that can be recognized by the terminal device.

Specifically, the terminal device can perform vector representation on the text sequence corresponding to each position, and obtain multiple text vectors; each text vector is used to describe the semantic information of the drug at the corresponding position; each text vector is vectorized Integration, generating a comprehensive entity representation of the drug.

Among them, the terminal device performs a vector representation of each text sequence, which can be processed and generated through the model. For example, the text sequence is vector processed through BioBERT (named entity recognition model) to obtain a comprehensive entity representation. For example, the text vector generated by BioBERT after processing the above two text sequences P1 and P2 can be: DrugP1={vp1_1, vp1_2,…vp1_k}, DrugP2={vp2_1, vp2_2,…vp2_k}. Among them, DrugP1 represents the text vector corresponding to the P1 text sequence; vp1-k represents the vector corresponding to the k-th character in the P1 text sequence.

It can be understood that at this time, each text vector can only describe the semantic information of the drug at the corresponding position in the target text. Based on this, in order to obtain a comprehensive entity representation of the drug, the terminal device also needs to integrate multiple text vectors of the drug. Specifically, the terminal device can integrate text vectors through the following formulas 1 and 2:

Among them, Drug _e1 represents the integrated vector obtained after integrating the first text vector; that is, for DrugP1, after summing each vector representing DrugP1, the average value is calculated. Integrate vectors. After that, each integrated vector corresponding to the drug is averaged again to generate a comprehensive entity representation of the drug, Drug _a .

Specifically, reference may be made to FIG. 4 , which is a schematic structural diagram of a model for generating a comprehensive entity representation of a drug according to an embodiment of the present application. Among them, Drug-α at the bottom in Figure 4 represents the name of the drug; { _xi ,...,xi _+k-1 } represents the text sequence of Drug-α, and then the BioBERT model performs vector representation processing to generate Text vector (Drug _e1 and Drug _e2 in the picture). After that, the text vector is processed by the above-mentioned Formula 1 and Formula 2 to generate the top-level Drug _a . That is, a comprehensive entity representation is generated. It should be noted that this process provides the terminal device with a final comprehensive entity representation by integrating all parts of a single drug and integrating a single text vector of all the same drugs.

S203. For any drug pair among multiple drugs, the terminal device determines the fused entity representation of the drug pair based on the comprehensive entity representations of the two drugs in the drug pair.

In one embodiment, the above-mentioned comprehensive entity representation of a single drug is used to describe the semantic information of the corresponding drug at each position in the target text. Therefore, it can be considered that the above fused entities represent semantic information used to describe the interaction of two drugs in the target text.

Among them, the terminal device can process the comprehensive entity representation through the following formula 3 to obtain the fused entity representation:

H ₁ =W ₁ [tanh(Drug _α )]+b ₁ ,H ₂ =W ₂ [tanh(Drug _β )]+b ₂ (3)

Among them, H ₁ and H ₂ respectively represent the target vector obtained after processing the comprehensive entity representation, W ₁ and W ₂ represent the known parameter matrix, b1 and b2 represent the known offset terms; tanh represents the comprehensive entity Indicates hyperbolic tangent processing.

After obtaining H ₁ and H ₂ , the terminal device can splice H ₁ and H ₂ , and then input them into Formula 4 again to obtain the fused entity representation.

H ₀ =W ₃ [concat(H ₁ ,H ₂ )]+b ₃ (4)

Among them, H ₀ is the fusion entity representation of the drug pair, W ₃ represents the known parameter matrix, and the known offset item of b3; concat represents the concatenation and merging of multiple string functions (that is, splicing H ₁ and H ₂ ).

It should be noted that the above formula only processes the calculation formula for drugs that appear twice in the target text. When the number of occurrences is multiple, the formula should also be adapted accordingly.

S204. The terminal device predicts the interaction relationship of the drug pair based on the fusion entity representation of the drug pair.

In one embodiment, the above S203 has explained that the fused entity representation can be used to describe the semantic information of the interaction between two drugs in the target text. Based on this, when the terminal device predicts the interaction relationship of the drug pair based on the fused entity representation, its prediction accuracy will be higher.

Specifically, the terminal device can predict the interaction relationship between drug pairs through the following formula 5:

Type＝soft max(H ₀ ) (5)

Among them, softmax represents the classification function, which is used to process H ₀ and output the probability value of the drug pair belonging to each interaction relationship. Afterwards, the maximum value of the probability value corresponds to the interaction relationship and is determined as the final predicted interaction relationship of the drug pair.

In this embodiment, by obtaining the corresponding comprehensive entity representation of each drug in the target text, the terminal device can use a comprehensive entity representation to comprehensively describe the semantic information of the corresponding drug at each position in the target text. Afterwards, for each target drug pair, the terminal device can fuse according to the comprehensive entity representation of each drug to generate a fused entity representation of the drug pair. Furthermore, the terminal device can accurately predict the interaction relationship of the drug pair based on the fused entity representation.

In one embodiment, the above-mentioned S202-S204 may all process the target text by the drug relationship prediction model in the terminal device. That is, after executing S201, the terminal device can input the obtained target text into the drug relationship prediction model to predict the interaction relationship of each drug pair among multiple drugs.

Specifically, the drug relationship prediction model may include a first activation layer, a second activation layer, a first fully connected layer and a second fully connected layer. Among them, the first activation layer is used to perform tanh function processing in Formula 3 on the comprehensive entity representation. The first fully connected layer is used to perform W ₁ [] + b ₁ or W ₂ [] + b ₂ processing in Formula 3 on the vector processed by the tanh function to obtain the target vector. After that, the terminal device can splice the target vectors of the two drugs and input them to the second activation layer for processing. At this time, the second activation layer is used to perform the concat function processing in Formula 4 on the spliced target vector, and will be processed by The vector processed by the concat function is input to W ₃ [] + b ₃ for processing to obtain the fused entity representation.

It should be noted that the above example only illustrates the model structure in which the drug relationship prediction model processes the comprehensive entity representation of the two drugs in the drug pair and generates the fused entity representation of the drug pair. That is, only the model structure for processing S203 is explained. Among them, the drug relationship prediction model should also include a model structure for executing the processes of S202 and S204, which will not be explained one by one in this embodiment.

In a specific embodiment, please refer to FIG. 5 , which is a schematic flowchart of predicting the interaction relationship between drug pairs using a drug relationship prediction model according to an embodiment of the present application. Among them, the data processing specifically involves converting the sentence-level DDI2013 data set (sentence-level DDI Extraction 2013) into the document set DDI2013 data set (document-level DDI Extraction 2013). After that, the process of loading key information is performed on the document set DDI2013 data set. Specifically, the establishment of text sequence (Article seq) is performed for each text in the data set, which includes but is not limited to the establishment of text sequence of the entire text and the establishment of text sequence of each drug; determines drug pairs (Pairs) and generates Drug information. Afterwards, for the determined drug pair, document-entity embedding is performed on each drug in the drug pair. Specifically, a comprehensive entity representation is performed for each drug (drug) (that is, a drug emb is generated). Afterwards, tanh+fully-connected processing is performed on the comprehensive entity representation of the drug respectively. That is, the comprehensive entity representation is sequentially input to the first activation layer and the first fully connected layer and processed to obtain the target vector (H ₁ and H ₂ ) of each drug. After that, the two target vectors are spliced, and the spliced target vector is input into the second activation layer and the second fully connected layer to obtain the fused entity representation (H ₀ ). Finally, the fused entity representation is input into the Sofmax layer for classification prediction, and the interaction relationship (Type) of the drug pair is obtained.

In one embodiment, the drug relationship prediction model is a pre-trained model. For example, the above-mentioned drug relationship prediction model can be BERT, SciBERT, BioBERT and other models. In this embodiment, the drug relationship prediction model may specifically be BioBERT.

In one embodiment, for the above-mentioned problem that the sentence-level drug relationship prediction model requires larger memory space during the training process, and the network training also takes longer, referring to Figure 6, the terminal device can specifically The original data set is processed through the following steps S601-S604 to reduce the data volume of the original data set and improve the efficiency of network training. The details are as follows:

S601. The terminal device obtains the original data set, which includes multiple original texts.

S602. The terminal device counts the number of drugs contained in each original text respectively.

S603. The terminal device filters the original text containing at least two drugs in the original data set to obtain a subset of the original data.

S604. The terminal device performs label processing on each original text in the original data subset to obtain a training set.

In one embodiment, the method of obtaining the original text may be similar to the method of obtaining the initial text, and the comparison will not be described again. It should be noted that if the original data set is used directly for model training, it will consume a lot of training time.

It is understood that there may be text in the original text that does not contain both drugs. Such raw text cannot be used directly for training. Based on this, the terminal device can separately count the number of drugs contained in each original text. Afterwards, the original text that only included one drug was deleted, and preprocessing was performed on the original text that was not deleted.

In one embodiment, the above-mentioned preprocessing at least includes a process of expanding drugs using drug-shared suffixes. The above processing process has been explained in the above S201 and will not be described again. It should be added that the above preprocessing also includes but is not limited to: lowercase English characters in the original text, remove punctuation, and convert all numbers in the original text to "NUM" instead, which is not limited.

It can be understood that the original text after the above preprocessing is the text that can be used to train the drug relationship prediction model, thereby reducing the redundancy of the training data.

In one embodiment, labeling the original text specifically includes labeling the drug pairs appearing at each position in the original text to participate in model training.

However, based on the above description of the sentence-level drug relationship prediction model in the prior art, it can be seen that when a document is divided into multiple sentences, there may be multiple sentences containing the same drug pair. However, when the same drug pair is located in different locations in the document, its corresponding relationship labels may be different. This means that the same drug pair corresponds to different semantic information in the document. If the first drug pair or the relationship label of a certain drug pair is used to participate in model training, the prediction accuracy of the final drug relationship prediction model will also be reduced. If each identical drug pair in a document uses different relationship labels, it will cause confusion in the relationship labels in the data set.

Based on this, in this embodiment, referring to Figure 7, the terminal device can also process the label relationship of each drug pair in the original text in the following manner S701-S702, so that the same drug pair can also use the optimal Relationship tags to solve the problem of confusing relationship tags:

S701. The terminal device obtains each drug pair contained in the original text and the relationship labels between various drug pairs.

S702. If there is a drug pair with multiple relationship labels, the terminal device determines the relationship label with a higher priority among the multiple relationship labels as the relationship label of the drug pair according to the preset label priority.

In one embodiment, the above-mentioned relationship labels are used to represent the action relationship between drug pairs and to participate in the iterative process in the drug relationship prediction model. During the training process, the relationship label of each drug pair is usually pre-annotated by the staff. Therefore, the terminal device can directly obtain each drug pair contained in the original text, as well as the relationship labels between each drug pair.

It should be noted that if the same drug pair appears in different locations in the original text, its context and semantics may be different, so the corresponding relationship labels may also be different. In this embodiment, since the drug pair is based on the comprehensive entity representation of two drugs, a fusion entity representation is generated. That is, when a drug pair has multiple relationship labels, only one relationship label should be used to correspond to the drug pair for model training.

In one embodiment, the above-mentioned tag priority is a preset priority. For example, the above tags can be: False, Int, Advise, Effect, and Mechanism respectively. Its priority can be: False<Int<Advise<Effect<Mechanism

Among them, as shown above, the label Mechanism has the highest priority, that is, it contains more pharmacokinetic information between the two drugs; the label Effect indicates that there is a certain degree of reaction between the two drugs, but not as much as Mechanism; The label Advise indicates that there is an interaction between the two drugs, and the degree is less than that of Effect; the label Int indicates that the degree of interaction between the two drugs is low, and the degree is not as high as Advise; the label False indicates that there is no drug interaction between the two drugs.

In this embodiment, when a drug pair has multiple relationship labels, the multi-relationship labels can be converted into single-relationship labels through the label priority rules as shown above, so that the converted single-relationship labels can better Represents the interaction relationship between drug pairs in the original text.

In one embodiment, the method for predicting drug-pair interaction relationships in this application is a document-level prediction method. Compared with the sentence-level prediction method for drug-pair interaction relationships, its advantages are as follows:

In practical applications, sentence-level prediction of drug-pair interaction relationships must transform a sentence into multiple instances containing only two drug entities. In contrast, document-level prediction of drug-to-drug interaction relationships can target multiple drug entities simultaneously. Therefore, document-level drug-drug interaction relationship prediction can simplify data preprocessing operations and reduce the text input into the drug-drug relationship prediction model. In order to reflect this advantage more intuitively, the number of sentences recorded in sentence-level drug-drug pair interaction prediction methods in recent years (which need to be input to the drug-drug relationship prediction model) was collected and compared with the number of sentences included in this article. Compare. See Table 1 below for details:

Table 1. Number of sentences in different methods

As can be seen from Table 1, the highest amount of text is included in the original DDI Extraction 2013. After preprocessing, there are 27792 sentences in the training set and 5716 sentences in the test set, for a total of 33508 sentences. In this embodiment, the number after preprocessing is the smallest: 3784 sentences in the training set and 790 sentences in the test set, for a total of 4574 sentences.

(2) Comparison of different BERT models: In text processing, there are three commonly used BERT pre-training models, namely BERT, SciBERT and BioBERT. In order to observe the effect of the three pre-trained models in the prediction of document-level drug pair interaction relationships, BioBERT in this method was replaced with BERT and SciBERT. However, in actual experiments, it was found that the proposed method would not work properly after using BERT or SciBERT instead of BioBERT. Specifically: the document-level drug-to-drug interaction prediction method does not blind drugs, and most drugs are composed of complex drug nouns. Among the three pre-trained models, only BioBERT is trained on a large-scale biomedical corpus, so only BioBERT can accurately express the entity representation of complex drugs. In order to further obtain the representation effects of the three pre-trained models, we adopted the method of extracting sentence-level drug pair interaction relationships on the DDI corpus. Other experimental settings are exactly the same in order to analyze which pre-trained model can better express the text data of drug pair interaction relationships. See Table 2 below for details

Table 2. Results using different BERT models

预训练模型Pre-trained model	macro-P(％)macro-P(%)	macro-R(％)macro-R(%)	macro-F1(％)macro-F1(%)
BERTBERT	78.7878.78	73.2773.27	75.9275.92
SciBERTSciBERT	81.7181.71	74.8074.80	78.1078.10
BioBERTBioBERT	85.8985.89	73.4673.46	79.1979.19

As shown in Table 2, the method using BERT has the lowest performance, with macro-P (macro average precision, an evaluation index of the model) reaching 78.78%, and macro-R (macro average recall, another evaluation index of the model). reached 73.27%, and macro-F1 (macro average harmonic mean, another evaluation index of the model) reached 75.92%. The results of SciBERT method are moderate, macro-R is the highest, reaching 74.80%. This is because SciBERT is trained on a large-scale corpus of scientific literature, resulting in greatly improved performance compared to BERT. The best results were obtained using the BioBERT method, with macro-P reaching 85.89% and macro-F1 reaching 79.19%. This shows that BioBERT trained on the biomedical corpus can more accurately express the text data of drug pair interaction relationships.

(3) Performance of comprehensive entity representation (embedding) of document-level drugs: In order to verify the above actual performance, an experiment can be designed to compare the effect of embedding without document-level drugs and embedding using document-level drugs. Specifically, the former is marked as Without DEE (for each drug, embedding is only performed when it appears for the first time), and compared with the method using DEE proposed in this embodiment.

Table 3. Effects of using DEE and not using DEE

方法method	macro-P(％)macro-P(%)	macro-R(％)macro-R(%)	macro-F1(％)macro-F1(%)
Without DEEWithout DEE	60.0760.07	56.3256.32	58.4358.43
Use DEEUse DEE	65.6065.60	59.7159.71	62.5162.51

As shown in Table 3, without using the DEE method, macro-P reached 60.07%, macro-R reached 56:32%, and macro-F1 reached 58.43%. In the case of adopting the DEE method, macro-P reached 65.60%, macro-R reached 59.71%, and macro-F1 reached 62.51%, which were 5.53%, 3.39%, and 4.08% higher than those without DEE, respectively. The reason is that without using DEE, the contextual semantic information of the same drug at different locations in the document will not be considered. Therefore, this method can obtain a complete comprehensive entity representation of the drug in the document by using document-level drug embedding to obtain more accurate prediction results.

(4) Comparison of different neural network model structures: This embodiment performs special preprocessing on the DDI Extraction 2013 data set for the first time, achieving document-level prediction of drug interaction relationships. Currently, there is no work on document-level DDI datasets. To verify the effectiveness of the proposed method, it is compared with methods using CNN and BiLSTM, the two most commonly used neural network models. These two methods also adopt document-level drug embedding, but use different neural network model structures after obtaining a comprehensive entity representation of the drug. However, in practical applications, it was found that the method using only the BiLSTM network model did not work. Therefore, the terminal device changed it to a method that combines CNN and BiLSTM neural network models, and expressed it as "CNN+BiLSTM".

Table 4. Results using different neural network model structures

As can be seen from Table 4, although macro-P reaches 66.98% in the neural network model structure of CNN+BiLSTM, which is the highest among the three methods, macro-R only reaches 50.19%, and macro-F1 only reaches 57.38%. Therefore The network structure has the lowest overall performance. The macro-P of the CNN method reaches 56.75%, macro-R reaches 59.97%, and macro-F1 reaches 58.32%. The overall performance of CNN is slightly higher than CNN+BiLSTM. However, when using the structure of the drug relationship prediction model in this application, macro-P is only 1.38% lower than CNN+BiLSTM, and macro-R is 0.26% lower than CNN. Macro-P and macro-R are both almost the highest and therefore have the best overall performance. The reason is: the input is a comprehensive entity representation of two drugs, not a complete sentence. Therefore, suitable sentence-level neural network model structures cannot achieve the same performance as document-level neural network model structures (especially in BiLSTM).

In summary, in this embodiment, the document-level prediction method of drug-pair interaction relationships is used. Compared with the sentence-level prediction method of drug-pair interaction relationships in the prior art, the input to drug relationships can be greatly reduced. The amount of data in the prediction model can be combined to accurately express the semantics of drugs in multiple different locations, so that the drug relationship prediction model can extract the real semantic information of the drug in the document and improve the model prediction accuracy.

Please refer to FIG. 8 , which is a structural block diagram of a device for predicting drug pair interaction relationships provided by an embodiment of the present application. The modules included in the device for predicting drug pair interaction relationships in this embodiment are used to execute the steps in the embodiments corresponding to Figures 2, 3, 6 and 7. For details, please refer to FIG. 2, FIG. 3, FIG. 6, and FIG. 7 and the relevant descriptions in the embodiments corresponding to FIG. 2, FIG. 3, FIG. 6, and FIG. 7. For convenience of explanation, only parts related to this embodiment are shown. Referring to Figure 8, the device 800 for predicting drug pair interaction relationships may include: an acquisition module 810, a comprehensive entity representation determination module 820, a fusion entity representation determination module 830, and a prediction module 840, wherein:

The acquisition module 810 is used to acquire target text; the target text includes multiple drugs, and each drug appears at least once in the target text.

The comprehensive entity representation determination module 820 is used to determine the comprehensive entity representation corresponding to each drug, and the comprehensive entity representation is used to describe the semantic information of the corresponding drug at each position in the target text.

The fusion entity representation determination module 830 is used for determining, for any drug pair among a plurality of drugs, the fusion entity representation of the drug pair based on the comprehensive entity representations of two drugs in the drug pair.

The prediction module 840 is used to predict the interaction relationship of the drug pair according to the fusion entity representation of the drug pair.

In one embodiment, the acquisition module 810 is also used to:

Obtain the initial text. The initial text includes multiple drugs, and each drug appears at least once in the initial text; if there is a drug name using a drug-shared suffix in the initial text, the drug name is expanded to obtain the target text.

In one embodiment, the comprehensive entity representation determination module 820 is also used to:

For any kind of drug, multiple positions of the drug in the target text are determined respectively; based on the multiple positions, a text sequence corresponding to each position is generated; the text sequence corresponding to each position is vector processed to obtain a comprehensive entity representation.

The text sequence corresponding to each position is represented by a vector, and multiple text vectors are obtained; each text vector is used to describe the semantic information of the drug at the corresponding position; each text vector is vector integrated to generate a comprehensive summary of the drug Entity representation.

In one embodiment, the device 800 for predicting drug pair interaction relationships further includes:

The input module is used to input the target text into the pre-trained drug relationship prediction model for processing, and obtain the interaction relationship of each drug pair among multiple drugs.

In one embodiment, the drug relationship prediction model includes a first activation layer, a second activation layer, a first fully connected layer and a second fully connected layer; the fusion entity representation determination module 830 is also used to:

Input the comprehensive entity representations of the two drugs into the first activation layer and the first fully connected layer in sequence to obtain the target vectors corresponding to the two drugs; splice the two target vectors, and input the spliced target vectors into In the second activation layer and the second fully connected layer, the fused entity representation is obtained.

In one embodiment, the drug relationship prediction model is trained based on the training set; the drug pair interaction relationship prediction device 800 also includes the following module to obtain the training set:

The original data set acquisition module is used to obtain the original data set, which includes multiple original texts.

The statistics module is used to separately count the number of drugs contained in each original text.

The filtering module is used to filter the original text containing at least two drugs in the original data set to obtain a subset of the original data.

The label processing module is used to label each original text in the original data subset to obtain a training set.

In one embodiment, the tag processing module is also used to:

Obtain each drug pair included in the original text and the relationship labels between the various drug pairs; if there are drug pairs with multiple relationship labels, prioritize the multiple relationship labels according to the preset label priority. The high relationship label is determined as the relationship label of the drug pair.

It should be understood that in the structural block diagram of the device for predicting drug pair interaction relationships shown in Figure 8, each module is used to execute each step in the embodiment corresponding to Figure 2, Figure 3, Figure 6 and Figure 7, and for Each step in the embodiment corresponding to Figure 2, Figure 3, Figure 6 and Figure 7 has been explained in detail in the above embodiment. For details, please refer to Figure 2, Figure 3, Figure 6 and Figure 7 and Figure 2, Figure 3, The relevant descriptions in the embodiments corresponding to Figures 6 and 7 will not be described again here.

Figure 9 is a structural block diagram of a terminal device provided by an embodiment of the present application. As shown in Figure 9, the terminal device 900 of this embodiment includes: a processor 910, a memory 920, and a computer program 930 stored in the memory 920 and executable on the processor 910, such as a program for predicting drug interaction relationships. . When the processor 910 executes the computer program 930, the steps in each embodiment of the method for predicting the interaction relationship of each drug pair are implemented, such as S101 to S104 shown in Figure 1. Alternatively, when the processor 910 executes the computer program 930, it implements the functions of each module in the embodiment corresponding to FIG. 8, for example, the functions of modules 810 to 840 shown in FIG. 8. For details, please refer to the relevant information in the embodiment corresponding to FIG. 8. describe.

Exemplarily, the computer program 930 can be divided into one or more modules, and the one or more modules are stored in the memory 920 and executed by the processor 910 to realize the drug pair interaction relationship provided by the embodiments of the present application. method of prediction. One or more modules may be a series of computer program instruction segments capable of completing specific functions. The instruction segments are used to describe the execution process of the computer program 930 in the terminal device 900 . For example, the computer program 930 can implement the method for predicting drug-drug interaction relationships provided in the embodiments of the present application.

The terminal device 900 may include, but is not limited to, a processor 910 and a memory 920. Those skilled in the art can understand that FIG. 9 is only an example of the terminal device 900 and does not constitute a limitation on the terminal device 900. It may include more or less components than shown in the figure, or some components may be combined, or different components may be used. , for example, the terminal device may also include input and output devices, network access devices, buses, etc.

The processor 910 may be a central processing unit, or other general-purpose processors, digital signal processors, application-specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. wait. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.

The memory 920 may be an internal storage unit of the terminal device 900, such as a hard disk or memory of the terminal device 900. The memory 920 may also be an external storage device of the terminal device 900, such as a plug-in hard disk, a smart memory card, a flash memory card, etc. equipped on the terminal device 900. Further, the memory 920 may also include both an internal storage unit of the terminal device 900 and an external storage device.

Embodiments of the present application provide a computer-readable storage medium, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the medicines in the above embodiments are implemented. Predictive methods for interaction relationships.

Embodiments of the present application provide a computer program product. When the computer program product is run on a terminal device, the terminal device executes the method for predicting drug pair interaction relationships in each of the above embodiments.

The above embodiments are only used to illustrate the technical solutions of the present application, but are not intended to limit them. Although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still modify the technical solutions described in the foregoing embodiments. Modifications are made to the recorded technical solutions, or equivalent substitutions are made to some of the technical features; these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and shall be included in this application. within the scope of protection.

Claims

A method for predicting drug-drug interaction relationships, characterized in that the method includes:

Obtain target text; the target text includes multiple drugs, and each drug appears at least once in the target text;

Determine the comprehensive entity representation corresponding to each drug respectively, and the comprehensive entity representation is used to describe the semantic information of the corresponding drug at each position in the target text;

For any drug pair in the plurality of drugs, determine the fusion entity representation of the drug pair based on the comprehensive entity representation of two drugs in the drug pair;

According to the fusion entity representation of the drug pair, the interaction relationship of the drug pair is predicted.
The method according to claim 1, characterized in that said obtaining target text includes:

Obtain initial text, the initial text includes the plurality of drugs, and each drug appears at least once in the initial text;

If there is a drug name using a drug-shared suffix in the initial text, the drug name is expanded to obtain the target text.
The method according to claim 1, wherein the separately determining the comprehensive entity representation corresponding to each drug includes:

For any one of the drugs, determine multiple positions of the drug in the target text;

According to the multiple positions, generate a text sequence corresponding to each position;

The text sequence corresponding to each position is vector processed to obtain the comprehensive entity representation.
The method according to claim 3, characterized in that, performing vector processing on the text sequence corresponding to each position to obtain the comprehensive entity representation includes:

The text sequence corresponding to each position is represented by a vector, and multiple text vectors are obtained correspondingly; each text vector is used to describe the semantic information of the drug at the corresponding position;

Each text vector is vector integrated to generate a comprehensive entity representation of the drug.
The method according to any one of claims 1 to 4, characterized in that the comprehensive entity representation corresponding to each drug is determined respectively, and the comprehensive entity representation is used to describe the corresponding drug in the target. Semantic information at various positions in the text; for any drug pair among the plurality of drugs, determine a fused entity representation of the drug pair based on the comprehensive entity representation of two drugs in the drug pair; according to the The fusion entity representation of the drug pair predicts the interaction relationship of the drug pair, including:

The target text is input into the pre-trained drug relationship prediction model for processing, and the interaction relationship of each drug pair in the multiple drugs is obtained.
The method according to claim 5, wherein the drug relationship prediction model includes a first activation layer, a second activation layer, a first fully connected layer and a second fully connected layer; The integrated entity representation of the two drugs determines the fused entity representation of the drug pair, including:

Input the comprehensive entity representations of the two drugs into the first activation layer and the first fully connected layer in sequence to obtain target vectors corresponding to the two drugs;

The two target vectors are spliced, and the spliced target vectors are sequentially input into the second activation layer and the second fully connected layer to obtain the fused entity representation.
The method according to claim 5, characterized in that the drug relationship prediction model is obtained by training based on a training set, and the acquisition method of the training set is:

Obtain an original data set, the original data set including a plurality of original texts;

Separately count the number of drugs contained in each of the original texts;

Filter the original text containing at least two drugs in the original data set to obtain a subset of the original data;

Label processing is performed on each original text in the original data subset to obtain the training set.
The method according to claim 7, characterized in that, performing label processing on each original text in the original data subset to obtain the training set includes:

Obtain each drug pair contained in the original text, and relationship labels between various drug pairs;

If there is a drug pair with multiple relationship labels, the relationship label with a higher priority among the multiple relationship labels is determined as the relationship label of the drug pair according to the preset label priority.
A terminal device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that when the processor executes the computer program, it implements claims 1 to 1 The method described in any one of 8.
A computer-readable storage medium stores a computer program, characterized in that when the computer program is executed by a processor, the method according to any one of claims 1 to 8 is implemented.