CN116129989A

CN116129989A - Method, device, terminal equipment and medium for predicting drug relevance

Info

Publication number: CN116129989A
Application number: CN202310157524.0A
Authority: CN
Inventors: 邓磊; 胡小文
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2023-02-23
Filing date: 2023-02-23
Publication date: 2023-05-16

Abstract

The application is applicable to the technical field of biological information, and provides a method, a device, terminal equipment and a medium for predicting drug relevance, wherein a bipartite graph is constructed according to a relevance pair formed by lncRNA and a drug, and vector representation of the lncRNA and the drug is initialized; performing neighbor node aggregation on the vector representation by using a neural network model to obtain initial feature vectors of the lncRNA and the drug; constructing a local structure neighbor and a global semantic neighbor, and constructing a first contrast learning loss and a second contrast learning loss according to the initial feature vector; according to the contrast learning loss, updating the initial feature vector by combining a BPR loss function to obtain an intermediate feature vector of the lncRNA and the drug; if the intermediate feature vector meets the update termination condition, constructing a relevance prediction model by using the intermediate feature vector, and predicting relevance of the lncRNA and the drug. The method and the device can improve the accuracy of prediction of the relevance of the lncRNA and the drug.

Description

Method, device, terminal equipment and medium for predicting drug relevance

Technical Field

The application belongs to the technical field of biological information, and particularly relates to a method, a device, terminal equipment and a medium for predicting drug relevance.

Background

In recent years, there has been increasing evidence that ncrnas (non-coding RNAs, a functional RNA molecule) can affect drug efficiency by modulating genes associated with drug sensitivity, such as inducing alternative signaling pathways. Long non-coding RNA (lncRNA, a ncRNA) is an RNA molecule that is more than 200 nucleotides in length. lncRNA plays a key role in many biological processes such as epigenetic regulation, cell cycle regulation, cell differentiation, transcription and post-transcriptional regulation, and genomic splicing.

Numerous related studies have shown that lncRNA regulates human disease through the co-action of a range of biomolecules in organisms. Their mutations and dysfunctions are closely related to human diseases such as nervous system diseases, hematopathy, cardiovascular diseases and various cancers. With the development of sequencing technology, more and more lncRNA molecules are detected and analyzed in terms of sensitivity and depth, especially their role in drug sensitivity. Studies show that lncRNA can regulate drug sensitivity related genes, induce alternative signal pathways and further influence drug efficacy. For example, lncrrnanorad (non-coding RNA activated by DNA damage) inhibits proliferation of osteosarcoma HOS (human osteosarcoma cells)/DDP (human lung adenocarcinoma cells) and increases their sensitivity to cisplatin by targeting miR-410-3 p. The chemotherapy of the gall bladder cancer induces the sensitivity of the gall bladder cancer cells through a key regulatory factor lncRNA1 (GBCDrlnc 1), so that the identification of the association of the lncRNA and the sensitivity of the medicine has important significance for the development of the medicine. However, traditional biological assay-based methods tend to consume significant amounts of time and labor, and are highly blind, resulting in inaccurate predictions of lncRNA and drug association.

Disclosure of Invention

The embodiment of the application provides a method, a device, terminal equipment and a medium for predicting the relevance of a drug, which can solve the problem that the prediction of the relevance of the lncRNA and the drug is inaccurate at present.

In a first aspect, an embodiment of the present application provides a method for predicting drug association, including:

step 1, constructing a correlation bipartite graph according to a correlation pair formed by the lncRNA to be detected and a target drug, and randomly initializing vector representation of the lncRNA to be detected and vector representation of the target drug respectively;

step 2, running a neural network model on the associated bipartite graph, and respectively carrying out neighbor node aggregation on the vector representation of the lncRNA to be detected and the vector representation of the target drug to obtain an initial feature vector of the lncRNA to be detected and an initial feature vector of the target drug;

step 3, constructing a local structure neighbor of the association pair based on the association bipartite graph, and constructing a first contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug;

step 4, constructing a global semantic neighbor of the association pair based on the association bipartite graph, and constructing a second contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug;

Step 5, calculating comprehensive loss according to the first contrast learning loss, the second contrast learning loss and the BPR loss function, and respectively and reversely transmitting and updating the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug by utilizing the comprehensive loss to obtain an intermediate feature vector of the lncRNA to be detected and an intermediate feature vector of the target drug;

step 6, if the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug meet the preset updating termination condition, taking the intermediate feature vector of the lncRNA to be detected as the final feature vector of the lncRNA to be detected and taking the intermediate feature vector of the target drug as the final feature vector of the target drug; otherwise the first set of parameters is selected,

taking the intermediate feature vector of the lncRNA to be detected as the vector representation of the lncRNA to be detected in the step 2, taking the intermediate feature vector of the target drug as the vector representation of the target drug in the step 2, and returning to the step 2;

step 7, constructing a relevance prediction model according to the final feature vector of the lncRNA to be detected and the final feature vector of the target drug;

and step 8, predicting the relevance of the lncRNA to be detected and the target drug by using a relevance prediction model.

Optionally, the neural network model in step 2 is a graph roll-up neural network model.

Optionally, in step 2, running a neural network model on the associated bipartite graph, and respectively performing neighbor node aggregation on the vector representation of the lncRNA to be detected and the vector representation of the target drug to obtain an initial feature vector of the lncRNA to be detected and an initial feature vector of the target drug, where the method includes:

by calculation formula

Obtaining an initial feature vector e of the lncRNA to be detected _n Initial feature vector e of target drug _d； wherein ,Nn_n Representing neighbor node set, nn of lncRNA to be measured _d A set of neighbor nodes representing the target drug,

embedding of node vector representing lncRNA to be tested in layer I of graph roll-up neural network,/->

Embedding a node vector representing a target drug in a first layer of a graph convolutional neural network, wherein L represents the total layer number of the graph convolutional neural network and +.>

Embedding of node vector representing lncRNA to be tested in layer 1+1 of graph roll-up neural network, < >>

The node vector representing the target drug is embedded at layer 1+1 of the graph convolutional neural network.

Optionally, in step 3, constructing a first contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug, including:

by calculation formula

Obtaining local structure neighbor learning loss of lncRNA to be detected

Local structural neighbor learning loss of target drug +.>

wherein ,

Initial eigenvector e representing lncRNA to be tested _d At the output of the kth layer of the graph roll-up neural network,

an initial feature vector e representing the target drug _n At the output of the kth layer of the graph-rolled neural network, k represents an even number, τ represents the hyper-parameter of the softmax function, +.>

Vector representation representing lncRNA to be tested, +.>

Vector representation of target drug, n_num represents total number of lncRNA obtained in step 1, d_num represents total number of drug obtained in step 1,/o>

Vector representation representing the ith lncRNA at layer 0 of the graph convolutional neural network, +.>

Vector representation representing the jth drug at layer 0 of the graph convolutional neural network, i=1, 2,..n_num, j=1, 2,..d_num;

by calculation formula

Obtaining a first contrast studyLoss L _local The method comprises the steps of carrying out a first treatment on the surface of the Where α represents a hyper-parameter for balancing weights.

Optionally, in step 4, constructing a second contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug, including:

by calculation formula

Obtaining global semantic neighbor contrast learning loss of lncRNA to be detected

Global semantic neighbor contrast learning penalty of target drug >

wherein ,c_i Representing prototype of lncRNA to be tested, c _j Representing prototypes of the drug, C representing a collection of prototypes;

by calculation formula

Obtaining a second contrast learning loss L _glocal The method comprises the steps of carrying out a first treatment on the surface of the Where β represents the hyper-parameter for the balance weight.

Optionally, step 5 includes:

by calculation formula

L＝L _BPR +λ ₁ L _local +λ ₂ L _glocal +λ ₃ ||θ|| ₂

Obtaining comprehensive loss L; wherein lambda is ₁ ,λ ₂ ,λ ₃ All represent hyper-parameters of the balance weights, θ represents parameters of the graph convolution neural network, σ represents a nonlinear activation function, τ represents paired training data,

represents the lncRNA to be tested: n and target drug: d, d ⁺ Has relevance between->

Represents the lncRNA to be tested: n and sampling drugs: d, d ^- No correlation exists between the two;

initial feature vector e of lncRNA to be detected by utilizing comprehensive loss L _n And an initial feature vector e of the target drug _d Performing back propagation update to obtain an intermediate feature vector e of the lncRNA to be detected _n Intermediate eigenvector e of' and target drug _d '。

Optionally, the expression of the relevance prediction model in step 7 is as follows:

wherein ,

represents the lncRNA to be tested: n and target drug: and d, an association score.

Optionally, before performing step 6, the prediction method further includes:

according to the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug, calculating an AUC value and an AUPR value respectively;

If the AUC value and the AUPR value reach the maximum value, determining that the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug meet the preset updating termination condition; otherwise the first set of parameters is selected,

and determining that the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug do not meet the preset updating termination condition.

In a second aspect, embodiments of the present application provide a device for predicting drug association, including:

the initialization module is used for constructing a correlation bipartite graph according to a correlation pair which is obtained in advance and consists of the lncRNA to be detected and the target drug, and randomly initializing vector representation of the lncRNA to be detected and vector representation of the target drug respectively;

the aggregation module is used for running a neural network model on the associated bipartite graph, and respectively carrying out neighbor node aggregation on the vector representation of the lncRNA to be detected and the vector representation of the target drug to obtain an initial feature vector of the lncRNA to be detected and an initial feature vector of the target drug;

the first contrast learning loss module is used for constructing a local structure neighbor of the association pair based on the association bipartite graph, and constructing first contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug;

the second contrast learning loss module is used for constructing a global semantic neighbor of the association pair based on the association bipartite graph, and constructing a second contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug;

The intermediate feature vector module is used for calculating comprehensive loss according to the first contrast learning loss, the second contrast learning loss and the BPR loss function, and updating the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug by utilizing the comprehensive loss to obtain the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug;

the final feature vector module is used for judging whether the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug meet the preset updating termination condition; if the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug meet the preset updating termination condition, the intermediate feature vector of the lncRNA to be detected is used as a final feature vector of the lncRNA to be detected, and the intermediate feature vector of the target drug is used as a final feature vector of the target drug; otherwise the first set of parameters is selected,

taking the intermediate feature vector of the lncRNA to be detected as the vector representation of the lncRNA to be detected in the aggregation module, taking the intermediate feature vector of the target drug as the vector representation of the target drug in the aggregation module, and returning to the execution aggregation module;

the prediction model module is used for constructing a relevance prediction model according to the final feature vector of the lncRNA to be detected and the final feature vector of the target drug;

And the prediction module is used for predicting the relevance of the lncRNA to be detected and the target drug by using the relevance prediction model.

In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the method for predicting drug relevance described above when executing the computer program.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program, which when executed by a processor, implements a method for predicting drug relevance as described above.

The scheme of the application has the following beneficial effects:

in some embodiments of the present application, according to an initial feature vector of a lncRNA to be detected and an initial feature vector of a target drug, a first contrast learning loss and a second contrast learning loss are constructed, and then according to the first contrast learning loss, the second contrast learning loss and the BPR loss, the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug are respectively updated by back propagation, so that a more accurate feature vector can be obtained, thereby improving the accuracy of prediction of relevance between the lncRNA to be detected and the target drug.

Other advantages of the present application will be described in detail in the detailed description section that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for predicting drug relevance according to an embodiment of the present disclosure;

FIG. 2a is a graph of ROC comparing the predicted drug association method provided by one embodiment of the present application with other prior art performance;

FIG. 2b is a visual diagram of a method for predicting drug relevance and other prior art performance versus drug node feature vector aggregation according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a device for predicting drug association according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

In addition, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

Aiming at the problem that the prediction of the relevance between the lncRNA and the drug is inaccurate at present, the application provides a method, a device, a terminal device and a medium for predicting the relevance between the drug, wherein a first contrast learning loss and a second contrast learning loss are constructed according to an initial feature vector of the lncRNA to be detected and an initial feature vector of a target drug, and then the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug are respectively subjected to back propagation update according to the first contrast learning loss, the second contrast learning loss and the BPR loss, so that more accurate feature vectors can be obtained, and the accuracy of the prediction of the relevance between the lncRNA to be detected and the target drug is improved.

As shown in fig. 1, the method for predicting drug association provided in the present application mainly includes the following steps:

step 1, constructing a correlation bipartite graph according to a correlation pair formed by the lncRNA to be detected and the target drug, and randomly initializing vector representation of the lncRNA to be detected and vector representation of the target drug respectively.

In the examples of the present application, the above-mentioned association pair of lncRNA to be tested and the target drug can be obtained from the RNAacrDrug database (containing RNAs from multiple sets of chemical data related to drug sensitivity). The lncRNA to be measured is any one of a plurality of lncRNA obtained, and the target drug is any one of a plurality of drug obtained.

It should be noted that, in the embodiment of the present application, the bipartite graph may be constructed by a common bipartite graph construction method.

The vector representation of the lncRNA to be detected and the vector representation of the target drug are randomly initialized to obtain more accurate feature vectors later, and if the operation is not performed, the feature vectors of the lncRNA to be detected and the feature vectors of the target drug are always zero, so that the update of the feature vectors is meaningless.

And 2, running a neural network model on the associated bipartite graph, and respectively carrying out neighbor node aggregation on the vector representation of the lncRNA to be detected and the vector representation of the target drug to obtain an initial feature vector of the lncRNA to be detected and an initial feature vector of the target drug.

In some embodiments of the present application, the neural network model is a graph roll-up neural network model.

Specifically, by a calculation formula

Obtaining an initial feature vector e of the lncRNA to be detected _n Initial characterization of target drugQuantity e _d； wherein ,Nn_n Representing neighbor node set, nn of lncRNA to be measured _d A set of neighbor nodes representing the target drug,

It should be noted that, the initial feature vector e of the lncRNA to be tested _n Initial feature vector e of target drug _d In the layer aggregation stage, the vector representation of the lncRNA node to be detected of each graph volume is laminated

And vector representation of the target drug->

The vector representation accuracy can be improved by the aggregation obtained by a weighted sum method.

And step 3, constructing a local structure neighbor of the association pair based on the association bipartite graph, and constructing a first contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug.

The above-mentioned local structure neighbors represent the adjacent lncRNA nodes to be measured or adjacent target drug nodes in the spatial structure in the associated bipartite graph in step 1.

It should be noted that, the above-mentioned local structure neighbors are used to describe the high-order association in the association bipartite graph.

Feature vector e of layer I in graph roll-up neural network ^(l) Is the weighted sum of the first neighbor of each node (lncRNA node to be tested or drug target node).

Since even information propagation of bipartite graphs naturally aggregates information from homogeneous structural neighbors (structural neighbor nodes of the same type), embedding of homogeneous neighbors (nodes of the same type) can be obtained from the output of even layers of the graph-rolling network, and then the obtained feature vectors are utilized to simulate higher-order relationships between local neighbor nodes.

And 4, constructing a global semantic neighbor of the association pair based on the association bipartite graph, and constructing a second contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug.

The global semantic neighbors are expressed in the correlated bipartite graph in the step 1, are not adjacent in spatial structure, but have possible correlation (similar node effects, but no direct correlation on the bipartite graph) of the lncRNA nodes or target drug nodes to be detected, and the global semantic neighbors are mainly constructed to relieve the influence of data sparsity on experimental results and reduce the influence of noise generated in the construction process of local structure neighbors on prediction effects.

To construct the appropriate global semantic neighbor contrast learning objective, in some embodiments of the present application, prototype contrast learning objectives are developed by learning potential prototypes for each node (lncRNA node to be tested or target drug node) to identify global semantic neighbors. Based on the greater likelihood that similar lncRNA nodes to be tested or target drug nodes are located in adjacent feature spaces, a prototype can be defined as the center of a cluster consisting of a set of semantic neighbors, i.e., potential prototypes can be learned by a clustering algorithm (e.g., k-nearest neighbor algorithm).

And 5, calculating comprehensive loss according to the first contrast learning loss, the second contrast learning loss and the BPR loss function, and respectively and reversely propagating and updating the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug by utilizing the comprehensive loss to obtain the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug.

The expression of the above BPR (Bayesian Personalized Ranking, bayesian personalized ordering) loss function is as follows:

wherein, sigma is a nonlinear activation function,

representing training data pairs->

Represents the observed lncRNA to be tested: n and target drug d ⁺ There is a correlation between- >

Sample drug (randomly initialized drug node, drug not associated with lncRNA: n): d, d ^- And (3) testing lncRNA: n has no experimentally verified correlation.

In some embodiments of the present application, after step 5 is performed, the intermediate vector is determined as follows:

and a step a, calculating an AUC value and an AUPR value respectively according to the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug.

The AUC value (area surrounded by the axis under the ROC curve (receiver operation characteristic curve)) and the AUPR value (area surrounded by the axis under the RC curve) are calculated herein to determine whether the intermediate feature vector of the lncRNA to be tested and the intermediate feature vector of the target drug at that time have satisfied the update termination condition (the optimal intermediate feature vector of the lncRNA to be tested and the optimal intermediate feature vector of the target drug). When the update termination condition is not met, the step is repeatedly executed until model fitting (solving the optimal intermediate feature vector of the lncRNA to be detected and the optimal intermediate feature vector of the target drug) is carried out, and the accuracy of the feature vector is improved, so that the accuracy of relevance prediction is improved.

It should be noted that, calculating the AUC value and the AUPR value belongs to common general knowledge, and the calculation process is not described herein.

Step b, if the AUC value and the AUPR value reach the maximum value, determining that the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug meet the preset updating termination condition; otherwise, determining that the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug do not meet the preset updating termination condition.

and (2) taking the intermediate feature vector of the lncRNA to be detected as the vector representation of the lncRNA to be detected in the step (2), taking the intermediate feature vector of the target drug as the vector representation of the target drug in the step (2), and returning to the step (2).

And 7, constructing a relevance prediction model according to the final feature vector of the lncRNA to be detected and the final feature vector of the target drug.

Specifically, the expression of the constructed relevance prediction model is as follows:

wherein ,

The relevance prediction model constructed in the step 7 is input with the final feature vector of the lncRNA to be detected and the final feature vector of the target drug obtained after the steps 1-6 are executed, so that the relevance score of the lncRNA to be detected and the target drug is obtained, and the higher the relevance score of the lncRNA to be detected and the target drug is, the higher the relevance between the lncRNA to be detected and the target drug is.

The specific process of constructing the first contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug in step 3 (constructing the local structural neighbor of the association pair based on the association bipartite graph and constructing the first contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug) is illustrated as follows.

Step 3.1, through a calculation formula

Obtaining local structure neighbor learning loss of lncRNA to be detected

Local structural neighbor learning loss of target drug +.>

wherein ,

initial eigenvector e representing lncRNA to be tested _d Output of the kth layer of the neural network is rolled up in the graph,/->

Vector representation representing lncRNA to be tested, +. >

Vector representation of the jth drug at layer 0 of the graph convolutional neural network, i=1, 2.

Specifically, the feature vector of the node (the lncRNA node to be detected or the target drug node in the associated bipartite graph) and the feature vector output by the even layer graph convolution network model are taken as positive samples, other feature vectors are taken as negative samples, and finally the distance between the positive samples is minimized by using an InfoNCE (self-supervision contrast learning) loss function.

Step 3.2, through a calculation formula

Obtaining a first contrast learning loss L _local 。

Where α represents a hyper-parameter for balancing weights.

Next, an exemplary description is given to a specific process of constructing a second contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug in step 4 (constructing a global semantic neighbor of the association pair based on the association bipartite graph, and constructing a second contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug).

Step 4.1, through a calculation formula

Global semantic neighbor contrast learning penalty of target drug>

wherein ,c_i Representing prototype of lncRNA to be tested, c _j Representing prototypes of the drug, C representing a collection of prototypes. The prototype represents the center of a cluster in the global semantic structure neighborhood where all lncRNA or drugs have possible associations.

Step 4.2, through a calculation formula

Obtaining a second contrast learning loss L _glocal 。

Where β represents the hyper-parameter for the balance weight.

The analysis process of step 4 is exemplarily described below.

Global semantic neighbors are constructed primarily to maximize the following likelihood functions (a function about parameters in the statistical model, representing likelihood in model parameters):

where Θ represents model parameters and a represents an incidence matrix. c _i and c_j Respectively representing the lncRNA to be tested: n and target drug: potential prototypes of d, p (·) represent likelihood functions that need to be maximized.

In the embodiment of the present application, based on an EM (Expectation-Maximization algorithm) optimization algorithm and an InfoNCE loss function, nodes in the same cluster are defined as positive samples, and nodes of different clusters are regarded as negative samples.

In the embodiment of the present application, in order to optimize the contrast learning manner of the global semantic neighbors, the lower bound of the likelihood function is obtained through the Jensen inequality, specifically as follows:

wherein ,Q(c_i |e _n ) Representation c _i Distribution of Q (c) _i |e _d ) Representation c _j Is a distribution of (a).

When e _n (e _d ) After being observed, the above equation is optimized using the EM optimization algorithm:

in E-step (one of the steps of the EM optimization algorithm), E _n and e_d Is fixed, so the K-means algorithm can be applied to vector representations of the lncRNA and target drug to be tested to estimate Q (c _i |e _n) and Q(c_j |e _d ). If lncRNA is to be tested: n belongs to cluster i and the target drug d belongs to cluster j, then the corresponding cluster center c _i and c_j Prototypes of n and d, respectively.

For prototype c _i and c_j Which is distributed as a norm

and

The distribution of other prototypes is

and

It is then possible to obtain:

in all clusters, assuming that the distribution of lncRNA to be tested and the target drug is an isotropic gaussian distribution, then one can obtain:

wherein ,(e_n -c _i ) ² ＝2-2e _l ·c _l . Assuming that each gaussian has the same derivative, represented by the temperature hyper-parameter σ, there is:

clustering the embedding of the lncRNA to be detected and the target drug by using a K-means algorithm to respectively obtain K clusters of the lncRNA to be detected and the target drug. Obtaining

and

After that, the final comparative learning loss (second comparative learning loss) can be obtained:

The following describes an exemplary procedure of step 5 (calculating a comprehensive loss according to the first contrast learning loss, the second contrast learning loss, and the BPR loss function, and updating the initial feature vector of the lncRNA to be measured and the initial feature vector of the target drug by using the comprehensive loss to obtain the intermediate feature vector of the lncRNA to be measured and the intermediate feature vector of the target drug by counter propagation, respectively).

Step 5.1, through the calculation formula

L＝L _BPR +λ ₁ L _local +λ ₂ L _glocal +λ ₃ ||θ|| ₂

Obtaining comprehensive loss L; wherein lambda is ₁ ,λ ₂ ,λ ₃ Each representing a hyper-parameter of the balance weight, θ representing a parameter of the graph convolution neural network (a kernel function of the graph convolution network), σ representing a nonlinear activation function, τ representing pairs of training data,

Represents the lncRNA to be tested: n and sampling drugs: d, d ^- There is no correlation between them.

Step 5.2, utilizing the comprehensive loss L to respectively detect initial feature vectors e of the lncRNA to be detected _n And an initial feature vector e of the target drug _d Performing back propagation update to obtain an intermediate feature vector e of the lncRNA to be detected _n Intermediate eigenvector e of' and target drug _d '。

It should be noted that the back propagation by using the loss is common knowledge, and the process thereof is not described here.

In some embodiments of the present application, there is also provided validity verification of a prediction method of drug association, the result being as follows:

	AUC	AUPR
			NDSGCL_N	0.9103	0.9178
NDSGCL_L	0.9285	0.9444
			NDSGCL_G	0.9549	0.9267
NDSGCL	0.9734	0.9800

in particular, to verify the impact of local structure neighbor contrast learning and global semantic neighbor contrast learning on model performance, in embodiments of the present application, different variants are constructed, where NDSGCL-N indicates that no contrast learning method is used, NDSGCL-L indicates that only local structure neighbors are used, NDSGCL-G indicates that only global semantic neighbors are used, NDSGCL indicates that both local structure neighbors and global semantic neighbors are used, and the results are shown in the table above. It can be deduced from the above table that the local structure neighbors can be used alone to effectively extract higher order relations between nodes, but the improvement of AUC is not great due to the introduction of noise in the process of constructing positive and negative samples; also, using global semantic neighbors alone can significantly alleviate data sparsity, but the boost to AUPR is not significant because higher-order associations between nodes cannot be fully exploited. Combining these two methods is critical to improving the predictive accuracy of NDSGCL.

To further evaluate the performance of the predictive methods of drug association provided herein, in some embodiments of the present application, a comparison is made in conjunction with other currently most advanced methods, as shown in fig. 2a and 2 b. The ordinate of fig. 2a represents true rate, the abscissa of fig. 2a represents false positive rate, and in fig. 2a, GCN represents graph roll-up network method (migration to lncRNA and drug association prediction problem); the lightGCN represents an optimization of the graph convolutional network, and the method abandons the characteristic change and nonlinear activation of the traditional graph convolutional network and only keeps the node aggregation of the graph convolutional network; GCL-ED represents a contrast learning method of a data augmentation method based on random missing of edges; GCL-ND represents a contrast learning method of a data augmentation method of random loss of disease nodes; MLRDFM indicates that MLRDFM integrates the similarity of four miRNAs and two diseases, and the association of miRNAs and diseases is predicted by a deep decomposition machine method; NDSGCL refers to a method for predicting drug association provided herein.

Init in fig. 2b represents a characteristic representation of the original drug node; the lightGCN represents an optimization of the graph convolution network, abandons characteristic change and nonlinear activation of the traditional graph convolution network, and only keeps node aggregation of the graph convolution network; GCL-ED represents a contrast learning method of a data augmentation method based on random missing of edges; GCL-ND represents a contrast learning method of a data augmentation method of random loss of disease nodes; NDSGCL refers to a method for predicting drug association provided herein.

As can be seen from fig. 2a and 2b, the methods for predicting drug association provided in the present application are superior to other methods currently most advanced.

As can be seen from the above steps, the method for predicting the drug relevance provided by the present application constructs a first contrast learning loss and a second contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug, and then performs back propagation update on the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug according to the first contrast learning loss, the second contrast learning loss and the BPR loss, so that a more accurate feature vector can be obtained, thereby improving the accuracy of predicting the relevance of the lncRNA to be detected and the target drug.

The drug association prediction device provided by the application is exemplified in the following in connection with specific embodiments.

As shown in fig. 3, an embodiment of the present application provides a device for predicting drug association, where the device 300 for predicting drug association includes:

the initialization module 301 is configured to construct a correlation bipartite graph according to a pre-acquired correlation pair formed by the lncRNA to be tested and the target drug, and randomly initialize vector representation of the lncRNA to be tested and vector representation of the target drug, respectively;

the aggregation module 302 is configured to run a neural network model on the associated bipartite graph, and perform neighbor node aggregation on the vector representation of the lncRNA to be detected and the vector representation of the target drug to obtain an initial feature vector of the lncRNA to be detected and an initial feature vector of the target drug;

the first contrast learning loss module 303 is configured to construct a local structure neighbor of the association pair based on the association bipartite graph, and construct a first contrast learning loss according to an initial feature vector of the lncRNA to be detected and an initial feature vector of the target drug;

the second contrast learning loss module 304 is configured to construct a global semantic neighbor of the association pair based on the association bipartite graph, and construct a second contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug;

The intermediate feature vector module 305 is configured to calculate a comprehensive loss according to the first contrast learning loss, the second contrast learning loss and the BPR loss function, and update the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug by using the comprehensive loss to obtain an intermediate feature vector of the lncRNA to be detected and an intermediate feature vector of the target drug;

the final feature vector module 306 is configured to determine whether the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug meet a preset update termination condition; if the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug meet the preset updating termination condition, the intermediate feature vector of the lncRNA to be detected is used as a final feature vector of the lncRNA to be detected, and the intermediate feature vector of the target drug is used as a final feature vector of the target drug; otherwise the first set of parameters is selected,

the prediction model module 307 is configured to construct a relevance prediction model according to the final feature vector of the lncRNA to be detected and the final feature vector of the target drug;

The prediction module 308 is configured to predict the relevance of the lncRNA to be detected and the target drug by using a relevance prediction model.

It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

As shown in fig. 4, an embodiment of the present application provides a terminal device, as shown in fig. 4, a terminal device D10 of the embodiment includes: at least one processor D100 (only one processor is shown in fig. 4), a memory D101 and a computer program D102 stored in the memory D101 and executable on the at least one processor D100, the processor D100 implementing the steps in any of the various method embodiments described above when executing the computer program D102.

Specifically, when the processor D100 executes the computer program D102, an association bipartite graph is constructed by an association pair formed by the lncRNA to be detected and the target drug, and the vector representation of the lncRNA to be detected and the vector representation of the target drug are respectively initialized randomly, then a neural network model is operated on the association bipartite graph, neighbor node aggregation is respectively carried out on the vector representation of the lncRNA to be detected and the vector representation of the target drug to obtain an initial feature vector of the lncRNA to be detected and an initial feature vector of the target drug, then a local structure neighbor of the association pair is constructed based on the association bipartite graph, and then a first contrast learning loss is constructed based on the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug, and then a second contrast learning loss is constructed according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug, and the initial feature vector of the lncRNA to be detected is respectively updated in a reverse direction to obtain an intermediate feature vector of the lncRNA to be detected and an initial feature vector of the target drug to be detected, and finally an intermediate feature vector of the intermediate feature vector to be detected is predicted according to the association bipartite, and a final feature vector of the lncRNA to be detected and the intermediate feature vector to be detected is finally, and the intermediate feature vector of the intermediate feature of the lncRNA to be detected is predicted, and the intermediate feature vector is finally is predicted and the final feature vector is predicted by using the intermediate feature vector of the intermediate vector and the intermediate feature has been predicted to be predicted and the intermediate feature has. According to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug, a first contrast learning loss and a second contrast learning loss are constructed, and then according to the first contrast learning loss, the second contrast learning loss and the BPR loss function, the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug are subjected to back propagation update, so that more accurate feature vectors can be obtained, and the accuracy of prediction of relevance of the lncRNA to be detected and the target drug is improved.

The processor D100 may be a central processing unit (CPU, central Processing Unit), the processor D100 may also be other general purpose processors, digital signal processors (DSP, digital Signal Processor), application specific integrated circuits (ASIC, application Specific Integrated Circuit), off-the-shelf programmable gate arrays (FPGA, field-Programmable Gate Array) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory D101 may in some embodiments be an internal storage unit of the terminal device D10, for example a hard disk or a memory of the terminal device D10. The memory D101 may also be an external storage device of the terminal device D10 in other embodiments, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device D10. Further, the memory D101 may also include both an internal storage unit and an external storage device of the terminal device D10. The memory D101 is used for storing an operating system, an application program, a boot loader (BootLoader), data, other programs, etc., such as program codes of the computer program. The memory D101 may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps that may implement the various method embodiments described above.

The present embodiments provide a computer program product which, when run on a terminal device, causes the terminal device to perform steps that enable the respective method embodiments described above to be implemented.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a predicting device/terminal equipment of pharmaceutical association, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunication signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The application has the following advantages:

1. a general computing framework for predicting the relevance of lncRNA and medicines is provided, in the framework, local structure neighbors are constructed to extract high-order relevance between relevance pairs, and global semantic neighbors are constructed to relieve the influence of data sparsity on experimental results.

2. Compared with the prior art, the influence of data sparsity on experimental results is relieved by introducing a contrast learning idea, and the prediction accuracy is effectively improved. And moreover, the correlation data of the large-scale lncRNA and the drug can be predicted in a very short time, so that blindness and cost of a biological experiment are reduced.

While the foregoing is directed to the preferred embodiments of the present application, it should be noted that modifications and adaptations to those embodiments may occur to one skilled in the art and that such modifications and adaptations are intended to be comprehended within the scope of the present application without departing from the principles set forth herein.

Claims

1. A method for predicting drug association, comprising:

step 1, constructing a correlation bipartite graph according to a correlation pair formed by a to-be-detected lncRNA and a target drug, and randomly initializing vector representation of the to-be-detected lncRNA and vector representation of the target drug respectively;

step 3, constructing a local structural neighbor of the association pair based on the association bipartite graph, and constructing a first contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug;

step 5, calculating comprehensive loss according to the first comparison learning loss, the second comparison learning loss and a BPR loss function, and respectively and reversely transmitting and updating an initial feature vector of the lncRNA to be detected and an initial feature vector of the target drug by utilizing the comprehensive loss to obtain an intermediate feature vector of the lncRNA to be detected and an intermediate feature vector of the target drug;

Step 6, if the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug meet a preset updating termination condition, taking the intermediate feature vector of the lncRNA to be detected as a final feature vector of the lncRNA to be detected, and taking the intermediate feature vector of the target drug as a final feature vector of the target drug; otherwise the first set of parameters is selected,

and step 8, predicting the relevance of the lncRNA to be detected and the target drug by using the relevance prediction model.

2. The prediction method according to claim 1, wherein the neural network model in step 2 is a graph roll-up neural network model;

in the step 2, a neural network model is run on the associated bipartite graph, and neighbor node aggregation is performed on the vector representation of the lncRNA to be detected and the vector representation of the target drug respectively to obtain an initial feature vector of the lncRNA to be detected and an initial feature vector of the target drug, including:

By calculation formula

Obtaining an initial feature vector e of the lncRNA to be detected _n An initial feature vector e of the target drug _d； wherein ,Nn_n Representing the neighbor node set of the lncRNA to be detected, and Nn _d A set of neighbor nodes representing the target agent,

embedding a node vector representing the lncRNA to be tested in the first layer of the graph roll-up neural network,/for>

Embedding a node vector representing the target drug in the first layer of the graph roll-up neural network, wherein L represents the total layer number of the graph roll-up neural network, < >>

Embedding a node vector representing the lncRNA to be tested in the first layer (1) of the graph roll-up neural network,/I>

The node vector representing the target drug is embedded in the layer l+1 of the graph roll-up neural network.

3. The prediction method according to claim 2, wherein the constructing a first contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug in the step 3 includes:

by calculation formula

Obtaining the local structure neighbor learning loss of the lncRNA to be detected

And local structural neighbor learning loss of the target drug +.>

wherein ,

Representing the initial feature vector e of the lncRNA to be tested _d Output of the kth layer of the neural network is rolled up in the graph,/- >

Vector representation representing the lncRNA to be tested,>

a vector representation representing the target drug, n_num representing the total number of lncRNA obtained in step 1, d_num representing the total number of drugs obtained in step 1,% and%>

by calculation formula

Obtaining the first contrast learning loss L _local The method comprises the steps of carrying out a first treatment on the surface of the Where α represents a hyper-parameter for balancing weights.

4. The prediction method according to claim 3, wherein the constructing a second contrast learning loss according to the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug in the step 4 includes:

by calculation formula

Obtaining the global semantic neighbor contrast learning loss of the lncRNA to be detected

All of the target drugsPartial semantic neighbor contrast learning penalty->

wherein ,c_i Representing a prototype of the lncRNA to be tested, c _j Representing prototypes of the drug, C representing a collection of prototypes;

by calculation formula

Obtaining the second contrast learning loss L _glocal The method comprises the steps of carrying out a first treatment on the surface of the Where β represents the hyper-parameter for the balance weight.

5. The prediction method according to claim 4, wherein the step 5 includes:

by calculation formula

L＝L _BPR +λ ₁ L _local +λ ₂ L _glocal +λ ₃ ||θ|| ₂

Obtaining the comprehensive loss L; wherein lambda is ₁ ,λ ₂ ,λ ₃ All represent hyper-parameters of the balance weights, θ represents parameters of the graph convolution neural network, σ represents a nonlinear activation function, τ represents paired training data,

representing the lncRNA to be tested: n and the target drug: d, d ⁺ Has relevance between->

Representing the lncRNA to be tested: n and sampling drugs: d, d ^- No correlation exists between the two;

respectively aiming at initial feature vectors e of the lncRNA to be detected by utilizing the comprehensive loss L _n And an initial feature vector e of the target drug _d Performing back propagation update to obtain an intermediate feature vector e of the lncRNA to be detected _n ' and intermediate eigenvector e of the target drug _d '。

6. The prediction method according to claim 5, wherein the expression of the relevance prediction model in step 7 is as follows:

wherein ,

representing the lncRNA to be tested: n and the target drug: and d, an association score.

7. The prediction method according to claim 1, characterized in that before performing said step 6, said prediction method further comprises:

if the AUC value and the AUPR value reach the maximum value, determining that the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug meet a preset updating termination condition; otherwise the first set of parameters is selected,

and determining that the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug do not meet a preset updating termination condition.

8. A device for predicting drug association, comprising:

the intermediate feature vector module is used for calculating comprehensive loss according to the first comparison learning loss, the second comparison learning loss and the BPR loss function, and respectively and reversely transmitting and updating the initial feature vector of the lncRNA to be detected and the initial feature vector of the target drug by utilizing the comprehensive loss to obtain the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug;

the final feature vector module is used for judging whether the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug meet a preset update termination condition; if the intermediate feature vector of the lncRNA to be detected and the intermediate feature vector of the target drug meet a preset updating termination condition, taking the intermediate feature vector of the lncRNA to be detected as a final feature vector of the lncRNA to be detected and taking the intermediate feature vector of the target drug as a final feature vector of the target drug; otherwise the first set of parameters is selected,

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method of predicting drug relevance according to any one of claims 1 to 7 when executing the computer program.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method of predicting drug relevance according to any one of claims 1 to 7.