WO2021208993A1

WO2021208993A1 - Information processing method and apparatus for predicting drug target

Info

Publication number: WO2021208993A1
Application number: PCT/CN2021/087362
Authority: WO
Inventors: 蒋华良; 郑明月; 钟飞盛; 吴小龙; 李叙潼
Original assignee: 中国科学院上海药物研究所
Priority date: 2020-04-17
Filing date: 2021-04-15
Publication date: 2021-10-21
Also published as: CN113539366A

Abstract

Disclosed are an information processing method and an apparatus for predicting a drug target, thereby improving the accuracy of predicting the drug target. The method comprises: obtaining a compound perturbation spectrum corresponding to a compound (S11); obtaining a gene perturbation spectrum corresponding to a target gene on which the compound acts (S12); determining the degree of correlation between the compound perturbation spectrum and the gene perturbation spectrum (S13); and predicting the probability that the compound can act on the target gene according to the degree of correlation and preset experimental condition data (S14). The correlation between the compound perturbation spectrum and the gene perturbation spectrum is considered in the determination process for determining whether the compound can act on the target gene, so as to improve the accuracy of predicting the drug target.

Description

Information processing method and device for predicting drug target

Technical field

This application relates to the field of artificial intelligence, and in particular to an information processing method and device for predicting drug targets.

Background technique

The computer prediction model of drug action target helps to deepen our understanding of drug molecular action mechanism, metabolic pathways, adverse effects and drug resistance. In recent years, the rapid increase of multi-omics data and the rapid development of artificial intelligence technology have laid the foundation for the development of computer technology for drug target reasoning and prediction.

At present, the techniques for predicting drug targets using gene expression profiles or transcriptome data mainly include: comparative analysis methods, network-based analysis methods, and machine learning methods.

Among them, the comparative analysis method is based on the similarity of characteristic differentially expressed genes for prediction, such as CMap developed by Broad Institute. The network-based method starts from the perspective of systems biology and integrates gene expression profiles with cell networks to predict drug targets. For example, the ProTINA method developed by Noa et al. established a cell-type-specific protein-gene regulatory network and used dynamic models to infer drug targets from differential gene expression profiles, showing good prediction results. In addition, different machine learning algorithms have also been used to mine transcription profile data for drug target prediction. For example, Pabon et al. used a random forest (RF) model to predict drug targets by analyzing the correlation between drug-induced and gene knockdown transcription profiles.

However, the above-mentioned methods used in the prior art still have drawbacks. For example, the correlation between the compound perturbation spectrum and the gene perturbation spectrum cannot be explored, and there is still a lot of room for improvement in the accuracy of drug target prediction. Therefore, how to propose an information processing method for predicting drug targets to explore the correlation between compound perturbation spectrum and gene perturbation spectrum and improve the accuracy of drug target prediction is an urgent technical problem to be solved.

Summary of the invention

The purpose of the embodiments of the present application is to provide an information processing method for predicting drug targets, so as to improve the accuracy of drug target prediction.

In order to solve the above technical problems, the embodiments of the present application adopt the following technical solutions: an information processing method for predicting drug targets, including:

Obtain the compound perturbation spectrum corresponding to the compound;

Obtaining the gene perturbation spectrum corresponding to the target gene acted by the compound;

Determining the degree of correlation between the perturbation spectrum of the compound and the perturbation spectrum of the gene;

According to the correlation degree and preset experimental condition data, the probability that the compound can have an effect on the target gene is predicted.

The beneficial effect of the present application is that it can determine the degree of correlation between the perturbation spectrum of the compound and the gene perturbation spectrum, and then predict the probability that the compound can affect the target gene based on the degree of correlation and experimental condition data, so as to determine In the process of judging whether the compound can have an effect on the target gene, the correlation between the compound perturbation spectrum and the gene perturbation spectrum is considered, thereby improving the accuracy of drug target prediction.

In an embodiment, the determining the degree of correlation between the perturbation spectrum of the compound and the perturbation spectrum of the gene includes:

The correlation degree between the perturbation spectrum of the compound and the perturbation spectrum of the gene is calculated based on a first preset algorithm.

In one embodiment, when the degree of correlation is the Pearson correlation coefficient of the perturbation spectrum of the compound and the perturbation spectrum of the gene, according to the degree of correlation and preset experimental condition data, the compound can be The prediction of the probability of the target gene having an effect includes:

Obtain preset experimental condition data;

The Pearson correlation coefficient and the experimental condition data are substituted into a second preset algorithm to obtain a score of the interaction probability of the compound and the target gene.

Input the compound perturbation spectrum and the gene perturbation spectrum into a feature extraction network to perform feature extraction on the compound perturbation spectrum and the gene perturbation spectrum;

Acquiring a first vector corresponding to the compound perturbation spectrum and a second vector corresponding to the gene perturbation spectrum output by the feature extraction network;

Inputting the first vector and the second vector into a calculation module;

Obtaining the Pearson correlation coefficient of the first vector and the second vector output by the calculation module.

The beneficial effect of this embodiment is that the eigenvectors corresponding to the compound perturbation spectrum and the gene perturbation spectrum are calculated through the neural network, that is, the first vector and the second vector, and then the first vector and the second vector can be obtained through the calculation module. Pearson correlation coefficient, the Pearson correlation coefficient of the first vector and the second vector, that is, the Pearson correlation coefficient of the compound perturbation spectrum and the gene perturbation spectrum, which is used to characterize the degree of correlation between the compound perturbation spectrum and the gene perturbation spectrum, Therefore, in this embodiment, the correlation degree between the compound perturbation spectrum and the gene perturbation spectrum can be obtained through the neural network, which simplifies the process of determining the correlation degree between the two.

In an embodiment, the predicting the probability that the compound can have an effect on the target gene according to the correlation degree and preset experimental condition data includes:

Obtain preset experimental condition data;

Input the Pearson correlation coefficient and the experimental condition data into the classification module;

Obtain the score of the interaction probability between the compound and the target gene output by the classification module.

In an embodiment, the preset experimental condition data includes at least one of the following data:

Compound perturbation duration, compound dosage, gene knockdown duration and cell type.

In one embodiment, when there are multiple types of target genes, the method further includes:

Obtain scores of the interaction probabilities of various target genes and the compound respectively;

Sort the scores corresponding to the various target genes;

It is determined that the target gene corresponding to the highest score value interacts with the compound.

This application also provides an information processing device for predicting drug targets, including:

The first acquisition module is used to acquire the perturbation spectrum of the compound corresponding to the compound;

The second acquisition module is used to acquire the gene perturbation spectrum corresponding to the target gene on which the compound acts;

A determining module for determining the degree of correlation between the perturbation spectrum of the compound and the perturbation spectrum of the gene;

The prediction module is used to predict the probability that the compound can have an effect on the target gene according to the correlation degree and preset experimental condition data.

In an embodiment, the determining module includes:

The first input sub-module is used to input the compound perturbation spectrum and the gene perturbation spectrum into a feature extraction network to perform feature extraction on the compound perturbation spectrum and the gene perturbation spectrum;

The first acquisition sub-module is configured to acquire a first vector corresponding to the compound perturbation spectrum and a second vector corresponding to the gene perturbation spectrum output by the feature extraction network;

A second input sub-module for inputting the first vector and the second vector into a calculation module;

The second acquisition sub-module is configured to acquire the Pearson correlation coefficient of the first vector and the second vector output by the calculation module.

In an embodiment, the prediction module includes:

The third acquisition sub-module is used to acquire preset experimental condition data;

The third input submodule is used to input the Pearson correlation coefficient and the experimental condition data into the classification module;

The fourth obtaining submodule is used to obtain the score of the interaction probability between the compound and the target gene output by the classification module.

Description of the drawings

FIG. 1 is a flowchart of an information processing method for predicting drug targets according to an embodiment of the application;

2 is a flowchart of an information processing method for predicting drug targets according to another embodiment of the application;

3 is a flowchart of an information processing method for predicting drug targets according to another embodiment of the application;

4 is a block diagram of an information processing device for predicting drug targets according to an embodiment of the application;

5 is a block diagram of an information processing device for predicting drug targets according to another embodiment of the application, showing the main architecture of the determining module of this embodiment;

FIG. 6 is a block diagram of an information processing device for predicting drug targets according to another embodiment of the application, showing the main architecture of the prediction module of this embodiment.

Detailed ways

Various solutions and features of the present application are described here with reference to the drawings.

It should be understood that various modifications can be made to the embodiments applied herein. Therefore, the above description should not be regarded as a limitation, but merely as an example of an embodiment. Those skilled in the art will think of other modifications within the scope and spirit of this application.

The drawings included in the specification and constituting a part of the specification illustrate the embodiments of the application, and together with the general description of the application given above and the detailed description of the embodiments given below, are used to explain the application principle.

These and other characteristics of the present application will become apparent from the following description of preferred forms of embodiments given as non-limiting examples with reference to the accompanying drawings.

It should also be understood that although the application has been described with reference to some specific examples, those skilled in the art can surely implement many other equivalent forms of the application, which have the features described in the claims and are therefore all located here. Within the limited scope of protection.

When combined with the drawings, the above and other aspects, features and advantages of the present application will become more apparent in view of the following detailed description.

Hereinafter, specific embodiments of the present application will be described with reference to the accompanying drawings; however, it should be understood that the applied embodiments are merely examples of the present application, which can be implemented in various ways. Well-known and/or repeated functions and structures have not been described in detail to avoid unnecessary or redundant details from obscuring the present application. Therefore, the specific structural and functional details applied for herein are not intended to be limiting, but merely serve as the basis and representative basis of the claims to teach those skilled in the art to use the present in a variety of ways with substantially any suitable detailed structure. Application.

This specification may use the phrases "in one embodiment," "in another embodiment," "in yet another embodiment," or "in other embodiments," which can all refer to the same as in accordance with the present application. Or one or more of the different embodiments.

Fig. 1 is a flowchart of an information processing method for predicting drug targets according to an embodiment of the application. The method includes the following steps S11-S14:

In step S11, obtain the perturbation spectrum of the compound corresponding to the compound;

In step S12, obtain the gene perturbation spectrum corresponding to the target gene that the compound acts on;

In step S13, determine the degree of correlation between the compound perturbation spectrum and the gene perturbation spectrum;

In step S14, the probability that the compound can have an effect on the target gene is predicted based on the degree of correlation and preset experimental condition data.

In this embodiment, the compound perturbation profile corresponding to the compound is obtained, where the compound perturbation profile is used to express the difference between the gene expression profile after the cell is added with the drug and the gene expression profile under the normal state of the cell. In this embodiment, the compound refers to the compound in the drug whose target is to be predicted.

Further, the perturbation spectrum of the compound is determined in the following way:

After co-incubating the selected small molecule compounds with specific cells, a positive and negative control group is set, and the differentially expressed genes are analyzed by sequencing technology to obtain the compound perturbation spectrum. In addition, the perturbation spectrum of the compound can also be obtained by searching existing databases. From the compound perturbation differential gene expression profile, 978 marker feature differential genes were extracted, and a 978-dimensional feature vector was formed. The 978-dimensional feature vector represents the compound perturbation spectrum.

After obtaining the compound perturbation spectrum, obtain the gene perturbation spectrum corresponding to the target gene that the compound acts on, where the gene perturbation spectrum is used to characterize the difference between the expression profile after the gene knockdown of the cell and the expression profile under the normal state of the cell The difference. Determine the degree of correlation between the compound perturbation spectrum and the gene perturbation spectrum, and then predict the probability that the compound can have an effect on the target gene based on the degree of correlation and preset experimental condition data. It should be noted that, in most cases, a compound is physically related to the protein in the gene. Therefore, the effect of the compound on the target gene includes the effect on the protein encoded by the target gene.

The beneficial effect of this application is that it can determine the degree of correlation between the compound perturbation spectrum and the gene perturbation spectrum, and then predict the probability that the compound can affect the target gene based on the degree of correlation and experimental condition data, so as to determine whether the compound can affect the target gene. In the process of determining the effect of the target gene, the correlation between the compound perturbation spectrum and the gene perturbation spectrum is considered, thereby improving the accuracy of drug target prediction.

In an embodiment, the above step S13 may be implemented as the following steps:

The correlation degree between the compound perturbation spectrum and the gene perturbation spectrum is calculated based on the first preset algorithm.

In this embodiment, the correlation degree of the above-mentioned compound perturbation spectrum can be implemented based on an algorithm. Specifically, the algorithm can be input into an application program for implementation. The algorithm is specifically as follows:

First, obtain the compound perturbation spectrum and gene perturbation spectrum. In specific practice, the characteristics of the compound perturbation spectrum are represented by a 978-dimensional vector, denoted as C, C=(c1,c2,c3...c978), for any i( i=1-978), ci represents the differential expression value of gene i after compound perturbation, that is, the difference between the gene expression profile after the cell is added with the drug and the gene expression profile under the normal state of the cell.

Gene perturbation spectrum characteristics (978-dimensional vector), denoted as G, G = (g1, g2, g3...g978), for any i (i = 1-978), gi represents the differential expression value of gene i after gene knockdown , That is, the difference between the expression profile of the cell after gene knockdown and the expression profile of the cell in the normal state.

Experimental condition data (4-dimensional vector), E=(t1, d, t2, l), t1 represents the perturbation duration of the compound, d represents the compound dose, t2 represents the duration of gene knockdown, and l represents the type of cell line.

The protein-protein interaction network (PPI network), represented by a connection matrix, is marked as symbol A.

For the convenience of explanation, without loss of generality, only the differential expression of two genes is studied, then C=(c1,c2), G=(g1,g2).

In order to make the whole process easier to understand, we can make C=(0.1,0.3), G=(0.1,0.3), the connection matrix

E=(24,10,96,1).

From the connection matrix

Availability matrix

Easy to get

From the Laplacian matrix L=DA, we can get:

From the regularized Laplacian matrix L _sys = D ^-1/2 LD ^{-1/2, we} can get:

Spectral decomposition of the matrix:

L _sys = UλU ^T

Therefore:

Without loss of generality, the parameter matrix can be

Since (f*h) _graph ＝UωU ^T f

When f=c,

Define a relu function:

Obviously, l1 _relu = relu(l1) = (0.03,0.00)

For simplicity, the 200-dimensional image embedding is not generated, but only a 2-dimensional compound perturbation image embedding is generated, which is set as E1.

Without loss of generality, the parameter matrix can be

Compound perturbation diagram embedded:

The same can be obtained: gene knock-down map embedding E2=[0.03 0.03 0.03]

Calculate the Pearson of E1 and E2

Obviously, Pearson R ² =r*r=1.

In an embodiment, when the degree of correlation is the Pearson correlation coefficient of the compound perturbation spectrum and the gene perturbation spectrum, the above step S14 can be implemented as the following steps A1-A2:

In step A1, obtain preset experimental condition data;

In step A2, the Pearson correlation coefficient and experimental condition data are substituted into the second preset algorithm to obtain the score of the interaction probability between the compound and the target gene.

In this embodiment, the preset experimental condition data E=(t1,d,t2,l) is obtained, and the specific experimental condition data t1=24, d=10, t2=96, l=1 according to specific experimental conditions . Combine Pearson R ² with the four-dimensional vector experimental condition data E to obtain a five-dimensional vector, denoted as v ₅ .

Obviously v ₅ =(24,10,96,1,1).

Can make the parameter matrix

o _exp =e ^o =(e ¹³² ,e ¹³² )

sum=e ¹³² +e ¹³²

output is a two-dimensional vector, taking the first dimension as the CPI score,

That is: CPI score=output[1]=0.5.

That is, the Pearson correlation coefficient and experimental condition data are substituted into the second preset algorithm, and the score of the interaction probability between the compound and the target gene is 0.5.

In an embodiment, the above step S13 can be implemented as the following steps B1-B4:

In step B1, input the compound perturbation spectrum and the gene perturbation spectrum into the feature extraction network to perform feature extraction on the compound perturbation spectrum and the gene perturbation spectrum;

In step B2, obtain the first vector corresponding to the compound perturbation spectrum and the second vector corresponding to the gene perturbation spectrum output by the feature extraction network;

In step B3, input the first vector and the second vector into the calculation module;

In step B4, the Pearson correlation coefficient of the first vector and the second vector output by the calculation module is obtained.

In this embodiment, firstly obtain the compound perturbation spectrum (also called compound perturbation transcription profile feature, which is composed of 978-dimensional vectors in specific practice) and gene perturbation spectrum (also called gene knockdown transcription The spectrum feature is composed of 978-dimensional vector in the actual practice process), and then the compound perturbation spectrum and gene perturbation spectrum are passed through the feature extraction network. In this embodiment, the feature extraction network is a spectral-based graph neural network (GCN). By constructing two parallel GCNs to extract features from the compound perturbation spectrum and the gene perturbation spectrum respectively, the key features are extracted to achieve dimensionality reduction. After the feature extraction, the feature extraction network outputs the first vector corresponding to the compound perturbation spectrum The second vector corresponding to the gene perturbation spectrum; the first vector and the second vector are obtained after dimensionality reduction of the corresponding 978-dimensional feature vector. Therefore, the dimensionality of the first vector and the second vector is less than 978-dimensional , Input the first vector and the second vector into the calculation module; obtain the Pearson correlation coefficient of the first vector and the second vector output by the calculation module.

In an embodiment, as shown in FIG. 2, the above-mentioned step S14 can be implemented as the following steps S21-S23:

In step S21, obtain preset experimental condition data;

In step S22, the Pearson correlation coefficient and experimental condition data are input into the classification module;

In step S23, the score of the interaction probability between the compound and the target gene output by the classification module is obtained.

Obtain preset experimental condition data. Specifically, the preset experimental condition data may include at least one of the following data:

The Pearson correlation coefficient and experimental condition data are input into the classification model. In this example, the classification model consists of a fully connected hidden layer (used to extract input features) and an output layer (used to determine whether there is a compound-protein target interaction). Classification discrimination) composition to obtain the score of the interaction probability between the compound and the target gene output by the classification module.

In this embodiment, the heterogeneous experimental condition information is integrated, so that the cell line background, dose and time-dependent equivalent response to differential gene expression and drug target inference prediction can be considered, and the accuracy of prediction can be further improved.

In an embodiment, as shown in FIG. 3, when there are multiple types of target genes, the method can also be implemented as the following steps S31-S33:

In step S31, scores of the interaction probabilities of various target genes and compounds are obtained respectively;

In step S32, the scores corresponding to various target genes are sorted;

In step S33, it is determined that the target gene corresponding to the highest score value interacts with the compound.

In this embodiment, when there are multiple target genes, the scores of the interaction probability of each target gene and the compound are obtained respectively, that is, the aforementioned steps S11-S14 are performed once for each target gene to calculate the score of the interaction probability of each target gene and the compound. , And then sort the calculated scores to determine that the target gene corresponding to the highest score value interacts with the compound. That is, the target gene corresponding to the highest score value is the target of the drug corresponding to the compound.

Fig. 4 is a block diagram of an information processing device for predicting drug targets according to an embodiment of the application. The device includes the following modules:

The first obtaining module 41 is used to obtain the perturbation spectrum of the compound corresponding to the compound;

The second obtaining module 42 is used to obtain the gene perturbation spectrum corresponding to the target gene on which the compound acts;

The determination module 43 is used to determine the degree of correlation between the compound perturbation spectrum and the gene perturbation spectrum;

The prediction module 44 is used to predict the probability that the compound can have an effect on the target gene according to the degree of correlation and preset experimental condition data.

In one embodiment, as shown in FIG. 5, the determining module 43 includes:

The first input submodule 51 is used to input the compound perturbation spectrum and the gene perturbation spectrum into the feature extraction network to perform feature extraction on the compound perturbation spectrum and the gene perturbation spectrum;

The first acquisition sub-module 52 is configured to acquire the first vector corresponding to the compound perturbation spectrum and the second vector corresponding to the gene perturbation spectrum output by the feature extraction network;

The second input submodule 53 is used to input the first vector and the second vector into the calculation module;

The second acquisition sub-module 54 is used to acquire the Pearson correlation coefficient of the first vector and the second vector output by the calculation module.

In one embodiment, as shown in FIG. 6, the prediction module 44 includes:

The third acquiring sub-module 61 is used to acquire preset experimental condition data;

The third input submodule 62 is used to input Pearson correlation coefficient and experimental condition data into the classification module;

The fourth obtaining submodule 63 is used to obtain the score of the interaction probability between the compound and the target gene output by the classification module.

The above embodiments are only exemplary embodiments of the application, and are not used to limit the application, and the protection scope of the application is defined by the claims. Those skilled in the art can make various modifications or equivalent substitutions to this application within the essence and protection scope of this application, and such modifications or equivalent substitutions shall also be deemed to fall within the protection scope of this application.

Claims

An information processing method for predicting drug targets, which is characterized in that it comprises:

Obtain the compound perturbation spectrum corresponding to the compound;

Obtaining the gene perturbation spectrum corresponding to the target gene acted by the compound;

Determining the degree of correlation between the perturbation spectrum of the compound and the perturbation spectrum of the gene;

According to the correlation degree and preset experimental condition data, the probability that the compound can have an effect on the target gene is predicted.
The method of claim 1, wherein the determining the degree of correlation between the perturbation spectrum of the compound and the perturbation spectrum of the gene comprises:

The correlation degree between the perturbation spectrum of the compound and the perturbation spectrum of the gene is calculated based on a first preset algorithm.
The method according to claim 2, wherein when the degree of correlation is the Pearson correlation coefficient of the perturbation spectrum of the compound and the perturbation spectrum of the gene, according to the degree of correlation and preset experimental condition data , To predict the probability that the compound can have an effect on the target gene, including:

Obtain preset experimental condition data;

The Pearson correlation coefficient and the experimental condition data are substituted into a second preset algorithm to obtain a score of the interaction probability of the compound and the target gene.
The method of claim 1, wherein the determining the degree of correlation between the perturbation spectrum of the compound and the perturbation spectrum of the gene comprises:

Input the compound perturbation spectrum and the gene perturbation spectrum into a feature extraction network to perform feature extraction on the compound perturbation spectrum and the gene perturbation spectrum;

Acquiring a first vector corresponding to the compound perturbation spectrum and a second vector corresponding to the gene perturbation spectrum output by the feature extraction network;

Inputting the first vector and the second vector into a calculation module;

Obtaining the Pearson correlation coefficient of the first vector and the second vector output by the calculation module.
The method of claim 4, wherein the predicting the probability that the compound can have an effect on the target gene according to the correlation degree and preset experimental condition data comprises:

Obtain preset experimental condition data;

Input the Pearson correlation coefficient and the experimental condition data into the classification module;

Obtain the score of the interaction probability between the compound and the target gene output by the classification module.
The method according to claim 3 or 5, wherein the preset experimental condition data includes at least one of the following data:

Compound perturbation duration, compound dosage, gene knockdown duration and cell type.
The method according to any one of claims 1-6, wherein when there are multiple types of target genes, the method further comprises:

Obtain scores of the interaction probabilities of various target genes and the compound respectively;

Sort the scores corresponding to the various target genes;

It is determined that the target gene corresponding to the highest score value interacts with the compound.
An information processing device for predicting drug targets, which is characterized in that it comprises:

The first acquisition module is used to acquire the perturbation spectrum of the compound corresponding to the compound;

The second acquisition module is used to acquire the gene perturbation spectrum corresponding to the target gene on which the compound acts;

A determining module for determining the degree of correlation between the perturbation spectrum of the compound and the perturbation spectrum of the gene;

The prediction module is used to predict the probability that the compound can have an effect on the target gene according to the correlation degree and preset experimental condition data.
The device according to claim 8, wherein the determining module comprises:

The first input sub-module is used to input the compound perturbation spectrum and the gene perturbation spectrum into a feature extraction network to perform feature extraction on the compound perturbation spectrum and the gene perturbation spectrum;

The first acquisition sub-module is configured to acquire a first vector corresponding to the compound perturbation spectrum and a second vector corresponding to the gene perturbation spectrum output by the feature extraction network;

A second input sub-module for inputting the first vector and the second vector into a calculation module;

The second acquisition sub-module is configured to acquire the Pearson correlation coefficient of the first vector and the second vector output by the calculation module.
9. The device of claim 9, wherein the prediction module comprises:

The third acquisition sub-module is used to acquire preset experimental condition data;

The third input submodule is used to input the Pearson correlation coefficient and the experimental condition data into the classification module;

The fourth obtaining submodule is used to obtain the score of the interaction probability between the compound and the target gene output by the classification module.