CN114171114A - Method and device for constructing drug target prediction model, storage medium and electronic equipment - Google Patents

Method and device for constructing drug target prediction model, storage medium and electronic equipment Download PDF

Info

Publication number
CN114171114A
CN114171114A CN202111475687.0A CN202111475687A CN114171114A CN 114171114 A CN114171114 A CN 114171114A CN 202111475687 A CN202111475687 A CN 202111475687A CN 114171114 A CN114171114 A CN 114171114A
Authority
CN
China
Prior art keywords
sample set
training sample
prediction model
target prediction
drug target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111475687.0A
Other languages
Chinese (zh)
Inventor
杨勇宏
胡延庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202111475687.0A priority Critical patent/CN114171114A/en
Publication of CN114171114A publication Critical patent/CN114171114A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis

Abstract

The application discloses a method and a device for constructing a drug target prediction model, a storage medium and electronic equipment, wherein the method for constructing the drug target prediction model comprises the steps of obtaining a training sample set; preprocessing a training sample set to obtain a seepage phase change point of the training sample set; carrying out seepage processing on the training sample set based on the seepage phase change point to obtain a connected probability matrix; extracting target characteristic vectors in the connected probability matrix; sequentially performing dimensionality reduction processing and embedding processing on the target feature vector to obtain a low-dimensional embedded vector; and training the logistic regression model by using the low-dimensional embedded vector to obtain a drug target prediction model. The medicine target prediction model obtained by the scheme can effectively improve the efficiency of medicine target prediction.

Description

Method and device for constructing drug target prediction model, storage medium and electronic equipment
Technical Field
The embodiment of the application relates to the field of machine learning, in particular to a method and a device for constructing a drug target prediction model, a storage medium and electronic equipment.
Background
An important link in the drug research process is the screening of new drugs, and the key to establishing a new drug screening system is to find drug action targets to provide intervention treatment. The drug target is a special site formed by biological molecules, which can generate pharmacological action by the action of the drug and biological macromolecules of organisms and achieve the purpose of preventing and treating diseases, is the basis of the drug action, and has very important significance in new drug screening. The prediction of the drug target has irreplaceable effect on the evaluation of the initial drug property of the drug molecule and has great significance in the fields of old drug after the drug is mature, new drug use and the like.
However, most of the current drug target prediction models predict based on specific chemical structure information of drugs or proteins, and the acquisition cost is high. Moreover, most of the existing drug target prediction models are complex in structure, so that the prediction efficiency is low.
Disclosure of Invention
The embodiment of the application provides a method and a device for constructing a drug target prediction model, a storage medium and electronic equipment.
In a first aspect, an embodiment of the present application provides a method for constructing a drug target prediction model, including:
acquiring a training sample set;
preprocessing the training sample set to obtain a seepage phase change point of the training sample set;
carrying out seepage processing on the training sample set based on the seepage phase change point to obtain a connected probability matrix;
extracting target characteristic vectors in the connected probability matrix;
sequentially performing dimensionality reduction processing and embedding processing on the target feature vector to obtain a low-dimensional embedded vector;
and training a logistic regression model by using the low-dimensional embedded vector to obtain a drug target prediction model.
In the method for constructing a drug target prediction model provided in the embodiment of the present application, the preprocessing the training sample set to obtain the percolation phase change point of the training sample set includes:
obtaining the average degree of the network nodes of the training sample set;
determining a search space of the seepage phase change point based on the average degree of the network nodes;
and searching in the search space according to a preset step length to determine the seepage phase change point of the training sample set.
In the method for constructing a drug target prediction model provided in the embodiment of the present application, the searching in the search space according to a preset step size to determine a percolation phase change point of the training sample set includes:
respectively endowing each edge in the training sample set with a first random number;
comparing the first random number with a first probability value in the search space to determine whether to perform edge deletion processing;
if yes, obtaining the size of a target connected component in each connected component in the training sample set after edge deletion processing;
returning to execute the step of respectively endowing each edge in the training sample set with a first random number until the number of the target connected components reaches a first preset number;
obtaining a first average value of the target connected components of a first preset quantity;
and determining the seepage phase change point of the training sample set based on the preset step length and the first average value of the target connected component.
In the method for constructing a drug target prediction model provided in the embodiment of the present application, the training sample set is subjected to percolation processing based on the percolation phase transition point to obtain a connected probability matrix, which includes:
respectively endowing each edge in the training sample set with a second random number;
comparing the second random number with the percolation phase change point to determine whether to perform edge deletion processing;
if so, acquiring a node list of each connected component in the training sample set after edge deletion processing, and generating a connected matrix based on the node list;
returning to execute the step of respectively giving a second random number to each edge in the training sample set until the number of the connected matrixes reaches a second preset number;
and acquiring a second average value of a second preset number of the connected matrixes, wherein the second average value is the connected probability matrix.
In the method for constructing a drug target prediction model provided in the embodiment of the present application, the performing dimensionality reduction and embedding on the target feature vector in sequence to obtain a low-dimensional embedded vector includes:
reducing the dimension of the target characteristic vector to 128 dimensions by adopting an automatic encoder to obtain a node embedded vector;
and performing edge embedding processing on the node embedding vector by using a Hardman product to obtain a low-dimensional embedding vector.
In the method for constructing a drug target prediction model provided in the embodiment of the present application, the training of the logistic regression model using the low-dimensional embedded vector to obtain the drug target prediction model includes:
and training the logistic regression model based on a gradient descent learning method and the low-dimensional embedded vector to obtain a drug target prediction model.
In the method for constructing a drug target prediction model provided in the embodiment of the present application, after the training of the logistic regression model by using the low-dimensional embedded vector to obtain the drug target prediction model, the method further includes:
obtaining a verification sample set and a test sample set;
inputting the verification sample set and the test sample set into the drug target prediction model for testing to obtain a test result
In a second aspect, an embodiment of the present application provides an apparatus for constructing a drug target prediction model, including:
the device comprises a sample acquisition unit, a training sample set acquisition unit and a training sample acquisition unit, wherein the sample acquisition unit is used for acquiring the training sample set;
the first processing unit is used for preprocessing the training sample set to obtain a seepage phase change point of the training sample set;
the second processing unit is used for carrying out seepage processing on the training sample set based on the seepage phase change point to obtain a connection probability matrix;
the vector extraction unit is used for extracting a target characteristic vector in the connected probability matrix;
the third processing unit is used for sequentially carrying out dimensionality reduction processing and embedding processing on the target feature vector to obtain a low-dimensional embedded vector;
and the model generation unit is used for training the logistic regression model by using the low-dimensional embedded vector to obtain a drug target prediction model.
In a third aspect, an embodiment of the present application provides a storage medium, where a plurality of instructions are stored, and the instructions are suitable for a processor to load and execute any one of the methods described above.
In a fourth aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method according to any one of the above when executing the computer program.
The construction method of the drug target prediction model provided by the application comprises the steps of obtaining a training sample set; preprocessing the training sample set to obtain a seepage phase change point of the training sample set; carrying out seepage processing on the training sample set based on the seepage phase change point to obtain a connected probability matrix; extracting target characteristic vectors in the connected probability matrix; sequentially performing dimensionality reduction processing and embedding processing on the target feature vector to obtain a low-dimensional embedded vector; and training a logistic regression model by using the low-dimensional embedded vector to obtain a drug target prediction model. The medicine target prediction model obtained by the scheme can effectively improve the efficiency of medicine target prediction.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a method for constructing a drug target prediction model provided by the present application.
Fig. 2 is a schematic structural diagram of a device for constructing a drug target prediction model provided by the present application.
Fig. 3 is a schematic structural diagram of a server provided in the present application.
Fig. 4 is a schematic structural diagram of an electronic device provided in the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first" and "second", etc. in this application are used to distinguish between different objects and not to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or modules is not limited to the listed steps or modules but may alternatively include other steps or modules not listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The embodiment of the application provides a method and a device for constructing a drug target prediction model, a storage medium and electronic equipment. Specifically, the embodiment of the present application provides a device for constructing a drug target prediction model, which is suitable for an electronic device, where the electronic device may be a terminal device such as a mobile phone, a tablet computer, and a notebook computer, or a network-side device such as a server, and the server may be a single server, or a server cluster composed of multiple servers, or a physical server, or a virtual server.
The following detailed description will be made separately, and the description sequence of each embodiment below does not limit the specific implementation sequence.
Referring to fig. 1, fig. 1 is a schematic flow chart of a method for constructing a drug target prediction model provided in the present application. The method for constructing the drug target prediction model of this embodiment can be implemented using the device for constructing the drug target prediction model described above. The specific process of the construction method of the drug target prediction model can be as follows:
101, a training sample set is obtained.
In particular, the drug-protein network can be obtained by collating published data. The network is a bipartite network, and each edge in the network represents the target point of the corresponding drug on the corresponding protein.
In particular implementations, the drug-protein network can be divided into a training sample set, a validation sample set, and a test sample set. It should be noted that the verification sample set contains 10% of the edges of the original network. The training sample set includes the remaining connected edges and the same number of node pairs that do not have connected edges. The test sample set includes all node pairs in the network that have no edges.
For example, the drug-protein network may contain 1519 drugs and 1025 proteins, each edge in the network indicating the presence of the corresponding drug target on the corresponding protein, for a total of 6744 edges. Wherein the verification sample set contains 10% of the edges of the original network for a total of 674. The training sample set includes the remaining 6070 edges and the same number of node pairs with no edges connected. The test sample set includes all node pairs in the network that have no edges. Due to the particularity of the link prediction task, an intersection exists between the test sample set and the training sample set.
And 102, preprocessing the training sample set to obtain the seepage phase change point of the training sample set.
Specifically, the average degree of the network nodes of the training sample set may be obtained, then a search space of the percolation phase change point is determined based on the average degree of the network nodes, and finally, the search space is searched according to a preset step length to determine the percolation phase change point of the training sample set.
It will be appreciated that the set of training samples is a network. That is, there are several network nodes in the training sample set. The average degree of the network node is the average value of the degrees of a plurality of network nodes.
It should be noted that the average degree of the network node is greater than or equal to 1. The size of the search space is the inverse of the average degree of the network node. That is, the search space is less than or equal to 1.
In a specific implementation process, each probability value P in the search space can be tested by using 0.01 as a preset step length according to the following method:
(1) and (3) assigning a random number P 'to each edge in the training sample set, and deleting the edge if P' is greater than P.
(2) And calculating the size of the target connected component in the training sample set after the edge deletion S1. If the training sample set network is fully connected after the edge deletion, S1 is made to be 0.
(3) Repeating the steps (1) and (2) until a first preset number of target connected components are obtained, and solving a first average value S of the first preset number of target connected componentsP1
(4) And (3) assigning a random number P 'to each edge in the training sample set, and deleting the edge if P' is greater than P + 0.01.
(5) And calculating the size of the target connected component in the training sample set after the edge deletion S2. If the training sample set network is fully connected after the edge deletion, S2 is made to be 0.
(6) Repeating the steps (4) and (5) until a first preset number of target communication components are obtained, and solving the first preset number of target communication componentsSecond average value S of quantityP2
(7) And (5) repeating the steps (4), (5) and (6) until the average value of the sizes of the first preset number of target connected components corresponding to each probability value P in the search space is obtained.
When steps (4), (5) and (6) are repeated, P is P + 0.01.
It should be noted that, after the edges of the training sample set are deleted, the training sample set includes a plurality of connected components. According to the seepage theory, the probability value corresponding to the maximum average value of the second large connected component is the seepage phase change point P of the networkc. That is, the target connected component is a connected component of which connected component size is the second among the plurality of connected components.
That is, the step "searching in the search space according to the preset step size to determine the percolation phase change point of the training sample set" may include:
respectively endowing each edge in the training sample set with a first random number;
comparing the first random number with a first probability value in the search space to determine whether to perform edge deletion processing;
if yes, obtaining the size of a target connected component in each connected component in the training sample set after edge deletion processing;
returning to execute the step of respectively endowing each edge in the training sample set with a first random number until the number of the target connected components reaches a first preset number;
acquiring a first average value of the sizes of the target connected components of a first preset quantity;
and determining a seepage phase change point of the training sample set based on a preset step length and a first average value of the target connected component size.
It should be noted that the preset step includes, but is not limited to, 0.01. The preset step size may also be 0.02, 0.03, 0.035, etc. The preset step length can be set according to actual conditions. The first preset number can be set according to actual conditions. In some embodiments, the first predetermined number may be 300.
In some embodiments, the step of "determining a percolation phase change point of the training sample set based on a preset step size and a first average of the target connected component sizes" may include:
respectively endowing each edge in the training sample set with a third random number;
comparing the second random number with a second probability value in the search space to determine whether to perform edge deletion processing;
if yes, obtaining the size of a target connected component in each connected component in the training sample set after edge deletion processing;
returning to execute the step of respectively giving a third random number to each edge in the training sample set until the number of the target connected components reaches a first preset number;
acquiring a second average value of the sizes of the first preset number of target connected components;
and determining the seepage phase change point of the training sample set based on the second average value and the preset step length.
In some embodiments, determining the percolation phase change point of the training sample set based on the second average value and the preset step size in step "may include:
returning to execute the step of respectively giving a third random number to each edge in the training sample set until a second average value of the sizes of the target connected components of the first preset number corresponding to all probability values based on the preset step length in the search space is obtained and terminating;
and determining the seepage phase change point of the training sample set based on the first average value and the second average value.
It will be appreciated that the second probability value may be incremented or decremented according to a preset step size when the step of assigning a third random number to each edge in the training sample set is performed back.
And 103, carrying out seepage processing on the training sample set based on the seepage phase change point to obtain a connected probability matrix.
Specifically, the following is:
1. and (3) assigning a random number P 'to each edge in the training sample set, and deleting the edge if P' is greater than P.
2. After the edge deletion is countedAnd (4) generating a connection matrix M by using the node lists of all the connection components in the network. Element M in connection matrixu,vA value of 1 indicates that the nodes u and v after puncturing are located in the same connected component. Element M in matrixu,vA value of 0 indicates that nodes u and v are not connected after the edge deletion.
3. And (3) repeating the steps 1 and 2, and averaging the obtained connected matrix M to obtain a connected probability matrix M'.
Note that P at this time is the percolation phase transition point obtained in step 102.
That is, the step of performing percolation processing on the training sample set based on the percolation phase change point to obtain the connected probability matrix may include:
respectively endowing each edge in the training sample set with a second random number;
comparing the second random number with the percolation phase change point to determine whether to perform edge deletion processing;
if so, acquiring a node list of each connected component in the training sample set after edge deletion processing, and generating a connected matrix based on the node list;
returning to execute the step of respectively giving a second random number to each edge in the training sample set until the number of the connected matrixes reaches a second preset number;
and acquiring a second average value of a second preset number of connected matrixes, wherein the second average value is a connected probability matrix.
It should be noted that the preset number can be set according to actual situations.
And 104, extracting target characteristic vectors in the connected probability matrix.
It should be noted that each row in the connected probability matrix can be regarded as a feature vector of a certain drug or protein. That is, the target feature vector is each row of data in the connected probability matrix.
And 105, sequentially performing dimensionality reduction processing and embedding processing on the target feature vector to obtain a low-dimensional embedded vector.
Specifically, an automatic encoder can be adopted to reduce the dimension of the target feature vector to 128 dimensions, so as to obtain a node embedded vector; and carrying out edge embedding processing on the node embedding vector by using a Hardman product to obtain a low-dimensional embedding vector.
It should be noted that the automatic encoder is an optimizer used by a neural network including two fully-connected layers Adam, the hidden layer dimension of the automatic encoder is 128 dimensions, the output layer dimension is the same as the input layer dimension, and a ReLU function is used as an activation function between layers, and the function is as follows:
Figure BDA0003393312340000081
suppose the autoencoder input is x and the output is
Figure BDA0003393312340000082
Having a loss function of
Figure BDA0003393312340000083
Training is carried out by using an automatic encoder, and then the output of the hidden layer of the automatic encoder is extracted to obtain a 128-dimensional node embedding vector. Then, edge embedding is performed using hadamard products: for any two nodes u, v in the network, the embedded vectors are respectively set as
Figure BDA0003393312340000084
The low-dimensional embedding vector of the node pair is as follows:
Figure BDA0003393312340000085
and 106, training the logistic regression model by using the low-dimensional embedded vector to obtain a drug target prediction model.
Specifically, the logistic regression model can be trained based on a gradient descent learning method and a low-dimensional embedding vector to obtain a drug target prediction model.
In some embodiments, after obtaining the drug target prediction model, the prediction efficiency and accuracy of the drug target prediction model can also be tested. That is, after step 106, the method may further include:
obtaining a verification sample set and a test sample set;
and inputting the verification sample set and the test sample set into a drug target prediction model for testing to obtain a test result.
It can be understood that the prediction efficiency and accuracy of the drug target prediction model can be obtained according to the test result, so that whether the corresponding adjustment needs to be performed on the drug target prediction model can be determined according to the prediction efficiency and accuracy.
All the above technical solutions can be combined arbitrarily to form the optional embodiments of the present application, and are not described herein again.
In summary, the method for constructing the drug target prediction model provided by the embodiment of the present application includes obtaining a training sample set; preprocessing a training sample set to obtain a seepage phase change point of the training sample set; carrying out seepage processing on the training sample set based on the seepage phase change point to obtain a connected probability matrix; extracting target characteristic vectors in the connected probability matrix; sequentially performing dimensionality reduction processing and embedding processing on the target feature vector to obtain a low-dimensional embedded vector; and training the logistic regression model by using the low-dimensional embedded vector to obtain a drug target prediction model. The medicine target prediction model is constructed based on the logistic regression model, so that the method has the advantages of simple model, low time complexity and less number of hyper-parameters. Therefore, the medicine target prediction model obtained by the scheme can effectively improve the efficiency of medicine target prediction. And moreover, the drug target prediction model does not depend on specific chemical structure information of drugs or proteins, so that the data acquisition cost is low.
In order to better implement the method for constructing the drug target prediction model, correspondingly, the embodiment of the application also provides a device for constructing the drug target prediction model, wherein the device for constructing the drug target prediction model can be integrated in an electronic device. The meanings of the terms are the same as those in the construction method of the drug target prediction model, and specific implementation details can refer to the description in the method embodiment.
As shown in fig. 2, fig. 2 is a schematic structural diagram of a device for constructing a drug target prediction model according to an embodiment of the present disclosure. The device 200 for constructing the drug target prediction model may include a sample acquiring unit 201, a first processing unit 202, a second processing unit 203, a vector extracting unit 204, a third processing unit 205, and a model generating unit 206. Wherein the content of the first and second substances,
a sample obtaining unit 201, configured to obtain a training sample set;
the first processing unit 202 is configured to perform preprocessing on the training sample set to obtain a percolation phase change point of the training sample set;
the second processing unit 203 is configured to perform percolation processing on the training sample set based on a percolation phase change point to obtain a connected probability matrix;
a vector extraction unit 204, configured to extract a target feature vector in the connected probability matrix;
a third processing unit 205, configured to perform dimension reduction processing and embedding processing on the target feature vector in sequence to obtain a low-dimensional embedded vector;
and the model generating unit 206 is configured to train the logistic regression model by using the low-dimensional embedded vector to obtain a drug target prediction model.
To sum up, the device 200 for constructing a drug target prediction model provided in the embodiment of the present application acquires a training sample set through the sample acquisition unit 201; preprocessing the training sample set by the first processing unit 202 to obtain a seepage phase change point of the training sample set; carrying out seepage processing on the training sample set by the second processing unit 203 based on the seepage phase change point to obtain a connected probability matrix; extracting target feature vectors in the connected probability matrix by a vector extraction unit 204; the third processing unit 205 performs dimensionality reduction processing and embedding processing on the target feature vector in sequence to obtain a low-dimensional embedded vector; the logistic regression model is trained by the model generation unit 206 using the low-dimensional embedding vector to obtain a drug target prediction model. The medicine target prediction model is constructed based on the logistic regression model, so that the method has the advantages of simple model, low time complexity and less number of hyper-parameters. Therefore, the medicine target prediction model obtained by the scheme can effectively improve the efficiency of medicine target prediction. And moreover, the drug target prediction model does not depend on specific chemical structure information of drugs or proteins, so that the data acquisition cost is low.
The embodiment of the present application further provides a server, as shown in fig. 3, which shows a schematic structural diagram of the server according to the embodiment of the present application, specifically:
the server may include components such as a processor 301 of one or more processing cores, memory 302 of one or more computer-readable storage media, a power supply 303, and an input unit 304. Those skilled in the art will appreciate that the server architecture shown in FIG. 3 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 301 is a control center of the server, connects various parts of the entire server using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 302 and calling data stored in the memory 302, thereby performing overall monitoring of the server. Optionally, processor 301 may include one or more processing cores; preferably, the processor 301 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 301.
The memory 302 may be used to store software programs and modules, and the processor 301 executes various functional applications and data processing by operating the software programs and modules stored in the memory 302. The memory 302 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 302 may also include a memory controller to provide the processor 301 with access to the memory 302.
The server further includes a power supply 303 for supplying power to the various components, and preferably, the power supply 303 may be logically connected to the processor 301 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The power supply 303 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
The server may also include an input unit 304, the input unit 304 being operable to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
Although not shown, the server may further include a display unit and the like, which will not be described in detail herein. Specifically, in this embodiment, the processor 301 in the server loads the executable file corresponding to the process of one or more application programs into the memory 302 according to the following instructions, and the processor 301 runs the application programs stored in the memory 302, thereby implementing various functions as follows:
acquiring a training sample set;
preprocessing a training sample set to obtain a seepage phase change point of the training sample set;
carrying out seepage processing on the training sample set based on the seepage phase change point to obtain a connected probability matrix;
extracting target characteristic vectors in the connected probability matrix;
sequentially performing dimensionality reduction processing and embedding processing on the target feature vector to obtain a low-dimensional embedded vector;
and training the logistic regression model by using the low-dimensional embedded vector to obtain a drug target prediction model.
The above operations can be specifically referred to the previous embodiments, and are not described herein.
Accordingly, an embodiment of the present application also provides a terminal, as shown in fig. 4, the terminal may include Radio Frequency (RF) circuit 401, memory 402 including one or more computer-readable storage media, input unit 403, display unit 404, sensor 405, audio circuit 406, Wireless Fidelity (WiFi) module 407, processor 408 including one or more processing cores, and power supply 409. Those skilled in the art will appreciate that the terminal configuration shown in fig. 4 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the RF circuit 401 may be used for receiving and transmitting signals during a message transmission or communication process, and in particular, for receiving downlink information of a base station and then sending the received downlink information to the one or more processors 408 for processing; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuitry 401 includes, but is not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 401 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Messaging Service (SMS), and the like.
The memory 402 may be used to store software programs and modules, and the processor 408 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal, etc. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 408 and the input unit 403 access to the memory 402.
The input unit 403 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in a particular embodiment, the input unit 403 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts it to touch point coordinates, and sends the touch point coordinates to the processor 408, and can receive and execute commands from the processor 408. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 403 may include other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 404 may be used to display information input by or provided to the user and various graphical user interfaces of the terminal, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 404 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 408 to determine the type of touch event, and then the processor 408 provides a corresponding visual output on the display panel according to the type of touch event. Although in FIG. 4 the touch-sensitive surface and the display panel are shown as two separate components to implement input and output functions, in some embodiments the touch-sensitive surface may be integrated with the display panel to implement input and output functions.
The terminal may also include at least one sensor 405, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or the backlight when the terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the motion sensor is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for recognizing the attitude of an electronic device, vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured in the terminal, detailed description is omitted here.
Audio circuitry 406, a speaker, and a microphone may provide an audio interface between the user and the terminal. The audio circuit 406 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electric signal, which is received by the audio circuit 406 and converted into audio data, which is then processed by the audio data output processor 408, and then transmitted to, for example, another terminal via the RF circuit 401, or the audio data is output to the memory 402 for further processing. The audio circuitry 406 may also include an earbud jack to provide peripheral headset communication with the terminal.
WiFi belongs to short distance wireless transmission technology, and the terminal can help the user to send and receive e-mail, browse web page and access streaming media etc. through WiFi module 407, it provides wireless broadband internet access for the user. Although fig. 4 shows the WiFi module 407, it is understood that it does not belong to the essential constitution of the terminal, and may be omitted entirely as needed within the scope not changing the essence of the invention.
The processor 408 is a control center of the terminal, connects various parts of the entire electronic device using various interfaces and lines, performs various functions of the terminal and processes data by operating or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby integrally monitoring the electronic device. Optionally, processor 408 may include one or more processing cores; preferably, the processor 408 may integrate an application processor, which handles primarily the operating system, user interface, applications, etc., and a modem processor, which handles primarily the wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 408.
The terminal also includes a power source 409 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 408 via a power management system to manage charging, discharging, and power consumption via the power management system. The power supply 409 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
Although not shown, the terminal may further include a camera, a bluetooth module, and the like, which will not be described herein. Specifically, in this embodiment, the processor 408 in the terminal loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 408 runs the application programs stored in the memory 402, thereby implementing various functions:
acquiring a training sample set;
preprocessing a training sample set to obtain a seepage phase change point of the training sample set;
carrying out seepage processing on the training sample set based on the seepage phase change point to obtain a connected probability matrix;
extracting target characteristic vectors in the connected probability matrix;
sequentially performing dimensionality reduction processing and embedding processing on the target feature vector to obtain a low-dimensional embedded vector;
and training the logistic regression model by using the low-dimensional embedded vector to obtain a drug target prediction model.
The above operations can be specifically referred to the previous embodiments, and are not described herein.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, the embodiments of the present application provide a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any one of the methods for constructing a drug target prediction model provided in the embodiments of the present application. For example, the instructions may perform the steps of:
acquiring a training sample set;
preprocessing a training sample set to obtain a seepage phase change point of the training sample set;
carrying out seepage processing on the training sample set based on the seepage phase change point to obtain a connected probability matrix;
extracting target characteristic vectors in the connected probability matrix;
sequentially performing dimensionality reduction processing and embedding processing on the target feature vector to obtain a low-dimensional embedded vector;
and training the logistic regression model by using the low-dimensional embedded vector to obtain a drug target prediction model.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
Since the instructions stored in the storage medium can execute the steps in the method for constructing any drug target prediction model provided in the embodiment of the present application, the beneficial effects that can be achieved by the method for constructing any drug target prediction model provided in the embodiment of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.
The method, the device, the storage medium and the electronic device for constructing the drug target prediction model provided by the embodiment of the application are introduced in detail, a specific example is applied in the description to explain the principle and the implementation mode of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A construction method of a drug target prediction model is characterized by comprising the following steps:
acquiring a training sample set;
preprocessing the training sample set to obtain a seepage phase change point of the training sample set;
carrying out seepage processing on the training sample set based on the seepage phase change point to obtain a connected probability matrix;
extracting target characteristic vectors in the connected probability matrix;
sequentially performing dimensionality reduction processing and embedding processing on the target feature vector to obtain a low-dimensional embedded vector;
and training a logistic regression model by using the low-dimensional embedded vector to obtain a drug target prediction model.
2. The method for constructing a drug target prediction model according to claim 1, wherein the preprocessing the training sample set to obtain the percolation phase change point of the training sample set comprises:
obtaining the average degree of the network nodes of the training sample set;
determining a search space of the seepage phase change point based on the average degree of the network nodes;
and searching in the search space according to a preset step length to determine the seepage phase change point of the training sample set.
3. The method for constructing a drug target prediction model according to claim 2, wherein the searching in the search space according to a preset step size to determine the percolation phase change point of the training sample set comprises:
respectively endowing each edge in the training sample set with a first random number;
comparing the first random number with a first probability value in the search space to determine whether to perform edge deletion processing;
if yes, obtaining the size of a target connected component in each connected component in the training sample set after edge deletion processing;
returning to execute the step of respectively endowing each edge in the training sample set with a first random number until the number of the target connected components reaches a first preset number;
obtaining a first average value of the target connected components of a first preset quantity;
and determining the seepage phase change point of the training sample set based on the preset step length and the first average value of the target connected component.
4. The method for constructing a drug target prediction model according to claim 1, wherein the percolation processing is performed on the training sample set based on the percolation phase transition point to obtain a connected probability matrix, comprising:
respectively endowing each edge in the training sample set with a second random number;
comparing the second random number with the percolation phase change point to determine whether to perform edge deletion processing;
if so, acquiring a node list of each connected component in the training sample set after edge deletion processing, and generating a connected matrix based on the node list;
returning to execute the step of respectively giving a second random number to each edge in the training sample set until the number of the connected matrixes reaches a second preset number;
and acquiring a second average value of a second preset number of the connected matrixes, wherein the second average value is the connected probability matrix.
5. The method for constructing a drug target prediction model according to claim 1, wherein the step of sequentially performing dimensionality reduction processing and embedding processing on the target feature vector to obtain a low-dimensional embedded vector comprises:
reducing the dimension of the target characteristic vector to 128 dimensions by adopting an automatic encoder to obtain a node embedded vector;
and performing edge embedding processing on the node embedding vector by using a Hardman product to obtain a low-dimensional embedding vector.
6. The method for constructing a drug target prediction model according to claim 1, wherein the training of the logistic regression model using the low-dimensional embedding vector to obtain the drug target prediction model comprises:
and training the logistic regression model based on a gradient descent learning method and the low-dimensional embedded vector to obtain a drug target prediction model.
7. The method for constructing a drug target prediction model according to claim 1, wherein after the training of the logistic regression model using the low-dimensional embedding vector to obtain the drug target prediction model, the method further comprises:
obtaining a verification sample set and a test sample set;
and inputting the verification sample set and the test sample set into the drug target prediction model for testing to obtain a test result.
8. A device for constructing a drug target prediction model is characterized by comprising:
the device comprises a sample acquisition unit, a training sample set acquisition unit and a training sample acquisition unit, wherein the sample acquisition unit is used for acquiring the training sample set;
the first processing unit is used for preprocessing the training sample set to obtain a seepage phase change point of the training sample set;
the second processing unit is used for carrying out seepage processing on the training sample set based on the seepage phase change point to obtain a connection probability matrix;
the vector extraction unit is used for extracting a target characteristic vector in the connected probability matrix;
the third processing unit is used for sequentially carrying out dimensionality reduction processing and embedding processing on the target feature vector to obtain a low-dimensional embedded vector;
and the model generation unit is used for training the logistic regression model by using the low-dimensional embedded vector to obtain a drug target prediction model.
9. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the method of any of claims 1 to 7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1-7 when executing the computer program.
CN202111475687.0A 2021-12-06 2021-12-06 Method and device for constructing drug target prediction model, storage medium and electronic equipment Pending CN114171114A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111475687.0A CN114171114A (en) 2021-12-06 2021-12-06 Method and device for constructing drug target prediction model, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111475687.0A CN114171114A (en) 2021-12-06 2021-12-06 Method and device for constructing drug target prediction model, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN114171114A true CN114171114A (en) 2022-03-11

Family

ID=80483232

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111475687.0A Pending CN114171114A (en) 2021-12-06 2021-12-06 Method and device for constructing drug target prediction model, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN114171114A (en)

Similar Documents

Publication Publication Date Title
US11169827B2 (en) Resource loading at application startup using attributes of historical data groups
CN106775637B (en) Page display method and device for application program
JP6311194B2 (en) Contact grouping method and apparatus
CN104852885B (en) Method, device and system for verifying verification code
CN108156508B (en) Barrage information processing method and device, mobile terminal, server and system
US20170109756A1 (en) User Unsubscription Prediction Method and Apparatus
CN106096359A (en) A kind of solution lock control method and mobile terminal
CN111125523B (en) Searching method, searching device, terminal equipment and storage medium
CN107885718B (en) Semantic determination method and device
CN110633438A (en) News event processing method, terminal, server and storage medium
CN111090877A (en) Data generation method, data acquisition method, corresponding devices and storage medium
CN110597957A (en) Text information retrieval method and related device
CN110837404A (en) Shortcut operation processing method and device for internal function module and storage medium
CN106657254A (en) Synchronization method, device and system for contact information
CN105631059B (en) Data processing method, data processing device and data processing system
CN109450853B (en) Malicious website determination method and device, terminal and server
CN110019648B (en) Method and device for training data and storage medium
CN106230919B (en) File uploading method and device
CN114171114A (en) Method and device for constructing drug target prediction model, storage medium and electronic equipment
CN110503189B (en) Data processing method and device
CN109451295A (en) A kind of method and system obtaining virtual information
CN112367428A (en) Electric quantity display method and system, storage medium and mobile terminal
CN112748835A (en) Terminal, server, recent task list display method and application recommendation method
CN110809234A (en) Figure category identification method and terminal equipment
CN111027406A (en) Picture identification method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination