CN113270152B

CN113270152B - Method and system for predicting metabolic site of small molecule CYP metabolic enzyme

Info

Publication number: CN113270152B
Application number: CN202110420115.6A
Authority: CN
Inventors: 李远鹏; 陈萍; 赖力鹏; 温书豪; 马健
Original assignee: Beijing Jingtai Technology Co ltd
Current assignee: Beijing Jingtai Technology Co ltd
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2023-10-20
Anticipated expiration: 2041-04-19
Also published as: CN113270152A

Abstract

A method and system for predicting metabolic sites of small molecule CYP metabolizing enzymes, comprising: collecting molecular structure data related to metabolic sites, and sorting the collected data into data sets of different subtypes based on the different metabolic enzyme subtypes; converting the two-dimensional molecular structure into a three-dimensional molecular structure; vectorizing the atom types in the three-dimensional structure of the molecule; vectorizing the distance in the three-dimensional structure of the molecule; taking the vector of the atom type as a vertex information vectorization matrix in the graph convolution neural network, taking the vector of the distance as a side information vectorization matrix in the graph convolution neural network, updating vertex information according to a graph convolution neural network model, and calculating the probability of the atom being a metabolic site; predicting metabolic sites on different CYP metabolic enzyme subtypes based on probabilities of being metabolic sites; the method and the system predict the metabolic site of chemical reaction of molecules and CYP metabolism of different subtypes, and the prediction result can help pharmaceutical chemists to design or optimize the molecular structure.

Description

Method and system for predicting metabolic site of small molecule CYP metabolic enzyme

Technical Field

The invention relates to the technical field of computers, in particular to a method and a system for predicting metabolic sites of small-molecule CYP metabolic enzymes.

Background

ADMET (absorption, distribution, metabolism, excretion, toxicity) of potential drugs is an important component of the prediction of pharmaceutical properties. Liver is the main clearing organ of the medicine, and liver clearance is divided into two modes of liver metabolism and bile excretion. The liver is rich in various enzymes required for drug phase I and phase II metabolism, of which P450 enzymes are the most important. P450 enzymes are a large family of multiple types of P450 enzymes, and P450 enzymes can be divided into several different major classes, each of which can be subdivided into several minor classes, based on the ordered identity of amino acids. Among the important P450 enzymes in humans are CYP1A2, CYP2A6, CYP2B6, CYP2C8, CYP2C9, CYP2C19, CYP2D6, CYP2E1, CYP3A4 and CYP3 A5. There are significant species differences in P450 enzymes and the metabolic pathways and metabolites of drugs may be different in animals and humans. Polymorphism (polymorphs) is an important feature of P450 enzymes, and is an important cause of individual differences in drug response. Polymorphism means that there is a large difference in the amount of a P450 enzyme between individuals of the same species. High individuals metabolize rapidly, known as the fast metabolic form (extensive metabolizer); low amounts of individuals are slow in their metabolic rate, known as the slow metabolizing (pore metaboly). Many P450 enzymes exhibit polymorphisms in humans, with polymorphisms typified by CYP2D6 and CYP2C 19. In addition, P450 enzymes are inducible and suppressible. That is, the amount and activity of the P450 enzyme may be affected by the drug (or other exogenous material), may affect the metabolism of the drug itself, and may cause metabolic drug interactions. In the process of drug design, potential drugs may have no way to reach effective drug action concentration in the targeted organ due to too fast metabolic rate, and no way to exert drug effect. Potential drugs may also accumulate in the metabolic organ-liver due to the slow metabolic rate, resulting in drug hepatotoxicity. Therefore, in the early stages of drug design, the molecular structure is optimized to achieve an effective concentration without accumulating hepatotoxicity, depending on the metabolic behavior of the drug in the animal body.

Disclosure of Invention

Based on this, it is necessary to provide a method for predicting the metabolic site of a small molecule CYP metabolic enzyme with improved prediction effect.

Meanwhile, a system for predicting the metabolic site of the small molecule CYP metabolic enzyme with improved prediction effect is provided.

A method of predicting a metabolic site of a small molecule CYP metabolizing enzyme, comprising:

and (3) data arrangement: collecting molecular structure data related to metabolic sites, and sorting the collected data into data sets of different subtypes based on the different metabolic enzyme subtypes;

structural transformation: converting the two-dimensional molecular structure into a three-dimensional molecular structure;

vectorization of atomic types: vectorizing the atom types in the three-dimensional structure of the molecule, vectorizing the vectors of all atoms in the molecule according to the atomic sequence to form a matrix M;

vectorization of distance: vectorizing the distance in the three-dimensional structure of the molecule, and calculating the distance between every two atoms in the molecule to form a matrix D;

and (3) constructing a model: in a graph convolution neural network model, vectorization of atom types and vectorization of distances are input, the vector of the atom types is used as a vertex information vectorization matrix in the graph convolution neural network, the vector of the distances is used as a side information vectorization matrix in the graph convolution neural network, vertex information is updated according to the graph convolution neural network model, and the probability that atoms are metabolic sites is calculated;

Predicting metabolic sites: the metabolic site atoms are predicted on different CYP metabolic enzyme subtypes, and the top-ranked atoms are taken out on different CYP metabolic enzyme subtypes according to the probability of being the metabolic site to be predicted as the metabolic site.

In a preferred embodiment, further comprising: adding a label: atoms are marked with a 0,1 tag value according to the predicted result, 0 represents a non-metabolic site, and 1 represents a metabolic site.

In a preferred embodiment, in the sorting data step, the collected data is sorted into 9 data sets of different subtypes based on different metabolic enzyme subtypes: performing desalting and desolvation treatment on a molecular structure of CYP 1A2, CYP 2A6, CYP 2B6, CYP 2C19, CYP 2C8, CYP 2C9, CYP 2D6, CYP 2E1 and CYP 3A4 to form a standard model input format, converting atoms into 0,1 label values on the label values of the molecules if the molecules are provided with labels, wherein 0 represents not a metabolic site, and 1 represents a metabolic site; storing the molecular structure data and the tag value in a file with a set format, wherein the file information with the set format comprises: a type and coordinate of each atom in the molecule, b type of atom-atom connection bond, c label value, whether the atom in the molecule is a potential metabolic site.

In a preferred embodiment, in the vectorization step of the atom type, the number of protons of the atoms is taken as the atom type to judge, the atoms in the molecule are converted into vector representation, the row vector representation consisting of 0 and 1 is adopted, the vector length of the atoms is set to be n, the x-th of the vector corresponding to the atoms is 1 if the number of protons of the atoms is x, the rest n-1 positions are 0, all the atoms in the molecule are vectorized into a matrix M according to the atomic sequence, the matrix M is M x n matrix, and if the number of atoms in the molecule is less than M, all 0 vectors are used for complement; each element D { i, j } in the matrix D, which is represented as the distance of the i-th atom from the j-th atom, is all 0's for all diagonal elements in the matrix D.

In a preferred embodiment, the length of the vector of the atom is 78, if the number of protons of the atom is M, the M of the vector corresponding to the atom is 1, the rest 77 bits are 0, the matrix M is 100 x 78, including 100 row vectors and 78 column vectors, and if the number of atoms in the molecule is less than 100, the vectors are complemented by using all 0 vectors; matrix D is a 100 x 100 matrix comprising 100 row vectors and 100 column vectors.

In a preferred embodiment, in the building model, the hyper-parameters of the graph roll-up neural network model include: the radius r of the graph convolution is 3-5; the hyper-parameters in the graph roll-up neural network model training process include: the mini_batch size is 32-128, and the learning rate is 0.0003-0.001; the learning rate is 0.0003-0.001, a sigmoid function is adopted as an excitation function in the convolution process, and a two-class cross entropy is adopted as a loss function for judging whether atoms are metabolic sites or not; the updating the vertex information includes: information of a central atom, information of adjacent chemical bonds in a radius range set by taking the central atom as a center, and information of adjacent atoms;

The information of the central atom after the set layer drawing convolution is as follows:

after the first convolution, the updated vertex information includes: a. information of central atoms, b. information of all adjacent chemical bonds with radius 1, c. information of all adjacent atoms with radius 1;

after the second convolution, the updated vertex information includes: a. information of central atoms, b. information of all adjacent chemical bonds with radius 2, c. information of all adjacent atoms with radius 2;

after the third convolution, the updated vertex information includes: a. information of central atoms, b information of all adjacent chemical bonds with radius 3, c information of all adjacent atoms with radius 3; … … and so on.

In a preferred embodiment, the graph roll-up neural network model is a spatial domain graph roll-up neural network model, and the spatial domain graph roll-up solution includes: a messaging process, a status update process, the messaging process comprising: the node is taken as a center, and atom information around the node and information of chemical bonds are gathered together; the status update process includes: according to the information of the central node and information converged based on the message transmission process, updating the information of the central node comprehensively, wherein the spatial domain map neural network model inputs a vectorization matrix M of an atomic type serving as a node information matrix, a vectorization matrix D of a distance, and the vectorization information of the j-th atom of the M matrix by c _j Each element D in the vectorized matrix D representing the distance _ij Recorded is the distance of the ith atom from the jth atom,

functional form in the message passing process:

wherein W, b is a parameter to be trained in the graph roll-up neural network, v _ij Calculate the j-th atom as center and the i-th atom as to the j-th atomsAn influence of the sub-element, the influence being information in the information transfer in the graph convolution neural network;

the j-th node is updated according to the function form in the state updating process

The formula represents the status update of the central node: with information of the central node itself (c _probe ) Plus all surrounding information

For calculating the probability of whether an atom is a metabolic site, a layer of fully connected neural network is adopted to calculate the probability of the node atom being the metabolic site:

Probability＝sigmoid(c _probe *W+b)＝1/(1+e-(c _probe * W+b)), where W, b is the parameter in the graph roll-up neural network that needs to be trained.

In a preferred embodiment, the step of predicting the metabolic site predicts the top three atoms on different CYP metabolic enzyme subtypes as metabolic sites according to probability of being metabolic sites.

A system for predicting a metabolic site of a small molecule CYP metabolizing enzyme, comprising:

and (3) finishing a data module: collecting molecular structure data related to metabolic sites, and sorting the collected data into data sets of different subtypes based on the different metabolic enzyme subtypes;

And a structural conversion module: converting the two-dimensional molecular structure into a three-dimensional molecular structure;

atom type vectorization module: vectorizing the atom types in the three-dimensional structure of the molecule, vectorizing the vectors of all atoms in the molecule according to the atomic sequence to form a matrix M;

distance vectorization module: vectorizing the distance in the three-dimensional structure of the molecule, and calculating the distance between every two atoms in the molecule to form a matrix D;

and (3) constructing a model module: in a graph convolution neural network model, vectorization of atom types and vectorization of distances are input, the vector of the atom types is used as a vertex information vectorization matrix in the graph convolution neural network, the vector of the distances is used as a side information vectorization matrix in the graph convolution neural network, vertex information is updated according to the graph convolution neural network model, and the probability that atoms are metabolic sites is calculated;

predicted metabolic site module: the metabolic site atoms are predicted on different CYP metabolic enzyme subtypes, and the top-ranked atoms are taken out on different CYP metabolic enzyme subtypes according to the probability of being the metabolic site to be predicted as the metabolic site.

In a preferred embodiment, in the arrangement data module, the molecular structure is subjected to desalting and desolventizing treatment, and is arranged into a standard model input format, if the molecule is provided with a label, atoms are converted into 0,1 label values on the label value of the molecule, wherein 0 represents a non-metabolic site, and 1 represents a metabolic site; storing the molecular structure data and the tag value in a file with a set format, wherein the file information with the set format comprises: a type and coordinates of each atom in the molecule, b type of atom-atom connection bond, c tag value, whether the atom in the molecule is a potential metabolic site;

In the vectorization module of the atom type, the number of protons of atoms is taken as the atom type to judge, the atoms in the molecule are converted into vector representation, row vector representation consisting of 0 and 1 is adopted, the vector length of the atoms is set to be n, the x of the vector corresponding to the atoms is 1 if the number of the protons of the atoms is x, the rest n-1 positions are 0, all the atoms in the molecule are vectorized into a matrix M according to the atom ordering, the matrix M is M-n matrix, and if the number of the atoms in the molecule is smaller than M, all 0 vectors are used for complement;

in the distance vectorization module, each element D { i, j } in the matrix D is expressed as the distance between the ith atom and the jth atom, and all diagonal elements in the matrix D are all 0;

in the model building module, the hyper-parameters of the graph roll-up neural network model include: the radius r of the graph convolution is 3-5; the hyper-parameters in the graph roll-up neural network model training process include: the mini_batch size is 32-128, and the learning rate is 0.0003-0.001; a sigmoid function is adopted as an excitation function in the convolution process, and a two-class cross entropy is adopted as a loss function for judging whether atoms are metabolic sites or not; the updating the vertex information includes: information of a central atom, information of adjacent chemical bonds in a radius range set by taking the central atom as a center, and information of adjacent atoms;

The method and the system for predicting the metabolic sites of the small-molecule CYP metabolic enzymes focus on the prediction of the metabolic sites of chemical reactions of the molecules and CYP metabolism of different subtypes in the in-vivo metabolic process of the drug molecules, collect data into data sets of different subtypes based on the different metabolic enzyme subtypes, predict the metabolic site atoms on the different CYP metabolic enzyme subtypes to improve the prediction effect, vector of an atomic type and vector of a distance are input into a graph rolling neural network model, the vector of the atomic type is used as a vertex information vectorization matrix in the graph rolling neural network, the vector of the distance is used as a side information vectorization matrix in the graph rolling neural network, the probability of the atoms as the metabolic sites is calculated according to the graph rolling neural network model, the prediction effect is further improved, and a pharmaceutical chemist can be helped to design or optimize a molecular structure.

Drawings

FIG. 1 is a partial flow chart of a method of predicting a metabolic site of a small molecule CYP metabolizing enzyme according to an embodiment of the invention;

FIG. 2 is a graph showing the predicted results of the molecules of one embodiment of the present invention on CYP metabolizing enzyme subtype 1A 2;

FIG. 3 is a graph showing the predicted results of the molecules of one embodiment of the present invention on CYP metabolizing enzyme subtype 2A 6;

FIG. 4 is a graph showing the predicted results of the molecules of one embodiment of the present invention on CYP metabolizing enzyme subtype 2B 6;

FIG. 5 is a graph showing the predicted results of the molecules of one embodiment of the present invention on CYP metabolizing enzyme subtype 2C 19;

FIG. 6 is a graphical representation of the predicted results of one embodiment of the invention on CYP metabolizing enzyme subtype 2C 8;

FIG. 7 is a graphical representation of the predicted results of one embodiment of the invention on CYP metabolizing enzyme subtype 2C 9;

FIG. 8 is a graph showing the predicted results of the molecules of one embodiment of the present invention on CYP metabolizing enzyme subtype 2D 6;

FIG. 9 is a graph showing the predicted results of the molecules of one embodiment of the present invention on CYP metabolizing enzyme subtype 2E 1;

FIG. 10 is a graph showing the predicted results of the molecules of one embodiment of the present invention on CYP metabolizing enzyme subtype 3A 4.

Detailed Description

As shown in fig. 1, a method for predicting a metabolic site of a small molecule CYP metabolic enzyme according to an embodiment of the present invention includes:

Step S101, data are collated: collecting molecular structure data related to metabolic sites, and sorting the collected data into data sets of different subtypes based on the different metabolic enzyme subtypes;

step S103, structural transformation: converting the two-dimensional molecular structure into a three-dimensional molecular structure;

step S105, vectorization of atom types: vectorizing the atom types in the three-dimensional structure of the molecule, vectorizing the vectors of all atoms in the molecule according to the atomic sequence to form a matrix M;

step S107, vectorization of distance: vectorizing the distance in the three-dimensional structure of the molecule, and calculating the distance between every two atoms in the molecule to form a matrix D;

step S109, constructing a model: in a graph convolution neural network model, vectorization of atom types and vectorization of distances are input, the vector of the atom types is used as a vertex information vectorization matrix in the graph convolution neural network, the vector of the distances is used as a side information vectorization matrix in the graph convolution neural network, vertex information is updated according to the graph convolution neural network model, and the probability that atoms are metabolic sites is calculated;

step S111, predicting metabolic site: the metabolic site atoms are predicted on different CYP metabolic enzyme subtypes, and the top-ranked atoms are taken out on different CYP metabolic enzyme subtypes according to the probability of being the metabolic site to be predicted as the metabolic site.

Further, the method for predicting a metabolic site of a small molecule CYP metabolizing enzyme of the present embodiment further comprises: adding a label: atoms are marked with a 0,1 tag value according to the predicted result, 0 represents a non-metabolic site, and 1 represents a metabolic site.

In this example, the molecular structure data related to the metabolic site can be collected from the public data set such as drug bank, chemBL, etc.

Further, in the data sorting step of this embodiment, the collected data are sorted into 9 data sets of different subtypes based on different metabolic enzyme subtypes: CYP 1A2, CYP 2A6, CYP 2B6, CYP 2C19, CYP 2C8, CYP 2C9, CYP 2D6, CYP 2E1, CYP 3A4.

Because the data collected from the public data set is inconsistent in data format, the molecular structure needs to be subjected to desalting, desolventizing and other treatments, and the data is arranged into a standard model input format.

If the molecule is provided with a tag, converting the atom into a tag value of 0,1 on the tag value of the molecule, wherein 0 represents a metabolic site not, and 1 represents a metabolic site; storing the molecular structure data and the tag value in a file with a set format, wherein the file information with the set format comprises: a type and coordinate of each atom in the molecule, b type of atom-atom connection bond, c label value, whether the atom in the molecule is a potential metabolic site.

Preferably, in this embodiment, the data and tag values in one sample are stored in the form of an sdf file. Sdf contains information including: a. the type and coordinates of each atom in the molecule; b. the type of atom-to-atom bond; c. tag value, which atom in the molecule is a potential metabolic site. Finally, the sdf file is contained in a folder by using different metabolic enzyme subtypes. The inputs of the model are a and b, and the output label value of the model is c.

Step S103, structural transformation: the two-dimensional molecular structure is converted to a three-dimensional molecular structure using the disclosed chemoinformatics software kit rdkit, with the mmff94 two-dimensional molecular structure in rdkit.

Further, preferably, a kit for converting a two-dimensional molecular structure into a three-dimensional molecular structure may employ rdkit,1. The force field in rdkit is defined first, using such a function

The "MMFF94s" force field is defined by prop=allchem.mmffgetmolecular probes (m, mmffvariant= "MMFF94 s"), 2. The three-dimensional coordinates of the molecules are randomly initialized using the "MMFF94s" force field, rdkit.chem.allchem.mmffgetmolecular eforceld (m, prop, confid=id), 3. Because the random three-dimensional coordinates may be due to bond length or unreasonable structure, it is necessary to optimize the initialized three-dimensional structure of the molecules, and the three-dimensional conformation of the molecular structure is optimized using the "MMFF94s" force field. The function ff.minimum () in rdkit is used to optimize the three-dimensional conformation.

Further, in the vectorization step of the atom type in this embodiment, the number of protons of atoms is used as the atom type to determine, the atoms in the molecule are converted into vector representations, the row vector representations composed of 0 and 1 are adopted, the vector length of the atoms is set to be n, if the number of protons of the atoms is x, the x-th of the vector corresponding to the atoms is 1, the rest n-1 positions are 0, all the atoms in the molecule are vectorized according to the atom ordering to form a matrix M, the matrix M is an M-n matrix, and if the number of atoms in the molecule is smaller than M, the vector is complemented by using all 0 vectors.

Further, the length of the vector of the atom in this embodiment is 78, if the number of protons of the atom is M, the mth bit of the vector corresponding to the atom is 1, the rest 77 bits are 0, the matrix M is 100×78, including 100 row vectors and 78 column vectors, and if the number of atoms in the molecule is less than 100, the vector is complemented by an all 0 vector.

Specifically, for vectorization of each atom in the sample molecule, the number of protons of the atom is used as an important judgment of the atom type, such as oxygen (O) of 8, carbon (C) of 6, etc., and all atom types in all molecules are counted, because the collected molecules are organic molecules, and no atom greater than 78 protons. All different types of atoms are then arranged according to the chemical periodic table: [ H, he, li, be, B, C, N, o. ], a total alignment length of 78. There is also a ordering of atoms within a molecule according to the molecular structure, such as: c, C, O, C, C, N. The first C atom in the molecule is converted to a one-hot vector, such as [0,0,0,0,0,1,0,0. The vector length is 78, the vector has only bit 6 as 1 and the other positions as 0. The second C atom in the molecule is also sorted according to this rule into [0,0,0,0,0,1,0,0.+ -. And the second O atom in the molecule is also sorted into the vector of [0,0,0,0,0,0,0,1.], and so on. Finally, vectorizing the vectors of all atom types in one molecule into a matrix M according to the atomic sequence, wherein the size of the matrix M is 100 x 78. The first dimension of the matrix M is the number of atoms in the molecule and the second dimension is the length of each atom vectorization. If some molecules have an atomic number less than 100, the matrix M will be complemented with a vector using all 0's.

In step S107, in the distance vectorization step, each element D { i, j } in the matrix D, in which all diagonal elements are 0, is expressed as the distance between the i-th atom and the j-th atom.

Further, preferably, the matrix D is a 100×100 matrix, including 100 row vectors and 100 column vectors.

Further, in the building model of the present embodiment, the hyper-parameters of the graph roll-up neural network model include: the radius r of the graph roll is 3-5, the mini_batch size is 32-128, the learning rate is 0.0003-0.001, the sigmoid function is adopted as the excitation function in the convolution process, and the two-class cross entropy of whether atoms are metabolic sites or not is adopted as the loss function.

Further, in the present embodiment, the input of the graph convolution neural network model is vectorization of an atomic type and vectorization of a distance. The output of the model is a probability value that each atom is a metabolic site. The model takes the vector of the atomic type as a vertex information vectorization matrix in the graph rolling neural network, takes the vectorization of the distance as a side information vectorization matrix in the graph rolling neural network, and updates vertex information according to rules in the graph rolling neural network. Preferably, the graph neural network model hyper-parameters of the present embodiment include: the convolution radius r is 5. The excitation function in the convolution process adopts a sigmoid function. The loss function uses a two-class cross entropy of whether an atom is a metabolic site. The super parameters in the graph neural network model training process of the embodiment include: mini_bach size is 32, learning rate is 0.0003.

Further, updating vertex information of the present embodiment includes: information of a central atom, information of adjacent chemical bonds in a radius range set by taking the central atom as a center, and information of adjacent atoms;

Further, the graph roll-up neural network model of the present embodiment is a spatial domain graph roll-up neural network model, and the spatial domain graph roll-up solution includes: a messaging process, a status update process, the messaging process comprising: centering on the node, and combining the atomic information around the node with chemical bonds Information is gathered together; the status update process includes: according to the information of the central node and information converged based on the message transmission process, updating the information of the central node comprehensively, wherein the spatial domain map neural network model inputs a vectorization matrix M of an atomic type serving as a node information matrix, a vectorization matrix D of a distance, and the vectorization information of the j-th atom of the M matrix by c _j Each element D in the vectorized matrix D representing the distance _ij Recorded is the distance of the ith atom from the jth atom.

Functional form in the message passing process:

wherein W, b is a parameter to be trained in the graph roll-up neural network, namely W ^fc 、W ^cf 、W ^df 、b ^f1 、b ^f2 For parameters to be trained in graph convolution neural network, v _ij Calculating the influence of the ith atom on the j atoms by taking the jth atom as the center, wherein the influence is information in information transmission in the graph convolution neural network;

functional form in the state updating process, and formula for updating jth node

Probability＝sigmoid(c _probe *W+b)＝1/(1+e-(c _probe * W+b)), wherein W, b is the graph convolution nerveParameters in the network that need to be trained.

Further, in the step of predicting the metabolic site, the top three atoms are extracted from the different CYP metabolic enzyme subtypes according to the probability of being the metabolic site to be predicted as the metabolic site.

Referring to fig. 2-10, the most likely metabolic site atoms are predicted on 9 different CYP metabolic enzyme subtypes.

FIG. 2 shows the predicted results on CYP metabolizing enzyme subtype 1A2 in a molecule. The upper left hand corner of fig. 2 is molecular structure information of the ordering of atoms in the sdf file, and the lower left hand corner of fig. 2 is the most likely front 3 metabolic sites of atoms at the shaded dots in the predicted result. The right side of fig. 2 is a specific value of probability of predicting each atom as a metabolic site.

FIG. 3 shows the predicted results on CYP metabolizing enzyme subtype 2A6 in a molecule. The upper left hand corner of fig. 3 is molecular structure information of the ordering of atoms in the sdf file, and the lower left hand corner of fig. 3 is the most likely front 3 metabolic sites of atoms marked as shaded dots in the predicted result. The right side of fig. 3 is a specific numerical value of probability of predicting each atom as a metabolic site.

FIG. 4 shows the predicted results on CYP metabolizing enzyme subtype 2B6 in a molecule. The upper left hand corner of fig. 4 is molecular structure information of the ordering of atoms in the sdf file, and the lower left hand corner of fig. 4 is the most likely front 3 metabolic sites of atoms at the shaded dots in the predicted result. The right side of fig. 4 is a specific numerical value of probability of predicting each atom as a metabolic site.

FIG. 5 shows the predicted results on CYP metabolizing enzyme subtype 2C19 in a molecule. The upper left hand corner of fig. 5 is molecular structure information of the ordering of atoms in the sdf file, and the lower left hand corner of fig. 5 is the most likely front 3 metabolic sites of atoms marked as shaded dots in the predicted result. The right side of fig. 5 is a specific numerical value of probability of predicting each atom as a metabolic site.

FIG. 6 shows the predicted results on CYP metabolizing enzyme subtype 2C8 in a molecule. The upper left hand corner of fig. 6 is molecular structure information of the ordering of atoms in the sdf file, and the lower left hand corner of fig. 6 is the most likely front 3 metabolic sites of atoms marked as shaded dots in the predicted result. The right side of fig. 6 is a specific numerical value of probability of predicting each atom as a metabolic site.

FIG. 7 shows the predicted results on CYP metabolizing enzyme subtype 2C9 in a molecule. The upper left hand corner of fig. 7 is molecular structure information of the ordering of atoms in the sdf file, and the lower left hand corner of fig. 7 is the most likely front 3 metabolic sites of atoms marked as shaded dots in the predicted result. The right side of fig. 7 is a specific numerical value of probability of predicting each atom as a metabolic site.

FIG. 8 shows the predicted results on CYP metabolizing enzyme subtype 2D6 in a molecule. The upper left hand corner of fig. 8 is molecular structure information of the ordering of atoms in the sdf file, and the lower left hand corner of fig. 8 is the most likely front 3 metabolic sites of atoms marked as shaded dots in the predicted result. The right side of fig. 8 is a specific numerical value of probability of predicting each atom as a metabolic site.

FIG. 9 shows the predicted results on CYP metabolizing enzyme subtype 2E1 in the molecule. The upper left hand corner of fig. 9 is molecular structure information of the ordering of atoms in the sdf file, and the lower left hand corner of fig. 9 is the most likely front 3 metabolic sites of atoms marked as shaded dots in the predicted result. The right side of fig. 9 is a specific numerical value of probability of predicting each atom as a metabolic site.

FIG. 10 shows the predicted results on CYP metabolizing enzyme subtype 3A4 in a molecule. The upper left hand corner of fig. 10 is molecular structure information of the ordering of atoms in the sdf file, and the lower left hand corner of fig. 10 is the most likely front 3 metabolic sites of atoms at the shaded dots in the predicted result. The right side of fig. 10 is a specific numerical value of probability of predicting each atom as a metabolic site.

The system for predicting a metabolic site of a small molecule CYP metabolic enzyme according to an embodiment of the invention comprises:

Further, the system for predicting a metabolic site of a small molecule CYP metabolizing enzyme of the present embodiment further comprises: and (3) adding a label module: atoms are marked with a 0,1 tag value according to the predicted result, 0 represents a non-metabolic site, and 1 represents a metabolic site.

The structural transformation module of this embodiment: the two-dimensional molecular structure is converted to a three-dimensional molecular structure using the disclosed chemoinformatics software kit rdkit, with the mmff94 two-dimensional molecular structure in rdkit.

Further, in the vectorization module of the atomic type in this embodiment, the number of protons of atoms is used as the atomic type to determine, the atoms in the molecule are converted into vector representations, the row vector representations composed of 0 and 1 are adopted, the vector length of the atoms is set to be n, if the number of protons of the atoms is x, the x-th of the vector corresponding to the atoms is 1, the rest n-1 positions are 0, all the atoms in the molecule are vectorized into a matrix M according to the atomic order, the matrix M is an m×n matrix, and if the number of atoms in the molecule is smaller than M, the vector is complemented by using all 0 vectors.

In the distance vectorization module, each element D { i, j } in the matrix D is expressed as the distance between the ith atom and the jth atom, and all diagonal elements in the matrix D are all 0.

Further, in the present embodiment, the input of the graph convolution neural network model is vectorization of an atomic type and vectorization of a distance. The output of the model is a probability value that each atom is a metabolic site. The model takes the vector of the atomic type as a vertex information vectorization matrix in the graph rolling neural network, takes the vectorization of the distance as a side information vectorization matrix in the graph rolling neural network, and updates vertex information according to rules in the graph rolling neural network. Preferably, the graph neural network model hyper-parameters of the present embodiment include: the convolution radius r is 5. The excitation function in the convolution process adopts a sigmoid function. The loss function uses a two-class cross entropy of whether an atom is a metabolic site. The super parameters in the graph neural network model training process of the embodiment include: mini_batch size was 32 and learning rate was 0.0003.

Further, the graph rolling neural network model of the embodiment is a spatial domain graph rolling neural network modelType (2). The spatial domain volume integral solution includes: a messaging process, a status update process, the messaging process comprising: the node is taken as a center, and atom information around the node and information of chemical bonds are gathered together; the status update process includes: according to the information of the central node and information converged based on the message transmission process, updating the information of the central node comprehensively, wherein the spatial domain map neural network model inputs a vectorization matrix M of an atomic type serving as a node information matrix, a vectorization matrix D of a distance, and the vectorization information of the j-th atom of the M matrix by c _j Each element D in the vectorized matrix D representing the distance _ij Recorded is the distance of the ith atom from the jth atom.

Functional form in the message passing process:

/>

The present application focuses on the prediction of metabolic sites where molecules chemically react with different subtypes of CYP metabolism during in vivo metabolism of drug molecules. This prediction can help pharmaceutical chemists design or optimize molecular structure.

With the above-described preferred embodiments according to the present application as a teaching, the worker skilled in the art could make various changes and modifications without departing from the scope of the technical idea of the present application. The technical scope of the present application is not limited to the contents of the specification, and must be determined according to the scope of claims.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A method of predicting a metabolic site of a small molecule CYP metabolizing enzyme, comprising:

predicting metabolic sites: predicting metabolic site atoms on different CYP metabolic enzyme subtypes, and taking out the atoms which are ranked at the top according to the probability of being metabolic sites on different CYP metabolic enzyme subtypes to predict the atoms as metabolic sites;

in the vectorization step of the atom type, the number of protons of atoms is taken as the atom type to judge, the atoms in the molecule are converted into vector representation, row vector representation consisting of 0 and 1 is adopted, the vector length of the atoms is set to be n, the x-th bit of the vector corresponding to the atoms is 1 if the number of the protons of the atoms is x, the rest n-1 bits are 0, all the atoms in the molecule are vectorized into a matrix M according to the atom ordering, the matrix M is M-n matrix, and if the number of the atoms in the molecule is smaller than M, all 0 vectors are used for complement; each element D (i, j) in the matrix D, which is represented as the distance of the i-th atom from the j-th atom, is all 0's;

In the step of predicting the metabolic site, atoms of the top three are taken out from different CYP metabolic enzyme subtypes according to the probability of being the metabolic site to predict the metabolic site.

2. The method for predicting the metabolic site of a small molecule CYP metabolizing enzyme of claim 1, further comprising: adding a label: atoms are marked with a 0,1 tag value according to the predicted result, 0 represents a non-metabolic site, and 1 represents a metabolic site.

3. The method for predicting metabolic sites of small molecule CYP metabolizing enzymes of claim 1, wherein in the sorting data step, the collected data is sorted into 9 data sets of different subtypes based on different metabolic enzyme subtypes: performing desalting and desolvation treatment on a molecular structure of CYP 1A2, CYP 2A6, CYP 2B6, CYP 2C19, CYP 2C8, CYP 2C9, CYP 2D6, CYP 2E1 and CYP 3A4 to form a standard model input format, converting atoms into 0,1 label values on the label values of the molecules if the molecules are provided with labels, wherein 0 represents not a metabolic site, and 1 represents a metabolic site; storing the molecular structure data and the tag value in a file with a set format, wherein the file information with the set format comprises: a type and coordinate of each atom in the molecule, b type of atom-atom connection bond, c label value, whether the atom in the molecule is a potential metabolic site.

4. The method for predicting the metabolic site of a small molecule CYP metabolizing enzyme according to claim 3, wherein the vector of atoms has a length of 78, the M-th position of the vector corresponding to an atom is 1 if the number of protons of the atom is M, the remaining 77 positions are 0, and the matrix M is 100 x 78, comprising 100 row vectors and 78 column vectors, and if the number of atoms in the molecule is less than 100, the vectors are complemented with all 0 vectors; matrix D is a 100 x 100 matrix comprising 100 row vectors and 100 column vectors.

5. The method for predicting metabolic sites of small molecule CYP metabolic enzyme according to any one of claims 1 to 4, wherein in the constructing model, the hyper-parameters of the graph roll-up neural network model comprise: the radius r of the graph convolution is 3-5; the hyper-parameters in the graph roll-up neural network model training process include: the mini_batch size is 32-128, and the learning rate is 0.0003-0.001; the learning rate is 0.0003-0.001, a sigmoid function is adopted as an excitation function in the convolution process, and a two-class cross entropy is adopted as a loss function for judging whether atoms are metabolic sites or not; the updating the vertex information includes: information of a central atom, information of adjacent chemical bonds in a radius range set by taking the central atom as a center, and information of adjacent atoms;

6. The method of predicting metabolic sites of a small molecule CYP metabolizing enzyme of claim 5, wherein the graph roll-up neural network model is a spatial domain graph roll-up neural network model, the spatial domain graph roll-up solution comprising: a messaging process, a status update process, the messaging process comprising: the node is taken as a center, and atom information around the node and information of chemical bonds are gathered together; the status update process includes: according to the information of the central node and information converged based on the message transmission process, updating the information of the central node comprehensively, wherein the spatial domain map neural network model inputs a vectorization matrix M of an atomic type serving as a node information matrix, a vectorization matrix D of a distance, and the vectorization information of the j-th atom of the M matrix by c _j Each element D in the vectorized matrix D representing the distance _ij Recorded is the distance of the ith atom from the jth atom,

functional form in the message passing process:

wherein W, b is a parameter to be trained in the graph roll-up neural network, v _ij Calculating the influence of the ith atom on the j atoms by taking the jth atom as the center, wherein the influence is information in information transmission in the graph convolution neural network;

the functional form in the state updating process is expressed as a formula that the j-th node is updated

The formula represents the status update of the central node: with information c of the central node itself _probe Adding surrounding information

7. A system for predicting a metabolic site of a small molecule CYP metabolizing enzyme, comprising:

predicted metabolic site module: predicting metabolic site atoms on different CYP metabolic enzyme subtypes, and taking out the atoms which are ranked at the top according to the probability of being metabolic sites on different CYP metabolic enzyme subtypes to predict the atoms as metabolic sites;

the vectorization module of the atom type is also used for judging the proton number of the atom as the atom type, converting the atom in the molecule into vector representation, adopting row vector representation composed of 0 and 1, setting the vector length of the atom as n, setting the x-th bit of the vector corresponding to the atom as 1 if the proton number of the atom is x, and setting the rest n-1 bits as 0, vectorizing all the atoms in the molecule into a matrix M according to the atom ordering, wherein the matrix M is an M-n matrix, and supplementing by using all 0 vectors if the atom number in the molecule is smaller than M; each element D (i, j) in the matrix D, which is represented as the distance of the i-th atom from the j-th atom, is all 0's;

The predicted metabolic site module is also configured to take out top three atoms on different CYP metabolic enzyme subtypes according to probability of being metabolic site for prediction as metabolic site.

8. The system for predicting the metabolic site of a small molecule CYP metabolizing enzyme according to claim 7, wherein the sorting data module sorts the molecular structure into a standard model input format by desalting and desolventizing, converting atoms into a 0,1 tag value on the tag value of the molecule if the molecule is tagged, 0 representing not a metabolic site, and 1 representing a metabolic site; storing the molecular structure data and the tag value in a file with a set format, wherein the file information with the set format comprises: a type and coordinates of each atom in the molecule, b type of atom-atom connection bond, c tag value, whether the atom in the molecule is a potential metabolic site;

In the vectorization module of the distance, each element D (i, j) in the matrix D is represented as a distance between the ith atom and the jth atom, and all diagonal elements in the matrix D are all 0;