WO2023123023A1 - 筛选分子的方法、装置及其应用 - Google Patents

筛选分子的方法、装置及其应用 Download PDF

Info

Publication number
WO2023123023A1
WO2023123023A1 PCT/CN2021/142381 CN2021142381W WO2023123023A1 WO 2023123023 A1 WO2023123023 A1 WO 2023123023A1 CN 2021142381 W CN2021142381 W CN 2021142381W WO 2023123023 A1 WO2023123023 A1 WO 2023123023A1
Authority
WO
WIPO (PCT)
Prior art keywords
molecular
interaction
skeleton
classes
molecules
Prior art date
Application number
PCT/CN2021/142381
Other languages
English (en)
French (fr)
Inventor
胡建星
吴楚楠
徐旻
庞丽雪
Original Assignee
深圳晶泰科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳晶泰科技有限公司 filed Critical 深圳晶泰科技有限公司
Priority to PCT/CN2021/142381 priority Critical patent/WO2023123023A1/zh
Publication of WO2023123023A1 publication Critical patent/WO2023123023A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/60In silico combinatorial chemistry
    • G16C20/64Screening of libraries

Definitions

  • the present application relates to the technical field of computational simulation, in particular to a method, device and application for screening molecules.
  • molecular screening can be performed based on thresholds for preset indicators. However, this may cause some molecules that are helpful for subsequent development to be filtered out.
  • this application provides a method, device and application for screening molecules, which can reduce the probability of filtering out molecules that are helpful for subsequent development.
  • the first aspect of the present application provides a method for screening molecules.
  • the above method includes: obtaining the first mapping relationship between the simplified molecular linear formulas of M ligand molecules and the N molecular structures, and the M ligand molecules
  • the simplified molecular linear formulas each have structural information, and M and N are integers greater than or equal to 1; for each of at least some of the molecules in the simplified molecular linear formulas of M ligand molecules, the structural information of the ligand molecules is respectively Perform skeleton extraction to obtain O molecular skeletons, O is an integer greater than or equal to 1, and O is less than or equal to M; aggregate O molecular skeletons to obtain P molecular skeleton classes, P is an integer greater than or equal to 1, and P less than or equal to 0; determine the second mapping relationship between P molecular skeletons and N molecular structures based on the first mapping relationship, so as to screen ligand molecules matching the target receptor molecule based on the second mapping relationship.
  • a second aspect of the present application provides a method for evaluating molecules, the method comprising: obtaining a simplified molecular linear formula of the molecule to be screened; determining the skeleton of the molecule to be screened based on the simplified molecular linear formula of the molecule to be screened; The skeleton and the multiple mapping relationships determined by the above method are evaluated for the molecules to be screened, and the multiple mapping relationships include: at least one of the first mapping relationship to the sixth mapping relationship.
  • the third aspect of the present application provides a design method, which includes: displaying molecular screening results, the molecular screening results are the screening results obtained according to the above method; performing drug design or material design based on the molecular screening results.
  • the fourth aspect of the present application provides a device for screening molecules, including: a first mapping relationship obtaining module, configured to obtain the first mapping relationship between the simplified molecular linear formulas of M ligand molecules and N molecular structures,
  • the simplified molecular linear formulas of M ligand molecules each have structural information, and M and N are integers greater than or equal to 1;
  • the molecular skeleton extraction module is used for at least some molecules in the simplified molecular linear formulas of the M ligand molecules For each molecule of the ligand molecule, the skeleton is extracted from the structural information of the ligand molecule to obtain O molecular skeletons, O is an integer greater than or equal to 1, and O is less than or equal to M;
  • the molecular skeleton aggregation module is used to aggregate O molecules Skeleton, obtain P molecular skeleton classes, P is an integer greater than or equal to 1, and P is less than or equal to 0;
  • the second mapping relationship determination module is used to determine P mole
  • a fifth aspect of the present application provides a device for evaluating molecules.
  • the device includes: a simplified molecular linear formula obtaining module, which is used to obtain the simplified molecular linear formula of the molecule to be screened; a molecular skeleton acquisition module, which is used to determine the skeleton of the molecule to be screened based on the simplified molecular linear formula of the molecule to be screened; molecular evaluation A module for evaluating the molecule to be screened based on the skeleton of the molecule to be screened and multiple mapping relationships determined by the above device, the multiple mapping relationships include: at least one of the first mapping relationship to the sixth mapping relationship.
  • the sixth aspect of the present application provides a design device, which includes: a screening result display module and a design module.
  • the screening result display module is used to display the molecular screening results, which are obtained according to the above device; the design module is used for drug design or material design based on the molecular screening results.
  • a seventh aspect of the present application provides an electronic device, including: a processor; and a memory, on which executable code is stored, and when the executable code is executed by the processor, the processor is made to execute the above method.
  • the eighth aspect of the present application also provides a computer-readable storage medium, on which executable codes are stored, and when the executable codes are executed by a processor of an electronic device, the processor is made to execute the above method.
  • the ninth aspect of the present application further provides a computer program product, including executable codes, and when the executable codes are executed by a processor, the foregoing method is implemented.
  • the method, device and application for screening molecules determine the skeleton of the ligand molecule based on the molecular structure of the ligand molecule, and cluster the skeletons of multiple ligand molecules to obtain skeletons, so that the construction of the skeleton can be realized
  • the mapping relationship between class and molecular structure makes it possible to predict the molecular structure and other characteristics of the molecule to be screened based on the skeleton of the molecule to be screened, improve the accuracy and convenience of the screened molecule, and assist in recommending reasonable molecules for synthesis and testing stage.
  • the technical solution provided by the present application can further determine the mapping relationship between molecular structures and structural classes and/or interaction classes, so that users can perform molecular screening based on more dimensional mapping relationships.
  • the technical solution provided by the present application can also verify whether the interactions in the interaction class are stable based on the results of dynamic simulations, which is convenient for users to carry out molecular screening based on whether the interactions are stable.
  • Figure 1 schematically shows a schematic diagram of the process of screening molecules according to an embodiment of the present application
  • Fig. 2 schematically shows an exemplary system architecture in which the method, device and application for screening molecules can be applied according to an embodiment of the present application
  • Fig. 3 schematically shows a flow chart of a method for screening molecules according to an embodiment of the present application
  • Figures 4 to 6 schematically show a schematic diagram of the process of extracting a molecular skeleton according to an embodiment of the present application
  • Fig. 7 schematically shows a schematic diagram of a skeleton diagram according to an embodiment of the present application
  • FIG. 8A schematically shows a schematic structural diagram of a skeleton according to an embodiment of the present application.
  • Fig. 8B schematically shows a schematic structural diagram of another skeleton according to an embodiment of the present application.
  • Figure 9 schematically shows a flow chart of a molecular assessment method according to an embodiment of the present application.
  • Fig. 10 schematically shows a flow chart of a design method according to an embodiment of the present application.
  • Figure 11 schematically shows a block diagram of a device for screening molecules according to an embodiment of the present application
  • Fig. 12 schematically shows a block diagram of a device for evaluating molecules according to an embodiment of the present application
  • Fig. 13 schematically shows a block diagram of a design device according to an embodiment of the present application.
  • Fig. 14 schematically shows a block diagram of an electronic device for implementing a method for screening molecules according to an embodiment of the present application.
  • first, second, third and so on may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another.
  • first information may also be called second information, and similarly, second information may also be called first information.
  • second information may also be called first information.
  • a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features.
  • “plurality” means two or more, unless otherwise specifically defined.
  • Molecular docking is a molecular simulation method that mainly analyzes the properties and interactions of receptors and ligands through electric field force, and then facilitates the prediction of the binding mode of receptors and ligands.
  • Molecular simulation refers to the use of theoretical methods and computer technology to simulate the structure and physical and chemical properties of molecules or molecular systems.
  • molecular structure-based virtual screening can be applied to the early stage of material development, such as in drug development.
  • the role of virtual screening is to screen out potential ligand molecules (such as drug molecules) that can bind to target receptor molecules (such as proteins) in a large-scale (for example, the number of molecules> 108 ) virtual molecular library .
  • molecular docking software includes, but is not limited to: AutoDockVina, ICMLeDock, rDock, UCSF DOCK, etc., and commercial software includes Glide, LigandFit, GOLD, MOE Dock, etc. It should be noted that the screening method based on molecular structure has become one of the early paradigms of small molecule drug development in related technologies, and the compound libraries available for screening are Enamine Real, Labnetworkx, etc.
  • the three-dimensional (3D) structure of the molecule to be screened is established, and energy optimization is performed. Then, hydrogen atoms are added to the 3D structure, and a force field and atomic charges of the molecules to be screened are added, respectively. Next, probes with preset radii generate template molecular surfaces of target receptor molecules. Then, a plausible binding site on the molecular surface of the receptor molecule of interest is determined. Energy scoring and evaluation are performed for each reasonable binding site.
  • semi-flexible docking can be used to generate a specified number of different conformational orientations (orientation), obtain the electrostatic and van der Waals interactions between the molecule to be screened and the binding site, and thus calculate the target Scoring of screening molecules.
  • the score is compared with a preset threshold to determine whether the molecule to be screened is retained for the subsequent development process.
  • AUC Absolute under curve
  • ROC receiver operating characteristic curve
  • the docking-based algorithm in the related art can also perform molecular screening from a large-scale compound library by selecting a specific docking scoring threshold, but the accuracy (such as AUC) of its overall positive molecular screening is often compared with The accuracy of the docking algorithm is related to the threshold selected by the user.
  • This screening method may be used in early drug development applications because the docking algorithm score is not high (below the threshold selected by the user or the intended algorithm is not accurate) but the skeleton has the potential to optimize the space. Seed compounds.
  • evaluation algorithms or software such as Quantitative structure-activity relationship (QSAR) can be used to evaluate drug molecules, and some factors related to the binding of molecules to be screened and proteins can be considered.
  • QSAR Quantitative structure-activity relationship
  • the evaluation results are not directly related to the three-dimensional structure of the protein pocket occupied by the small molecule, and it is difficult to establish a hypothetical relationship based on the small molecule structure and binding mode, so it has a limited role in assisting the recommended synthesis and testing.
  • This application aims to provide a molecular evaluation method oriented to the virtual screening process, based on at least one of the molecular structure, skeleton and binding mode, the information aggregation of successfully docked molecules is carried out, so as to assist in recommending reasonable molecules to enter the synthesis and testing stages.
  • the technical solution of the present application can adapt to the increase in the scale of positive compounds, and at the same time provide a molecular screening method more in line with drug development experience.
  • the technical solution of the present application can hierarchically project the complex high-latitude small protein spatial structure information onto the skeleton, shape, and binding mode. And it can be used as a basis to select representative molecules through clustering algorithm.
  • medicinal chemists can select the appropriate number of representative molecules according to objective factors such as the stage of the specific project, risk, budget, etc. to regulate the clustering and screening algorithms of the method, so as to realize the effective transition from large-scale virtual screening to pipeline molecules.
  • Recommended Aided Decision Capabilities It should be noted that the technical solution of the present application has better applicability to input (molecular library), regardless of the size of the molecular library.
  • the embodiment of the present application jumps out of the method of screening molecules based on the scoring value as the only reference value, returns to the rational design based on structure, and selects the dominant skeleton molecule from the perspective of structural rationality for the optimization of downstream molecules.
  • Fig. 1 schematically shows a schematic diagram of the process of screening molecules according to an embodiment of the present application.
  • At least one of the skeleton class, structure class, and related action class of the successfully docked ligand molecule is associated with the molecular structure and simplified molecular linear formula, so that users can use these knowledge to Screening the molecules to be screened improves the accuracy of molecular screening.
  • Fig. 2 schematically shows an exemplary system architecture in which the method, device and application for screening molecules can be applied according to an embodiment of the present application.
  • Figure 2 is only an example of the system architecture to which the embodiment of the present application can be applied to help those skilled in the art understand the technical content of the present application, but it does not mean that the embodiment of the present application cannot be used in other device, system, environment or scenario.
  • a system architecture 200 may include terminal devices 201 , 202 , 203 , a network 204 and a server 205 .
  • the network 204 is used as a medium for providing communication links between the terminal devices 201 , 202 , 203 and the server 205 .
  • Network 204 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
  • terminal devices 201, 202, 203 Users can use terminal devices 201, 202, 203 to interact with other terminal devices and server 205 through network 204 to receive or send information, such as sending molecular three-dimensional molecular structure requests, screening molecular requests, etc. and receiving screening results, molecular three-dimensional molecular structure etc.
  • the terminal devices 201, 202, and 203 can be installed with various communication client applications, such as web browser applications, drug development applications, database applications, search applications, instant messaging tools, email clients, social platform software and other applications wait.
  • Terminal devices 201, 202, and 203 include, but are not limited to, electronic devices such as smart desktop computers, tablet computers, and laptop computers that can support functions such as surfing the Internet and displaying images.
  • the server 205 can receive a request for a three-dimensional molecular structure of a molecule, etc., and send information about the three-dimensional molecular structure of a molecule, etc. to the terminal devices 201 , 202 , 203 .
  • the server 205 may be a background management server, a server cluster, and the like.
  • terminal devices are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and clouds.
  • Fig. 3 schematically shows a flowchart of a method for screening molecules according to an embodiment of the present application.
  • this embodiment provides a method for screening molecules, the method includes operation S310 to operation S340, specifically as follows:
  • the ligand molecule is a molecule corresponding to the receptor molecule.
  • the small molecule can be called a ligand molecule.
  • the structural information may be structural information contained in the simplified linear formula of the molecule.
  • Molecular docking is a method for drug design through the characteristics of receptor molecules (such as protein molecules) and the interaction mode between receptor molecules and ligand molecules (such as drug molecules). Molecular docking studies the interactions between molecules (such as between ligand molecules and receptor molecules) and predicts their binding modes and affinities.
  • the recognition relationship can depend on the spatial matching and energy matching of the two.
  • the RNA of the virus depends on a certain RNA polymerase protein, and it has been confirmed that a specific part of the RNA polymerase protein is the target of small molecule drugs.
  • Molecular docking can be used to infer the binding activity of multiple small molecules to this target, so as to predict whether these small molecules have the potential to become drug candidates.
  • Simplified molecular linear formulas enable the representation of molecular structures in text.
  • the simplified molecular linear formula can conform to Simplified Molecular-Input Line-Entry System (SMILES for short).
  • SILES Simplified Molecular-Input Line-Entry System
  • the way SMILES represents molecules is by encoding the structure as text.
  • text encoded strings
  • NLP natural language processing
  • a simplified molecular linear formula can have multiple corresponding molecular structures.
  • the first mapping relationship may be as follows.
  • the simplified molecular linear formula corresponds to the smiles column, and the molecular structure corresponds to the sdf_index column.
  • the first mapping relationship can be expressed as: sdf_index, similes.
  • the first mapping relationship 2 is the first mapping relationship 2:
  • the sdf_index column is before the "”
  • the smiles column is after the ",”.
  • the molecular structures of a series of compound molecules are obtained from the virtual screening process, and stored in a file in the format of .sdf, wherein only the structures of ligand small molecules can be stored) .
  • a .csv file which stores the mapping relationship between the SMILES formula of each molecule and its corresponding molecular structure file (.sdf), that is, the first mapping relationship.
  • the SMILES format of a molecule can correspond to multiple molecular structures.
  • skeleton extraction is performed on the structural information of the simplified molecular linear formulas to obtain O molecular skeletons, where O is an integer greater than or equal to 1, and O is less than or equal to M.
  • various skeleton extraction methods can be used to extract the skeleton from the simplified molecular linear formula.
  • a skeleton extraction algorithm is performed on the SMILES formula of each molecule to obtain the Bemis-Murcko skeleton of each molecule.
  • FIG. 4 to FIG. 6 schematically illustrate the process of extracting a molecular skeleton according to an embodiment of the present application.
  • the SMILES formula 1 can be converted into the molecular structure diagram shown in Fig. 4 .
  • the conversion method between the SMILES formula and the molecular structure diagram can adopt many related technologies, which will not be described in detail here.
  • FIG. 5 the difference from FIG. 4 is that nitrogen atoms (N) and oxygen atoms (O) in FIG. 5 are replaced, such as carbon atoms (C) or hydrogen atoms (H).
  • N nitrogen atoms
  • O oxygen atoms
  • FIG. 4 element symbols for carbon atoms (C) and hydrogen atoms (H) are omitted.
  • the element symbols in Figure 4 can also be deleted. Then, replace the double bond in Figure 4 with a single bond.
  • the difference from FIG. 5 is that the branch chains or dangling keys in FIG. 6 are removed.
  • the skeleton corresponding to the SMILES formula can be obtained.
  • the skeleton corresponding to the SMILES formula can be obtained through the Bemis-Murcko framework.
  • O molecular skeletons are aggregated to obtain P molecular skeleton classes, where P is an integer greater than or equal to 1, and P is less than or equal to zero.
  • multiple skeleton classes can be obtained through the aggregation operation on multiple molecular skeletons. For example, group identical skeletons into one class.
  • the above method may also include the following operation: performing homogeneous merging on the P molecular skeletons to obtain a multi-level molecular skeleton set, wherein, in the multi-level molecular skeleton set, The parent molecular skeleton corresponds to at least one sub-molecular skeleton, the underlying molecular skeleton corresponds to at least one ligand molecule, and the skeleton structure of the sub-molecular skeleton is more complex than that of the parent molecular skeleton.
  • At least some of the ligand molecules can be integrated into one table through homogeneous merging, making the parent-child relationship between various skeletons clearer, so that users can select potentially useful skeletons or molecules based on the parent-child relationship.
  • the above method may also include the following operations to generate a skeleton graph, wherein the skeleton graph includes a plurality of nodes, and the non-terminal nodes in the plurality of nodes represent the multi-level molecular skeleton At least part of the molecular skeletons in the set, the terminal node among the multiple nodes represents a molecular cluster including the skeleton corresponding to the terminal node among the M ligand molecules, and a parent node among the multiple nodes corresponds to at least one child node.
  • Fig. 7 schematically shows a schematic diagram of a skeleton diagram according to an embodiment of the present application.
  • each node in FIG. 7 can represent a skeleton class
  • the parent node corresponds to at least one child node
  • the bottom node can correspond to a specific molecular formula of a ligand molecule.
  • the two nodes framed by the two dotted circles on the left in Fig. 7 are the molecular formulas of the two ligand molecules respectively. These two ligand molecules have the same backbone class. However, the skeleton class where the node framed by the dotted circle above is located, and the skeleton class where the node framed by the dotted circle below is a child-parent relationship.
  • each terminal leaf node represents a class of molecular clusters with the same terminal backbone.
  • Skeleton numbering based on skeleton clustering is obtained by numbering each node.
  • the root node is 1, and each node is coded according to Arabic numerals in order of graph traversal, such as 1, 2, 3, 4, etc.
  • each node in FIG. 7 may have a fill color or a fill pattern.
  • the shade of the fill color can indicate the average score of all molecules contained in the current skeleton node. For example, if the score represents activity, the node with darker color means that the molecule contained in its backbone has a higher activity value, that is, the matching degree between the ligand molecule corresponding to the node and the target receptor molecule is higher. In this way, it is convenient for users to intuitively see from the skeleton diagram: the molecule containing which skeleton has higher activity and has a higher probability of becoming a ligand molecule corresponding to the target receptor molecule.
  • FIG. 8A schematically shows a schematic structural diagram of a skeleton according to an embodiment of the present application.
  • FIG. 8B schematically shows a schematic structural diagram of another skeleton according to an embodiment of the present application.
  • FIG. 8A is a skeleton for a ligand molecule identified as MOL0436.
  • Fig. 8B is the skeleton for the ligand molecule identified as MOL0049, the two skeletons have the same middle part framed by the dotted circle, and the two skeletons have the same parent skeleton.
  • a second mapping relationship between P molecular skeleton classes and N molecular structures is determined based on the first mapping relationship, so that based on the second mapping relationship, the simplified molecule from the simplified molecular linear formula including M ligand molecules
  • the linear set is screened for molecules that match the target receptor molecule.
  • the second mapping relationship by constructing the second mapping relationship, it is convenient for the user to perform molecular screening based on at least the second mapping relationship.
  • the above method may further include the following operations.
  • the respective molecular structures corresponding to at least part of the ligand molecules corresponding to the skeleton classes are acquired. Referring to Fig. 7, all molecular structures corresponding to a certain node can be obtained.
  • the volume difference between at least some of the ligand molecules is determined based on the corresponding molecular structures of at least some of the ligand molecules.
  • ligand molecules corresponding to the skeleton class are clustered based on the volume difference to obtain multiple structural classes. For example, clusters with small volume differences are grouped together.
  • a third mapping relationship between the plurality of structural classes and molecular structures is determined based on the first mapping relationship.
  • determining the volume difference between at least some of the ligand molecules based on the corresponding molecular structures of at least some of the ligand molecules may include the following operations.
  • the pocket region of the target receptor molecule is meshed.
  • the proportion of the molecular structures corresponding to at least part of the ligand molecules occupying the grid is determined.
  • grid space occupancy vectors of molecular structures corresponding to at least part of the ligand molecules are constructed based on the occupancy ratios.
  • the volume difference between at least some of the ligand molecules is determined based on the grid space occupancy vectors of the corresponding molecular structures of at least some of the ligand molecules.
  • determining the volume difference between at least some of the ligand molecules based on the grid space occupancy vectors of the corresponding molecular structures of at least some of the ligand molecules may include the following operations.
  • the distance between the grid space occupation vectors of the respective molecular structures corresponding to the two ligand molecules is determined.
  • the volume difference of the corresponding molecular structure occupancy spaces of the two ligand molecules is determined.
  • the three-dimensional molecular structures of all molecules in a skeleton class are extracted, such as extracting a .sdf file. Then, the space-occupied volume differences between the three-dimensional molecular structures of the two molecules are calculated respectively. Since the molecular structure comes from the docked three-dimensional molecular structure, there is no need to perform structure-based alignment and translation operations here. After obtaining the space occupation volume difference between two molecules, as their space distance, cluster analysis based on the three-dimensional shape difference can be carried out. Each class represents molecules whose space occupies a similar volume.
  • the space occupation volume difference between the three-dimensional molecular structures of two molecules can be determined by the following method: by dividing the pocket region of the molecular structure of the receptor molecule (such as the protein structure) into equidistant grids, by determining the molecular The grid space occupancy vector of the structure is constructed, and then the space occupancy of two molecules is obtained by calculating the distance between the two grid space occupancy vectors (such as Tanimoto Distance, Euclidean distance, etc.). volume difference.
  • the ligand molecule can be correctly combined in the protein pocket.
  • the ligand molecule should fit (complementary) the pocket in shape and electrostatic distribution.
  • the degree of structural matching between the molecule and the target receptor molecule can be determined in the above manner.
  • the above method may further include the following operations to further analyze the mapping relationship between structural classes and/or skeleton classes and molecular structures.
  • the first interaction features between at least part of the molecular structures corresponding to the structural class and the target receptor molecules are obtained, and/or, for P
  • the second interaction characteristics between each of the at least part of the molecular structures corresponding to the skeleton class and the target receptor molecule are acquired.
  • cluster the first interaction differences to obtain multiple first interaction classes and/or perform clustering on the second interaction differences to obtain multiple second interaction classes.
  • mapping relationship determines the fourth mapping relationship between the plurality of first interaction classes and the molecular structure, and/or determine the relationship between the plurality of second interaction classes and the molecular structure based on the first mapping relationship Fifth mapping relationship.
  • the molecular-protein molecular binding pattern fingerprint is a code used to characterize the interaction type between small molecules and protein molecules based on structure.
  • the action fingerprint includes, but is not limited to: at least one of the action type, the atomic number of the action site, and the action site of protein amino acid residues.
  • the above method may also include the following operations.
  • the second interaction characteristics between at least part of the ligand molecules corresponding to the skeleton classes and target receptor molecules are acquired.
  • Second interaction differences between pairs of at least some of the respective second interaction features of the ligand molecules are then determined.
  • clustering is performed on the second interaction differences to obtain a plurality of second interaction classes.
  • a fifth mapping relationship between the plurality of second interaction classes and the molecular structure is determined based at least on the first mapping relationship.
  • determining the second interaction difference between at least some of the respective second interaction features of the ligand molecules may include the following operations.
  • an interaction feature vector corresponding to the second interaction feature is determined.
  • an interaction difference between any two of at least some of the respective second interaction features of the ligand molecules is determined. For example, first determine the distance between the interaction feature vectors corresponding to the second interaction feature, and then determine the second interaction of the two ligand molecules based on the distance between the interaction feature vectors corresponding to the second interaction feature Effect difference.
  • the encoded information of the fingerprint of the binding mode between the molecule and the protein molecule includes the type of interaction (such as hydrogen bond donor molecule and acceptor molecule, ⁇ - ⁇ interaction, etc.), the atomic number of the molecular interaction site, and The site of action of protein amino acid residues.
  • This information enables rapid identification of structure-based interactions between molecules and protein molecules.
  • Each molecule can form multiple such interactions with protein molecules, and each molecule can extract multiple interaction fingerprints through its docked molecular structure (interaction fingerprints are vectorized interaction features of molecules). , to get a 1 ⁇ n-dimensional vector, and the distance between interaction fingerprints can be calculated by the method of Tanimoto distance, etc.).
  • the fingerprint feature vector of the molecular structure can be constructed. For example, fingerprint extraction can be performed for all molecular three-dimensional molecular structure information under a certain skeleton class and/or a certain structural class, and the fingerprint distance between two fingerprints can be calculated, and then cluster analysis based on fingerprints can be performed. Molecules under the same class should have similar skeletons, shapes and/or binding modes. Fingerprint clustering can be viewed as a type of unsupervised clustering.
  • the stability characteristics of the interactions can be further analyzed.
  • the above method may further include the following operations.
  • the representative molecule of the current class is determined.
  • the representative molecule may be a molecule corresponding to a class center of a certain class, or the like.
  • Stability characteristics may include: stable and unstable.
  • a sixth mapping relationship between the stability feature and the first interaction class or the second interaction class is determined based on the stability feature of the representative molecule.
  • representative molecules in each class can be determined for dynamic simulation. The purpose is to verify whether the interaction is still stable in the results of the kinetic simulation. If there is an interaction that is unstable under the kinetic model, it should be marked in the final result.
  • the trajectory files obtained by sampling can be collected by performing a 50 ns dynamic simulation on the composite structure of the successfully docked ligand molecule and the receptor molecule. Whether the interaction between the ligand molecule and the protein molecule extracted based on the trajectory file analysis can be continuously observed in the sampled steady state. If it can be continuously observed, it means that the interaction formed by the representative molecule is still stable and observable under the simulated situation.
  • the representative molecule can be the cluster center molecule, and the cluster center is the point with the most balanced distance between an object and other objects in the class. There can be only one cluster center, and it can be directly obtained by a clustering algorithm.
  • the above-mentioned method may further include the following operations.
  • the simplified molecular linear formula of the ligand molecule, the molecular structure and at least one of the following: molecular skeleton class, structure class, first interaction class or second interaction class are stored in association to obtain a mapping table.
  • the data used has 32k mapping relationships, and a total of 4k different simplified molecular linear formulas are recorded.
  • the .csv relationship file can be loaded through the Python Pandas library, and all data in the "smiles" column in the table can be obtained, that is, all simplified molecular linear expressions. Extract the Bemis-Murcko skeleton of each simplified molecular linear form by the relevant content as shown above, and merge the same skeletons.
  • relevant node information can be represented and stored in the form of a skeleton diagram for visualization.
  • the Python Networkx library can be used to store the skeleton graph and draw the node graph as shown in Figure 7.
  • Clustering analysis based on interaction fingerprint is similar to structural clustering analysis.
  • the .sdf file of all molecular structures under each skeleton class needs to be obtained.
  • the protein structure file .pdb used for docking needs to be obtained.
  • each number in square brackets represents a one-dimensional feature.
  • the value of the feature can be obtained by one-hot encoding. For example,
  • the encoded information includes the type of interaction (such as hydrogen bond donor acceptor, ⁇ - ⁇ interaction, etc.), the atomic number of the interaction site of the small molecule, and the interaction site of the amino acid residue of the protein.
  • the corresponding interaction code is [CYS260_HB_Acceptor, ...].
  • each small molecule will form multiple interactions with the protein, that is, each small molecule has an interaction list as above, such as [CYS260_HB_Acceptor, ...].
  • the interaction features are encoded, and after the feature vectors are obtained, clustering based on the distance between fingerprints can be performed through a clustering algorithm.
  • clustering based on the distance between fingerprints can be performed through a clustering algorithm.
  • each representative molecule such as the cluster center
  • extract its small molecule structure For each representative molecule (such as the cluster center) in each fingerprint cluster, extract its small molecule structure. Several frames were sampled from the simulated trajectories to identify their interaction binding modes, which were used to verify the validity of the interaction fingerprints. If the interaction shown by the fingerprint is still stable in the simulation, it is marked as 1; if it is unstable, it is marked as 0. Record it in the input .csv relational table, which can be marked "ifp_valid".
  • Another aspect of the present application also provides a method of evaluating a molecule.
  • Fig. 9 schematically shows a flowchart of a molecular evaluation method according to an embodiment of the present application.
  • the user can use the mapping table in the following manner.
  • the above method may further include operation S910 to operation S930.
  • a simplified molecular linear formula of the molecule to be screened is obtained.
  • the user can input the simplified molecular linear formula on the terminal device, and the terminal device can also send the simplified molecular linear formula to the cloud.
  • the skeleton of the molecule to be screened is determined based on the simplified molecular linear formula of the molecule to be screened.
  • the skeleton corresponding to the simplified molecular linear formula can be generated locally or in the cloud.
  • the molecule to be screened is evaluated based on the backbone of the molecule to be screened and various mapping relationships determined by the above method.
  • various mapping relationships may be stored in the mapping table, including but not limited to: at least one of the first mapping relationship to the sixth mapping relationship.
  • the entry corresponding to the molecule to be screened in the mapping table can be determined by means of skeleton matching or the like.
  • users can select interested skeletons, shapes and interaction clusters, and then select representative molecules in the clusters to enter the subsequent synthesis and testing stage, and verify whether such skeletons, shapes and interactions are suitable for the target according to the results of the synthesis test.
  • Drug design of proteins can help.
  • a batch of molecules can be combined according to different skeletons, shapes, and interaction clusters, and targeted control experiments can be performed to accelerate the structure-based drug development process.
  • This embodiment aggregates ligand molecules based on molecular structure, skeleton and interaction. Compared with single docking scoring and threshold filtering, this embodiment is less affected by the accuracy of the docking scoring algorithm and can be considered comprehensively. The structural information of the combination of ligand molecules and receptor molecules is more in line with the thinking of drug designers to promote drug development.
  • the molecular screening process is sequentially divided into three dimensions of "skeleton clustering", “structural clustering” and “interaction clustering”, which has more information than single-dimensional docking scoring.
  • skeleton clustering is taken first, and the purpose is to make the whole screening process take molecular skeleton differences as the root category, which is more in line with the development habits of drug designers. It should be noted that the process of classification and aggregation of the two dimensions of "structural clustering” and “interaction clustering” has no sequence requirements.
  • the stability of the interaction is verified by means of dynamic simulation, which ensures the reliability of introducing the interaction fingerprint.
  • Another aspect of the present application also provides a design method.
  • Fig. 10 schematically shows a flow chart of a design method according to an embodiment of the present application.
  • the above design method includes operation S1010 to operation S1020.
  • Another aspect of the present application also provides a device for screening molecules.
  • Fig. 11 schematically shows a block diagram of a device for screening molecules according to an embodiment of the present application.
  • the device 1100 for screening molecules may include: a first mapping relationship obtaining module 1110 , a molecular skeleton extraction module 1120 , a molecular skeleton aggregation module 1130 and a second mapping relationship determination module 1140 .
  • the first mapping relationship obtaining module 1110 is used to obtain the first mapping relationship between the simplified molecular linear formulas of the M ligand molecules and the N molecular structures, the simplified molecular linear formulas of the M ligand molecules each have structural information, M, N is an integer greater than or equal to 1.
  • the molecular skeleton extraction module 1120 is used to extract the skeletons of the structural information of the simplified molecular linear formulas for at least part of the simplified molecular linear formulas of M ligand molecules, respectively, to obtain O molecular skeletons, where O is an integer greater than or equal to 1 , and O is less than or equal to M.
  • the molecular skeleton aggregation module 1130 is used to aggregate O molecular skeletons to obtain P molecular skeleton classes, where P is an integer greater than or equal to 1, and P is less than or equal to zero.
  • the second mapping relationship determination module 1140 is used to determine the second mapping relationship between the P molecular skeleton classes and the N molecular structures based on the first mapping relationship, so as to screen the ligand molecule matching the target receptor molecule based on the second mapping relationship .
  • the above-mentioned apparatus 1100 further includes: a skeleton molecular structure acquisition module, a volume difference determination module, a structure clustering module and a third mapping relationship determination module.
  • the module for obtaining the molecular structure of the skeleton class is used for obtaining, for each class of at least part of the skeleton classes in the P molecular skeleton classes, the molecular structures corresponding to at least part of the ligand molecules corresponding to the skeleton class.
  • the volume difference determination module is used to determine the volume difference between at least some of the ligand molecules based on the corresponding molecular structures of at least some of the ligand molecules.
  • the structure clustering module is used for clustering at least part of the ligand molecules corresponding to the skeleton class based on the volume difference to obtain multiple structural classes.
  • the third mapping relationship determination module is used to determine a third mapping relationship between multiple structural classes and molecular structures based on the first mapping relationship.
  • the volume difference determination module includes: a grid division unit, an occupancy ratio determination unit, an occupancy vector construction unit, and a volume difference determination unit.
  • the meshing unit is used to mesh the pocket region of the target receptor molecule.
  • the occupancy ratio determination unit is used to determine the occupancy ratio of the molecular structures corresponding to at least part of the ligand molecules to the grid.
  • the occupancy vector construction unit is used for constructing grid space occupancy vectors of molecular structures corresponding to at least part of the ligand molecules based on occupancy ratios.
  • the volume difference determination unit is used to determine the volume difference between at least some of the ligand molecules based on the grid space occupancy vectors of the corresponding molecular structures of at least some of the ligand molecules.
  • the volume difference determination unit includes: a distance determination subunit, and a volume difference determination subunit.
  • the distance determining subunit is used to determine the distance between the grid space occupation vectors of the corresponding molecular structures of the two ligand molecules.
  • the volume difference determination subunit is used to determine the volume difference of the space occupied by the corresponding molecular structures of the two ligand molecules based on the distance between the grid space occupancy vectors.
  • the above-mentioned apparatus 1100 further includes: an interaction feature acquisition module, an interaction difference determination module, an interaction difference clustering module, and a role mapping relationship determination module.
  • the interaction feature acquisition module is used to acquire, for each of at least some of the structural classes in the plurality of structural classes, first interaction features between at least part of the molecular structures corresponding to the structural class and target receptor molecules, and /or, for each class of at least some skeleton classes in the P molecular skeleton classes, acquire the second interaction characteristics between each of the at least part of the molecular structures corresponding to the skeleton class and the target receptor molecule.
  • the interaction difference determination module is used to determine the first interaction difference between the first interaction features of at least some of the molecular structures, and/or determine the difference between the second interaction features of at least some of the ligand molecules. The second interaction difference between .
  • the interaction difference clustering module is used to cluster the first interaction differences to obtain multiple first interaction classes, and/or to cluster the second interaction differences to obtain multiple second interaction classes.
  • the action mapping relationship determination module is used to determine the fourth mapping relationship between the multiple first interaction classes and the molecular structure based on at least the first mapping relationship, and/or determine the multiple second interaction classes and the molecular structure based on the first mapping relationship.
  • the interaction difference determination module includes: an interaction feature vector determination unit, and an interaction difference determination unit.
  • the interaction feature vector determining unit is used to determine an interaction feature vector corresponding to the second interaction feature.
  • the interaction difference determination unit is used to repeat the following operations until determining the interaction difference between any two of at least some of the respective second interaction features of the ligand molecules: determining the difference between the interaction feature vectors corresponding to the second interaction features The distance between; the second interaction difference of the two ligand molecules is determined based on the distance between the interaction feature vectors corresponding to the second interaction feature.
  • the above-mentioned apparatus 1100 further includes: a representative molecule determination module, a stability feature acquisition module and a stability mapping relationship determination module.
  • the representative molecule determining module is used for determining the representative molecule of the current class for each first interaction class or any class of each second interaction class.
  • the stability feature acquisition module is used to perform molecular dynamics simulation on the representative molecules of the current class to obtain the stability features of the representative molecules.
  • the stability mapping relationship determining module is used to determine a sixth mapping relationship between the stability feature and the first interaction class or the second interaction class based on the stability feature of the representative molecule.
  • the above-mentioned device 1100 further includes: an associative storage module, which is used for associatingly storing the simplified molecular linear formula of the ligand molecule, the molecular structure, and at least one of the following: molecular skeleton type, structure type, first interaction The action class or the second interaction class to get the mapping table.
  • an associative storage module which is used for associatingly storing the simplified molecular linear formula of the ligand molecule, the molecular structure, and at least one of the following: molecular skeleton type, structure type, first interaction The action class or the second interaction class to get the mapping table.
  • the above-mentioned apparatus 1100 further includes: a simplified molecular linear formula acquisition module, a skeleton determination module and an evaluation module.
  • the simplified molecular linear formula acquisition module is used to obtain the simplified molecular linear formula of the molecule to be screened.
  • the skeleton determination module is used to determine the skeleton of the molecule to be screened based on the simplified molecular linear formula of the molecule to be screened.
  • the evaluation module is used for evaluating the molecule to be screened based on the backbone and the mapping table of the molecule to be screened.
  • the above-mentioned device 1100 further includes: a homogeneous merging module, configured to perform homogeneous merging of P molecular skeleton classes to obtain a multi-level molecular skeleton set, wherein the parent molecular skeleton in the multi-level molecular skeleton set corresponds to at least A sub-molecular skeleton, the underlying molecular skeleton corresponds to at least one ligand molecule, and the skeleton structure of the sub-molecular skeleton is more complex than that of the parent molecular skeleton.
  • a homogeneous merging module configured to perform homogeneous merging of P molecular skeleton classes to obtain a multi-level molecular skeleton set, wherein the parent molecular skeleton in the multi-level molecular skeleton set corresponds to at least A sub-molecular skeleton, the underlying molecular skeleton corresponds to at least one ligand molecule, and the skeleton structure of
  • the above-mentioned apparatus 1100 further includes: a skeleton graph generation module, configured to generate a skeleton graph, wherein the skeleton graph includes a plurality of nodes, and the non-terminal nodes in the plurality of nodes represent at least For a partial molecular skeleton, an end node among the plurality of nodes represents a molecular cluster including a skeleton corresponding to the end node among the M ligand molecules, and a parent node among the plurality of nodes corresponds to at least one child node.
  • a skeleton graph generation module configured to generate a skeleton graph, wherein the skeleton graph includes a plurality of nodes, and the non-terminal nodes in the plurality of nodes represent at least For a partial molecular skeleton, an end node among the plurality of nodes represents a molecular cluster including a skeleton corresponding to the end node among the M ligand molecules, and a parent node among the plurality of
  • Another aspect of the present application provides a device for evaluating molecules.
  • Fig. 12 schematically shows a block diagram of a device for evaluating molecules according to an embodiment of the present application.
  • the above-mentioned device 1200 for evaluating molecules may include a module 1210 for obtaining a simplified molecular linear formula, a module 1220 for obtaining a skeleton of a molecule to be screened, and a module 1230 for evaluating molecules.
  • the simplified molecular linear formula obtaining module 1210 is used to obtain the simplified molecular linear formula of the molecule to be screened.
  • the molecular skeleton obtaining module 1220 is used to determine the skeleton of the molecule to be screened based on the simplified molecular linear formula of the molecule to be screened.
  • the molecular evaluation module 1230 is used to evaluate the molecule to be screened based on the skeleton of the molecule to be screened and various mapping relationships determined according to the above-mentioned device 1100, and the various mapping relationships include: the first mapping relationship to the sixth mapping relationship at least one of the relationships.
  • Another aspect of the present application also provides a design device.
  • Fig. 13 schematically shows a block diagram of a design device according to an embodiment of the present application.
  • the design device 1300 may include: a screening result display module 1310 and a design module 1320 .
  • the screening result display module 1310 is used for displaying the molecular screening results, and the molecular screening results are based on the screening results obtained by the above-mentioned device 1100 .
  • the design module 1320 is used for drug design or material design based on molecular screening results.
  • Another aspect of the present application also provides an electronic device.
  • Fig. 14 schematically shows a block diagram of an electronic device for implementing a method for screening molecules according to an embodiment of the present application.
  • an electronic device 1400 includes a memory 1410 and a processor 1420 .
  • the processor 1420 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), on-site Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, and the like.
  • the memory 1410 may include various types of storage units such as system memory, read only memory (ROM), and persistent storage. Wherein, the ROM can store static data or instructions required by the processor 1420 or other modules of the computer.
  • the persistent storage device may be a readable and writable storage device. Persistent storage may be a non-volatile storage device that does not lose stored instructions and data even if the computer is powered off.
  • the permanent storage device adopts a mass storage device (such as a magnetic or optical disk, flash memory) as the permanent storage device.
  • the permanent storage device may be a removable storage device (such as a floppy disk, an optical drive).
  • the system memory can be a readable and writable storage device or a volatile readable and writable storage device, such as dynamic random access memory.
  • System memory can store some or all of the instructions and data that the processor needs at runtime.
  • memory 1410 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (eg, DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), and magnetic and/or optical disks may also be used.
  • memory 1410 may include a readable and/or writable removable storage device, such as a compact disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual-layer DVD-ROM), Read-only Blu-ray Disc, Super Density Disc, Flash memory card (such as SD card, min SD card, Micro-SD card, etc.), magnetic floppy disk, etc.
  • a readable and/or writable removable storage device such as a compact disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual-layer DVD-ROM), Read-only Blu-ray Disc, Super Density Disc, Flash memory card (such as SD card, min SD card, Micro-SD card, etc.), magnetic floppy disk, etc.
  • Computer-readable storage media do not contain carrier waves and transient electronic signals transmitted by wireless or wire.
  • Executable codes are stored in the memory 1410 , and when the executable codes are processed by the processor 1420 , the processor 1420 may execute part or all of the methods mentioned above.
  • the method according to the present application can also be implemented as a computer program or computer program product, the computer program or computer program product including computer program code instructions for executing some or all of the steps in the above method of the present application.
  • the present application may also be implemented as a computer-readable storage medium (or a non-transitory machine-readable storage medium or a machine-readable storage medium), on which executable code (or computer program or computer instruction code) is stored,
  • executable code or computer program or computer instruction code
  • the processor of the electronic device or server, etc.
  • the processor is made to perform part or all of the steps of the above-mentioned method according to the present application.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Library & Information Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Heterocyclic Carbon Compounds Containing A Hetero Ring Having Oxygen Or Sulfur (AREA)

Abstract

一种筛选分子的方法、装置及其应用。该筛选分子的方法包括:获得M个配体分子的简化分子线性式与N个分子结构之间的第一映射关系(S310);对于M个配体分子的简化分子线性式中的至少部分,分别对简化分子线性式的结构信息进行骨架提取,得到O个分子骨架(S320);聚合O个分子骨架,得到P个分子骨架类(S330);基于第一映射关系确定P个分子骨架类与N个分子结构之间的第二映射关系(S340),以便基于第二映射关系从包括M个配体分子的简化分子线性式的简化分子线性式集合中筛选与目标受体分子匹配的分子。该方法能够提升用户筛选分子的准确度和便捷度。

Description

筛选分子的方法、装置及其应用 技术领域
本申请涉及计算模拟技术领域,尤其涉及一种筛选分子的方法、装置及其应用。
背景技术
随着计算机技术和基础学科理论的快速发展,分子模拟的计算效率和精度都获得极大提高,使得分子模拟在多学科领域得到广泛应用。其中,筛选分子是分子模拟中的重要部分。
相关技术为了实现分子筛选,可以基于针对预设指标的阈值进行分子筛选。但是,这可能导致一些对后续开发有帮助的分子被过滤掉。
发明内容
为解决或部分解决相关技术中存在的问题,本申请提供一种筛选分子的方法、装置及其应用,能够降对后续开发有帮助的分子被过滤掉的概率。
本申请的第一个方面提供了一种筛选分子的方法,上述方法包括:获得M个配体分子的简化分子线性式与N个分子结构之间的第一映射关系,M个配体分子的简化分子线性式各自具有结构信息,M、N是大于或者等于1的整数;对于M个配体分子的简化分子线性式中的至少部分分子中的每个分子,分别对配体分子的结构信息进行骨架提取,得到O个分子骨架,O是大于或者等于1的整数,并且O小于或者等于M;聚合O个分子骨架,得到P个分子骨架类,P是大于或者等于1的整数,并且P小于或者等于O;基于第一映射关系确定P个分子骨架类与N个分子结构之间的第二映射关系,以便基于第二映射筛选与目标受体分子匹配的配体分子。
本申请的第二个方面提供了一种评估分子的方法,该方法包括:获得待筛选分子的简化分子线性式;基于待筛选分子的简化分子线性式确定待筛选分子的骨架;基于待筛选分子的骨架和如上方法确定的多种映射关系对待筛选分子进行评估,多种映射关系包括:第一映射关系至第六映射关系中至少一种。
本申请的第三个方面提供了一种设计方法,该设计方法包括:展示分子筛选结果,分子筛选结果是根据如上述方法得到的筛选结果;基于分子筛选结果进行药物设计或者材料设计。
本申请的第四方面提供了一种筛选分子的装置,包括:第一映射关系获得模块,用于获得M个配体分子的简化分子线性式与N个分子结构之间的第一映射关系,M个配体分子的简化分子线性式各自具有结构信息,M、N是大于或者等于1的整数;分子骨架提取模块,用于对于M个配体分子的简化分子线性式中的至少部分分子中的每个分子,分别对配体分子的结构信息进行骨架提取,得到O个分子骨架,O是大于或者等于1的整数,并且O小于或者等于M;分子骨架聚合模块,用于聚合O个分子骨架,得到P个分子骨架类,P是大于或者等于1的整数,并且P小于或者等于O;第二映射关系确定模块,用于基于第一映射关系确定P个分子骨架类与N个分子结构之间的第二映射关系,以便基于第二映射关系筛选与目标受体分子匹配的配体分子。
本申请的第五个方面提供了一种评估分子的装置。该装置包括:简化分子线性式获得模块,用于获得待筛选分子的简化分子线性式;待筛选分子骨架获得模块,用于基于待筛选分子的简化分子线性式确定待筛选分子的骨架;分子评估模块,用于基于待筛选分子的骨架和根据上述装置确定的多种映射关系对待筛选分子进行评估,多种映射关系包括:第一映射关系至第六映射关系中至少一种。
本申请的第六个方面提供了一种设计装置,该设计装置包括:筛选结果展示模块和设计模块。筛选结果展示模块用于展示分子筛选结果,分子筛选结果是根据如上述装置得到的筛选结果;设计模块用于基于分子筛选结果进行药物设计或者材料设计。
本申请的第七方面提供了一种电子设备,包括:处理器;存储器,其上存储有可执行代码,当上述可执行代码被处理器执行时,使得处理器执行上述方法。
本申请的第八方面还提供了一种计算机可读存储介质,其上存储有可执行代码,当可执行代码被电子设备的处理器执行时,使处理器执行上述方法。
本申请的第九方面还提供了一种计算机程序产品,包括可执行代码,可执行代码被处理器执行时实现上述方法。
本申请提供的筛选分子的方法、装置及其应用,基于配体分子的分子结构确定该配体分子的骨架,对多个配体分子的骨架进行聚类,得到骨架类,这样可以实现构建骨架类与分子结构之间的映射关系,使得可以基于待筛选分子的骨架来预测待筛选分子的分子结构等特征,提升筛选分子的准确度和便捷度,以便于辅助推荐合理的分子进入合成及测试阶段。
此外,本申请提供的技术方案还可以进一步确定分子结构与结构类和/或相互作用类之间的映射关系,便于用户基于更多维度的映射关系来进行分子筛选。
另外,本申请提供的技术方案还可以基于动力学模拟的结果来验证相互作用类中的相互作用是否稳定,便于用户基于相互作用是否稳定来进行分子筛选。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
通过结合附图对本申请示例性实施方式进行更详细地描述,本申请的上述以及其它目的、特征和优势将变得更加明显,其中,在本申请示例性实施方式中,相同的参考标号通常代表相同部件。
图1示意性示出了根据本申请实施例的筛选分子的过程示意图;
图2示意性示出了根据本申请实施例的可以应用筛选分子的方法、装置及其应用的一种示例性系统架构;
图3示意性示出了根据本申请实施例的一种筛选分子的方法的流程图;
图4至图6示意性示出了根据本申请实施例的提取分子骨架的过程示意图;
图7示意性示出了根据本申请实施例的骨架图的示意图;
图8A示意性示出了根据本申请实施例的一种骨架的结构示意图;
图8B示意性示出了根据本申请实施例的另一种骨架的结构示意图;
图9示意性示出了根据本申请实施例的分子评估方法的流程图;
图10示意性示出了根据本申请实施例的一种设计方法的流程图;
图11示意性示出了根据本申请实施例的一种筛选分子的装置的框图;
图12示意性示出了根据本申请实施例的一种评估分子的装置的框图;
图13示意性示出了根据本申请实施例的一种设计装置的框图;
图14示意性示出了实现本申请实施例的一种筛选分子的方法的电子设备的方框图。
具体实施方式
下面将参照附图更详细地描述本申请的实施方式。虽然附图中显示了本申请的实施方式,然而应该理解,可以以各种形式实现本申请而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本申请更加透彻和完整,并且能够将本申请的范围完整地传达给本领域的技术人员。
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在此使用的术语“包括”、“包含”等表明了特征、步骤、操作和/或部件的存在,但是并不排除存在或添加一个或多个其他特征、步骤、操作或部件。
在此使用的所有术语(包括技术和科学术语)具有本领域技术人员通常所理解的含义,除非另外定义。应注意,这里使用的术语应解释为具有与本说明书的上下文相一致的含义,而不应以理想化或过于刻板的方式来解释。
应当理解,尽管在本申请可能采用术语“第一”、“第二”、“第三”等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。
在对本申请的技术方案进行描述之前,先对本申请涉及的本领域的部分技术术语进行说明。
分子对接,是一种主要通过电场力分析受体、配体的性质特征以及相互作用,进而便于预测受体和配体的结合模式的分子模拟方法。
分子模拟,是指利用理论方法和计算机技术,模拟分子或分子体系的结构和物理化学性质。
为了解决相关技术中存在的问题,可以将基于分子结构的虚拟筛选应用于材料开发的早期阶段,如药物开发中。虚拟筛选的作用是在一个较大规模的(例如,分子的数量>10 8个)虚拟分子库中筛选出可与目标受体分子(如蛋白质)相结合的潜在配体分子(如药物分子)。
相关技术中可以采用虚拟筛选算法针对药物分子与蛋白质相互作用进行打分,得分高的分子意味着更有潜力成为候选药物分子进入下一阶段的开发,这类软件通常称之为分子对接(Docking)软件。例如,分子对接软件包括但不限于:AutoDockVina、ICMLeDock、rDock、UCSF DOCK等,以及商业软件包括Glide、LigandFit、GOLD、MOE Dock等。需要说明的是,基于分子结构的筛选方法已经成为相关技术中的早期小分子药物研发的范式之一,同时可供筛选的化合物库例如Enamine Real,Labnetworkx等。这些化合物库所包含的化合物规模过亿,而云计算的快速发展让大规模分子筛选计算成为可能。然而大规模的化合物筛选给后处理特别是挑选符合特定靶点结合口袋的苗头化合物带来挑战。相关技术中的 药物分子筛选过程会通过一个或多个此类分子对接软件对被筛选化合物库中分子进行对接打分,设置一个相对可接受的阈值。对高于该阈值的分子进行保留,以便在后续开发流程中的使用。
具体地,首先,建立待筛选分子的三维立体(3D)结构,并进行能量优化。然后,在3D结构上添加氢原子,并分别添加力场和待筛选分子的原子电荷。接着,以预设半径的探针生成模板目标受体分子的分子表面。然后,确定目标受体分子的分子表面的合理结合位点。对于每个合理结合位点进行能量打分和评价。具体地,可以采用半柔性对接(semi-flexible docking)等,生成指定个数的不同构象取向(orientation),获得待筛选分子与结合位点的静电和范德华相互作用,并由此计算得到针对待筛选分子的打分。该打分和预设阈值进行比较,以确定待筛选分子是否保留至后续开发流程。
然而,此类打分算法或软件在测试集上表现最佳的AUC不到80%,导致通过某一阈值界定过滤条件会造成可能对后续开发有帮助的分子过早的被过滤掉。其中,AUC(Area under curve)被定义为受试者工作特征曲线(receiver operating characteristic curve,简称ROC)下与坐标轴围成的面积,AUC越接近1.0,检测方法真实性越高。
例如,申请人发现:相关技术中的基于对接的算法也能从大规模的化合物库中,通过选择特定的对接打分阈值进行分子筛选,但其整体阳性分子筛选的准确性(如AUC)往往与对接算法的准确度与用户选取的阈值相关,该筛选方法在早期药物研发应用中可能会因为对接算法打分不高(低于用户所选阈值或打算算法不准确)但骨架具有可优化空间的潜在苗头化合物。
综上,如何从仅考虑对接分数值的方法以外寻求一种有效的分子评估手段成为分子虚拟筛选过程中的有待改进的问题。
此外,早期药物筛选阶段,特别是针对同类型第一款药物(First-in-class,简称FIC)的早期药物研发,开发人员期望分子的评估结果能够对于分子推荐到合成及测试的策略有所帮助。基于结构的药物设计通常会建立起小分子结构信息与蛋白质结合模式间的假设关联,以此为依据推荐合成、测试、验证假设等,以便进行后续改进。然而,对接软件更多的是从小分子、蛋白质结构的角度进行评估,如何聚合众多小分子对接结果,将重要的结构、骨架、结合模式等差异信息聚合,以此为依据辅助分子推荐到接下来的合成、测试流程显得尤为重要。
相关技术中可以使用如量化构效关系评估(Quantitative structure-activity relationship,简称QSAR)等评估算法或者软件,针对药物分子进行评估,并且考虑 待筛选分子与蛋白质结合相关的部分因素。但是,评估结果未直接与小分子占据蛋白质口袋的三维空间结构建立关联,较难建立起基于小分子结构与结合模式间的假设关联,从而对于辅助推荐合成及测试的作用有限。
本申请旨在提供一种面向虚拟筛选过程的分子评估方法,基于分子结构、骨架及结合模式中至少一种对对接成功的分子进行信息聚合,以便于辅助推荐合理分子进入合成及测试阶段。
本申请的技术方案能够适配阳性化合物规模增加的情况,同时提供更符合药物开发经验的分子筛选方法。本申请的技术方案可以将复杂的高纬度的小分子蛋白空间结构信息分层次的投影到骨架、形状、结合模式上。并且可以以此为依据通过聚类算法挑选代表分子。实际情况中,药物化学家可根据具体项目所处阶段、风险、预算等客观因素选择适合的代表分子数来调控该方法的聚类及筛选算法,从而实现有效的从大规模虚拟筛选到管线分子推荐的辅助决策能力。需要说明的是,本申请的技术方案对于输入(分子库)的适用性更好,无论分子库的大小,均能适用。
此外,本申请实施例跳出以打分值为唯一参照值的筛选分子的方法,回归基于结构的理性设计,从结构的合理性角度出发挑选优势骨架分子供下游分子优化选择。
以下将通过图1至图13对本申请实施例的一种筛选分子的方法、装置及其应用进行详细描述。
图1示意性示出了根据本申请实施例的筛选分子的过程示意图。
参见图1,本实施例中将对接成功的配体分子的骨架类、结构类和相关作用类中至少一种与分子结构、简化分子线性式之间关联起来,使得用户可以辅以这些知识来对待筛选分子进行筛选,提升了分子筛选准确度。
图2示意性示出了根据本申请实施例的可以应用筛选分子的方法、装置及其应用的一种示例性系统架构。
需要注意的是,图2所示仅为可以应用本申请实施例的系统架构的示例,以帮助本领域技术人员理解本申请的技术内容,但并不意味着本申请实施例不可以用于其他设备、系统、环境或场景。
参见图2,根据该实施例的系统架构200可以包括终端设备201、202、203,网络204和服务器205。网络204用以在终端设备201、202、203和服务器205之间提供通信链路的介质。网络204可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备201、202、203通过网络204与其他终端设备和服务器205进行交互,以接收或发送信息等,如发送分子三维分子结构请求、筛选分子请求等和接收筛选结果、分子三维分子结构等。终端设备201、202、203可以安装有各种通讯客户端应用,例如,网页浏览器应用、药物开发类应用、数据库类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等应用等。
终端设备201、202、203包括但不限于智能台式电脑、平板电脑、膝上型便携计算机等可以支持上网、图像展示等功能的电子设备。
服务器205可以接收分子三维分子结构请求等,并且发送分子三维分子结构信息等给终端设备201、202、203。例如,服务器205可以为后台管理服务器、服务器集群等。
需要说明的是,终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和云端。
图3示意性示出了根据本申请实施例的一种筛选分子的方法的流程图。
参见图3,该实施例提供了一种筛选分子的方法,该方法包括操作S310~操作S340,具体如下:
在操作S310中,获得M个配体分子的简化分子线性式与N个分子结构之间的第一映射关系,M个配体分子的简化分子线性式各自具有结构信息,M、N是大于或者等于1的整数。
在本实施例中,配体分子是与受体分子相对应的分子。例如,小分子A和大分子B对接成功,则可以将小分子称为配体分子。结构信息可以是简化分子线性式中包含的结构信息。
分子对接是通过受体分子(如蛋白质分子)的特征以及受体分子和配体分子(如药物分子)之间的相互作用方式来进行药物设计的一种方法。分子对接研究分子间(如配体分子和受体分子之间)相互作用,并预测其结合模式和亲合力。
其中,药物分子与体内的蛋白质大分子之间会发生类似钥匙与锁的识别关系,这种识别关系可以依赖于两者的空间匹配和能量匹配。例如,以某病毒为例,该病毒的RNA依赖于某个RNA聚合酶蛋白,经证实该RNA聚合酶蛋白的某个特定部位是小分子药物的作用靶点。则可以利用分子对接来推测多个小分子和这个靶点的结合活性,从而预测这些小分子是否有成为候选药物的潜力。
简化分子线性式能够用文本表达分子的结构。具体地,简化分子线性式可以符合 简化分子线性输入规范(Simplified Molecular-Input Line-Entry System,简称SMILES)。SMILES表示分子的方法是将结构编码为文本。通过将结构信息转换为文本信息,以便在机器学习输入管道中使用文本(编码字符串)进行输入。这样便于使用自然语言处理(NLP)的相关算法来进行药物开发。
一个简化分子线性式可以存在多个对应的分子结构。第一映射关系可以如下所示。简化分子线性式对应于smiles列,分子结构对应于sdf_index列。第一映射关系可以表示为:sdf_index,similes。
例如,第一映射关系1:
protein_ligand_02538_energy_0_split_0_pose_8,CCl(C)CC(C[NH2+]CC(=O)
第一映射关系2:
protein_ligand_02538_energy_0_split_0_pose_3,CCl(C)CC(C[NH2+]CC(=O)
第一映射关系3:
protein_ligand_02538_energy_0_split_0_pose_5,CCl(C)CC(C[NH2+]CC(=O)
其中,第一映射关系1、第一映射关系2的和第一映射关系3中的“,”之前是sdf_index列,“,”之后是smiles列。
在一个具体实施例中,从虚拟筛选过程中获取一系列化合物分子的分子结构(对接成功后的结构),以.sdf的格式存储于文件中,其中,可以仅存储配体小分子的结构)。此外,存在一个.csv文件,该.csv文件中存储有每个分子的SMILES式及其对应的分子结构的文件(.sdf)之间的映射关系,即第一映射关系。其中,一个分子的SMILES格式可以对应多个分子结构。
在操作S320中,对于M个配体分子的简化分子线性式中的至少部分,分别对简化分子线性式的结构信息进行骨架提取,得到O个分子骨架,O是大于或者等于1的整数,并且O小于或者等于M。
在某些实施例中,可以采用多种骨架提取方法从简化分子线性式中提取骨架。例如,对每个分子的SMILES式执行骨架提取算法,以得到各分子的Bemis-Murcko骨架。
图4至图6示意性示出了根据本申请实施例的提取分子骨架的过程示意图。
参见图4,可以将SMILES式1转换为图4所示的分子结构图。SMILES式和分子结构图之间的转换方法可以采用多种相关技术,在此不做详述。
参见图5,与图4不同的是,图5中的氮原子(N)和氧原子(O)被替换,如替换为碳原子(C)或者氢原子(H)。需要说明的是,图4中省略了针对碳原子(C) 和氢原子(H)的元素符号标识。此外,也可以删除图4中的元素符号标识。然后,将图4中的双键替换为单键。通过以上操作即可得到图5所示的结构。以上仅为示例性说明,不能理解为对本申请的限定。
参见图6,与图5不同的是,图6中的支链或者悬挂键被去除。通过上述操作即可得到与SMILES式对应的骨架。例如,可以通过Bemis-Murcko framework得到与SMILES式对应的骨架。
在操作S330中,聚合O个分子骨架,得到P个分子骨架类,P是大于或者等于1的整数,并且P小于或者等于O。
在本实施例中,通过对多个分子骨架的聚合操作可以得到多个骨架类。例如,将相同的骨架分为一类。
在某些实施例中,在得到P个分子骨架类之后,上述方法还可以包括如下操作,对P个分子骨架类进行同类合并,得到多级分子骨架集合,其中,多级分子骨架集合中的父分子骨架对应至少一个子分子骨架,底层分子骨架对应至少一个配体分子,子分子骨架的骨架结构比父分子骨架的骨架结构复杂。
通过同类合并可以将至少部分配体分子融入到一个表中,使得各种骨架之间的父子关系更加清晰,以便用户基于该父子关系选取可能有用的骨架或分子。
在某些实施例中,在得到多级分子骨架集合之后,上述方法还可以包括如下操作,生成骨架图,其中,骨架图包括多个节点,多个节点中的非末端节点表示多级分子骨架集合中的至少部分分子骨架,多个节点中的末端节点表示M个配体分子的中的包括与该末端节点对应的骨架的分子簇,多个节点中的一个父节点对应至少一个子节点。
图7示意性示出了根据本申请实施例的骨架图的示意图。
参见图7,图7中每个节点可以表示一个骨架类,父节点对应至少一个子节点,底层节点可以对应一个具体的配体分子的分子式。如图7中左侧的两个虚线圆圈框住的两个节点,分别是两个配体分子的分子式。这两个配体分子具有相同的骨架类。但是,上方虚线圆圈框住的节点所在的骨架类,是下方虚线圆圈框住的节点所在的骨架类之间是子父关系。
例如,从图中可直观看到分子与骨架之间的隶属关系,同时每个末端叶子节点代表一类具有相同末端骨架的分子簇。对每个节点编号即可获得基于骨架聚类的骨架编号。例如,根节点是1,按照图遍历顺序依次按照阿拉伯数字对各节点进行编码,如1,2,3,4等。
此外,图7中的各节点可以具有填充色或者填充图案等。以填充色为例,填充色的颜色深浅标识的可以是当前骨架节点所包含的所有分子的打分均值。例如,如果打分代表活性,那么颜色越深的节点意味着其骨架所包含的分子的活性值越高,即与该节点对应的配体分子与目标受体分子之间的匹配度越高。这样便于用户直观地从骨架图中看到:包含哪个骨架的分子的活性更高,成为与目标受体分子对应的配体分子的概率更高。
图8A示意性示出了根据本申请实施例的一种骨架的结构示意图。图8B示意性示出了根据本申请实施例的另一种骨架的结构示意图。
参见图8A和图8B,图8A是针对标识为MOL0436的配体分子的骨架。图8B是针对标识为MOL0049的配体分子的骨架,两个骨架各自的被虚线圈框住的中间部分相同,这两个骨架具有同一个父骨架。
在操作S340中,基于第一映射关系确定P个分子骨架类与N个分子结构之间的第二映射关系,以便基于第二映射关系从包括M个配体分子的简化分子线性式的简化分子线性式集合中筛选与目标受体分子匹配的分子。
在本实施例中,通过构建第二映射关系,便于用户至少基于该第二映射关系进行分子筛选。
在某些实施例中,为了便于用户从更多维度进行分子筛选,上述方法还可以包括如下操作。
首先,对于P个分子骨架类中至少部分骨架类中的每个类,获取与该骨架类对应的至少部分配体分子各自对应的分子结构。参见图7,可以获取与某个节点对应的所有分子结构。
然后,基于至少部分配体分子各自对应的分子结构确定至少部分配体分子两两之间的体积差别。
接着,基于体积差别对与该骨架类对应的至少部分配体分子进行聚类,得到多个结构类。例如,体积差别小的聚为一类。
然后,基于第一映射关系确定多个结构类与分子结构之间的第三映射关系。
其中,基于至少部分配体分子各自对应的分子结构确定至少部分配体分子两两之间的体积差别可以包括如下操作。
首先,将目标受体分子的口袋区域划分为网格。
然后,确定至少部分配体分子各自对应的分子结构对网格的占据比例。
接着,基于占据比例构建至少部分配体分子各自对应的分子结构的网格空间占据向量。
然后,基于至少部分配体分子各自对应的分子结构的网格空间占据向量确定至少部分配体分子两两之间的体积差别。
例如,基于至少部分配体分子各自对应的分子结构的网格空间占据向量确定至少部分配体分子两两之间的体积差别可以包括如下操作。
首先,确定两个配体分子各自对应的分子结构的网格空间占据向量之间的距离。
然后,基于网格空间占据向量之间的距离确定两个配体分子各自对应的分子结构占据空间的体积差别。
重复以上操作直至确定至少部分配体分子中任意两个配体分子各自对应的分子结构之间的体积差别。
在一个具体实施例中,首先,提取一个骨架类中所有分子的三维分子结构,如提取一个.sdf文件。然后,分别计算两两分子的三维分子结构之间的空间占据体积差别。由于分子结构来自于对接后的三维分子结构,所以不必在此处做基于结构的对齐及平移变化操作。获得两两分子的空间占据体积差别后,作为其空间距离,即可进行基于三维形状差异的聚类分析。每个类别表示其空间占据体积相近的分子。例如,可以将空间距离小于预设距离阈值的分子聚到同一类别中,预设距离阈值可以取2.5埃
Figure PCTCN2021142381-appb-000001
~4埃,或者
Figure PCTCN2021142381-appb-000002
等。其中,两两分子的三维分子结构之间的空间占据体积差别可以通过如下方式来确定:可通过将受体分子的分子结构(如蛋白结构)的口袋区域划分为等间距网格,通过确定分子结构对于网格的占据情况,构建网格空间占据向量,进而通过计算两个网格空间占据向量之间的距离(如谷本距离(Tanimoto Distance)、欧式距离等),得到两个分子的空间占据体积差别。
本实施例中,根据“锁钥原理”和“诱导契合”的理论基础,可以确定配体分子能够正确结合在蛋白口袋内。配体分子在形状上和静电分布上应当与口袋吻合(互补)。通过如上方式能够确定分子与目标受体分子之间的结构匹配度。
在某些实施例中,在得到多个结构类之后,上述方法还可以包括如下操作,以进一步分析结构类和/骨架类与分子结构之间的映射关系。
首先,对于多个结构类中至少部分结构类中的每个类,获取与该结构类对应的至少部分分子结构各自与目标受体分子之间的第一相互作用特征,和/或,对于P个分子骨架类中至少部分骨架类的每个类,获取与该骨架类对应的至少部分分子结构各自 与目标受体分子之间的第二相互作用特征。
然后,确定至少部分分子结构各自的第一相互作用特征两两之间的第一相互作用差别,和/或,确定至少部分配体分子各自的第二相互作用特征两两之间的第二相互作用差别。
接着,对第一相互作用差别进行聚类,得到多个第一相互作用类,和/或,对第二相互作用差别进行聚类,得到多个第二相互作用类。
然后,至少基于第一映射关系确定多个第一相互作用类与分子结构之间的第四映射关系,和/或,基于第一映射关系确定多个第二相互作用类与分子结构之间的第五映射关系。
本实施例中可以实现基于分子与蛋白质分子结合模式指纹的聚类分析。其中,分子与蛋白质分子结合模式指纹是一种用于表征基于结构的小分子与蛋白质分子间的相互作用类型的编码。例如,作用指纹包括但不限于:作用类型、作用位点原子序号和蛋白质氨基酸残基的作用位点中至少一种。
需要说明的是,在得到骨架类之后,无需得到结构类,即可执行上述确定相互作用类的过程。
例如,在得到P个分子骨架类之后,上述方法还可以包括如下操作。
首先,对于P个分子骨架类中至少部分骨架类的每个类,获取与该骨架类对应的至少部分配体分子各自与目标受体分子之间的第二相互作用特征。
然后,确定至少部分配体分子各自的第二相互作用特征两两之间的第二相互作用差别。
接着,对第二相互作用差别进行聚类,得到多个第二相互作用类。
然后,至少基于第一映射关系确定多个第二相互作用类与分子结构之间的第五映射关系。
在某些实施例中,确定至少部分配体分子各自的第二相互作用特征两两之间的第二相互作用差别可以包括如下操作。
首先,确定与第二相互作用特征对应的相互作用特征向量。
然后,重复以下操作直至确定至少部分配体分子各自的第二相互作用特征中的任意两个之间的相互作用差别。例如,先确定与第二相互作用特征对应的相互作用特征向量之间的距离,再基于与第二相互作用特征对应的相互作用特征向量之间的距离确定两个配体分子各自的第二相互作用差别。
在一个具体实施例中,分子与蛋白质分子结合模式指纹的编码信息包括了相互作用类型(如氢键供体分子和受体分子、π-π相互作用等)、分子的作用位点原子序号及蛋白质氨基酸残基的作用位点。通过该信息能够快速识别出分子与蛋白质分子所形成的基于结构的相互作用。每个分子可与蛋白质分子形成多个此类的相互作用,进而每个分子可通过其对接后的分子结构提取到多个相互作用指纹(相互作用指纹是对分子的相互作用特征进行向量化后,得到一个1×n维的向量,相互作用指纹之间的距离可通过谷本距离等的方法进行计算得到)。依据该相互作用指纹可构建该分子结构的指纹特征向量。例如,可以针对某一骨架类和/或某一结构类下的所有分子三维分子结构信息,进行指纹提取,并计算两两指纹之间的指纹距离,进而进行基于指纹的聚类分析。同一类下的分子应当具备相似的骨架、形状和/或结合模式。指纹聚类可以看作是一种无监督聚类。
在某些实施例中,考虑到相互作用可以分为稳定相互作用和非稳定相互作用,在本实施例中,还可以进一步对相互作用的稳定性特征进行分析。
具体地,在得到多个第一相互作用类之后,或者在得到多个第二相互作用类之后,上述方法还可以包括如下操作。
首先,对于每个第一相互作用类或者每个第二相互作用类中的任意一类,确定当前类的代表分子。例如,代表分子可以是某个类的与类中心对应的分子等。
然后,对当前类的代表分子进行分子动力学模拟,得到代表分子的稳定性特征。稳定性特征可以包括:稳定和不稳定。
接着,基于代表分子的稳定性特征确定稳定性特征与第一相互作用类或者第二相互作用类之间的第六映射关系。
在一个具体实施例中,可以确定各种类中的代表分子进行动力学模拟。目的是为了验证相互作用在动力学模拟的结果中是否依旧保持稳定,若存在动力学模型下不稳定的相互作用应当在最终结果中进行标识提示。具体地,可以通过对对接成功的配体分子和受体分子的复合结构进行50ns的动力学模拟,收集采样得到的轨迹文件。基于轨迹文件分析提取出的配体分子与蛋白质分子之间的相互作用,是否能够在采样的稳定状态下也可以持续可观测到。如果可以持续可观测到,则说明在模拟情况下该代表分子所形成的相互作用依旧稳定可观测。其中,代表分子可以为聚类中心分子,聚类中心是某个对象到类中其余对象之间距离最均衡的点,聚类中心可以只有一个,且可通过聚类算法直接得到。
在某些实施例中,为了便于用户查看上述多种映射关系,上述方法还可以包括如下操作。
相关联地存储配体分子的简化分子线性式、分子结构以及以下至少一种:分子骨架类、结构类、第一相互作用类或者第二相互作用类,得到映射表。
具体地,可以汇总上述所有操作得到的信息,形成一张“分子(简化分子线性式)-分子结构-骨架类-结构类-相互作用类-稳定性”的汇总表,从而方便药物设计人员根据感兴趣的骨架、相互作用类(结合模式挑)来选取合适的分子作为假设,进行后续的合成及测试验证。
在一个具体实施例中,采用的数据有32k条映射关系,共记录4k个不同的简化分子线性式。可以通过Python Pandas库加载该.csv关系文件,并获取该表格中“smiles”列的所有数据,即为所有的简化分子线性式。通过如上所示的相关内容提取每个简化分子线性式的Bemis-Murcko骨架,并合并相同的骨架。此外,可以通过骨架图的方式表示和存储相关的节点信息以便实现可视化。例如,可以采用Python Networkx库对骨架图进行存储,并绘制如图7所示节点图。
从图7中可直观看到简化分子线性式与骨架之间的隶属关系,同时每个底层叶子节点代表一类具有相同末端骨架的分子簇。对骨架图中的每个节点进行编号即可获得基于骨架聚类的骨架编号。将该编号也一并更新至输入的.csv关系表中持久化,标记为“scaffold_cluster”。
接下来进行基于结构的聚类分析,针对图7中任意节点的骨架,提取其对应的簇中简化分子线性式和对应的分子结构的.sdf文件,通过如RDKit的Shape Protrude Distance计算方法,来计算两两分子结构之间的空间体积占据差距,作为两两分子结构之间的距离。进而通过如scikit-learn机器学习开发包的DBSCAN算法库对所有分子结构进行聚类,得到每个骨架类下所有分子的基于分子结构的结构类,将结构类顺序编号并更新至输入的.csv关系表中。例如,可以标记为“shape_cluster”。
基于相互作用指纹的聚类分析与基于结构聚类分析类似,获取各骨架类下的全部分子结构的.sdf文件,同时需要获取其对接时使用的蛋白质结构文件.pdb,通过结合模式指纹识别算法提取出配体分子与蛋白质分子之间的指纹信息,如下所示:
6UYB_500ns_frame_ligand_009433_energy_2_isomer_0_split_0_pose_0[0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0]
其中,中括号中的每一位数字代表一维特征。特征的值可以通过独热编码的方式得到。例如,
编码信息包括了相互作用类型(如氢键供体受体,π-π相互作用等)、小分子的作用位点原子序号及蛋白质氨基酸残基的作用位点。
例如,如果当前小分子和蛋白质存在一个CYS260氨基酸上的氢键相互作用,且小分子为氢键受体,其对应的该相互作用编码为[CYS260_HB_Acceptor,…]。
如此,每个小分子都会和蛋白形成多种相互作用,即每个小分子都有一个如上的相互作用列表,如[CYS260_HB_Acceptor,…]。
将所有的相互作用收集起来,按任意确定顺序建立相互作用特征向量,则对任意小分子,存在的相互作用编码为1,不存在的编码为0。这样能够对所有分子构建相同长度的特征向量,如[0,0,1,0,…]。
接下来,将相互作用特征进行编码,得到特征向量后,可以通过聚类算法进行基于指纹间距离的聚类。将基于指纹的聚类簇编号存储于输入的.csv关系表中,可以标记为“ifp_cluster”。
对于每个指纹聚类簇中的代表分子(例如聚类中心),提取其小分子结构.sdf文件和对应的蛋白质结构.pdb文件,经由GROMACS运行指定时长(如100ns)的动力学模拟,并从模拟轨迹中采样若干帧,识别其相互作用结合模式,用于验证相互作用指纹的有效性。若指纹所示相互在模拟中作为仍稳定存在,标记为1;若不稳定存在,标记为0。将其记录在输入的.csv关系表中,可以标记“ifp_valid”。
汇总上述结果后的.csv表中的一个条目可以如表1所示。
表1
Figure PCTCN2021142381-appb-000003
本申请的另一方面还提供了一种评估分子的方法。
图9示意性示出了根据本申请实施例的分子评估方法的流程图。
参见图9,用户可以通过如下方式使用映射表。上述方法还可以包括操作S910~操作S930。
在操作S910,获得待筛选分子的简化分子线性式。例如,用户可以在终端设备上输入简化分子线性式,还可以由终端设备把简化分子线性式发送给云端。
在操作S920,基于待筛选分子的简化分子线性式确定待筛选分子的骨架。可以在本地或者云端生成与该简化分子线性式对应的骨架。
在操作S930,基于待筛选分子的骨架和如上方法确定的多种映射关系对待筛选分子进行评估。其中,多种映射关系可以存储在映射表中,包括但不限于:第一映射关系至第六映射关系中至少一种。例如,可以通过骨架匹配等方式确定映射表中与该待筛选分子对应的条目。
又例如,用户可以挑选感兴趣的骨架、形状和相互作用簇,进而选择簇中代表分子进入后续合成于测试阶段,根据合成测试的结果来验证此类骨架、形状和相互作用是否对于该靶点蛋白质的药物设计有所帮助。同时,可以根据不同的骨架、形状、相互作用簇组合一批次分子,进行有针对性的对照实验从而加速基于结构的药物开发过程。
本实施例基于分子结构、骨架及相互作用对配体分子进行信息聚合,相比于通过单一对接打分及阈值过滤的方式,本实施例受到对接打分算法的准确性影响更小,并且能够综合考虑配体分子与受体分子相结合的结构信息,更符合药物设计人员推进药物开发的思路。
本实施例将分子筛选过程依次拆分为“骨架聚类”、“结构聚类”和“相互作用聚类”三个维度进行,相比于单一维度的对接打分,其信息量更多。
本实施例在进行多维度分类聚合的过程中,以“骨架聚类”为先,其目的是让整个筛选过程以分子骨架差异为根类目,更符合药物设计人员的开发习惯。需要说明的是,“结构聚类”和“相互作用聚类”这两个维度分类聚合的过程没有前后顺序要求。
本实施例以动力学模拟方式验证相互作用(结合模式)稳定性,保障了引入相互作用指纹的可靠性。
本申请的另一方面还提供了一种设计方法。
图10示意性示出了根据本申请实施例的一种设计方法的流程图。
如图10所示,上述设计方法包括操作S1010~操作S1020。
在操作S1010,展示分子筛选结果,分子筛选结果是根据如上述的方法得到的筛选结果。
在操作S1020,基于分子筛选结果进行药物设计或者材料设计。
需要说明的是,筛选分子的过程可以参考如上所示的相关内容,在此不再详述。
本申请的另一方面还提供了一种筛选分子的装置。
图11示意性示出了根据本申请实施例的一种筛选分子的装置的框图。
如图11所示,该筛选分子的装置1100可以包括:第一映射关系获得模块1110、分子骨架提取模块1120、分子骨架聚合模块1130和第二映射关系确定模块1140。
第一映射关系获得模块1110用于获得M个配体分子的简化分子线性式与N个分子结构之间的第一映射关系,M个配体分子的简化分子线性式各自具有结构信息,M、N是大于或者等于1的整数。
分子骨架提取模块1120用于对于M个配体分子的简化分子线性式中的至少部分,分别对简化分子线性式的结构信息进行骨架提取,得到O个分子骨架,O是大于或者等于1的整数,并且O小于或者等于M。
分子骨架聚合模块1130用于聚合O个分子骨架,得到P个分子骨架类,P是大于或者等于1的整数,并且P小于或者等于O。
第二映射关系确定模块1140用于基于第一映射关系确定P个分子骨架类与N个分子结构之间的第二映射关系,以便基于第二映射关系筛选与目标受体分子匹配的配体分子。
在某些实施例中,上述装置1100还包括:骨架类分子结构获取模块、体积差别确定模块、结构聚类模块和第三映射关系确定模块。
骨架类分子结构获取模块用于对于P个分子骨架类中至少部分骨架类中的每个类,获取与该骨架类对应的至少部分配体分子各自对应的分子结构。
体积差别确定模块用于基于至少部分配体分子各自对应的分子结构确定至少部分配体分子两两之间的体积差别。
结构聚类模块用于基于体积差别对与该骨架类对应的至少部分配体分子进行聚类,得到多个结构类。
第三映射关系确定模块用于基于第一映射关系确定多个结构类与分子结构之间 的第三映射关系。
在某些实施例中,体积差别确定模块包括:网格划分单元、占据比例确定单元、占据向量构建单元和体积差别确定单元。
网格划分单元用于将目标受体分子的口袋区域划分为网格。
占据比例确定单元用于确定至少部分配体分子各自对应的分子结构对网格的占据比例。
占据向量构建单元用于基于占据比例构建至少部分配体分子各自对应的分子结构的网格空间占据向量。
体积差别确定单元用于基于至少部分配体分子各自对应的分子结构的网格空间占据向量确定至少部分配体分子两两之间的体积差别。
在某些实施例中,体积差别确定单元包括:距离确定子单元、体积差别确定子单元。
距离确定子单元用于确定两个配体分子各自对应的分子结构的网格空间占据向量之间的距离。
体积差别确定子单元用于基于网格空间占据向量之间的距离确定两个配体分子各自对应的分子结构占据空间的体积差别。
在某些实施例中,上述装置1100还包括:相互作用特征获取模块、相互作用差别确定模块、相互作用差别聚类模块和作用映射关系确定模块。
相互作用特征获取模块用于对于多个结构类中至少部分结构类中的每个类,获取与该结构类对应的至少部分分子结构各自与目标受体分子之间的第一相互作用特征,和/或,对于P个分子骨架类中至少部分骨架类的每个类,获取与该骨架类对应的至少部分分子结构各自与目标受体分子之间的第二相互作用特征。
相互作用差别确定模块用于确定至少部分分子结构各自的第一相互作用特征两两之间的第一相互作用差别,和/或,确定至少部分配体分子各自的第二相互作用特征两两之间的第二相互作用差别。
相互作用差别聚类模块用于对第一相互作用差别进行聚类,得到多个第一相互作用类,和/或,对第二相互作用差别进行聚类,得到多个第二相互作用类。
作用映射关系确定模块用于至少基于第一映射关系确定多个第一相互作用类与分子结构之间的第四映射关系,和/或,基于第一映射关系确定多个第二相互作用类与分子结构之间的第五映射关系。
在某些实施例中,相互作用差别确定模块包括:相互作用特征向量确定单元、相互作用差别确定单元。
相互作用特征向量确定单元用于确定与第二相互作用特征对应的相互作用特征向量。
相互作用差别确定单元用于重复以下操作直至确定至少部分配体分子各自的第二相互作用特征中的任意两个之间的相互作用差别:确定与第二相互作用特征对应的相互作用特征向量之间的距离;基于与第二相互作用特征对应的相互作用特征向量之间的距离确定两个配体分子各自的第二相互作用差别。
在某些实施例中,上述装置1100还包括:代表分子确定模块、稳定性特征获得模块和稳定性映射关系确定模块。
代表分子确定模块用于对于每个第一相互作用类或者每个第二相互作用类中的任意一类,确定当前类的代表分子。
稳定性特征获得模块用于对当前类的代表分子进行分子动力学模拟,得到代表分子的稳定性特征。
稳定性映射关系确定模块用于基于代表分子的稳定性特征确定稳定性特征与第一相互作用类或者第二相互作用类之间的第六映射关系。
在某些实施例中,上述装置1100还包括:关联存储模块,用于相关联地存储配体分子的简化分子线性式、分子结构以及以下至少一种:分子骨架类、结构类、第一相互作用类或者第二相互作用类,得到映射表。
在某些实施例中,上述装置1100还包括:简化分子线性式获得模块、骨架确定模块和评估模块。
简化分子线性式获得模块,用于获得待筛选分子的简化分子线性式。
骨架确定模块,用于基于待筛选分子的简化分子线性式确定待筛选分子的骨架。
评估模块,用于基于待筛选分子的骨架和映射表对待筛选分子进行评估。
在某些实施例中,上述装置1100还包括:同类合并模块,用于对P个分子骨架类进行同类合并,得到多级分子骨架集合,其中,多级分子骨架集合中的父分子骨架对应至少一个子分子骨架,底层分子骨架对应至少一个配体分子,子分子骨架的骨架结构比父分子骨架的骨架结构复杂。
在某些实施例中,上述装置1100还包括:骨架图生成模块,用于生成骨架图,其中,骨架图包括多个节点,多个节点中的非末端节点表示多级分子骨架集合中的至 少部分分子骨架,多个节点中的末端节点表示M个配体分子的中的包括与该末端节点对应的骨架的分子簇,多个节点中的一个父节点对应至少一个子节点。
本申请的另一方面还提供了一种评估分子的装置。
图12示意性示出了根据本申请实施例的一种评估分子的装置的框图。
参见图12,上述评估分子的装置1200可以包括简化分子线性式获得模块1210、待筛选分子骨架获得模块1220和分子评估模块1230。
简化分子线性式获得模块1210用于获得待筛选分子的简化分子线性式。
待筛选分子骨架获得模块1220用于基于所述待筛选分子的简化分子线性式确定所述待筛选分子的骨架。
分子评估模块1230用于基于所述待筛选分子的骨架和根据上述装置1100确定的多种映射关系对所述待筛选分子进行评估,所述多种映射关系包括:第一映射关系至第六映射关系中至少一种。
本申请的另一方面还提供了一种设计装置。
图13示意性示出了根据本申请实施例的一种设计装置的框图。
参见图13,该设计装置1300可以包括:筛选结果展示模块1310和设计模块1320。
筛选结果展示模块1310用于展示分子筛选结果,分子筛选结果是基于上述的装置1100得到的筛选结果。
设计模块1320用于基于分子筛选结果进行药物设计或者材料设计。
关于上述实施例中的筛选分子的装置1100、评估分子的装置1200、设计装置1300,其中各个模块、单元执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不再做详细阐述说明。
本申请的另一方面还提供了一种电子设备。
图14示意性示出了实现本申请实施例的一种筛选分子的方法的电子设备的方框图。
参见图14,电子设备1400包括存储器1410和处理器1420。
处理器1420可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以 是任何常规的处理器等。
存储器1410可以包括各种类型的存储单元,例如系统内存、只读存储器(ROM)和永久存储装置。其中,ROM可以存储处理器1420或者计算机的其他模块需要的静态数据或者指令。永久存储装置可以是可读写的存储装置。永久存储装置可以是即使计算机断电后也不会失去存储的指令和数据的非易失性存储设备。在一些实施方式中,永久性存储装置采用大容量存储装置(例如磁或光盘、闪存)作为永久存储装置。另外一些实施方式中,永久性存储装置可以是可移除的存储设备(例如软盘、光驱)。系统内存可以是可读写存储设备或者易失性可读写存储设备,例如动态随机访问内存。系统内存可以存储一些或者所有处理器在运行时需要的指令和数据。此外,存储器1410可以包括任意计算机可读存储媒介的组合,包括各种类型的半导体存储芯片(例如DRAM,SRAM,SDRAM,闪存,可编程只读存储器),磁盘和/或光盘也可以采用。在一些实施方式中,存储器1410可以包括可读和/或写的可移除的存储设备,例如激光唱片(CD)、只读数字多功能光盘(例如DVD-ROM,双层DVD-ROM)、只读蓝光光盘、超密度光盘、闪存卡(例如SD卡、min SD卡、Micro-SD卡等)、磁性软盘等。计算机可读存储媒介不包含载波和通过无线或有线传输的瞬间电子信号。
存储器1410上存储有可执行代码,当可执行代码被处理器1420处理时,可以使处理器1420执行上文述及的方法中的部分或全部。
此外,根据本申请的方法还可以实现为一种计算机程序或计算机程序产品,该计算机程序或计算机程序产品包括用于执行本申请的上述方法中部分或全部步骤的计算机程序代码指令。
或者,本申请还可以实施为一种计算机可读存储介质(或非暂时性机器可读存储介质或机器可读存储介质),其上存储有可执行代码(或计算机程序或计算机指令代码),当可执行代码(或计算机程序或计算机指令代码)被电子设备(或服务器等)的处理器执行时,使处理器执行根据本申请的上述方法的各个步骤的部分或全部。
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其他普通技术人员能理解本文披露的各实施例。

Claims (28)

  1. 一种筛选分子的方法,其特征在于,所述方法包括:
    获得M个配体分子的简化分子线性式与N个分子结构之间的第一映射关系,所述M个配体分子的简化分子线性式各自具有结构信息,M、N是大于或者等于1的整数;
    对于所述M个配体分子的简化分子线性式中的至少部分,分别对所述简化分子线性式的结构信息进行骨架提取,得到O个分子骨架,O是大于或者等于1的整数,并且O小于或者等于M;
    聚合所述O个分子骨架,得到P个分子骨架类,P是大于或者等于1的整数,并且P小于或者等于O;
    基于所述第一映射关系确定所述P个分子骨架类与所述N个分子结构之间的第二映射关系,以便基于所述第二映射关系筛选与目标受体分子匹配的配体分子。
  2. 根据权利要求1所述的方法,其特征在于,在所述得到P个分子骨架类之后,所述方法还包括:
    对于所述P个分子骨架类中至少部分骨架类中的每个类,获取与该骨架类对应的至少部分配体分子各自对应的分子结构;
    基于所述至少部分配体分子各自对应的分子结构确定所述至少部分配体分子两两之间的体积差别;
    基于所述体积差别对与该骨架类对应的至少部分配体分子进行聚类,得到多个结构类;
    基于所述第一映射关系确定所述多个结构类与所述分子结构之间的第三映射关系。
  3. 根据权利要求2所述的方法,其特征在于,所述基于所述至少部分配体分子各自对应的分子结构确定所述至少部分配体分子两两之间的体积差别,包括:
    将目标受体分子的口袋区域划分为网格;
    确定所述至少部分配体分子各自对应的分子结构对所述网格的占据比例;
    基于所述占据比例构建所述至少部分配体分子各自对应的分子结构的网格空间占据向量;
    基于所述至少部分配体分子各自对应的分子结构的网格空间占据向量确定所述至少部分配体分子两两之间的体积差别。
  4. 根据权利要求3所述的方法,其特征在于,所述基于所述至少部分配体分子各自对应的分子结构的网格空间占据向量确定所述至少部分配体分子两两之间的体积差别,包括:
    确定两个配体分子各自对应的分子结构的网格空间占据向量之间的距离;
    基于所述网格空间占据向量之间的距离确定所述两个配体分子各自对应的分子结构占据空间的体积差别;
    重复以上操作直至确定所述至少部分配体分子中任意两个配体分子各自对应的分子结构之间的体积差别。
  5. 根据权利要求2所述的方法,其特征在于,在所述得到多个结构类之后,所述方法还包括:
    对于所述多个结构类中至少部分结构类中的每个类,获取与该结构类对应的至少部分分子结构各自与目标受体分子之间的第一相互作用特征,和/或,对于所述P个分子骨架类中至少部分骨架类的每个类,获取与该骨架类对应的至少部分分子结构各自与目标受体分子之间的第二相互作用特征;
    确定至少部分分子结构各自的第一相互作用特征两两之间的第一相互作用差别,和/或,确定至少部分配体分子各自的第二相互作用特征两两之间的第二相互作用差别;
    对所述第一相互作用差别进行聚类,得到多个第一相互作用类,和/或,对所述第二相互作用差别进行聚类,得到多个第二相互作用类;
    至少基于所述第一映射关系确定所述多个第一相互作用类与所述分子结构之间的第四映射关系,和/或,基于所述第一映射关系确定所述多个第二相互作用类与所述分子结构之间的第五映射关系。
  6. 根据权利要求1所述的方法,其特征在于,在所述得到P个分子骨架类之后,所述方法还包括:
    对于所述P个分子骨架类中至少部分骨架类的每个类,获取与该骨架类对应的至少部分配体分子各自与目标受体分子之间的第二相互作用特征;
    确定至少部分配体分子各自的第二相互作用特征两两之间的第二相互作用差别;
    对所述第二相互作用差别进行聚类,得到多个第二相互作用类;
    至少基于所述第一映射关系确定所述多个第二相互作用类与所述分子结构之间的第五映射关系。
  7. 根据权利要求5或6所述的方法,其特征在于,所述确定至少部分配体分子各自的第二相互作用特征两两之间的第二相互作用差别,包括:
    确定与所述第二相互作用特征对应的相互作用特征向量;
    重复以下操作直至确定至少部分配体分子各自的第二相互作用特征中的任意两个之间的相互作用差别:
    确定与所述第二相互作用特征对应的相互作用特征向量之间的距离;
    基于与所述第二相互作用特征对应的相互作用特征向量之间的距离确定两个所述配体分子各自的第二相互作用差别。
  8. 根据权利要求5所述的方法,其特征在于,在所述得到多个第一相互作用类之后,或者在所述得到多个第二相互作用类之后,所述方法还包括:
    对于每个第一相互作用类或者每个第二相互作用类中的任意一类,确定当前类的代表分子;
    对所述当前类的代表分子进行分子动力学模拟,得到所述代表分子的稳定性特征;
    基于所述代表分子的稳定性特征确定所述稳定性特征与所述第一相互作用类或者所述第二相互作用类之间的第六映射关系。
  9. 根据权利要求8所述的方法,其特征在于,还包括:
    相关联地存储所述配体分子的简化分子线性式、所述分子结构以及以下至少一种:分子骨架类、结构类、第一相互作用类或者第二相互作用类,得到映射表。
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,在所述得到P个分子骨架类之后,所述方法还包括:
    对所述P个分子骨架类进行同类合并,得到多级分子骨架集合,其中,所述多级分子骨架集合中的父分子骨架对应至少一个子分子骨架,底层分子骨架对应至少一个配体分子,所述子分子骨架的骨架结构比所述父分子骨架的骨架结构复杂。
  11. 根据权利要求10所述的方法,其特征在于,在所述得到多级分子骨架集合之后,所述方法还包括:
    生成骨架图,其中,所述骨架图包括多个节点,所述多个节点中的非末端节点表示所述多级分子骨架集合中的至少部分分子骨架,所述多个节点中的末端节点表示所述M个配体分子的中的包括与该末端节点对应的骨架的分子簇,所述多个节点中的一个父节点对应至少一个子节点。
  12. 一种评估分子的方法,其特征在于,包括:
    获得待筛选分子的简化分子线性式;
    基于所述待筛选分子的简化分子线性式确定所述待筛选分子的骨架;
    基于所述待筛选分子的骨架和根据权利要求1至11任一项所述的方法确定的多种映射关系对所述待筛选分子进行评估,所述多种映射关系包括:第一映射关系至第六映射关系中至少一种。
  13. 一种设计方法,其特征在于,所述方法包括:
    展示分子筛选结果,所述分子筛选结果是根据权利要求1至12中任一项所述的方法得到的筛选结果;
    基于所述分子筛选结果进行药物设计或者材料设计。
  14. 一种筛选分子的装置,其特征在于,包括:
    第一映射关系获得模块,用于获得M个配体分子的简化分子线性式与N个分子结构之间的第一映射关系,所述M个配体分子的简化分子线性式各自具有结构信息,M、N是大于或者等于1的整数;
    分子骨架提取模块,用于对于所述M个配体分子的简化分子线性式中的至少部分,分别对所述简化分子线性式的结构信息进行骨架提取,得到O个分子骨架,O是大于或者等于1的整数,并且O小于或者等于M;
    分子骨架聚合模块,用于聚合所述O个分子骨架,得到P个分子骨架类,P是大于或者等于1的整数,并且P小于或者等于O;
    第二映射关系确定模块,用于基于所述第一映射关系确定所述P个分子骨架类与所述N个分子结构之间的第二映射关系,以便基于所述第二映射关系筛选与目标受体分子匹配的配体分子。
  15. 根据权利要求14所述的装置,其特征在于,所述装置还包括:
    骨架类分子结构获取模块,用于对于所述P个分子骨架类中至少部分骨架类中的每个类,获取与该骨架类对应的至少部分配体分子各自对应的分子结构;
    体积差别确定模块,用于基于所述至少部分配体分子各自对应的分子结构确定所述至少部分配体分子两两之间的体积差别;
    结构聚类模块,用于基于所述体积差别对与该骨架类对应的至少部分配体分子进行聚类,得到多个结构类;
    第三映射关系确定模块,用于基于所述第一映射关系确定所述多个结构类与所述分子结构之间的第三映射关系。
  16. 根据权利要求15所述的装置,其特征在于,所述体积差别确定模块包括:
    网格划分单元,用于将目标受体分子的口袋区域划分为网格;
    占据比例确定单元,用于确定所述至少部分配体分子各自对应的分子结构对所述网格的占据比例;
    占据向量构建单元,用于基于所述占据比例构建所述至少部分配体分子各自对应的分子结构的网格空间占据向量;
    体积差别确定单元,用于基于所述至少部分配体分子各自对应的分子结构的网格空间占据向量确定所述至少部分配体分子两两之间的体积差别。
  17. 根据权利要求16所述的装置,其特征在于,所述体积差别确定单元,包括:
    距离确定子单元,用于确定两个配体分子各自对应的分子结构的网格空间占据向量之间的距离;
    体积差别确定子单元,用于基于所述网格空间占据向量之间的距离确定所述两个配体分子各自对应的分子结构占据空间的体积差别。
  18. 根据权利要求15所述的装置,其特征在于,所述装置还包括:
    相互作用特征获取模块,用于对于所述多个结构类中至少部分结构类中的每个类,获取与该结构类对应的至少部分分子结构各自与目标受体分子之间的第一相互作用特征,和/或,对于所述P个分子骨架类中至少部分骨架类的每个类,获取与该骨架类对应的至少部分分子结构各自与目标受体分子之间的第二相互作用特征;
    相互作用差别确定模块,用于确定至少部分分子结构各自的第一相互作用特征两两之间的第一相互作用差别,和/或,确定至少部分配体分子各自的第二相互作用特征两两之间的第二相互作用差别;
    相互作用差别聚类模块,用于对所述第一相互作用差别进行聚类,得到多个第一相互作用类,和/或,对所述第二相互作用差别进行聚类,得到多个第二相互作用类;
    作用映射关系确定模块,用于至少基于所述第一映射关系确定所述多个第一相互作用类与所述分子结构之间的第四映射关系,和/或,基于所述第一映射关系确定所述多个第二相互作用类与所述分子结构之间的第五映射关系。
  19. 根据权利要求14所述的装置,其特征在于,所述装置还包括:
    第二相互作用特征获取模块,用于对于所述P个分子骨架类中至少部分骨架类的每个类,获取与该骨架类对应的至少部分配体分子各自与目标受体分子之间的第二相互作用特征;
    相互作用差别确定模块,用于确定至少部分配体分子各自的第二相互作用特征两两之间的第二相互作用差别;
    第二相互作用类获得模块,用于对所述第二相互作用差别进行聚类,得到多个第二相互作用类;
    第五相互作用映射关系确定模块,用于至少基于所述第一映射关系确定所述多个第二相互作用类与所述分子结构之间的第五映射关系。
  20. 根据权利要求18或19所述的装置,其特征在于,所述相互作用差别确定模块,包括:
    相互作用特征向量确定单元,用于确定与所述第二相互作用特征对应的相互作用特征向量;
    相互作用差别确定单元,用于重复以下操作直至确定至少部分配体分子各自的第二相互作用特征中的任意两个之间的相互作用差别:确定与所述第二相互作用特征对应的相互作用特征向量之间的距离;基于与所述第二相互作用特征对应的相互作用特征向量之间的距离确定两个所述配体分子各自的第二相互作用差别。
  21. 根据权利要求18所述的装置,其特征在于,所述装置还包括:
    代表分子确定模块,用于对于每个第一相互作用类或者每个第二相互作用类中的任意一类,确定当前类的代表分子;
    稳定性特征获得模块,用于对所述当前类的代表分子进行分子动力学模拟,得到所述代表分子的稳定性特征;
    稳定性映射关系确定模块,用于基于所述代表分子的稳定性特征确定所述稳定性特征与所述第一相互作用类或者所述第二相互作用类之间的第六映射关系。
  22. 根据权利要求21所述的装置,其特征在于,还包括:
    关联存储模块,用于相关联地存储所述配体分子的简化分子线性式、所述分子结构以及以下至少一种:分子骨架类、结构类、第一相互作用类或者第二相互作用类,得到映射表。
  23. 根据权利要求14至22中任一项所述的装置,其特征在于,所述装置还包括:
    同类合并模块,用于对所述P个分子骨架类进行同类合并,得到多级分子骨架集合,其中,所述多级分子骨架集合中的父分子骨架对应至少一个子分子骨架,底层分子骨架对应至少一个配体分子,所述子分子骨架的骨架结构比所述父分子骨架的骨架结构复杂。
  24. 根据权利要求23所述的装置,其特征在于,在所述得到多级分子骨架集合之后,所述装置还包括:
    骨架图生成模块,用于生成骨架图,其中,所述骨架图包括多个节点,所述多个节点中的非末端节点表示所述多级分子骨架集合中的至少部分分子骨架,所述多个节点中的末端节点表示所述M个配体分子的中的包括与该末端节点对应的骨架的分子簇,所述多个节点中的一个父节点对应至少一个子节点。
  25. 一种评估分子的装置,其特征在于,包括:
    简化分子线性式获得模块,用于获得待筛选分子的简化分子线性式;
    待筛选分子骨架获得模块,用于基于所述待筛选分子的简化分子线性式确定所述待筛选分子的骨架;
    分子评估模块,用于基于所述待筛选分子的骨架和根据权利要求14所述的装置确定的多种映射关系对所述待筛选分子进行评估,所述多种映射关系包括:第一映射关系至第六映射关系中至少一种。
  26. 一种设计装置,其特征在于,包括:
    筛选结果展示模块,用于展示分子筛选结果,所述分子筛选结果是根据权利要求14所述的装置得到的筛选结果;
    设计模块,用于基于所述分子筛选结果进行药物设计或者材料设计。
  27. 一种电子设备,其特征在于,包括:
    处理器;以及
    存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行如权利要求1-13中任一项所述的方法。
  28. 一种计算机可读存储介质,其特征在于,其上存储有可执行代码,当所述可执行代码被电子设备的处理器执行时,使所述处理器执行如权利要求1-13中任一项所述的方法。
PCT/CN2021/142381 2021-12-29 2021-12-29 筛选分子的方法、装置及其应用 WO2023123023A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/142381 WO2023123023A1 (zh) 2021-12-29 2021-12-29 筛选分子的方法、装置及其应用

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/142381 WO2023123023A1 (zh) 2021-12-29 2021-12-29 筛选分子的方法、装置及其应用

Publications (1)

Publication Number Publication Date
WO2023123023A1 true WO2023123023A1 (zh) 2023-07-06

Family

ID=86996772

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/142381 WO2023123023A1 (zh) 2021-12-29 2021-12-29 筛选分子的方法、装置及其应用

Country Status (1)

Country Link
WO (1) WO2023123023A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003019140A2 (en) * 2001-08-23 2003-03-06 Deltagen Research Laboratories, L.L.C. Method for molecular subshape similarity matching
US20180101641A1 (en) * 2015-03-23 2018-04-12 New York University Systems and methods of fragment-centric topographical mapping (fctm) to target protein-protein interactions
CN112053742A (zh) * 2020-07-23 2020-12-08 中南大学湘雅医院 分子靶标蛋白的筛选方法、装置、计算机设备和存储介质
CN112201313A (zh) * 2020-09-15 2021-01-08 北京晶派科技有限公司 一种自动化的小分子药物筛选方法和计算设备
CN113096723A (zh) * 2021-03-24 2021-07-09 北京晶派科技有限公司 小分子药物筛选通用分子库构建平台

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003019140A2 (en) * 2001-08-23 2003-03-06 Deltagen Research Laboratories, L.L.C. Method for molecular subshape similarity matching
US20180101641A1 (en) * 2015-03-23 2018-04-12 New York University Systems and methods of fragment-centric topographical mapping (fctm) to target protein-protein interactions
CN112053742A (zh) * 2020-07-23 2020-12-08 中南大学湘雅医院 分子靶标蛋白的筛选方法、装置、计算机设备和存储介质
CN112201313A (zh) * 2020-09-15 2021-01-08 北京晶派科技有限公司 一种自动化的小分子药物筛选方法和计算设备
CN113096723A (zh) * 2021-03-24 2021-07-09 北京晶派科技有限公司 小分子药物筛选通用分子库构建平台

Similar Documents

Publication Publication Date Title
Chen et al. PME: projected metric embedding on heterogeneous networks for link prediction
Bolón-Canedo et al. Feature selection for high-dimensional data
Sun et al. Graph convolutional networks for computational drug development and discovery
Emmert-Streib et al. Fifty years of graph matching, network alignment and network comparison
CN106663038B (zh) 用于机器学习的特征处理配方
CN106575246B (zh) 机器学习服务
Fan et al. Challenges of big data analysis
US9436919B2 (en) System and method of tuning item classification
CN103370722B (zh) 通过小波和非线性动力学预测实际波动率的系统和方法
Xu et al. Effective community division based on improved spectral clustering
Sowah et al. HCBST: An efficient hybrid sampling technique for class imbalance problems
Velu et al. Data mining in predicting liver patients using classification model
CN114118310A (zh) 基于综合相似度的聚类方法和装置
Swetha et al. Leveraging Scalable Classifier Mining for Improved Heart Disease Diagnosis
Moreo et al. Multi-label quantification
WO2023123023A1 (zh) 筛选分子的方法、装置及其应用
Wang et al. Predicting potential drug–disease associations based on hypergraph learning with subgraph matching
Mengle et al. Mastering machine learning on Aws: advanced machine learning in Python using SageMaker, Apache Spark, and TensorFlow
CN114300067A (zh) 筛选分子的方法、装置及其应用
Liu et al. 2D-shapley: a framework for fragmented data valuation
CN116383677B (zh) 一种知识图谱实体相似度计算方法及系统
Zhihong et al. 2D-Shapley: A Framework for Fragmented Data Valuation
Gupta Role of Big Data in Medical Imaging Modalities to Extract the Hidden Patterns Using HIPI in HDFS Environment
Sajja et al. Bayesian network structure learning with messy inputs: the case of multiple incomplete datasets and expert opinions
Mishra et al. Application of Classifier for Breast Cancer Cell Detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21969398

Country of ref document: EP

Kind code of ref document: A1