CN116246696A

CN116246696A - Ligand docking gesture virtual screening method based on quick retrieval

Info

Publication number: CN116246696A
Application number: CN202310319885.0A
Authority: CN
Inventors: 陈晓健; 顾彦慧; 刘畅; 张先锋; 夏浩辉; 李杨; 李亚飞; 王金兰
Original assignee: Nanjing Normal University
Current assignee: Nanjing Normal University
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-06-09

Abstract

The invention discloses a ligand docking gesture virtual screening method based on quick retrieval, which comprises the following steps: preprocessing ligand conformation information data, and establishing an index table for retrieving a tree space index structure; inputting known active ligand conformations, rapidly searching and screening potential candidate conformations similar to the query conformations by utilizing a search tree space index structure, and taking the top-k query result which is the most similar as a top-k conformational result; and evaluating the top-k conformation result obtained by retrieval, comparing and outputting the actual RMSD value of the top-k candidate conformation and the natural conformation, verifying the accuracy of the screening result, and further optimizing the screening strategy. The invention organizes the data structure and creates the index by utilizing the three-dimensional space search tree based on the space data of the ligand molecules so as to reduce the search range, thereby being capable of quickly searching out the optimal butt joint gesture structure in massive ligand structure data and effectively improving the prediction performance.

Description

Ligand docking gesture virtual screening method based on quick retrieval

Technical Field

The invention belongs to the field of computer-aided drug design and content retrieval, and particularly relates to a ligand docking posture virtual screening method based on rapid retrieval.

Background

Predicting the docking posture between proteins and ligands plays an important role in computer-aided biopharmaceuticals, and how to improve the prediction and screening efficiency becomes a key one of them. With the advent of protein design technology, more potential proteins are continually being explored, and their properties and functions are more abundant, so the need to rapidly screen out optimal ligand docking positions is continually rising, emerging proteins are continually emerging, and related protein property data and ligand docking position data are difficult to rapidly follow, which becomes a great difficulty in computer-aided drug prediction work.

The traditional method generally generates a combination of a plurality of docking gestures, and a group of docking gestures which most meet the conditions are screened out on the basis of the combination. In the logic of gesture docking prediction and screening, existing knowledge is needed to comprehensively consider the local information and the whole information of the intramolecular force, the intermolecular force and the protein, and the method similar to 'blind search' is used for searching, so that the defects of long time consumption, high cost and insufficient accuracy are serious. In addition, the method for predicting the docking posture based on the neural network generally needs massive posture docking data with good labels, the existing biomedical information is still limited, the data labels are not perfect, and some important information among molecules is inevitably lost or ignored by the existing method.

Drug discovery methods also include High Throughput Screening (HTSBDD), structure-based drug discovery (SBDD), and the like. Among them, the drug discovery method based on high throughput screening uses a phenotypic screening method, but it is based on biochemical experiments with lower efficiency; in structure-based drug discovery, the method using molecular docking computation requires higher computational performance while accuracy remains a bottleneck.

The content-based quick search method can utilize the existing information base to carry out quick screening through the characteristic value index, so that the search efficiency is improved and the effectiveness of the obtained result is ensured. The current method for searching the space data mainly comprises a space index method and a dimension reduction method. The former is a data structure for organizing and managing spatial data, which can effectively reduce the search range and improve the query efficiency. The space index method has many successful applications in the fields of geographic information systems, computer graphics, robot navigation and the like, and particularly in Geographic Information Systems (GIS), the space index method plays an important role in traffic flow analysis and land utilization planning. In the field of drug discovery, there is also an indexing method, which is based on fragment drug discovery (FBDD), using small molecule fragments as starting points, searching for lead compounds that bind to targets by indexing or other methods, but there is no mention of spatial data contained by ligands at the time of gesture docking, nor of using corresponding indexes to optimize the search strategy.

Disclosure of Invention

The invention aims to: in order to overcome the defects in the prior art, a ligand butt joint gesture virtual screening method based on quick retrieval is provided, a spatial retrieval tree is adopted, the spatial position relation of ligand conformation molecules is used for indexing, a splitting and screening strategy is optimized, the defects of complexity of gradual trial and error of molecules in a traditional prediction method and the defect of losing important information among molecules in a machine learning prediction method are avoided, the full utilization of the existing ligand conformation molecule information is realized, and the accuracy and the efficiency of the ligand molecule butt joint gesture prediction are greatly improved.

The technical scheme is as follows: in order to achieve the above purpose, the invention provides a ligand docking gesture virtual screening method based on quick retrieval, which comprises the following steps:

s1: preprocessing ligand conformation information data, and establishing an index table for obtaining a search tree space index structure;

s2: inputting known active ligand conformations, rapidly searching and screening potential candidate conformations similar to the query conformations by utilizing a search tree space index structure, and taking the top-k query result which is the most similar as a top-k conformational result;

s3: and evaluating the top-k conformation result obtained by retrieval, comparing and outputting the actual RMSD value of the top-k candidate conformation and the natural conformation, verifying the accuracy of the screening result, and further optimizing the screening strategy.

Further, in the step S1, the biological information data of the ligand conformation is preprocessed based on the spatial position relationship of the ligand conformation, and the specific processing steps are as follows:

a1: processing the CASF-2016 ligand docking candidate data set in the PDBbind to obtain biological information of different constellations of a single ligand, and simultaneously obtaining RMSD values between different candidate conformations and the natural ligand conformations;

a2: extracting atomic space position information in each conformation, and converting a corresponding three-dimensional structure into a group of characteristic points, wherein each characteristic point comprises coordinate and type information; covering all feature points in the constellation with a minimum bounding box in space;

a3: and constructing a retrieval tree from the feature point set according to a hierarchical structure, wherein each leaf node stores a feature point, and each non-leaf node stores the minimum bounding box of its child nodes.

Further, the biological information of different constellations of the single ligand in the step A1 includes molecular system information, composition atom information, bond value information, and substructure information.

Further, the step S2 specifically includes the following steps:

b1: extracting atomic space position information in the known active ligand conformation, and converting the three-dimensional structure of the known active ligand conformation into a group of characteristic points, wherein each characteristic point comprises coordinate and type information;

b2: searching the feature point set of the known active ligand conformation in an established search tree, performing similarity matching with candidate conformations to be screened in a hierarchical mode, and calculating an RMSD value, namely a similarity score according to the number and the distance of the matched feature points;

b3: and sorting the candidate conformations to be screened according to the similarity score, and selecting a part of candidate conformations with highest scores as candidate ligand conformations to obtain a top-k docking posture result.

Further, in the step B2, the feature point set of the known active ligand conformation is searched in an established search tree, wherein the search comprises top-down search and bottom-up search;

the search from top to bottom specifically comprises the following steps: firstly, finding out an area where an instance is located from a root node through a retrieval method, then further dividing the area according to the attribute of a target, finding out a next layer of area, and sequentially iterating to finally obtain a conformational output result;

the bottom-up search is specifically as follows: firstly, determining the atomic position relation according to the preprocessed data, and distinguishing different examples through clustering and measurement learning means.

Further, in the step B2, a similarity score calculating method based on a spatial position relationship is adopted, which specifically includes the following steps:

the similarity score calculation formula of the candidate conformation based on the spatial position relation is as follows:

wherein x is _ij X being the j-th conformation in top-k results _j All points in the neighborhood, y _i For all points within the y-neighborhood of the native conformation, n is the sum of the number of atoms contained in the individual ligands, the distance between points in equation (1) is defined as the deviation or deviation error of the candidate conformation, the smaller the RMSD value, representing the candidate conformationThe closer the molecule is to the known active conformation;

in order to minimize the value of the formula (1), namely, the docking posture closest to the natural conformation is obtained, the minimum value is taken from the calculation set, and the obtained conformation is the conformation molecule with the smallest deviation difference in the candidate conformation library; and after obtaining the conformation with the minimum deviation difference value in the search tree, backtracking upwards to obtain a conformation with the minimum deviation error outside the obtained conformation, and iterating until top-k conformation query results are output.

Further, the step S3 specifically includes:

c1: evaluating the screening result; calculating the RMSD values of all candidate conformations, sequencing the results to obtain k minimum conformations, comparing the k minimum conformations with top-k results obtained by searching, calculating the accuracy and comparing the time consumption;

c2: optimizing a screening strategy of a search tree; optimizing the splitting strategy, reconstructing the index item of the search tree, and carrying out reevaluation to obtain the strategy with highest accuracy through multiple experiments.

Further, the optimizing splitting strategy in the step C2 comprises linear splitting, binary splitting, quadtree splitting and the like.

The screening in the step C2 of the present invention is mainly based on the idea of ThresholdAlgorithm (TA), and when the regions are combined, two regions which are most relevant in cognition need to be combined together, so that the weights of different attribute parameters affect the sorting, thereby affecting the strategy of combining and splitting.

The invention organizes the data structure and creates the index by utilizing the three-dimensional space search tree based on the space data of the ligand molecules so as to reduce the search range, thereby being capable of quickly searching out the optimal butt joint gesture structure in massive ligand structure data and effectively improving the prediction performance.

The invention utilizes the spatial position relation of different conformations of the ligand to establish indexes for the ligand database, can rapidly screen the top-k docking gesture structure most similar to the known conformation in a large-scale ligand candidate library, and ensures the accuracy of the conformation obtained by screening.

The beneficial effects are that: compared with the prior art, the invention adopts a space retrieval tree, indexes by using the space position relation of ligand conformational molecules, optimizes the splitting and screening strategies, avoids the complexity of gradual trial and error of molecules in the traditional prediction method and the defect of losing important information among molecules in the machine learning prediction method, realizes the full utilization of the information of the existing ligand conformational molecules, greatly improves the accuracy and efficiency of the prediction of the ligand molecule docking posture, can quickly retrieve the optimal docking posture structure in massive ligand structure data, and plays an important role in the design of computer-aided medicaments in medicament discovery and design.

Drawings

FIG. 1 is a schematic overall flow diagram of the method of the present invention;

FIG. 2 is a schematic diagram of a conformational preprocessing scheme of the ligand docking pose of the present invention;

FIG. 3 is a schematic diagram of a search query flow for ligand docking gestures according to the present invention.

Detailed Description

The present invention is further illustrated in the accompanying drawings and detailed description which are to be understood as being merely illustrative of the invention and not limiting of its scope, and various modifications of the invention, which are equivalent to those skilled in the art upon reading the invention, will fall within the scope of the invention as defined in the appended claims.

The invention provides a ligand docking gesture virtual screening method based on quick retrieval, which is shown in figure 1 and comprises the following steps:

Referring to fig. 2, in step S1 of the present embodiment, biological information data of ligand conformation is preprocessed based on spatial position relation of ligand conformation, and specific processing steps are as follows:

a1: processing a CASF-2016 ligand docking candidate data set in the PDBbind to obtain biological information of different constellations of a single ligand, including molecular system information, composition atom information, bond value information, substructure information and the like, and simultaneously obtaining RMSD values between different candidate conformations and natural ligand conformations;

Referring to fig. 3, step S2 of the present embodiment specifically includes the following steps:

in the step, the feature point set of the known active ligand conformation is searched in an established search tree, wherein the search comprises top-down search and bottom-up search;

In this embodiment, a similarity score calculation method based on spatial position relationship is adopted, which specifically includes the following steps:

wherein x is _ij X being the j-th conformation in top-k results _j All points in the neighborhood, y _i For all points within the y-neighborhood of the native conformation, n is the sum of the number of atoms contained in the single ligand, the distance between points in equation (1) is defined as the deviation or deviation error of the candidate conformation, the smaller the RMSD value, the closer the conformational molecule representing the candidate is to the known active conformation;

The specific process of step S3 in this embodiment is as follows:

c2: optimizing a screening strategy of a search tree; optimizing splitting strategies, including linear splitting, binary splitting, quadtree splitting and the like, reconstructing index items of the search tree, reevaluating, and obtaining the strategy with highest accuracy through multiple experiments.

The embodiment also provides a ligand docking gesture virtual screening system based on quick retrieval, which comprises a network interface, a memory and a processor; the network interface is used for receiving and transmitting signals in the process of receiving and transmitting information with other external network elements; a memory storing computer program instructions executable on the processor; and a processor for executing the steps of the consensus method as described above when executing the computer program instructions.

The present embodiment also provides a computer storage medium storing a computer program which, when executed by a processor, implements the method described above. The computer-readable medium may be considered tangible and non-transitory. Non-limiting examples of non-transitory tangible computer readable media include non-volatile memory circuits (e.g., flash memory circuits, erasable programmable read-only memory circuits, or masked read-only memory circuits), volatile memory circuits (e.g., static random access memory circuits or dynamic random access memory circuits), magnetic storage media (e.g., analog or digital magnetic tape or hard disk drives), and optical storage media (e.g., CDs, DVDs, or blu-ray discs), among others. The computer program includes processor-executable instructions stored on at least one non-transitory tangible computer-readable medium. The computer program may also include or be dependent on stored data. The computer programs may include a basic input/output system (BIOS) that interacts with the hardware of the special purpose computer, device drivers that interact with particular devices of the special purpose computer, one or more operating systems, user applications, background services, background applications, and so forth.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. The ligand docking gesture virtual screening method based on the quick retrieval is characterized by comprising the following steps of:

2. The method for virtually screening the ligand docking posture based on the rapid search according to claim 1, wherein the step S1 is characterized in that the biological information data of the ligand conformation based on the spatial position relationship of the ligand conformation is preprocessed, and the specific processing steps are as follows:

a1: processing the ligand docking candidate data set to obtain biological information of different conformations of a single ligand, and simultaneously obtaining RMSD values between different candidate conformations and natural ligand conformations;

3. The method for virtually screening the docking postures of the ligands based on the rapid search according to claim 2, wherein the biological information of different constellations of the single ligands in the step A1 comprises molecular system information, composition atom information, key value information and substructure information.

4. The ligand docking gesture virtual screening method based on quick search according to claim 1, wherein the step S2 specifically includes the following steps:

5. The method for virtually screening ligand docking gestures based on rapid search according to claim 4, wherein the step B2 is characterized in that feature points of known active ligand conformations are searched in an established search tree, and the searching comprises top-down searching and bottom-up searching;

6. The method for virtually screening the ligand docking posture based on the rapid search according to claim 4, wherein the similarity score calculation method based on the spatial position relationship is adopted in the step B2, and specifically comprises the following steps:

wherein x is _ij X being the j-th conformation in top-k results _j All points in the neighborhood, y _i For all points within the y-neighborhood of the natural conformation, n is the sum of the number of atoms contained in a single ligand, and the distance between the points in formula (1) is defined as the deviation or deviation error of the candidate conformation;

7. The method for virtually screening the ligand docking posture based on the rapid search according to claim 1, wherein the step S3 is specifically:

8. The method for virtually screening a ligand docking gesture based on rapid search according to claim 7, wherein the optimized splitting strategy in step C2 comprises linear splitting, binary splitting, quadtree splitting.