CN115862768B

CN115862768B - Optimization method for large-scale drug virtual screening

Info

Publication number: CN115862768B
Application number: CN202211586978.1A
Authority: CN
Inventors: 柯浩; 彭延飞; 赵丽敏; 吴霞
Original assignee: Nanchang University
Current assignee: Nanchang University
Priority date: 2022-12-11
Filing date: 2022-12-11
Publication date: 2023-09-01
Anticipated expiration: 2042-12-11
Also published as: CN115862768A

Abstract

The invention discloses an optimization method for large-scale drug virtual screening, which comprises the following steps: and S1, using more than two molecular docking software and scoring functions thereof to obtain the score of each small molecule in the small molecule compound library on the protein binding capacity, S2 obtaining the final score and ranking of each small molecule compound by using a method of adjusting the consensus score according to the score obtained by each molecular docking software and scoring functions thereof, and S3 taking the final ranking as a virtual screening result. The method finds that the result obtained by the adjusted scoring function is better than other scoring methods as a whole by adjusting the scoring method of consensus; meanwhile, the consensus scoring is found to be combined with multi-stage screening, so that the computing resource can be better utilized, the computing time is obviously reduced and the computing efficiency is improved while the higher early enrichment capability is maintained.

Description

Optimization method for large-scale drug virtual screening

Technical Field

The invention relates to the field of drug research and development of virtual screening, in particular to a consensus scoring and multi-stage drug virtual screening method based on molecular docking.

Background

Virtual screening (virtual screening, VS), also known as computer screening, is an increasingly widely used method for identifying a compound from a large library of compounds that has new activity. The virtual screening is to simulate the interaction between the target spot and the candidate medicine by using molecular docking software on a computer before the biological activity screening, and calculate the affinity between the target spot and the candidate medicine so as to reduce the number of the actually screened compounds and improve the discovery efficiency of the lead compounds. In principle, virtual screening can be divided into two categories, receptor-based virtual screening and ligand-based virtual screening. Based on the receptor virtual screening, characteristic properties of a target protein binding site and an interaction mode between the target protein binding site and a small molecular compound are studied, the binding capacity of the protein and the small molecular compound is evaluated according to an affinity scoring function related to the binding capacity, and finally, a compound with a reasonable binding mode and a high prediction score is selected from a large number of compound molecules for subsequent biological activity testing. Ligand-based virtual screening typically utilizes small molecule compounds of known activity to search a database of compounds for chemical molecular structures that can match it based on their shape similarity or pharmacophore model. Finally, experimental screening research is carried out on the selected compounds.

Molecular docking is a drug screening approach based on the structure of the target protein. The molecular docking is carried out on the small molecular compound and the target, the score and the space conformation conditions are comprehensively analyzed, the properties such as electrostatic action, hydrogen bond action, hydrophobic action, van der Waals action and the like are included, the specific action mode and the combination configuration of the ligand small molecule and the receptor biological macromolecule can be explored, the reason of the activity of the compound is explained, and guidance is provided for reasonably optimizing the structure of the compound; the potential active compounds can also be screened to provide a reference for experiments. The virtual screening of the small molecular compounds based on molecular docking is a potential drug development tool at present, and the molecular docking process has no limit on the structure of the compounds, so that the virtual screening based on molecular docking is completely possible to obtain lead compounds with novel structures, the time and the cost can be effectively saved, the period of research and development of new drugs can be shortened, the cost of direct research and development can be reduced, and the method plays a positive pushing role in developing innovative drugs. Virtual screening based on receptor and molecular docking techniques is a widely used computer-aided drug design method, and many documents have reported using this method to discover inhibitors, agonists, etc. of target proteins. The virtual screening based on receptor and molecular docking technology can effectively reduce the cost of drug discovery and accelerate the drug discovery process. Molecular docking has been demonstrated to be effective in identifying small molecules that interact with protein binding pocket regions as a core technology for virtual screening. A number of different molecular docking software are currently in widespread use. There have been studies that use consensus scoring methods to integrate scoring of small molecules by different molecular docking software in hopes of achieving better virtual screening, and some studies consider docking conformation in consensus scoring.

Wherein computer screening, i.e. docking and scoring, is the core step of virtual screening, i.e. placing each small molecule at the receptor protein ligand binding site, optimizing the ligand conformation and position to have optimal binding to the receptor, scoring the optimal conformation, sorting all compounds according to the scoring, and then picking the highest scored small molecule from the library of compounds. Scoring functions (scoring functions) are key elements in protein-small molecule docking and virtual screening. Traditional scoring functions are typically based on empirical or molecular force fields, or on statistical models. The scoring functions that were successful in comparison were glide score and AutoDock Vina score. These scoring functions have been shown to aid in drug screening and design of small molecules. The main advantages of these traditional scoring functions include, but are not limited to, the following: 1) The calculation speed is high, and the screening of the ultra-large scale virtual library can be realized. 2) The robustness is relatively good, and the method has a certain small molecule enrichment rate for screening different types of targets. However, molecular docking software has a high false positive rate, and the results screened by using different molecular docking software are greatly different, so that the number of small molecules in a small molecule compound library which can be used for virtual screening is hundreds of millions, and the small molecule number is rapidly increased, so that the screening efficiency is particularly important for large-scale virtual screening, the modern medicine screening needs to face massive compound data, and the current molecular docking technology is difficult to meet the actual demands. Therefore, a method for scoring small molecular compounds by combining different molecular docking software and scoring functions thereof, and a rapid virtual screening method with higher enrichment capacity are urgently needed, so that the effect of virtual screening is improved and the efficiency of virtual screening is accelerated.

Disclosure of Invention

Aiming at the technical problems, the invention provides an optimization method for large-scale drug virtual screening, which comprises the following steps:

s1, preparing a protein structure and a small molecule compound library, determining a search space, selecting two or more molecule docking software and scoring functions, setting docking parameters, and using the two or more molecule docking software and the scoring functions to obtain the score of each small molecule in the small molecule compound library on the protein binding capacity

Selecting molecular docking software with enrichment capability and scoring function thereof which can be used for virtual screening and verification as a set M, respectively preprocessing protein and small molecular compound library L according to official documents and using instructions thereof, and respectively executing molecular docking based on protein structure to obtain scoring S of small molecular compound j in the molecular docking software and scoring function i thereof _ij ；

S2, integrating different molecular docking software and scoring small molecular compounds by scoring function thereof

For small molecule compounds that use molecular docking software and its scoring function i, excluding processing failure, docking failure, or scoring failure, calculating the average of scores for the final successful set N of N small molecule compounds in small molecule compound library LAnd standard deviation sigma _i The method comprises the steps of carrying out a first treatment on the surface of the The average value calculation formula is as follows:

the standard deviation calculation formula is as follows:

scoring of molecular docking software and scoring functions thereof is divided into two categories according to the directionality of scoring: the larger the score value, the stronger the binding capacity of the small molecule compound and the protein is classified into a first class; the larger the score value, the weaker the binding capacity of the small molecule compound and the protein is classified into a second class; molecular docking software and scoring function i scoring S of small molecule Compound j _ij According to two typesThe scoring category performs Z-Score normalization to obtain normalized Score Z _ij ；

For small molecule compound j, Z was normalized according to its Z-Score on all the different molecular docking software and its collection of scoring functions M _ij Its final score V is calculated according to the following formula _j ：

The small molecules in the set N are separated according to the final score V _j Sequentially ranking from big to small to obtain a ranking list R;

s3, analyzing virtual screening results

The ranking table R obtained in the step S2 is used as a final virtual screening result, and the higher the ranking, the stronger the binding capacity of the small molecule compound to protein is possible.

Further, the step S3 may further include a multi-stage virtual screening, including the following steps: namely, selecting the previous x% small molecular compounds in the ranking table R obtained in the step S2 as a small molecular compound library in the next-stage consensus scoring step S1, adding one or more new molecular docking software and scoring function thereof to execute molecular docking and consensus scoring of the stage to obtain the ranking table R ₂ The method comprises the steps of carrying out a first treatment on the surface of the Select ranking list R ₂ The former y percent of small molecular compounds in the step S1 are taken as a small molecular compound library in the step S1 of the consensus scoring of the next stage, one or more new molecular docking software and scoring functions thereof are added to execute the molecular docking and the consensus scoring of the stage, and a ranking table R is obtained ₃ The method comprises the steps of carrying out a first treatment on the surface of the And then, repeating the steps according to the need, and carrying out molecular butt joint and consensus scoring of a plurality of stages to obtain a ranking table R of the final stage, namely the R-th stage _r The method comprises the steps of carrying out a first treatment on the surface of the The novel molecular docking software and the scoring function thereof are used for virtual screening and verifying the molecular docking software with enrichment capability and the scoring function thereof.

Preferably, the small molecule library L described in step S1 is derived from the DUD-E database and the protein structure used is derived from the DUD-E database, the alpha fold protein database or the PDB database.

Preferably, the normalized score Z described in step S2 _ij There are two classes, where the Z-Score normalized calculation formula for the first class Score is as follows:

the Z-Score normalization calculation formula for the second class of scores is as follows: />

Preferably, the scoring function set M in step S1 is a default scoring function and a built-in rf_score scoring function in the idock molecular docking software, and a vinardo scoring function and an AutoDock4 (ad 4) scoring function in the AutoDock Vina molecular docking software, which are four scoring functions in total.

Further, the steps of the multi-stage screening are as follows: the method comprises the steps of performing virtual screening by using an idock butt joint to obtain a result of an idock scoring function and a result of an rf_score scoring function, obtaining an ir ranking table by consensus scoring of the result of the idock scoring function and the result of the rf_score scoring function, and taking x% of small molecules in front of the ir ranking table to enter a second stage; in the second stage, virtual screening is carried out by using a vinardo scoring function, so as to obtain a result of the vinardo scoring function, then, consensus scoring is carried out on the small molecules in the x% in front of an ir ranking table and all the small molecules in the result of the vinardo scoring function in the result of the idock scoring function and the result of the rf_score scoring function, so as to obtain a irv ranking table, and the small molecules in the y% in front of the irv ranking table are taken to enter the third stage; in the third stage, virtual screening is carried out by using an autodock4 scoring function, so as to obtain a result of the autodock4 scoring function, and then, consensus scoring is carried out on small molecules, which are y% before irv ranking, in the result of the rf score scoring function and the result of the vinardo scoring function and all small molecules in the result of the autodock4 scoring function, so as to obtain an irva ranking; wherein the values of the x and y parameters are 0-100, and can be flexibly adjusted according to the needs.

Preferably, x is 5 to 20 and y is 25 to 75.

Compared with the prior art, the invention has the following beneficial effects:

1. the virtual screening effect of the consensus scoring method adopted by the invention is better than that of single docking software, and better than that of other consensus scoring methods. The invention also tests 13 consensus scoring methods, finds out the EXP_Z_SCORE consensus scoring method with the best virtual screening effect, and is better than the ECR consensus scoring method [1] proposed by Palacio-Rodrii, K.et al.

2. With the rapid expansion of the size of small molecule libraries available for virtual screening and the ultra-large protein structure databases provided by alphaFold, large-scale virtual screening will play an increasingly important role in drug discovery. In order to reduce the calculation time of virtual screening, improve the calculation efficiency, save the calculation resources, and utilize a plurality of different molecular docking software and scoring functions thereof, we propose multi-stage screening combined with consensus scoring method. The multi-stage screening combined with consensus scoring can flexibly adjust the number of small molecules screened out at each stage to adjust the time and the screening effect, only uses a small amount of computing resources, and has higher early enrichment capacity. In the baseline test, the mean value of the enrichment factor EF1% of the multi-stage screening combined with consensus scores was not significantly different from the consensus score but was approximately 9 times faster than the consensus score. Multi-stage screening in combination with consensus scoring provides a significant advantage over traditional multi-stage screening in terms of early enrichment capacity. The multi-stage screening combined with consensus scoring is significant for large-scale drug screening, accelerating drug discovery and reducing drug development cost. The method can greatly accelerate the virtual screening speed while maintaining higher early enrichment capacity, and meets the actual requirement of virtual screening of a large-scale compound database. The invention provides a new idea for further expanding the screening scale and improving the utilization rate of computing resources by improving the consensus scoring method and combining multi-stage screening.

Drawings

FIG. 1 is a statistical plot of class fans of 51 targets from the DUD-E database tested in accordance with the present invention;

FIG. 2 is a violin graph of the screening results of 51 targets of the DUD-E database under 4 individual scoring functions and 13 consensus scoring methods; wherein screening effect was assessed using four indicators, EF1%, AUC, logAUC, BEDROC (α=80.5). Wherein the x-axis scoring function is ordered by average value of log auc, and the upper limit and the lower limit of the exp_z_score corresponding data are marked by dotted lines; quartile, median;

FIG. 3 is a multi-stage screening flow diagram incorporating consensus scoring;

FIG. 4 is a graph of the results of the 51 targets in the DUD-E database at the exp_z_score consensus score, including a plot of the violin of the 51 targets in the DUD-E database at the exp_z_score consensus score (all with autodock4, dock, rf_score, vinardo as input), a multi-stage screen combined with consensus scores, and the screening results under dock, with the screening effect assessed using four indicators EF1%, AUC, logAUC, BEDROC (α=80.5), with the upper and lower limits, quartiles, and median of the data corresponding to exp_z_score noted.

Detailed Description

Example 1: application of Z-Score standardized exponential function consensus scoring method in drug virtual screening

The DUD-E database is widely used for virtual screening benchmark tests, 51 protein targets with diversity are selected in consideration of the calculation time of virtual screening and the diversity of protein types, the total number of corresponding small molecules of the 51 targets exceeds 40 ten thousand, a plurality of molecular docking software, scoring functions thereof and a plurality of consensus scoring methods are used for more comprehensive tests, and a Z-Score standardized index function consensus scoring method with the best virtual screening effect is screened out from the plurality of consensus scoring methods. The class distribution of 51 targets selected from DUD-E is shown in FIG. 1.

The specific method comprises the following steps:

(1) Molecular docking software and selection of scoring function thereof

The idock 2.2.3 and AutoDock Vina 1.2.3 molecular docking software is used, and the default scoring function and the built-in RF_SCORE scoring function in the idock molecular docking software and the vinardo scoring function and the Autodock4 (ad 4) scoring function in the AutoDock Vina molecular docking software are selected to be four scoring functions.

(2) Preparation of protein Structure

All protein structures targeted for the virtual screening test were processed as follows:

1. replacement of non-standard residues, removal of heterogenies (including water molecules, coenzymes, metal ions, etc.), addition of missing atoms were performed using PDBFIXer 1.8.1 (https:// gitsub.com/openmm/pddbfixer), resulting in a collector_fixed.

2. The following commands hydrogenation and preparation of the required protein file format were performed using the script provided by AutoDock Vina: prepore_collector-r collector_fixed. Pdb-o collector. Pdbqt-A hydro-gens

The finally obtained receptor is the prepared protein file, and can be directly used as input of AutoDock Vina and idock to execute butt joint.

(3) Preparation of small molecule libraries

The method comprises the steps of obtaining an active small molecule file active_final. Sdf. Gz and a decoy small molecule file decoys_final. Sdf. Gz corresponding to each protein target from a DUD-E database, decompressing and then splitting and converting by using openbabel 3.1.0 (https:// github. Com/openbabel/openbabel) to obtain a mol2 format file corresponding to each small molecule, and executing the following command hydrogenation and preparation of a small molecule file format required by using a script provided by AutoDock Vina for each small molecule: the prepared_ligand-l molecular 2-o molecular 2 pdbqt-Ahydrogins, and finally the obtained molecular pdbqt is a prepared small molecular file, and can be directly used as input of Autodock Vina and dock to execute butt joint; after all small molecules are treated, each protein target is provided with a corresponding small molecule compound library, and the small molecule compound library consists of active small molecule compounds and decoy small molecule compounds.

(4) Determination of search space

Binding site predictions were made for all protein structures targeted for virtual screening assays using AutoSite 1.0.0 and binding sites were selected for visual inspection using PyMOL (TM) Molecular Graphics System, version 2.6.0a0. Reference to Co-crystallized Small molecule crystals in the DUD-E databaseThe position of _ligand.mol2 selects the binding site. After determining the binding site of the protein structure, the space coordinate information of the binding site structure, including size and position, is calculated using the PandasPdb module of biopendas 0.2.9, and then added in each of the three dimensions x, y, z of the binding site structure sizeAs search spaces for AutoDock Vina, idock.

(5) Setting docking parameters and performing docking

For each dock, the seed used was 20011204, which only output the best dock conformation, the exhaustion value for AutoDock Vina was 1, the tasks for idock were 8, and neither flexible nor hydrated docking was used.

The Autodock4 (ad 4) scoring function and the vinardo scoring function are scoring functions provided for Autodock Vina 1.2.3, and docking is performed when the Autodock4 scoring function and the vinardo scoring function are used to respectively obtain the score of the Autodock4 scoring function for a single small molecule and the score of the vinardo scoring function for a single small molecule. It is also necessary to provide an affinity graph when using the AutoDock4 scoring function, each docking being calculated from autogrid 4 (autoprid 4.2.7. X.2019-07-11) and prepure_gpf. Py provided by AutoDock Vina. The used idock scoring function and rf_score scoring function are the scoring functions provided by idock 2.2.3, and when the idock scoring function and the rf_score scoring function are used, the scoring of the idock scoring function on single small molecules and the scoring of the rf_score scoring function on single small molecules can be obtained by only performing one-time butt joint.

Scoring each small molecule compound on each scoring function is achieved by molecular docking based on the protein structure for each of the 51 proteins and its corresponding small molecule compound library. Small molecule compounds that failed in treatment, docking, or scoring were excluded, leaving only small molecule compounds that eventually scored successfully.

(6) Performing consensus scoring

To be able to integrate different scoring functions, normalization is required for the different scoring functions, and common normalization methods include ranking (Rank), maximum-minimum scaling (AASS), Z-score scaling, etc.

Scoring of scoring functions is divided into two categories by directionality of scoring: the larger the score value, the stronger the binding capacity of the small molecule compound and the protein is classified into a first class; a larger score indicates that the smaller the binding capacity of the small molecule compound to the protein, the weaker the classification into the second class. The effective scores of the first class are all scores with score values greater than 0; the second class of valid scores are all scores with a score value less than 0.

Rank, rank is used as a normalized score. The first class scores are ranked by score from large to small. The second class scores are ranked from small to large. I.e., for both classes of scores, the ranking is from good to bad. The smaller the score normalized by rank for both classes of scores, the better.

Z-score, score S for small molecule Compound j on scoring function i _ij Performing Z-score normalization according to two scoring categories to obtain normalized score Z _ij . Wherein the scoring function i is the average of the effective scores of all small molecule compounds corresponding to a proteinStandard deviation sigma _i . The calculation formula of the Z-score normalized score for the first class score is as follows:the Z-score normalized score for the second class of scores is calculated as follows: />The larger the score normalized to Z-score for both classes of scores, the better.

AASS, scoring S for small molecule compound j on scoring function i _ij Performing AASS standardization according to the two grading categories to obtain a standardized score A _ij . The highest score of the scoring function i in the first class of scores is denoted Best _i The lowest score is Worst _i . Scoring of the second classThe lowest score of the score function i is Best _i The highest score is denoted as Worst _i . I.e. for both classes of scores, the Best score is Best _i The Worst score is noted as Worst _i . For both classes of scores, the AASS normalized score calculation formula is as follows:

the larger and better the scores normalized to AASS for both classes of scores at this time.

What method is used to process the normalized score to obtain the final consensus score is also the key to the consensus score. The invention selects four calculation methods, sums (Sum), takes the Best value (Best), takes the Worst value (Worst), and sums (Exp) after exponential operation.

Sum, score normalized by small molecule compound j on scoring function i, is V _ij . Consensus scoring Sum of small molecule compound j under multiple scoring functions _j The calculation is carried out according to the following formula:

best, consensus score Best for small molecule compound j under multiple scoring functions _j Is the best value of the score normalized for the small molecule compound j under all scoring functions.

Consensus scoring of Worst, small molecule Compound j under multiple scoring functions Worst _j Is the worst value of the normalized score of the small molecule compound j under all scoring functions.

Exp, score normalized by small molecule compound j on scoring function i, is V _ij . Consensus score Exp for small molecule compound j under multiple scoring functions _j The calculation is carried out according to the following formula:

the 12 consensus scoring methods were obtained from a combination of normalization methods (3) and normalized score calculation methods (4). According to the normalization method and the calculation method of the final score, twelve consensus scoring methods can be combined, and the combination mode and the naming are as follows:

table 1 twelve consensus scoring method lists

	Rank	Z-score	AASS
				Sum	sum_rank	sum_z_score	sum_aass
Best	best_rank	best_z_score	best_aass
				Worst	worst_rank	worst_z_score	worst_aass
Exp	exp_rank	exp_z_score	exp_aass

For example, the full description of exp_z_score is as follows: the score of the small molecule compound j on the scoring function i is S _ij Scoring function i average of the effective scores of all small molecule compounds corresponding to a proteinStandard deviation sigma _i The normalized score of the small molecule compound j on the scoring function i is recorded as Z _ij . The calculation formula of the Z-score normalized score for the first class score is as follows: />The Z-score normalized score for the second class of scores is calculated as follows: />Final score Exp of small molecule compound j under multiple scoring functions _j Calculate +.>The higher final fraction indicates the greater likelihood of being an active small molecule compound.

Whereas Exp_rank (ECR) varies in the way of post-exponential summation (Exp): ranking the small molecule compounds j on the scoring function i as R _ij The total number of small molecules scored effectively is n. Consensus score Exp for small molecule compound j under multiple scoring functions _j The calculation is carried out according to the following formula:the higher final fraction indicates the greater likelihood of being an active small molecule compound.

In addition, a Rank By Vote (RBV) consensus scoring method was also tested. RBV, set the threshold value as x=top10%, and rank the small molecule compounds j within the threshold value x on the scoring function i, to obtain a ticket. The sum of the votes of the small molecule compound j on all scoring functions is the final score. Small molecule compounds with the same ticket number were randomly ranked.

(7) Calculating and evaluating indexes of virtual screening effect

For each protein target and ranking results obtained by the scoring method (single scoring function and consensus scoring), the enrichment factor EF, BEDROC, AUC, logAUC index can be calculated, with larger values for these indices indicating better screening results.

Calculating an enrichment factor: for a collection of small molecule compounds, wherein the number of active compounds is active, the total number of small molecule compounds is total, and the formula for calculating the hit rate is as follows:for the collection of small molecule compounds before topx% in the ranking table ranked by score from good to bad, where the number of active compounds is actives (x%), the total number of small molecule compounds is total (x%), the hit rate (x%) is calculated as follows:for ranking tables ranked by score from good to bad, the active compound ratio in the ranking was hit rate (100%), for the collection of small molecule compounds where the ranking was topx%, the hit rate was hit rate (x%), and the enrichment factor EFx% was calculated as follows: />

And calculating BEDROC indexes. BEDROC is defined as follows:

wherein n is the number of active compounds in the ranking table with scores ranging from good to bad; n is the total number of small molecule compounds; active compound ratio ra=n/N; r is (r) _i Ranking of the active compounds in the ranking table is for ranking the i-th active compound in the active compounds. The AUC index, AUC being the area under the ROC curve, is calculated as follows: ROC YeastThe ordinate of the line is the true positive rate TPR at a particular threshold, also known as actives found rate. The abscissa is the false positive rate FPR below a certain threshold, also referred to as decoys found rate.

The calculation formula of TPR is as follows:the FPR calculation formula is as follows: />

Compounds that are considered positive, if positive, true positive, and if negative, false positive, are ranked above a certain threshold; compounds that are considered negative after a particular threshold ranking. If it is a positive compound, and if it is a negative compound, it is a negative compound. Under a specific threshold, TP is the number of true positive compounds, FN is the number of false negative compounds, FP is the number of false positive compounds, and TN is the number of true negative compounds. The area under the ROC curve is AUC, the range of the value is [0,1], and the corresponding AUC of random screening is 0.5.

And calculating a log AUC index. The logAUC is the area under the semilogarithmic ROC curve, which is calculated as follows: after the ROC curve is drawn, let λ=0.001, ignore points with an abscissa less than or equal to λ, log the abscissa of points with each abscissa greater than λ ₁₀ And (5) carrying out logarithmic calculation. I.e. the point on the ROC curve is (x _i ，y _i ) Its abscissa passes through log ₁₀ The points on the semilogarithmic ROC curve after logarithmic calculation transformation are (log ₁₀ (x _i )，y _i ) Semilogarithmic ROC curve abscissa value range [ -3,0]The area under the Semilogarithmic ROC curve was calculated and the logauc=area/3 was calculated. The value range of the logAUC is [0,1]]。

(8) Analysis results

In the test results, we screened the best exp_z_score: the Z-score scaling based exponential function consensus score was better than the four individual scoring functions autodock4, idock, rf_score, vinardo, as a whole, and better than other consensus scoring methods, with the results shown in FIG. 2.

We also compared and tested the difference significance of the screening results of exp_rank (ECR, rank-based exponential function consensus scoring method) in the existing study and the best exp_z_score score we tested. The results are shown in Table 2 below.

TABLE 2 average evaluation of eight indicators of 51 target screening results in DUD-E database screening efficacy Table

	AUC	logAUC	BEDROC(α＝321.9)	BEDROC(α＝80.5)	BEDROC(α＝20.0)	EF1％	EF5％	EF10％
									exp_z_score	0.686	0.290	0.248	0.204	0.253	7.736	4.226	3.202
best_z_score	0.687	0.285	0.222	0.193	0.246	7.038	4.162	3.101
									exp_rank	0.689	0.284	0.247	0.193	0.244	7.423	4.047	3.104
sum_z_score	0.676	0.282	0.250	0.197	0.243	7.697	4.059	3.068
									best_rank	0.688	0.280	0.187	0.180	0.243	6.533	4.138	3.118
exp_aass	0.671	0.279	0.248	0.195	0.240	7.617	3.982	3.021
									sum_aass	0.669	0.278	0.248	0.194	0.238	7.506	3.955	2.971
best_aass	0.678	0.272	0.192	0.173	0.230	6.356	3.893	2.998
									sum_rank	0.673	0.272	0.234	0.179	0.227	6.782	3.722	2.901
ad4	0.645	0.270	0.229	0.194	0.235	7.002	3.915	2.896
									rbv	0.659	0.260	0.211	0.171	0.223	6.735	3.636	3.003
idock	0.647	0.252	0.204	0.156	0.199	6.342	3.235	2.574
									worst_aass	0.633	0.250	0.204	0.164	0.206	6.187	3.430	2.608
worst_z_score	0.634	0.249	0.223	0.166	0.204	6.366	3.363	2.534
									worst_rank	0.635	0.249	0.223	0.164	0.202	6.258	3.297	2.507
vinardo	0.639	0.245	0.172	0.152	0.198	5.436	3.282	2.509
									rf_score	0.625	0.228	0.118	0.116	0.172	4.040	2.865	2.336
pvalue	0.054	0.075	0.961	0.102	0.028	0.399	0.068	0.028

In table 2, screening effect was assessed using the average of protein structures in the DUD-E database for 51 targets using eight indicators AUC, logAUC, BEDROC (α=321.9, α=80.5, α=20.0), EF1%, EF5%, EF 10%. ad4 represents using an autodock4 scoring function, idock represents using an idock scoring function, rf_score represents using an rf_score scoring function in idock software, vinardo represents using a vinardo scoring function, wherein the last line pvalue is a result obtained by performing double-tail pairing t-test on each index data of exp_z_score and each index data of Exp_rank (ECR), data of p <0.05 is thickened, and the rest data is a screening result of a consensus scoring method.

From table 2, it can be seen that the exp_z_score consensus scoring method is better overall than the other methods from the average of the individual indices, and compared to Exp_rank (ECR), the other indices are all higher than exp_rank except AUC, and the BEDROC (α=20.0) and EF10% indices are higher than Exp_rank (ECR) on average and have significant differences (table 1). It is explained that the exp_z_score consensus scoring method is better than the Exp_rank (ECR) consensus scoring method in the existing study.

Example 2: application of multi-stage screening method combined with Z-Score standardized exponential function consensus Score in drug virtual screening

The targets used were the same as in example 1.

The method comprises the following steps:

(1) Step (1) in the same manner as in example 1

(2) Step (2) in the same manner as in example 1

(3) Step (3) in the same manner as in example 1

(4) Step (4) in the same manner as in example 1

(5) Multi-stage screening in combination with Z-Score normalized exponential function consensus scoring

The docking software parameters used were as mentioned in step (5) of example 1, the exp_z_score consensus scoring method used was as mentioned in step (6) of example 1, and the multi-stage screening in combination with the Z-Score normalized exponential function consensus scoring was performed according to the flow chart shown in fig. 3 and detailed description thereof.

The multi-stage screening procedure in combination with the Z-Score normalized exponential function consensus Score is as follows: the first stage uses the idock butt joint to carry out virtual screening to obtain an idock ranking table and an rf_score ranking table, then an ir ranking table is obtained through the consensus scoring of the idock ranking table and the rf_score ranking table, and small molecules with the x% of the ir ranking table enter the second stage; in the second stage, virtual screening is carried out by using a vinardo scoring function, so as to obtain a vinardo ranking table, then the small molecules in the idock ranking table and the rf_score ranking table, which are x percent in front of the ir ranking table, and all the small molecules in the vinardo ranking table are subjected to consensus scoring, so as to obtain a irv ranking table, and the small molecules in the irv ranking table, which are y percent in front of the irranking table, are taken to enter the third stage; and in the third stage, performing virtual screening by using an autodock4 scoring function, obtaining an autodock4 ranking table, and then taking the idock ranking table, and performing consensus scoring on the small molecules in the rf_score ranking table and the vinardo ranking table, which are y% in front of the irv ranking table, and all the small molecules in the autodock4 ranking table, so as to obtain an irva ranking table. And ranking the small molecules in the irva ranking table, the small molecules in the irv ranking table which are not in the irva ranking table and the small molecules in the irv ranking table which are not in the irv ranking table according to the original sequence inside the ranking, and obtaining the final ranking. The x and y parameters can be flexibly adjusted. Three sets of protocols were set up: plan a (x=40, y=50), plan C (x=20, y=50), plan E (x=10, y=50).

(6) Multi-stage screening without consensus scoring

A multi-stage screen without consensus scores was used as control data. In recent years, the study of Gorgulla, c. Et al conducted large scale screening of 13 hundred million compounds using a multi-stage screening method without consensus scoring, with the first stage screening using Qvina2 for rapid screening with minimal precision, and in the second stage, 13 residues of the receptor were considered flexible, and the first 300 ten thousand compounds were rescaled with greater precision using AutoDock Vina and Smina Vinardo [2].

The docking software parameters used were as mentioned in step (5) of example 1 and the exp_z_score consensus scoring method used was as mentioned in step (6) of example 1. The test for multi-stage screening without consensus scoring was performed as follows: in the first stage, virtual screening is carried out by using the idock butt joint to obtain an idock ranking table, and small molecules accounting for x% of the idock ranking table enter the second stage; in the second stage, virtual screening is carried out by using a vinardo scoring function docking, a vinardo ranking table is obtained, and small molecules with the front y% of the vinardo ranking table are taken to enter a third stage; and in the third stage, performing virtual screening by using an autodock4 scoring function docking to obtain an autodock4 ranking table. And ranking the small molecules in the autodock4 ranking table, the small molecules in the vinardo ranking table which are not in the autodock4 ranking table and the small molecules in the idock ranking table which are not in the vinardo ranking table according to the original sequence inside the ranking, and sequentially, thereby obtaining the final ranking. The x and y parameters can be flexibly adjusted. Three sets of protocols were set up: plan B (x=40, y=50), plan D (x=20, y=50), plan F (x=10, y=50).

(7) Calculating and evaluating indexes of virtual screening effect

For the ranking table results obtained for the six schemes (Plan A, B, C, D, E, F) mentioned in the above step (5) and step (6), the respective indices were calculated according to the method mentioned in step (7) of example 1.

(8) Analysis results

Fig. 4 shows, as control data, the screening effect of the exp_z_score (autodock 4, idock, rf_score, vinardo) and the idock test results, the multi-stage screening (plan a, C, E mentioned in step (5)) combined with the consensus score, the multi-stage screening (plan B, D, F mentioned in step (6)) without the consensus score, and the violin map of the idock screening results, with four indices EF1%, AUC, logAUC, BEDROC (α=80.5), the upper and lower limits, quartiles, and median of the exp_z_score corresponding data.

Table 3 is a comparison table of results of multi-stage screening of 51 targets on exp_z_score, multi-stage screening combined with consensus scores, multi-stage screening not combined with consensus scores

	exp_z_score	plan A	plan B	plan C	plan D	plan E	plan F	autodock4	idock	rf_score	vinardo
												EF1％	7.74	7.80	6.85	7.68	6.86	7.55	6.62	7.00	6.34	4.04	5.44
avg time	97.34	28.17	28.17	16.65	16.65	10.89	10.89	69.19	5.13	5.13	23.03
												total time	63.51	18.38	18.38	10.82	10.82	7.04	7.04	44.93	3.26	3.26	15.31

Table 3 above counts the mean value of EF1% index for 51 targets in the DUD-E database at exp_z_score (autodock, rf_score, vinardo as input), multi-stage screening combined with consensus scores (plan a, C, E), multi-stage screening without consensus scores (plan B, D, F), and screening results under a single scoring function (autodock 4, iddock, rf_score, vinardo), and mean value of small molecule mean for each of 51 targets (avg time, units s/cpu/mol), small molecule mean for all of 51 targets (total time, units s/cpu/mol), respectively.

On the important index EF1% of the ability to evaluate early enrichment in virtual screening, the multi-stage screening plan a, C, E and exp_z_score combined with consensus scores likewise have higher values than the one, whereas EF1% has a slightly decreasing trend with decreasing time of the multi-stage screening protocol combined with consensus scores, but still can maintain higher values. Multi-stage screening with consensus scores was higher in EF1% than multi-stage screening without consensus scores at the same time as multi-stage screening without consensus scores (table 3). Multi-stage screening that showed binding to consensus scores was superior to multi-stage screening that did not.

On the important indicator of early enrichment capacity BEDROC (α=80.5), the multi-stage screening of plant a, C, E combined with consensus scores also had higher values compared to the exp_z_score consensus score, while being worse on the AUC and logAUC indicators compared to the exp_z_score consensus score (fig. 4). It is shown that maintaining a higher early enrichment capacity is an important advantage of multi-stage screening in combination with consensus scoring. For actual drug screening, only a small part of small molecules with the top ranking rate are generally considered, so that the early enrichment capability has important significance for the actual drug screening sequence.

There was no significant difference between the mean of EF1% for Plan E and exp_z_score (p=0.42, two-tailed paired t-test), whereas mean per small molecule for Plan E calculated approximately 9 times faster than exp_z_score. Indicating that the multi-stage screening can maintain a higher early enrichment capacity with a substantial reduction in computation time relative to consensus scores. The multi-stage screening combined with consensus scoring can utilize the advantages of the consensus scoring to synthesize individual scoring functions to improve the ability of virtual screening, while greatly reducing the computational resources consumed by the complete process of consensus scoring. The higher early enrichment capacity and less computational resource consumption of multi-stage screening combined with consensus scoring is of great importance for large-scale drug screening.

Reference to the literature

[1]Palacio-Rodríguez,K.；Lans,I.；Cavasotto,C.N.；Cossio,P.Exponential Consensus Ranking Improves the Outcome in Docking and Receptor Ensemble Docking.Sci Rep 2019,9(1),5142.https://doi.org/10.1038/s41598-019-41594-3.

[2]Gorgulla,C.；Boeszoermenyi,A.；Wang,Z.-F.；Fischer,P.D.；Coote,P.W.；PadmanabhaDas,K.M.；Malets,Y.S.；Radchenko,D.S.；Moroz,Y.S.；Scott,D.A.；Fackeldey,K.；Hoffmann,M.；Iavniuk,I.；Wagner,G.；Arthanari,H.An Open-Source Drug DiscoveryPlatform Enables Ultra-Large Virtual Screens.Nature 2020,580(7805),663–668.https://doi.org/10.1038/s41586-020-2117-z.

Claims

1. An optimization method for large-scale drug virtual screening, comprising the steps of:

the standard deviation calculation formula is as follows:

scoring of molecular docking software and scoring functions thereof is divided into two categories according to the directionality of scoring: molecular docking software with higher scoring value and higher binding capacity of small molecular compound and protein is classified into a first class; molecular docking software, in which the larger the scoring value is, the weaker the binding ability of the small molecular compound to the protein is, and the scoring function of the molecular docking software is classified into a second class; molecular docking software and scoring function i scoring S of small molecule Compound j _ij Performing Z-Score normalization in two scoring categories to obtain normalized Score Z _ij ；

s3, analyzing virtual screening results

And taking the ranking table R obtained in the step S2 as a final virtual screening result, wherein the ranking table R is higher in the front, so that the binding capacity of the small molecular compound and the protein is stronger.

2. An optimization method for large-scale drug virtual screening according to claim 1, characterized in that: and S3, carrying out multi-stage virtual screening, wherein the multi-stage virtual screening comprises the following steps of: namely, selecting the previous x% small molecular compounds in the ranking table R obtained in the step S2 as a small molecular compound library in the next-stage consensus scoring step S1, adding one or more new molecular docking software and scoring function thereof to execute molecular docking and consensus scoring of the stage to obtain the ranking table R ₂ The method comprises the steps of carrying out a first treatment on the surface of the Select ranking list R ₂ The former y% of small molecule compounds in (a) are used as small molecule compounds in the next stage consensus scoring step S1A library, adding one or more new molecular docking software and scoring function thereof to execute molecular docking and consensus scoring of the stage to obtain a ranking table R ₃ The method comprises the steps of carrying out a first treatment on the surface of the And then, repeating the steps according to the need, and carrying out molecular butt joint and consensus scoring of a plurality of stages to obtain a ranking table R of the final stage, namely the R-th stage _r The method comprises the steps of carrying out a first treatment on the surface of the The novel molecular docking software and the scoring function thereof are used for virtual screening and verifying the molecular docking software with enrichment capability and the scoring function thereof.

3. An optimization method for large-scale drug virtual screening according to claim 1, characterized in that: the small molecule compound library L in the step S1 is derived from a DUD-E database, and the protein structure is derived from the DUD-E database, the alpha fold protein database or the PDB database.

4. An optimization method for large-scale drug virtual screening according to claim 1, characterized in that: normalized score Z described in step S2 _ij There are two classes, where the Z-Score normalized calculation formula for the first class Score is as follows:

5. An optimization method for large-scale drug virtual screening according to claim 1, characterized in that: the scoring function set M in step S1 is a default scoring function and a built-in rf_score scoring function in the dock molecule docking software, and a vinardo scoring function and an AutoDock4 (ad 4) scoring function in the AutoDock Vina molecule docking software, which are four scoring functions in total.

6. An optimization method for large-scale drug virtual screening according to claim 2, characterized in that: the steps of the multi-stage screening are as follows: the method comprises the steps of performing virtual screening by using an idock butt joint to obtain a result of an idock scoring function and a result of an rf_score scoring function, obtaining an ir ranking table by consensus scoring of the result of the idock scoring function and the result of the rf_score scoring function, and taking x% of small molecules in front of the ir ranking table to enter a second stage; in the second stage, virtual screening is carried out by using a vinardo scoring function, so as to obtain a result of the vinardo scoring function, then, consensus scoring is carried out on the small molecules in the x% in front of an ir ranking table and all the small molecules in the result of the vinardo scoring function in the result of the idock scoring function and the result of the rf_score scoring function, so as to obtain a irv ranking table, and the small molecules in the y% in front of the irv ranking table are taken to enter the third stage; in the third stage, virtual screening is carried out by using an autodock4 scoring function, so as to obtain a result of the autodock4 scoring function, and then, consensus scoring is carried out on small molecules, which are y% before irv ranking, in the result of the rf score scoring function and the result of the vinardo scoring function and all small molecules in the result of the autodock4 scoring function, so as to obtain an irva ranking; wherein the x and y parameter values are 0-100.

7. An optimization method for large-scale virtual drug screening according to claim 6, wherein: and x is 5-20 and y is 25-75.