CN111081321A

CN111081321A - CNS drug key feature identification method

Info

Publication number: CN111081321A
Application number: CN201911307432.6A
Authority: CN
Inventors: 丁彦蕊; 张瑞林
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2020-04-28
Anticipated expiration: 2039-12-18
Also published as: CN111081321B

Abstract

The invention discloses a CNS drug key feature identification method, and belongs to the field of computer-aided drug design. By combining a support vector machine and a greedy algorithm, the characteristics with the minimum effect on improving the prediction result are gradually deleted by utilizing the greedy idea, and further, the key characteristics for distinguishing the CNS drugs from non-CNS drug small molecules are accurately screened out. The method combines a support vector machine and a greedy algorithm for the first time to be applied to the identification of the key characteristics of the CNS drugs, screens the key characteristics in a gradual deletion mode, considers the effect of combination among the characteristics, avoids the difficulty of initial characteristic selection brought by a characteristic increasing method, enables the screened key characteristics to effectively distinguish the CNS drugs from non-CNS drug micromolecules, and provides an important guidance method for fundamentally designing CNS drug candidate micromolecules.

Description

CNS drug key feature identification method

Technical Field

The invention relates to a CNS drug key feature identification method, and belongs to the field of computer-aided drug design.

Background

Currently, hundreds of millions of people worldwide are affected by diseases of the Central Nervous System (CNS). Due to the particularity of the brain environment, research and development of related drugs have the disadvantages of low success rate, high cost, long period and the like, and development of new CNS drugs is urgent. Designing reliable CNS drug candidates can greatly reduce the cycle and cost of new drug development and significantly improve success rates. Understanding the characteristic differences between CNS drugs and non-CNS drugs is a prerequisite for designing effective CNS drug candidates. Thus, the discovery of key features in CNS drugs helps us understand the specificity of CNS drugs and guide CNS drug design.

For how to screen out key features from a large number of features of CNS drugs, Shahid M (SVM base descriptor Selection and Classification of neurological Disease drug for pharmaceutical Modeling, Molecular information, 2013,32(3): 241-249), et al, use a support vector mechanism to build a model, and rank features by calculating feature scores from coefficients of each feature, and can also be used to perform feature Selection. But deleting unimportant features based on the scoring of individual features ignores the effect of combinations between features, some features alone do not work well, but two unimportant features in combination may work well. Lu J (Analysis of the acquisition target-based classification system using molecular descriptors. Combinatorial chemistry & high throughput screening,2016,19(2):129-135.) et al, increase features one by one starting from 0; however, in this method, when the initial single feature is used, the amount of information contained is small, and there is a high possibility that the case where SEN is 0%, SPE is 100%, or SEN is 100%, and SPE is 0%, in which case the selection leaves which feature cannot be measured, and the feature selected at the beginning has a great influence on the prediction performance of the subsequent feature combination; if the IFS algorithm is used from a plurality of features, the first plurality of features may need to be determined by other methods.

Therefore, the accurate finding of the key characteristics between the CNS drugs and non-CNS drug small molecules has great effect on helping people to design CNS drug molecules and develop new CNS drugs.

Disclosure of Invention

In order to find out key characteristics between CNS drugs and non-CNS drug small molecules and further achieve the purpose of guiding CNS drug design, the invention provides an identification method for the key characteristics of the CNS drugs.

Optionally, the method includes:

firstly, preliminarily screening out characteristics which have the effect of distinguishing the CNS drug and non-CNS drug micromolecules from all characteristics of the CNS drug and non-CNS drug micromolecules;

step two, constructing a support vector machine model by utilizing the characteristics which are preliminarily screened in the step one and have the effect of distinguishing the CNS medicament from the non-CNS medicament, and optimizing parameters c and g to obtain an optimized support vector machine model;

and step three, gradually deleting the characteristics which are preliminarily screened in the step one and have the effect of distinguishing the CNS drugs from non-CNS drugs by using a greedy algorithm, and screening key characteristics for distinguishing the CNS drugs from the non-CNS drugs in the deletion process.

Optionally, assuming that the number of the features which are preliminarily screened in the first step and have the effect of distinguishing the CNS drug from the non-CNS drug is n; the third step includes:

3.1 delete each feature one by one, resulting in n different feature combinations: { a₂，a₃，a₄，…a_n}，{a₁，a₃，a₄，…a_n}，{a₁，a₂，a₄，…a_n}，…{a₁，a₂，a₃，a₄，…a_n-1}；

3.2 taking the n different feature combinations as input vectors of the optimized support vector machine model obtained in the second step to obtain the prediction performances respectively corresponding to the n different feature combinations, and reserving the feature combination with the best prediction performance;

3.3 execute 3.1 to 3.2 with n-1 features in one feature combination with the best predictive performance obtained at 3.2, and loop until n features are deleted;

3.4 selecting from the above 3.1 to 3.3 implementations a combination of features that is key to distinguishing between CNS drugs and non-CNS drugs.

Optionally, the prediction performance comprises sensitivity SEN and specificity SPE; SEN represents the prediction rate of CNS drugs and SPE represents the prediction rate of non-CNS drugs.

Optionally, the feature combination with the best prediction performance retained in the step 3.2 includes:

respectively comparing the SEN value and the SPE value corresponding to each feature combination, and selecting the highest SEN value and SPE value;

if the highest SEN and SPE belong to the same feature combination, the feature combination is reserved;

and if the SEN and the SPE which are the highest belong to two different feature combinations, comprehensively determining the feature combination to be reserved according to the SEN and the SPE of each of the two different feature combinations.

Optionally, assuming that the highest SEN and SPE belong to two different feature combinations a and B, respectively, the comprehensively determining the feature combination to be retained according to the SEN and SPE of the two different feature combinations includes:

comparing the SPE of the feature combination A with the SEN of the feature combination B;

if the SPE of the feature combination A is larger than the SEN of the feature combination B, selecting and reserving the feature combination A;

if the SPE of the feature combination A is smaller than the SEN of the feature combination B, selecting and reserving the feature combination B;

and if the SPE of the feature combination A is equal to the SEN of the feature combination B, comparing the sizes of the SEN of the feature combination A and the SPE of the feature combination B, and selecting the feature combination corresponding to the larger one.

Optionally, if SPE and SEN of the two feature combinations are equal, the feature combination a or the feature combination B is randomly reserved.

Optionally, in the first step, features which are effective in distinguishing the CNS drugs and non-CNS drug small molecules are preliminarily screened out from all the features, and a random forest algorithm is adopted, and the information gain rate is used as an attribute division evaluation function to perform preliminary feature selection.

Optionally, in the second step, the optimized support vector machine model is obtained by an exhaustion method.

The invention also provides a CNS drug molecule design method, which adopts the method to identify key characteristics of the CNS drug.

The invention has the beneficial effects that:

by combining a support vector machine and a greedy algorithm, the characteristics with the minimum effect on improving the prediction result are gradually deleted by utilizing the greedy idea, and further, the key characteristics for distinguishing the CNS drugs from non-CNS drug small molecules are accurately screened out. The method combines a support vector machine and a greedy algorithm for the first time to be applied to the identification of the key characteristics of the CNS drugs, screens the key characteristics in a gradual deletion mode, considers the effect of combination among the characteristics, avoids the difficulty of initial characteristic selection brought by a characteristic increasing method, enables the screened key characteristics to effectively distinguish the CNS drugs from non-CNS drug micromolecules, and provides an important guidance method for fundamentally designing CNS drug candidate micromolecules.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.

The first embodiment is as follows:

the embodiment provides a method for identifying key features of CNS drugs based on a support vector machine and a greedy algorithm, which combines the support vector machine and the greedy algorithm, gradually eliminates features having minimum effect on improving a prediction result by using a greedy thought, and further accurately screens out key features for distinguishing the CNS drugs from non-CNS drug small molecules, and comprises the following steps:

step (1) adopts a random forest algorithm to perform preliminary feature selection:

constructing a random forest model, dividing an evaluation function by using the information gain rate as an attribute, and performing primary feature selection;

specifically, a random forest model including 100 decision trees was constructed, and the evaluation function was divided using the information gain rate as an attribute with reference to "Yaoyang, Yangjing, Jangjuan. In order to improve the selection efficiency and performance, 2/3 samples and 1/2 features are randomly selected each time to construct a decision tree; to prevent overfitting, the node aborts splitting when the number of unassigned samples is less than 5.

And counting all the features appearing on the tree, namely the features which are preliminarily selected.

Optimizing support vector machine model parameters c and g by adopting an exhaustion method, wherein c is a penalty coefficient, and g is a nuclear parameter;

a classifier that identifies CNS drugs and non-CNS drug small molecules is constructed using a support vector machine algorithm with radial basis kernel functions in LIBSVM packages.

In [2 ]^-4，2⁴]All combinations of c and g are exhausted within the range, and 5-fold cross validation is performed under each combination to find the optimal combination of c and g.

The objective optimization problem of the support vector machine is to find a hyperplane which can distinguish the CNS sample from non-CNS sample as much as possible, and the formula is as follows:

f(x)＝w^Tx+b

where x is the input eigenvector, w is the normal vector normal to the hyperplane, and b is the offset.

Obtaining w according to the Lagrange multiplier; using a mapping function phi to map the eigenvectors x_iAnd x is mapped to a high dimensional space as shown in the following equation:

wherein λ is_iIs the Lagrange multiplier, y_iIs the sum of the feature vector x_iAnd (3) related sample labels, wherein m is the number of samples, and i is more than or equal to 1 and less than or equal to m.

Without a kernel function, the computation of a high dimensional space would likely lead to a dimensional explosion, and to avoid this problem, the radial basis kernel function K (x)_iX) is used instead of the explicit mapping Φ^T(x_i) Φ (x), as follows:

K(x_i,x)＝exp(-g*|x_i-x|²)

among them, the g parameter is a very important kernel parameter, and has a great influence on the training of the model. Another important parameter is the penalty factor c, which affects the smoothness of the classification plane.

And (2) calculating the prediction performance of the support vector machine model corresponding to all combinations of c and g by taking the features selected preliminarily in the step (1) as input vectors, selecting a group of corresponding c and g with the best prediction performance, and taking the group of c and g as the parameters c and g of the optimized support vector machine model.

Step (3) identifying key features by using greedy algorithm

And (3) respectively taking different combinations of the features preliminarily selected in the step (1) as input vectors of the support vector machine model optimized in the step (2), and screening out key features according to corresponding prediction performance.

Specifically, the method comprises the following steps:

s1 assumes that n features are initially selected in step (1): { a₁，a₂，a₃，a₄，…a_n}；

S2 delete each feature one by one, resulting in n different feature combinations: { a₂，a₃，a₄，…a_n}，{a₁，a₃，a₄，…a_n}，{a₁，a₂，a₄，…a_n}，…{a₁，a₂，a₃，a₄，…a_n-1}；

S3 using each feature combination in step S2 as input vector for supporting vector machine model, reserving feature combination with best prediction performance, and recording prediction performance p_j，1≤j≤n；

S4 using the set of feature combinations retained in step S3, executing S2 and S3 with n-1 features in the set, thereby looping through S2 to S4 until all features are deleted;

s5 all predicted performances p in the above-mentioned processes S2 to S4_jBest p in (1)_jAnd the corresponding feature combination is the screened key feature.

In the above-mentioned step (2) and step (3)Predicting the Performance p_jIncluding sensitivity SEN and specificity SPE:

sensitivity SEN, i.e. positive sample prediction rate (CNS drug prediction rate);

specific SPE, negative sample prediction rate (non-CNS drug prediction rate);

the constructed CNS drug recognition model is evaluated by the sensitivity SEN and the specificity SPE together, and the larger the value is, the better the performance of the model is.

In particular, in determining the predicted performance p_jBest p in (1)_jAnd if the highest SEN and SPE belong to the same feature combination, keeping the feature combination.

If the highest SEN and SPE belong to different feature combinations, selecting the feature combination to be reserved according to the size of the corresponding SPE and SEN; for example, the highest SEN and SPE belong to feature combinations a and B, respectively, that is, the SEN of feature combination a is the highest, and the SPE of feature combination B is the highest, the SEN of feature combination a and feature combination B are compared:

And if SPE and SEN of the two feature combinations are equal, randomly reserving feature combination A or feature combination B.

In order to verify the key characteristics of the method provided by the application, which can effectively identify the CNS drugs, the application takes the existing CNS drugs and non-CNS drug small molecules as experimental objects, and the data are derived from ZINC15(http:// ZINC15. gating. org /) and drug Bank (https:// www.drugbank.ca /) databases.

The inventors downloaded drug data in SDF format (corresponding to 879 non-CNS drug small molecules and 273 CNS drug small molecules) from the above two databases, including initial coordinates of all atoms in each drug molecule and bond type information between atoms; the drug data was used as input to the PaDEL software to derive all the eigenvalues for each drug molecule, with 1875 being the number of all eigenvalues calculated for each drug molecule in this example.

a random forest model was constructed, features contributing to the discrimination of CNS drugs from non-CNS drug small molecules were selected, and 941 useful features were selected from 1875 features.

Optimizing support vector machine model parameters c and g by adopting an exhaustion method to obtain optimized support vector machine model parameters c and g:

355 of 879 non-CNS drug small molecules are randomly selected as negative samples, 273 CNS drug small molecules are taken as positive samples, and parameters c and g of the support vector machine are set to [2 ]^-4，2⁴]In the range of (c), the best 5-fold cross-validation results were searched in all c and g combinations and the test set was used to test the model generalization performance. The above process was repeated 5 times, and the results are shown in table 1 below, where the positive samples in the 5 samples were identical in CNS drugs, and the negative samples were 355 randomly selected from 879 small molecules of non-CNS drugs, i.e., the positive samples in the 5 samples were identical in CNS drugs, and the negative samples corresponded to 355 different non-CNS drugs.

And (3) taking 941 features of each drug small molecule as input vectors of support vector machine models corresponding to different c and g combinations to predict the performance of the drug small molecule.

TABLE 1 support vector machine model parameters and Performance on external test set

As can be seen from table 1, the support vector machine models corresponding to parameters c and g corresponding to sample 3 have the best prediction performance, so that the support vector machine models corresponding to the group c and g are used to perform key feature screening, and the group of corresponding randomly selected 355 non-CNS drug small molecules and 273 CNS drug small molecules are used as samples for screening key features.

And (3) identifying key features by using a greedy algorithm:

selecting key features by using the randomly selected 355 non-CNS drug small molecules and 941 features corresponding to 273 CNS drug small molecules determined in the step (2) as input vectors of the optimized support vector machine model, specifically:

s1: 941 features are: { a₁，a₂，a₃，a₄，…a_n}，n＝941；

S2: deleting each feature one by one, resulting in 941 different feature combinations: { a₂，a₃，a₄，…a_n}，{a₁，a₃，a₄，…a_n}，{a₁，a₂，a₄，…a_n}，…{a₁，a₂，a₃，a₄，…a_n-1}；

S3: taking each feature combination in the step S2 as an input vector of a support vector machine model, reserving the feature combination with the best prediction performance, and recording the values of the sensitivity SEN and the specificity SPE of each group of feature combinations;

and comparing the sensitivity SEN value and the specificity SPE value of each 941 feature combinations, and if the SEN and the SPE with the highest values belong to the same feature combination, keeping the feature combination.

If the highest SEN and SPE belong to different feature combinations, selecting the feature combination to be reserved according to the size of the corresponding SPE and SEN; for example, the highest SPE belongs to the 800 th feature combination, and the highest SEN corresponds to the 900 th feature combination, then the SEN of the 800 th feature combination is compared with the 900 th feature combination SPE:

if the SEN of the 800 th feature combination is larger than the 900 th feature combination SPE, the 800 th feature combination is reserved;

if the SEN of the 800 th feature combination is smaller than the SPE of the 900 th feature combination, the 900 th feature combination is reserved;

if the SEN of the 800 th feature combination is equal to the 900 th feature combination SPE, comparing the size of the 800 th feature combination SPE with the 900 th feature combination SEN:

-if SPE of the 800 th feature combination is larger than the 900 th feature combination SEN, retaining the 800 th feature combination;

-if SPE of the 800 th feature combination is smaller than the 900 th feature combination SEN, retaining the 900 th feature combination;

-if SPE of the 800 th feature combination is equal to the 900 th feature combination SEN, randomly retaining the 800 th feature combination or the 900 th feature combination.

S4: using the set of feature combinations retained in step S3, performing S2 and S3 with 940 features in the set, thereby looping through S2 to S4 until all features are deleted;

s5: all predicted performances p in the above-mentioned processes of S2 to S4_jBest p in (1)_jAnd the corresponding feature combination is the screened key feature.

Through the above cycle process, 40 key features are finally screened out in this embodiment, and the SEN and SPE of the test result both reach more than 94% by using the screened out 40 key features as input variables of the model. The 40 key features selected are shown in table 2 below:

TABLE 2 screened 40 key features and their description

Some steps in the embodiments of the present invention may be implemented by software, and the corresponding software program may be stored in a readable storage medium, such as an optical disc or a hard disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A CNS drug key feature identification method is characterized in that a support vector machine and a greedy algorithm are combined, the feature with the minimum effect on improving a prediction result is gradually deleted by the greedy algorithm, and then key features for distinguishing CNS drugs from non-CNS drug small molecules are accurately screened out.

2. The method according to claim 1, characterized in that it comprises:

3. The method of claim 2, wherein n is assumed as the number of features selected in step one as having an effect of distinguishing CNS drugs from non-CNS drugs; the third step includes:

4. The method of claim 3, wherein the predictive performance includes sensitivity SEN and specificity SPE; SEN represents the prediction rate of CNS drugs and SPE represents the prediction rate of non-CNS drugs.

5. The method of claim 4, wherein the step of retaining the feature combination with the best prediction performance in 3.2 comprises:

6. The method of claim 5, wherein the step of comprehensively determining the combination of features to be preserved according to the SEN and the SPE of each of the two different feature combinations, assuming that the highest SEN and SPE belong to the two different feature combinations A and B, respectively, comprises:

7. The method of claim 6 wherein feature combination a or feature combination B is randomly retained if SPE and SEN of two feature combinations are equal.

8. The method as claimed in any one of claims 2 to 7, wherein the first step is to preliminarily select the features having an effect of distinguishing the CNS drug from non-CNS drug small molecules, and to perform the preliminary feature selection by using a random forest algorithm and using an information gain ratio as an attribute classification evaluation function.

9. The method according to any one of claims 2 to 8, wherein the second step adopts an exhaustive method to obtain the optimized support vector machine model.

10. A method for CNS drug molecule design, wherein said design method identifies key features of CNS drugs using the method of any of claims 1-9.