CN111081321A - CNS drug key feature identification method - Google Patents
CNS drug key feature identification method Download PDFInfo
- Publication number
- CN111081321A CN111081321A CN201911307432.6A CN201911307432A CN111081321A CN 111081321 A CN111081321 A CN 111081321A CN 201911307432 A CN201911307432 A CN 201911307432A CN 111081321 A CN111081321 A CN 111081321A
- Authority
- CN
- China
- Prior art keywords
- cns
- feature
- feature combination
- sen
- spe
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000003814 drug Substances 0.000 title claims abstract description 105
- 229940079593 drug Drugs 0.000 title claims abstract description 101
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000012706 support-vector machine Methods 0.000 claims abstract description 30
- 150000003384 small molecules Chemical class 0.000 claims abstract description 22
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 18
- 230000000694 effects Effects 0.000 claims abstract description 17
- 238000012217 deletion Methods 0.000 claims abstract description 4
- 230000037430 deletion Effects 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 12
- 238000007637 random forest analysis Methods 0.000 claims description 7
- 238000012216 screening Methods 0.000 claims description 7
- 230000035945 sensitivity Effects 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 6
- 230000000717 retained effect Effects 0.000 claims description 5
- 238000013461 design Methods 0.000 claims description 4
- 238000011156 evaluation Methods 0.000 claims description 4
- 229940000406 drug candidate Drugs 0.000 abstract description 4
- 238000009510 drug design Methods 0.000 abstract description 4
- 210000003169 central nervous system Anatomy 0.000 description 74
- 238000012360 testing method Methods 0.000 description 4
- 238000002790 cross-validation Methods 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 208000012902 Nervous system disease Diseases 0.000 description 1
- 208000025966 Neurological disease Diseases 0.000 description 1
- 206010000210 abortion Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000009509 drug development Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000013537 high throughput screening Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000002547 new drug Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 239000010127 yangjing Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/50—Molecular design, e.g. of drugs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Landscapes
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Medicinal Chemistry (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Databases & Information Systems (AREA)
- Pharmacology & Pharmacy (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention discloses a CNS drug key feature identification method, and belongs to the field of computer-aided drug design. By combining a support vector machine and a greedy algorithm, the characteristics with the minimum effect on improving the prediction result are gradually deleted by utilizing the greedy idea, and further, the key characteristics for distinguishing the CNS drugs from non-CNS drug small molecules are accurately screened out. The method combines a support vector machine and a greedy algorithm for the first time to be applied to the identification of the key characteristics of the CNS drugs, screens the key characteristics in a gradual deletion mode, considers the effect of combination among the characteristics, avoids the difficulty of initial characteristic selection brought by a characteristic increasing method, enables the screened key characteristics to effectively distinguish the CNS drugs from non-CNS drug micromolecules, and provides an important guidance method for fundamentally designing CNS drug candidate micromolecules.
Description
Technical Field
The invention relates to a CNS drug key feature identification method, and belongs to the field of computer-aided drug design.
Background
Currently, hundreds of millions of people worldwide are affected by diseases of the Central Nervous System (CNS). Due to the particularity of the brain environment, research and development of related drugs have the disadvantages of low success rate, high cost, long period and the like, and development of new CNS drugs is urgent. Designing reliable CNS drug candidates can greatly reduce the cycle and cost of new drug development and significantly improve success rates. Understanding the characteristic differences between CNS drugs and non-CNS drugs is a prerequisite for designing effective CNS drug candidates. Thus, the discovery of key features in CNS drugs helps us understand the specificity of CNS drugs and guide CNS drug design.
For how to screen out key features from a large number of features of CNS drugs, Shahid M (SVM base descriptor Selection and Classification of neurological Disease drug for pharmaceutical Modeling, Molecular information, 2013,32(3): 241-249), et al, use a support vector mechanism to build a model, and rank features by calculating feature scores from coefficients of each feature, and can also be used to perform feature Selection. But deleting unimportant features based on the scoring of individual features ignores the effect of combinations between features, some features alone do not work well, but two unimportant features in combination may work well. Lu J (Analysis of the acquisition target-based classification system using molecular descriptors. Combinatorial chemistry & high throughput screening,2016,19(2):129-135.) et al, increase features one by one starting from 0; however, in this method, when the initial single feature is used, the amount of information contained is small, and there is a high possibility that the case where SEN is 0%, SPE is 100%, or SEN is 100%, and SPE is 0%, in which case the selection leaves which feature cannot be measured, and the feature selected at the beginning has a great influence on the prediction performance of the subsequent feature combination; if the IFS algorithm is used from a plurality of features, the first plurality of features may need to be determined by other methods.
Therefore, the accurate finding of the key characteristics between the CNS drugs and non-CNS drug small molecules has great effect on helping people to design CNS drug molecules and develop new CNS drugs.
Disclosure of Invention
In order to find out key characteristics between CNS drugs and non-CNS drug small molecules and further achieve the purpose of guiding CNS drug design, the invention provides an identification method for the key characteristics of the CNS drugs.
Optionally, the method includes:
firstly, preliminarily screening out characteristics which have the effect of distinguishing the CNS drug and non-CNS drug micromolecules from all characteristics of the CNS drug and non-CNS drug micromolecules;
step two, constructing a support vector machine model by utilizing the characteristics which are preliminarily screened in the step one and have the effect of distinguishing the CNS medicament from the non-CNS medicament, and optimizing parameters c and g to obtain an optimized support vector machine model;
and step three, gradually deleting the characteristics which are preliminarily screened in the step one and have the effect of distinguishing the CNS drugs from non-CNS drugs by using a greedy algorithm, and screening key characteristics for distinguishing the CNS drugs from the non-CNS drugs in the deletion process.
Optionally, assuming that the number of the features which are preliminarily screened in the first step and have the effect of distinguishing the CNS drug from the non-CNS drug is n; the third step includes:
3.1 delete each feature one by one, resulting in n different feature combinations: { a2,a3,a4,…an},{a1,a3,a4,…an},{a1,a2,a4,…an},…{a1,a2,a3,a4,…an-1};
3.2 taking the n different feature combinations as input vectors of the optimized support vector machine model obtained in the second step to obtain the prediction performances respectively corresponding to the n different feature combinations, and reserving the feature combination with the best prediction performance;
3.3 execute 3.1 to 3.2 with n-1 features in one feature combination with the best predictive performance obtained at 3.2, and loop until n features are deleted;
3.4 selecting from the above 3.1 to 3.3 implementations a combination of features that is key to distinguishing between CNS drugs and non-CNS drugs.
Optionally, the prediction performance comprises sensitivity SEN and specificity SPE; SEN represents the prediction rate of CNS drugs and SPE represents the prediction rate of non-CNS drugs.
Optionally, the feature combination with the best prediction performance retained in the step 3.2 includes:
respectively comparing the SEN value and the SPE value corresponding to each feature combination, and selecting the highest SEN value and SPE value;
if the highest SEN and SPE belong to the same feature combination, the feature combination is reserved;
and if the SEN and the SPE which are the highest belong to two different feature combinations, comprehensively determining the feature combination to be reserved according to the SEN and the SPE of each of the two different feature combinations.
Optionally, assuming that the highest SEN and SPE belong to two different feature combinations a and B, respectively, the comprehensively determining the feature combination to be retained according to the SEN and SPE of the two different feature combinations includes:
comparing the SPE of the feature combination A with the SEN of the feature combination B;
if the SPE of the feature combination A is larger than the SEN of the feature combination B, selecting and reserving the feature combination A;
if the SPE of the feature combination A is smaller than the SEN of the feature combination B, selecting and reserving the feature combination B;
and if the SPE of the feature combination A is equal to the SEN of the feature combination B, comparing the sizes of the SEN of the feature combination A and the SPE of the feature combination B, and selecting the feature combination corresponding to the larger one.
Optionally, if SPE and SEN of the two feature combinations are equal, the feature combination a or the feature combination B is randomly reserved.
Optionally, in the first step, features which are effective in distinguishing the CNS drugs and non-CNS drug small molecules are preliminarily screened out from all the features, and a random forest algorithm is adopted, and the information gain rate is used as an attribute division evaluation function to perform preliminary feature selection.
Optionally, in the second step, the optimized support vector machine model is obtained by an exhaustion method.
The invention also provides a CNS drug molecule design method, which adopts the method to identify key characteristics of the CNS drug.
The invention has the beneficial effects that:
by combining a support vector machine and a greedy algorithm, the characteristics with the minimum effect on improving the prediction result are gradually deleted by utilizing the greedy idea, and further, the key characteristics for distinguishing the CNS drugs from non-CNS drug small molecules are accurately screened out. The method combines a support vector machine and a greedy algorithm for the first time to be applied to the identification of the key characteristics of the CNS drugs, screens the key characteristics in a gradual deletion mode, considers the effect of combination among the characteristics, avoids the difficulty of initial characteristic selection brought by a characteristic increasing method, enables the screened key characteristics to effectively distinguish the CNS drugs from non-CNS drug micromolecules, and provides an important guidance method for fundamentally designing CNS drug candidate micromolecules.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.
The first embodiment is as follows:
the embodiment provides a method for identifying key features of CNS drugs based on a support vector machine and a greedy algorithm, which combines the support vector machine and the greedy algorithm, gradually eliminates features having minimum effect on improving a prediction result by using a greedy thought, and further accurately screens out key features for distinguishing the CNS drugs from non-CNS drug small molecules, and comprises the following steps:
step (1) adopts a random forest algorithm to perform preliminary feature selection:
constructing a random forest model, dividing an evaluation function by using the information gain rate as an attribute, and performing primary feature selection;
specifically, a random forest model including 100 decision trees was constructed, and the evaluation function was divided using the information gain rate as an attribute with reference to "Yaoyang, Yangjing, Jangjuan. In order to improve the selection efficiency and performance, 2/3 samples and 1/2 features are randomly selected each time to construct a decision tree; to prevent overfitting, the node aborts splitting when the number of unassigned samples is less than 5.
And counting all the features appearing on the tree, namely the features which are preliminarily selected.
Optimizing support vector machine model parameters c and g by adopting an exhaustion method, wherein c is a penalty coefficient, and g is a nuclear parameter;
a classifier that identifies CNS drugs and non-CNS drug small molecules is constructed using a support vector machine algorithm with radial basis kernel functions in LIBSVM packages.
In [2 ]-4,24]All combinations of c and g are exhausted within the range, and 5-fold cross validation is performed under each combination to find the optimal combination of c and g.
The objective optimization problem of the support vector machine is to find a hyperplane which can distinguish the CNS sample from non-CNS sample as much as possible, and the formula is as follows:
f(x)=wTx+b
where x is the input eigenvector, w is the normal vector normal to the hyperplane, and b is the offset.
Obtaining w according to the Lagrange multiplier; using a mapping function phi to map the eigenvectors xiAnd x is mapped to a high dimensional space as shown in the following equation:
wherein λ isiIs the Lagrange multiplier, yiIs the sum of the feature vector xiAnd (3) related sample labels, wherein m is the number of samples, and i is more than or equal to 1 and less than or equal to m.
Without a kernel function, the computation of a high dimensional space would likely lead to a dimensional explosion, and to avoid this problem, the radial basis kernel function K (x)iX) is used instead of the explicit mapping ΦT(xi) Φ (x), as follows:
K(xi,x)=exp(-g*|xi-x|2)
among them, the g parameter is a very important kernel parameter, and has a great influence on the training of the model. Another important parameter is the penalty factor c, which affects the smoothness of the classification plane.
And (2) calculating the prediction performance of the support vector machine model corresponding to all combinations of c and g by taking the features selected preliminarily in the step (1) as input vectors, selecting a group of corresponding c and g with the best prediction performance, and taking the group of c and g as the parameters c and g of the optimized support vector machine model.
Step (3) identifying key features by using greedy algorithm
And (3) respectively taking different combinations of the features preliminarily selected in the step (1) as input vectors of the support vector machine model optimized in the step (2), and screening out key features according to corresponding prediction performance.
Specifically, the method comprises the following steps:
s1 assumes that n features are initially selected in step (1): { a1,a2,a3,a4,…an};
S2 delete each feature one by one, resulting in n different feature combinations: { a2,a3,a4,…an},{a1,a3,a4,…an},{a1,a2,a4,…an},…{a1,a2,a3,a4,…an-1};
S3 using each feature combination in step S2 as input vector for supporting vector machine model, reserving feature combination with best prediction performance, and recording prediction performance pj,1≤j≤n;
S4 using the set of feature combinations retained in step S3, executing S2 and S3 with n-1 features in the set, thereby looping through S2 to S4 until all features are deleted;
s5 all predicted performances p in the above-mentioned processes S2 to S4jBest p in (1)jAnd the corresponding feature combination is the screened key feature.
In the above-mentioned step (2) and step (3)Predicting the Performance pjIncluding sensitivity SEN and specificity SPE:
sensitivity SEN, i.e. positive sample prediction rate (CNS drug prediction rate);
specific SPE, negative sample prediction rate (non-CNS drug prediction rate);
the constructed CNS drug recognition model is evaluated by the sensitivity SEN and the specificity SPE together, and the larger the value is, the better the performance of the model is.
In particular, in determining the predicted performance pjBest p in (1)jAnd if the highest SEN and SPE belong to the same feature combination, keeping the feature combination.
If the highest SEN and SPE belong to different feature combinations, selecting the feature combination to be reserved according to the size of the corresponding SPE and SEN; for example, the highest SEN and SPE belong to feature combinations a and B, respectively, that is, the SEN of feature combination a is the highest, and the SPE of feature combination B is the highest, the SEN of feature combination a and feature combination B are compared:
if the SPE of the feature combination A is larger than the SEN of the feature combination B, selecting and reserving the feature combination A;
if the SPE of the feature combination A is smaller than the SEN of the feature combination B, selecting and reserving the feature combination B;
and if the SPE of the feature combination A is equal to the SEN of the feature combination B, comparing the sizes of the SEN of the feature combination A and the SPE of the feature combination B, and selecting the feature combination corresponding to the larger one.
And if SPE and SEN of the two feature combinations are equal, randomly reserving feature combination A or feature combination B.
In order to verify the key characteristics of the method provided by the application, which can effectively identify the CNS drugs, the application takes the existing CNS drugs and non-CNS drug small molecules as experimental objects, and the data are derived from ZINC15(http:// ZINC15. gating. org /) and drug Bank (https:// www.drugbank.ca /) databases.
The inventors downloaded drug data in SDF format (corresponding to 879 non-CNS drug small molecules and 273 CNS drug small molecules) from the above two databases, including initial coordinates of all atoms in each drug molecule and bond type information between atoms; the drug data was used as input to the PaDEL software to derive all the eigenvalues for each drug molecule, with 1875 being the number of all eigenvalues calculated for each drug molecule in this example.
Step (1) adopts a random forest algorithm to perform preliminary feature selection:
a random forest model was constructed, features contributing to the discrimination of CNS drugs from non-CNS drug small molecules were selected, and 941 useful features were selected from 1875 features.
Optimizing support vector machine model parameters c and g by adopting an exhaustion method to obtain optimized support vector machine model parameters c and g:
355 of 879 non-CNS drug small molecules are randomly selected as negative samples, 273 CNS drug small molecules are taken as positive samples, and parameters c and g of the support vector machine are set to [2 ]-4,24]In the range of (c), the best 5-fold cross-validation results were searched in all c and g combinations and the test set was used to test the model generalization performance. The above process was repeated 5 times, and the results are shown in table 1 below, where the positive samples in the 5 samples were identical in CNS drugs, and the negative samples were 355 randomly selected from 879 small molecules of non-CNS drugs, i.e., the positive samples in the 5 samples were identical in CNS drugs, and the negative samples corresponded to 355 different non-CNS drugs.
And (3) taking 941 features of each drug small molecule as input vectors of support vector machine models corresponding to different c and g combinations to predict the performance of the drug small molecule.
TABLE 1 support vector machine model parameters and Performance on external test set
As can be seen from table 1, the support vector machine models corresponding to parameters c and g corresponding to sample 3 have the best prediction performance, so that the support vector machine models corresponding to the group c and g are used to perform key feature screening, and the group of corresponding randomly selected 355 non-CNS drug small molecules and 273 CNS drug small molecules are used as samples for screening key features.
And (3) identifying key features by using a greedy algorithm:
selecting key features by using the randomly selected 355 non-CNS drug small molecules and 941 features corresponding to 273 CNS drug small molecules determined in the step (2) as input vectors of the optimized support vector machine model, specifically:
s1: 941 features are: { a1,a2,a3,a4,…an},n=941;
S2: deleting each feature one by one, resulting in 941 different feature combinations: { a2,a3,a4,…an},{a1,a3,a4,…an},{a1,a2,a4,…an},…{a1,a2,a3,a4,…an-1};
S3: taking each feature combination in the step S2 as an input vector of a support vector machine model, reserving the feature combination with the best prediction performance, and recording the values of the sensitivity SEN and the specificity SPE of each group of feature combinations;
and comparing the sensitivity SEN value and the specificity SPE value of each 941 feature combinations, and if the SEN and the SPE with the highest values belong to the same feature combination, keeping the feature combination.
If the highest SEN and SPE belong to different feature combinations, selecting the feature combination to be reserved according to the size of the corresponding SPE and SEN; for example, the highest SPE belongs to the 800 th feature combination, and the highest SEN corresponds to the 900 th feature combination, then the SEN of the 800 th feature combination is compared with the 900 th feature combination SPE:
if the SEN of the 800 th feature combination is larger than the 900 th feature combination SPE, the 800 th feature combination is reserved;
if the SEN of the 800 th feature combination is smaller than the SPE of the 900 th feature combination, the 900 th feature combination is reserved;
if the SEN of the 800 th feature combination is equal to the 900 th feature combination SPE, comparing the size of the 800 th feature combination SPE with the 900 th feature combination SEN:
-if SPE of the 800 th feature combination is larger than the 900 th feature combination SEN, retaining the 800 th feature combination;
-if SPE of the 800 th feature combination is smaller than the 900 th feature combination SEN, retaining the 900 th feature combination;
-if SPE of the 800 th feature combination is equal to the 900 th feature combination SEN, randomly retaining the 800 th feature combination or the 900 th feature combination.
S4: using the set of feature combinations retained in step S3, performing S2 and S3 with 940 features in the set, thereby looping through S2 to S4 until all features are deleted;
s5: all predicted performances p in the above-mentioned processes of S2 to S4jBest p in (1)jAnd the corresponding feature combination is the screened key feature.
Through the above cycle process, 40 key features are finally screened out in this embodiment, and the SEN and SPE of the test result both reach more than 94% by using the screened out 40 key features as input variables of the model. The 40 key features selected are shown in table 2 below:
TABLE 2 screened 40 key features and their description
Some steps in the embodiments of the present invention may be implemented by software, and the corresponding software program may be stored in a readable storage medium, such as an optical disc or a hard disk.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (10)
1. A CNS drug key feature identification method is characterized in that a support vector machine and a greedy algorithm are combined, the feature with the minimum effect on improving a prediction result is gradually deleted by the greedy algorithm, and then key features for distinguishing CNS drugs from non-CNS drug small molecules are accurately screened out.
2. The method according to claim 1, characterized in that it comprises:
firstly, preliminarily screening out characteristics which have the effect of distinguishing the CNS drug and non-CNS drug micromolecules from all characteristics of the CNS drug and non-CNS drug micromolecules;
step two, constructing a support vector machine model by utilizing the characteristics which are preliminarily screened in the step one and have the effect of distinguishing the CNS medicament from the non-CNS medicament, and optimizing parameters c and g to obtain an optimized support vector machine model;
and step three, gradually deleting the characteristics which are preliminarily screened in the step one and have the effect of distinguishing the CNS drugs from non-CNS drugs by using a greedy algorithm, and screening key characteristics for distinguishing the CNS drugs from the non-CNS drugs in the deletion process.
3. The method of claim 2, wherein n is assumed as the number of features selected in step one as having an effect of distinguishing CNS drugs from non-CNS drugs; the third step includes:
3.1 delete each feature one by one, resulting in n different feature combinations: { a2,a3,a4,…an},{a1,a3,a4,…an},{a1,a2,a4,…an},…{a1,a2,a3,a4,…an-1};
3.2 taking the n different feature combinations as input vectors of the optimized support vector machine model obtained in the second step to obtain the prediction performances respectively corresponding to the n different feature combinations, and reserving the feature combination with the best prediction performance;
3.3 execute 3.1 to 3.2 with n-1 features in one feature combination with the best predictive performance obtained at 3.2, and loop until n features are deleted;
3.4 selecting from the above 3.1 to 3.3 implementations a combination of features that is key to distinguishing between CNS drugs and non-CNS drugs.
4. The method of claim 3, wherein the predictive performance includes sensitivity SEN and specificity SPE; SEN represents the prediction rate of CNS drugs and SPE represents the prediction rate of non-CNS drugs.
5. The method of claim 4, wherein the step of retaining the feature combination with the best prediction performance in 3.2 comprises:
respectively comparing the SEN value and the SPE value corresponding to each feature combination, and selecting the highest SEN value and SPE value;
if the highest SEN and SPE belong to the same feature combination, the feature combination is reserved;
and if the SEN and the SPE which are the highest belong to two different feature combinations, comprehensively determining the feature combination to be reserved according to the SEN and the SPE of each of the two different feature combinations.
6. The method of claim 5, wherein the step of comprehensively determining the combination of features to be preserved according to the SEN and the SPE of each of the two different feature combinations, assuming that the highest SEN and SPE belong to the two different feature combinations A and B, respectively, comprises:
comparing the SPE of the feature combination A with the SEN of the feature combination B;
if the SPE of the feature combination A is larger than the SEN of the feature combination B, selecting and reserving the feature combination A;
if the SPE of the feature combination A is smaller than the SEN of the feature combination B, selecting and reserving the feature combination B;
and if the SPE of the feature combination A is equal to the SEN of the feature combination B, comparing the sizes of the SEN of the feature combination A and the SPE of the feature combination B, and selecting the feature combination corresponding to the larger one.
7. The method of claim 6 wherein feature combination a or feature combination B is randomly retained if SPE and SEN of two feature combinations are equal.
8. The method as claimed in any one of claims 2 to 7, wherein the first step is to preliminarily select the features having an effect of distinguishing the CNS drug from non-CNS drug small molecules, and to perform the preliminary feature selection by using a random forest algorithm and using an information gain ratio as an attribute classification evaluation function.
9. The method according to any one of claims 2 to 8, wherein the second step adopts an exhaustive method to obtain the optimized support vector machine model.
10. A method for CNS drug molecule design, wherein said design method identifies key features of CNS drugs using the method of any of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911307432.6A CN111081321B (en) | 2019-12-18 | 2019-12-18 | CNS drug key feature identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911307432.6A CN111081321B (en) | 2019-12-18 | 2019-12-18 | CNS drug key feature identification method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111081321A true CN111081321A (en) | 2020-04-28 |
CN111081321B CN111081321B (en) | 2023-10-31 |
Family
ID=70315502
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911307432.6A Active CN111081321B (en) | 2019-12-18 | 2019-12-18 | CNS drug key feature identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111081321B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115238148A (en) * | 2022-09-21 | 2022-10-25 | 杭州衡泰技术股份有限公司 | Characteristic combination screening method for multi-party enterprise joint credit rating and application |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104866863A (en) * | 2015-04-27 | 2015-08-26 | 大连理工大学 | Biomarker screening method |
CN105740626A (en) * | 2016-02-01 | 2016-07-06 | 华中农业大学 | Drug activity prediction method based on machine learning |
CN106991296A (en) * | 2017-04-01 | 2017-07-28 | 大连理工大学 | Ensemble classifier method based on the greedy feature selecting of randomization |
CN107731309A (en) * | 2017-08-31 | 2018-02-23 | 武汉百药联科科技有限公司 | A kind of Forecasting Methodology of pharmaceutical activity and its application |
CN110459274A (en) * | 2019-08-01 | 2019-11-15 | 南京邮电大学 | A kind of small-molecule drug virtual screening method and its application based on depth migration study |
-
2019
- 2019-12-18 CN CN201911307432.6A patent/CN111081321B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104866863A (en) * | 2015-04-27 | 2015-08-26 | 大连理工大学 | Biomarker screening method |
CN105740626A (en) * | 2016-02-01 | 2016-07-06 | 华中农业大学 | Drug activity prediction method based on machine learning |
CN106991296A (en) * | 2017-04-01 | 2017-07-28 | 大连理工大学 | Ensemble classifier method based on the greedy feature selecting of randomization |
CN107731309A (en) * | 2017-08-31 | 2018-02-23 | 武汉百药联科科技有限公司 | A kind of Forecasting Methodology of pharmaceutical activity and its application |
CN110459274A (en) * | 2019-08-01 | 2019-11-15 | 南京邮电大学 | A kind of small-molecule drug virtual screening method and its application based on depth migration study |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115238148A (en) * | 2022-09-21 | 2022-10-25 | 杭州衡泰技术股份有限公司 | Characteristic combination screening method for multi-party enterprise joint credit rating and application |
Also Published As
Publication number | Publication date |
---|---|
CN111081321B (en) | 2023-10-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200004777A1 (en) | Image Retrieval with Deep Local Feature Descriptors and Attention-Based Keypoint Descriptors | |
Stumpfe et al. | Similarity searching | |
JP6954003B2 (en) | Determining device and method of convolutional neural network model for database | |
JP6839342B2 (en) | Information processing equipment, information processing methods and programs | |
Hanczar et al. | Ensemble methods for biclustering tasks | |
Pes | Learning from high-dimensional biomedical datasets: the issue of class imbalance | |
Lin et al. | Efficient classification of hot spots and hub protein interfaces by recursive feature elimination and gradient boosting | |
US11775610B2 (en) | Flexible imputation of missing data | |
CN107679138B (en) | Spectral feature selection method based on local scale parameters, entropy and cosine similarity | |
US20190251468A1 (en) | Systems and Methods for Distributed Generation of Decision Tree-Based Models | |
CN113344113B (en) | Yolov3 anchor frame determination method based on improved k-means clustering | |
JP4937395B2 (en) | Feature vector generation apparatus, feature vector generation method and program | |
CN109390032B (en) | Method for exploring disease-related SNP (single nucleotide polymorphism) combination in data of whole genome association analysis based on evolutionary algorithm | |
CN111081321A (en) | CNS drug key feature identification method | |
CN112837743A (en) | Medicine repositioning method based on machine learning | |
He et al. | Measuring boundedness for protein complex identification in PPI networks | |
US11886445B2 (en) | Classification engineering using regional locality-sensitive hashing (LSH) searches | |
US11710057B2 (en) | Methods and systems for identifying patterns in data using delimited feature-regions | |
CN111860622B (en) | Clustering method and system applied to programming field big data | |
Yang et al. | Adaptive density peak clustering for determinging cluster center | |
US20120208227A1 (en) | Apparatus and method for processing cell culture data | |
Devi et al. | Similarity measurement in recent biased time series databases using different clustering methods | |
CN111401783A (en) | Power system operation data integration feature selection method | |
CN110766087A (en) | Method for improving data clustering quality of k-means based on dispersion maximization method | |
Böhm et al. | Querying objects modeled by arbitrary probability distributions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |