CN108536999A - A kind of ligand small molecule key minor structure screening technique and device - Google Patents

A kind of ligand small molecule key minor structure screening technique and device Download PDF

Info

Publication number
CN108536999A
CN108536999A CN201810233519.2A CN201810233519A CN108536999A CN 108536999 A CN108536999 A CN 108536999A CN 201810233519 A CN201810233519 A CN 201810233519A CN 108536999 A CN108536999 A CN 108536999A
Authority
CN
China
Prior art keywords
feature
ligand
minor structure
molecular
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810233519.2A
Other languages
Chinese (zh)
Inventor
吴建盛
刘犇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201810233519.2A priority Critical patent/CN108536999A/en
Publication of CN108536999A publication Critical patent/CN108536999A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C10/00Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like

Landscapes

  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The present invention relates to a kind of ligand small molecule key minor structure screening technique and devices, using the drug virtual screening method based on ligand, the ligand molecular of magnanimity characterizes ligand molecular using ECFP molecular fingerprints, each binary coding has corresponded to whether the ligand molecular has specific minor structure;The Feature Selection method that the LASSO of criterion is projected using the dual polyhedron based on reinforcement, retain with the relevant crucial minor structure of ligand activity, build model for subsequent learning process, and related device is established based on this method.The present invention can get rid of most uncorrelated features, can not only retain with the relevant crucial minor structure of ligand activity, on the other hand only need to build model in fraction feature in subsequent learning process, greatly improve the training effectiveness of model.

Description

A kind of ligand small molecule key minor structure screening technique and device
Technical field
The present invention relates to a kind of drug screening method and device, especially a kind of matching based on molecular fingerprint and machine learning Body small molecule key minor structure screening technique and device belong to drug virtual screening technical field.
Background technology
During new drug discovery, the drug found first by other various approach is not necessarily in this kind of compound most Ideal drug, but a new chemical constitution is proposed, it is lead compound that we, which are commonly referred to as it,.In lead compound Various chemical modifications are carried out on molecular structure, are studied the relationship of chemical constitution and biological effect, are passed through pharmacological evaluation and mathematics meter It calculates, determines optimum structure to selected drug, this has become a kind of basic means of modern age searching new drug, and achieves aobvious Write achievement.How quickly and effectively to find to realize that the crucial minor structure of specific function has become the key for accelerating new drug development Link.
Drug design process small molecular compound amounts are huge, due to experiment condition, manpower and fund etc. Limitation can not carry out all micromolecular compounds experiment test, and virtual screening (Virtual Screening, VS) is as meter Commonly a kind of method is come into being in calculation machine aided drug design.Purpose is that suitable target is filtered out in small molecule database Small molecule structure on molecule.The fast development of computer is but also virtual screening technology is more widely used.Virtually Screening includes following two categories:
(1) virtual screening based on receptor is based on the three-dimensional conformation of acceptor molecule, by the methods of molecular docking Database small molecular is combined with acceptor molecule, predicts the combination situation between small molecule and acceptor molecule, by marking letter Number further analyzes the matching degree and affinity power of the active site amino of small molecule and acceptor molecule, then to result It carries out ranking and selects potential small-molecule drug.
(2) virtual screening based on ligand is mainly to the conformation of existing active small molecular, size, pharmacophore, physics and chemistry Matter and activity relationship are analyzed, and common method mainly has similarity searching (Similarity Searching), pharmacophore With (Pharmacophore Matching), substructure search (Substructure Searching) structure quantitative relationship and meter Calculate the affinity of noval chemical compound.
Screening technique based on ligand is it is crucial that carry out compound using various methods or molecular fingerprint sufficient Description to study whether compound has similar activity or pathogenesis, or concludes and obtains playing a crucial role to compound activity Some groups information.In the drug virtual screening based on ligand, the molecular fingerprint characterization of ligand molecular is crucial.It is logical It crosses the specific minor structure of some in detection molecules structure whether there is, to which molecular structure is converted to a series of binary fingers Line sequence, this inspires the structural similarity of our small molecules that can be indicated with the similitude of molecular fingerprint.On the other hand, drug When small molecule is combined with drug targets, only local location is combined, this means that only a small number of crucial minor structures for it Bioactivity plays a decisive role.The crucial minor structure of specific function is played for new if can find that these are combined with particular target Medicine research and development have great impetus.
In simplest molecular fingerprint form, each in fingerprint all indicate whether corresponding molecule has (if so, Then the position is set as 1) some specific feature.MACCS fingerprints are exactly the Typical Representative of this fingerprint, it is altogether by 166 binary systems Position composition, wherein each all corresponding specific structure attribute information.It is clear that such fingerprint can not all be kept away The limitation with itself exempted from, i.e.,:Defined minor structure is possible to be identified only in dictionary.And in face of magnanimity The needs of compound molecule, small-scale minor structure identification far can not meet us, most popular at present is to use ECFP (Extended-connectivity fingerprints) molecular fingerprint method is characterized.The different fingerprint sides ECFP Method, such as ECFP4, ECFP8, ECFP12, subsequent number represent the radius size that ligand includes minor structure, in general, half Diameter is bigger, and the feature of generation is more.Since in drug virtual screening, the enormous amount of compound molecule is based on ECFP methods The characteristic dimension of generation is also huge.Ligand molecular is characterized using newest ECFP12 molecular fingerprints, each ligand The intrinsic dimensionality of molecule can be up to up to ten million dimensions.
Ligand molecular is mainly acted on by pharmacophoric group and target molecules, and pharmacophoric group is usually and ligand molecular A small number of minor structures it is related, i.e., the activity of most of minor structures and ligand molecular is unrelated.If ligand molecular data set is converted For matrix, one sample of often row expression corresponds to a ligand molecular, and each column indicates a feature, corresponds to a spy Fixed minor structure.When screening ligand magnanimity feature, need to consider " sparsity " that feature has, i.e., it is many in matrix It is not related to arrange with the activity of ligand molecular, removes these row by feature selecting, is then only needed in actual learning tasks It to be carried out on smaller matrix, the difficulty of learning tasks may decrease, and the calculating being related to and storage overhead can be reduced, and learn Obtaining the resolvability of model can also improve.In machine learning field, ligand molecular magnanimity feature meeting in actual task of generation It is absorbed in " dimension disaster " problem, time overhead of operation is excessive or even can not fitting data.
Invention content
It is an object of the invention to:In view of the defects existing in the prior art, a kind of ligand small molecule key minor structure is proposed Screening technique and device get rid of most uncorrelated features, i.e. minor structure, on the one hand can retain and ligand activity phase On the other hand the crucial minor structure of pass is only needed to build model in fraction feature in subsequent learning process, substantially be carried The training effectiveness of high model.
In order to reach object above, the present invention provides a kind of ligand small molecule key minor structure screening techniques, using base In the drug virtual screening method of ligand, the ligand molecular of magnanimity characterizes ligand molecular using ECFP molecular fingerprints, Each binary coding has corresponded to whether the ligand molecular has specific minor structure;It is thrown using the dual polyhedron based on reinforcement The Feature Selection method of the LASSO of shadow criterion, retain with the relevant crucial minor structure of ligand activity, learnt for subsequent Journey structure established model,
It mainly includes the following steps:
Step 1:Build required initial data set Ds, the initial data set DsIncluding generating needed for ECFP features The SMILES for the ligand molecular wanted is indicated and the concentration value for generating active function between ligand molecular and drug targets Standard Values, and ask the opposite number of its decimal log as bioactivity response values;
Step 2:Given initial data set Ds, to initial data set DsIt is handled, obtains the ECFP features of ligand molecular i.e. Data set Dt
Step 3:Based on data set Dt, the ligand molecular Feature Selection based on EDPP LASSO methods;
Step 4:The ligand molecular feature selecting of (stability selection) method is selected based on robustness;
Step 5:Feature is recalled, crucial minor structure visualization.
The present invention the technical solution that further limits be:Bioactivity response=-log described in step 110V, Middle v is Standard Values values, and response values reflect that the bioactivity size of ligand molecular and GPCR effects, value are smaller The activity of expression effect is lower.
Further, the given initial data setWhereinIt is each molecule SMILES atom connection figures, YiIt is the response values of each sample, ECFP features, that is, data set D of the ligand moleculart= {(Xi,Yi)|Xi∈R1*m,1≤i≤n}。
Further, in the step 3, the data set Dt, for meeting condition (λ ∈ (0, λ0]) λ={ λi|0 ≤i<K,λii+1, obtain the Feature Selection result Τ={ Τ for belonging to each λ valueii∈R1*m,0≤i<K }, wherein ΤiValue This feature is represented for 1 to retain, it is extraneous features to represent for 0, can be deleted.
And in step 3, the ligand molecular Feature Selection based on EDPP LASSO methods, it is assumed that data X ∈ Rn*m, n is sample Number, m are characterized dimension, then standard LASSO problems are:
Wherein y is label, β ∈ Rm, by adding L to loss function in (1) formula1The penalty of norm, by variable β's Coefficient is compressed and certain regression coefficients is made to become 0, reaches the mesh of feature selecting simultaneously during optimizing loss function 's;
Its dual problem is sought for (1) formula is convertible:
Wherein θ is dual variable;
The optimal solution of formula (1) and formula (2) is identical, so only requiring its dual problem;
For convenience's sake, the solution of optimization problem (2) is denoted as θ*(λ) (similarly the solution of optimization problem (1) is denoted as β*(λ));
Had according to KKT conditions:
Y=X β*(λ)+λθ*(λ) (3)
Wherein [*]iIth feature is represented, the KKT conditions shown in formula (4) obtain
It is exactly a unrelated feature (5)
A region Θ is first estimated, wherein including θ*(λ), (5) can be written as follows form:
It is exactly a unrelated feature (6)
Further, in the step 4, the Feature Selection for step 3 is as a result, K TiSuperposition it is tired and, this must To the selected frequency of every one-dimensional characteristic, the selected number of feature is more, represents it and is more likely to be relevant feature, chooses By the most preceding p features of selection number, the feature selecting of robust is obtained as a result, the p, which is natural number 0, arrives p.
Further, in the step 5, the feature selecting obtained in step 4 is recalled as a result, retaining final spy Sign, visually dissolves its specific minor structure figure, is of great significance for new drug development.
A kind of ligand small molecule key minor structure screening plant, including following module:
Data preprocessing module specifically includes the download of associated documents for arranging the associated documents downloaded in database, Data cleansing seeks response average values, structure initial data set D to the ligand molecular repeateds
ECFP feature generation modules, the initial data set D according to data preprocessing module structuresStructure data set Dt, use In generating ECFP molecular fingerprints, under normal circumstances, fingerprint depth is deeper, and the intrinsic dimensionality of generation is more, the analysis grain of minor structure Degree is finer.
Feature Selection module based on EDPP LASSO, according to data set Dt, to change the essence of screening by the way that λ value is arranged Thin degree removes extraneous features;In general, λ value is smaller, and the feature of reservation is more.
Robustness selects characteristic module, further preferred key feature;Retain quantity key feature is arranged, filters out Most important N number of minor structure, the N are that natural number 0 arrives n.
Feature is recalled, crucial minor structure visualization model;The module, which can be dissolved visually, influences drug targets and ligand point Son combines active crucial minor structure, instructs the optimization of lead compound.
It is transmitted and is connected by data line between each module.
Beneficial effects of the present invention:
1, it solves the problems, such as " dimension disaster ", removes quickly and substantially uncorrelated features, obtain the correlated characteristic of robust so that Follow-up learning process only need to build model in fraction feature, greatly improve the learning efficiency of model.
2, feature is recalled, and visualizes crucial minor structure, can instruct the Optimization Work of lead compound, it is made to overcome one's shortcomings And finally obtain excellent drug candidate.This result will greatly accelerate new drug discovery process.
Description of the drawings
The present invention will be further described below with reference to the drawings.
Fig. 1 is the crucial minor structure screening technique flow diagram of the present invention.
Fig. 2 is the crucial minor structure screening plant structural schematic diagram of the present invention.
Specific implementation mode
The method flow of the present invention is described in further detail below in conjunction with Fig. 1.
Step 1, first download obtains 7tmrlist files from Uniprot databases, and this document contains all totally 3092 G The Uniprot ID numbers of G-protein linked receptor (GPCR), we filter out the gpcr protein matter of the mankind, totally 825.And then from All interaction data files are downloaded in GLASS databases, this document contains Unique GPCR-ligand Entries 519,051.We are ranked up according to 825 mankind's gpcr protein matter of quantity pair of effect ligand, and choose The most preceding 25 gpcr protein matter of amount of ligand.Totally 8 in this 25 gpcr protein matter without any three-dimensional structure, this 8 The ligand number of a GPCR is all higher than 3000, side illustration its for the importance of bodily fuctions, we choose this 8 GPCR and make For the experimental subjects of this paper.
Herein, for more accurate research, to this 8 gpcr protein matter, we are respectively from CHEMBL databases In obtain the effect ligand of all of which and preserve these effect ligand moleculars (SMILES formats) and response values.
Specifically:Table 1 provides a reduced data collection overview.
1 experimental data set of table
By taking table 1 as an example, primary data concentration includes:
CANONICAL SMILES:The SMILES atom connection figures of ligand molecular.
response:The bioactivity value of each ligand molecular.
Step 2, ligand molecular ECFP features generate:
Given initial data setWhereinIt is the SMILES atoms connection of each molecule Figure, YiIt is the response values of each sample.Initial data set is further processed, the ECFP for obtaining description sample is special Sign, i.e. data set Dt={ (Xi,Yi)|Xi∈R1*m,1≤i≤n}。
The SMILES atoms connection figure of each molecule obtained from database and the input ECFP lifes of required fingerprint radius At software, the ECFP features of each ligand molecular regular length can be obtained.Since data set is characterized in being generated by all molecules , common feature is had between molecule, also has itself unique feature, therefore the feature of all molecules is combined, The common characteristic repeated is deleted, using left feature as last feature description.
Step 3, the ligand molecular Feature Selection based on EDPP LASSO methods:To data set Dt, sieved using EDPP features It makes an accurate selection of then, for meeting condition (λ ∈ (0, λ0]) λ={ λi|0≤i<K,λii+1, obtain the Feature Selection knot of each λ value Fruit Τ={ Τii∈R1*m,0≤i<K }, wherein ΤiValue represents feature reservation for 1, and it is extraneous features to be represented for 0, can be deleted.
Wherein, EDPP Feature Selections detailed process is as follows:
It is assumed that data X ∈ Rn*m, n is number of samples, and m is characterized dimension, then standard LASSO problems are:
Wherein y is label, β ∈ Rm, by adding L to loss function in (1) formula1The penalty of norm, by variable β's Coefficient is compressed and certain regression coefficients is made to become 0, reaches the mesh of feature selecting simultaneously during optimizing loss function 's.
Its dual problem is sought for (1) formula is convertible:
Wherein θ is dual variable.The optimal solution of formula (1) and formula (2) is identical, so only requiring its dual problem i.e. It can.For convenience's sake, the solution of optimization problem (2) is denoted as θ*(λ) (similarly the solution of optimization problem (1) is denoted as β*(λ)).According to KKT conditions have:
Y=X β*(λ)+λθ*(λ) (3)
Wherein [*]iIth feature is represented, the KKT conditions shown in formula (4) obtain
It is exactly a unrelated feature (5)
In other words, formula (5) can be made full use of to find out unrelated feature for LASSO problems.
But because θ*(λ) is an ignorant value, directly cannot find out unrelated feature using formula (5).So A region Θ can be first estimated, wherein including θ*(λ).So formula (5) can be written as follows form:
It is exactly a unrelated feature (6)
To sum up, as long as it includes θ that can find a region*(λ) so that in region each θ withThe absolute value of product is small In 1, this pattern (6) can serve as a criterion and be used for finding out unrelated feature for LASSO problems.As can be seen that looking for The region Θ arrived is smaller, θ*The estimation of (λ) is more correct, then there is more extraneous features that can go out by Rules Filtering Come.
Step 4, the ligand molecular feature selecting based on robustness selection method:Feature Selection for step 3 is as a result, handle K TiSuperposition it is tired and, this just obtains the selected frequency of each dimensional feature, and the selected number of feature is more, and representing it more has It may be key feature, choose p by the most feature of selection number, obtain the feature selecting result of robust.
Step 5, the robust features selection result for obtaining step 4 carry out feature backtracking, and visualization is crucial minor structure, Subsequent new drug discovery is instructed to test.
And the Feature Selection process of above-mentioned steps 2 and step 3 is optimized by two steps, first according to selection for meeting item A certain number of λ={ λ of parti|0≤i<K,)i>)i+1, certain amount chooses 100 in the present embodiment.Corresponding to identical quantity Feature Selection result Τ={ Τii∈R1*m,0≤i<K }, wherein ΤiValue represents feature reservation for 1, and it is unrelated spy to be represented for 0 Sign, can delete.Then these Feature Selections are considered as a result, K TiIt stacks up, it is selected that this just obtains each feature The frequency selected, the selected number of feature is more, represents it and is more likely to be relevant feature, chooses p by selection number most More features obtains the feature selecting of robust as a result, the p, which is natural number 0, arrives p, so as to avoid single parameter value is used The problem of modelling effect difference may be brought.
The present embodiment also discloses a kind of new ligand small molecule key minor structure screening plant, as shown in Fig. 2, including:
Data preprocessing module, for arranging the associated documents downloaded in database;Download including associated documents, data Cleaning seeks response average values to the ligand molecular repeated.
ECFP feature generation modules, for generating ECFP molecular fingerprints;Fingerprint depth can be arranged in ECFP features generation module Degree, under normal circumstances, fingerprint depth is deeper, and the intrinsic dimensionality of generation is more, and the analysis granularity of minor structure is finer.
Feature Selection module based on EDPP removes extraneous features;The module can change the essence of screening by the way that λ value is arranged Thin degree, in general, λ value is smaller, and the feature of reservation is more.
Robustness selects characteristic module, further preferred key feature;Key feature can be set and retain quantity, and screened Go out most important N number of minor structure, the N is that natural number 0 arrives n.
Feature is recalled, crucial minor structure visualization model;The module, which can be dissolved visually, influences drug targets and ligand point Son combines active crucial minor structure, for instructing the optimization of lead compound to have extremely important effect.
Meanwhile it being transmitted and being connected by data line between each module.
Beneficial effects of the present invention are summarized as follows:
(1) present invention solves the problems, such as " dimension disaster " of ligand molecular magnanimity feature, on the one hand, generates ligand molecular sea Measure feature can solve the problems, such as certain molecular fingerprint intrinsic dimensionalities limitation, on the other hand so that follow-up learning process only need to be Model is built in fraction feature, greatly improves the learning efficiency of model.
In the age of this data explosion, high-dimensional data are seen everywhere.Such as in the related problem of many biologies, The dimension of data is very high, needs expensive experiment, available training data quite few due to collecting data.This when It just will appear the case where characteristic dimension is much larger than sample number, if not doing other and assuming or limit, model is difficult to build, together When the problem of also causing over-fitting.And LASSO methods reject incoherent feature by building a penalty, solve " dimension disaster " problem so that follow-up learning process only need to build model in fraction feature, greatly improve the study of model Efficiency.
(2) present invention utilizes the LASSO methods based on EDPP criterion, removes quickly and substantially uncorrelated features, obtains Shandong The correlated characteristic of stick is conducive to understanding and the relevant minor structure of ligand activity, specific key is given by feature backtracking module Minor structure works to subsequent new drug discovery significant.
The description of the aforementioned specific exemplary embodiment to the present invention is in order to illustrate and illustration purpose.These descriptions It is not wishing to limit the invention to disclosed precise forms, and it will be apparent that according to the above instruction, can much be changed And variation.The purpose of selecting and describing the exemplary embodiment is that explaining the specific principle of the present invention and its actually answering With so that those skilled in the art can realize and utilize the present invention a variety of different exemplary implementation schemes and Various chooses and changes.The scope of the present invention is intended to be limited by claims and its equivalents.

Claims (10)

1. a kind of ligand small molecule key minor structure screening technique, it is characterised in that:Using the drug virtual screening based on ligand Method characterizes ligand molecular using ECFP molecular fingerprints the ligand molecular of magnanimity, each binary coding corresponds to Whether the ligand molecular has specific minor structure;The feature of the LASSO of criterion is projected using the dual polyhedron based on reinforcement Screening technique, retain with the relevant crucial minor structure of ligand activity, build model for subsequent learning process,
It mainly includes the following steps:
Step 1:Build required initial data set Ds, the initial data set DsIt is required including generating ECFP features The SMILES of ligand molecular is indicated and the concentration value Standard for generating active function between ligand molecular and drug targets Values, and ask the opposite number of its decimal log as bioactivity response values;
Step 2:Given initial data set Ds, to initial data set DsIt is handled, obtains the ECFP features i.e. data of ligand molecular Collect Dt
Step 3:Based on data set Dt, the ligand molecular Feature Selection based on EDPP LASSO methods;
Step 4:The ligand molecular feature selecting of (stability selection) method is selected based on robustness;
Step 5:Feature is recalled, crucial minor structure visualization.
2. a kind of ligand small molecule key minor structure screening technique according to claim 1, it is characterised in that:
Bioactivity response=-log described in step 110V, wherein v are Standard Values values, response values The bioactivity size for reflecting ligand molecular and GPCR effects, the activity for being worth smaller expression effect are lower.
3. a kind of ligand small molecule key minor structure screening technique according to claim 1 or 2, it is characterised in that:It is described Given initial data setWhereinIt is the SMILES atom connection figures of each molecule, YiIt is The response values of each sample, ECFP features, that is, data set D of the ligand moleculart={ (Xi,Yi)|Xi∈R1*m,1≤i≤ n}。
4. a kind of ligand small molecule key minor structure screening technique according to claim 3, it is characterised in that:The step In rapid 3, the data set Dt, for meeting condition (λ ∈ (0, λ0]) λ={ λi|0≤i<K,λii+1, it obtains belonging to each Feature Selection result Τ={ Τ of λ valueii∈R1*m,0≤i<K }, wherein ΤiValue represents this feature reservation for 1, is represented for 0 It is extraneous features, can deletes.
5. a kind of ligand small molecule key minor structure screening technique according to claim 1, it is characterised in that:
In the step 3, the ligand molecular Feature Selection based on EDPP LASSO methods, it is assumed that data X ∈ Rn*m, n is sample Number, m are characterized dimension, then standard LASSO problems are:
Wherein y is label, β ∈ Rm, by adding L to loss function in (1) formula1The penalty of norm, by the coefficient of variable β It is compressed and certain regression coefficients is made to become 0, achieve the purpose that feature selecting simultaneously during optimizing loss function;
Its dual problem is sought for (1) formula is convertible:
Wherein θ is dual variable;
The optimal solution of formula (1) and formula (2) is identical, so only requiring its dual problem;
For convenience's sake, the solution of optimization problem (2) is denoted as θ*(λ), similarly the solution of optimization problem (1) be denoted as β*(λ);
Had according to KKT conditions:
Y=X β*(λ)+λθ*(λ) (3)
Wherein [*]iIth feature is represented, the KKT conditions shown in formula (4) obtain
It is exactly a unrelated feature (5)
A region Θ is first estimated, wherein including θ*(λ), (5) can be written as follows form:
It is exactly a unrelated feature (6).
6. a kind of ligand small molecule key minor structure screening technique according to claim 4, it is characterised in that:The step In rapid 4, the Feature Selection for step 3 is as a result, K TiSuperposition it is tired and, this just obtains the selected frequency of each dimensional feature, The selected number of feature is more, represents it and is more likely to be relevant feature, chooses by the most preceding p spies of selection number Sign obtains the feature selecting of robust as a result, the p, which is natural number 0, arrives p.
7. a kind of ligand small molecule key minor structure screening technique according to claim 1, it is characterised in that:The step In rapid 5, the feature selecting obtained in step 4 is recalled as a result, retaining final feature, visually dissolves its specific minor structure Figure.
8. a kind of ligand small molecule key minor structure screening plant, it is characterised in that including:
Data preprocessing module, for arranging the associated documents downloaded in database;
ECFP feature generation modules, for generating ECFP molecular fingerprints;
Feature Selection module based on EDPP-LASSO removes extraneous features;
Robustness selects characteristic module, further preferred key feature;
Feature is recalled, crucial minor structure visualization model;
It is transmitted and is connected by data line between each module.
9. a kind of ligand small molecule key minor structure screening plant according to claim 8, it is characterised in that including:It is described Data preprocessing module includes the download of associated documents, and data cleansing seeks response average values to the ligand molecular repeated, Build initial data set Ds
The ECFP features generation module, the initial data set D according to data preprocessing module structuresStructure data set Dt
The Feature Selection module based on EDPP LASSO, according to data set Dt, to change the essence of screening by the way that λ value is arranged Thin degree, λ value is smaller, and the feature of reservation is more;
The robustness selects characteristic module, retains quantity key feature is arranged, filters out most important N number of minor structure, institute The N stated is that natural number 0 arrives n;
The feature backtracking, crucial minor structure visualization model, which, which can visually dissolve, influences drug targets and ligand point Son combines active crucial minor structure, instructs the optimization of lead compound.
10. a kind of ligand small molecule key minor structure screening plant according to claim 9, it is characterised in that:It is described ECFP feature generation modules, for fingerprint depth to be arranged, under normal circumstances, fingerprint depth is deeper, and the intrinsic dimensionality of generation is more, The analysis granularity of minor structure is finer.
CN201810233519.2A 2018-03-21 2018-03-21 A kind of ligand small molecule key minor structure screening technique and device Pending CN108536999A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810233519.2A CN108536999A (en) 2018-03-21 2018-03-21 A kind of ligand small molecule key minor structure screening technique and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810233519.2A CN108536999A (en) 2018-03-21 2018-03-21 A kind of ligand small molecule key minor structure screening technique and device

Publications (1)

Publication Number Publication Date
CN108536999A true CN108536999A (en) 2018-09-14

Family

ID=63484338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810233519.2A Pending CN108536999A (en) 2018-03-21 2018-03-21 A kind of ligand small molecule key minor structure screening technique and device

Country Status (1)

Country Link
CN (1) CN108536999A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110176279A (en) * 2019-05-31 2019-08-27 南京邮电大学 Lead compound virtual screening method and device based on small sample
CN110277173A (en) * 2019-05-21 2019-09-24 湖南大学 BiGRU drug toxicity forecasting system and prediction technique based on Smi2Vec
CN111798939A (en) * 2020-06-02 2020-10-20 中山大学 Crystal structure database construction method and structure search method
CN113282122A (en) * 2021-05-31 2021-08-20 西安建筑科技大学 Commercial building energy consumption prediction optimization method and system
CN114530215A (en) * 2022-02-18 2022-05-24 北京有竹居网络技术有限公司 Method and apparatus for designing ligand molecules

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106778861A (en) * 2016-12-12 2017-05-31 齐鲁工业大学 A kind of screening technique of key feature
CN106778032A (en) * 2016-12-14 2017-05-31 南京邮电大学 Ligand molecular magnanimity Feature Selection method in drug design
WO2017139044A1 (en) * 2016-02-09 2017-08-17 Albert Einstein College Of Medicine, Inc. Residue-based pharmacophore method for identifying cognate protein ligands

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017139044A1 (en) * 2016-02-09 2017-08-17 Albert Einstein College Of Medicine, Inc. Residue-based pharmacophore method for identifying cognate protein ligands
CN106778861A (en) * 2016-12-12 2017-05-31 齐鲁工业大学 A kind of screening technique of key feature
CN106778032A (en) * 2016-12-14 2017-05-31 南京邮电大学 Ligand molecular magnanimity Feature Selection method in drug design

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110277173A (en) * 2019-05-21 2019-09-24 湖南大学 BiGRU drug toxicity forecasting system and prediction technique based on Smi2Vec
CN110176279A (en) * 2019-05-31 2019-08-27 南京邮电大学 Lead compound virtual screening method and device based on small sample
CN110176279B (en) * 2019-05-31 2022-08-26 南京邮电大学 Lead compound virtual screening method and device based on small sample
CN111798939A (en) * 2020-06-02 2020-10-20 中山大学 Crystal structure database construction method and structure search method
WO2021243768A1 (en) * 2020-06-02 2021-12-09 中山大学 Crystal structure database construction method and structure search method
CN113282122A (en) * 2021-05-31 2021-08-20 西安建筑科技大学 Commercial building energy consumption prediction optimization method and system
CN114530215A (en) * 2022-02-18 2022-05-24 北京有竹居网络技术有限公司 Method and apparatus for designing ligand molecules

Similar Documents

Publication Publication Date Title
CN108536999A (en) A kind of ligand small molecule key minor structure screening technique and device
Kamal et al. A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset
Lanchantin et al. Deep motif: Visualizing genomic sequence classifications
Zhao et al. Data clustering in life sciences
Xu et al. Survey of clustering algorithms
Pandey et al. Computational approaches for protein function prediction: A survey
David et al. Comparative analysis of data mining tools and classification techniques using weka in medical bioinformatics
US10521441B2 (en) System and method for approximate searching very large data
US11501240B2 (en) Systems and methods for process design including inheritance
CN106778032B (en) Ligand molecular magnanimity Feature Selection method in drug design
CN109376790A (en) A kind of binary classification method based on Analysis of The Seepage
US7047137B1 (en) Computer method and apparatus for uniform representation of genome sequences
Reddy et al. Clustering biological data
Gandhi et al. Overview of feature subset selection algorithm for high dimensional data
Stapor et al. Machine learning methods for the protein fold recognition problem
Arteta Albert et al. Intelligent Indexing—Boosting Performance in Database Applications by Recognizing Index Patterns
Huang et al. Research on hybrid feature selection method based on iterative approximation Markov blanket
Ma et al. GraphsformerCPI: Graph Transformer for Compound–Protein Interaction Prediction
Halder et al. Enhancing K-nearest neighbor algorithm: a comprehensive review and performance analysis of modifications
Lee et al. Deep hierarchical embedding for simultaneous modeling of gpcr proteins in a unified metric space
Zenbout et al. Prediction of cancer clinical endpoints using deep learning and rppa data
Reforgiato et al. Graphclust: A method for clustering database of graphs
Sanchez-Gendriz et al. Gene Sequence to 2D Vector Transformation for Virus Classification
Li et al. Biogc: A novel framework for biological network classification via machine learning
Geylan Training Machine Learning-based QSAR models with Conformal Prediction on Experimental Data from DNA-Encoded Chemical Libraries

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180914