CN110146695B - Method for screening human transthyretin interferent by adopting k nearest neighbor algorithm - Google Patents

Method for screening human transthyretin interferent by adopting k nearest neighbor algorithm Download PDF

Info

Publication number
CN110146695B
CN110146695B CN201910378233.8A CN201910378233A CN110146695B CN 110146695 B CN110146695 B CN 110146695B CN 201910378233 A CN201910378233 A CN 201910378233A CN 110146695 B CN110146695 B CN 110146695B
Authority
CN
China
Prior art keywords
model
organic chemical
organic chemicals
quantitative
httr
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910378233.8A
Other languages
Chinese (zh)
Other versions
CN110146695A (en
Inventor
杨先海
刘会会
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN201910378233.8A priority Critical patent/CN110146695B/en
Publication of CN110146695A publication Critical patent/CN110146695A/en
Application granted granted Critical
Publication of CN110146695B publication Critical patent/CN110146695B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/543Immunoassay; Biospecific binding assay; Materials therefor with an insoluble carrier for immobilising immunochemicals
    • G01N33/544Immunoassay; Biospecific binding assay; Materials therefor with an insoluble carrier for immobilising immunochemicals the carrier being organic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Immunology (AREA)
  • Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Theoretical Computer Science (AREA)
  • Hematology (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Urology & Nephrology (AREA)
  • Chemical & Material Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Biotechnology (AREA)
  • Food Science & Technology (AREA)
  • Pathology (AREA)
  • Microbiology (AREA)
  • Artificial Intelligence (AREA)
  • Medicinal Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Cell Biology (AREA)
  • Investigating, Analyzing Materials By Fluorescence Or Luminescence (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a method for screening human transthyretin interferent by adopting a k-nearest neighbor algorithm. The method comprises the steps of firstly, calculating a quantitative descriptor based on morphological correction aiming at ionizable organic chemicals, and then constructing a binary classification model and a quantitative prediction model by adopting the quantitative descriptor based on morphological correction, a functional group, a molecular fragment descriptor and a k-nearest neighbor algorithm; when screening target organic chemicals, firstly, dividing the target organic chemicals into active and inactive organic chemicals based on a binary classification model; then predicting interference effect data of active organic chemicals by using a quantitative model; and finally, judging whether the target organic chemical is a potential human transthyretin interferon or not according to the predicted effect value. The descriptor mechanism is clear and easy to calculate, the prediction method is easy to program, the prediction model has good goodness of fit, robustness and prediction capability, and the screening method has good expandability and is suitable for screening potential human transthyretin interferents in the application domain.

Description

Method for screening human transthyretin interferent by adopting k nearest neighbor algorithm
Technical Field
The invention relates to a method for screening human transthyretin interferent by adopting a k-nearest neighbor algorithm, belonging to the technical field of endocrine interferent screening strategies.
Background
Endocrine disrupting effects caused by environmental Endocrine Disruptors (EDCs) seriously threaten the safety of people and wild animals, and are becoming global environmental problems for human beings. In management, how to effectively identify and evaluate potential EDCs from commercial chemicals is a primary problem to be solved by chemical management departments of various countries. However, years of practice show that problems such as low flux (50-100 chemicals per year), high cost (100 ten thousand dollars are consumed for each chemical) and the like exist in screening and evaluating potential EDCs by only adopting an experimental method, so that it is difficult to test commercial chemicals one by one according to the existing test system (the commercial chemicals are more than 14 ten thousand). Therefore, the development of a prediction model of endocrine disrupting effect indexes is of great significance for implementation of EDCs control.
Research has shown that endocrine-related diseases and disorders are often associated with the interfering effects of EDCs on biological macromolecules such as hormone receptors and transporters. Over the past, activation or inhibition of hormone receptor-mediated signal transduction processes has been considered to be the primary mechanism of action of EDCs, and much work has focused on studying the effects of EDCs and hormone receptors. However, recent studies have shown that in the pathogenic process of EDCs, the interference of EDCs with non-receptor mediated processes such as hormone transport is equally important. However, the current research on the prediction model of the hormone transporter disruptors is still poor.
Chinese patent CN106407665B discloses a virtual screening method for human transthyretin (hTTR) interferents, which comprises the steps of firstly classifying chemicals into 5 classes based on 10 groups, and then predicting interference effect data of target organic chemicals on hTTR by adopting an aromatic organic chemical quantitative prediction model or an alkane organic chemical quantitative prediction model. However, the above method has the following limitations: (1) the method only classifies the target organic chemical based on 10 groups, and if the target organic chemical does not contain the 10 groups, the target organic chemical cannot be classified, so that the interference effect of the target organic chemical cannot be predicted for the organic chemical which does not contain the 10 groups; (2) the descriptor of the method is only a Dragon descriptor calculated based on the molecular state of organic chemicals, however, Yang et al (Yang XH, Xie HB, Chen JW, LiXH. anionic polymeric bound strand with a transition fluoride in the molecular form of inorganic acids; non-organic polymers in a viral screening of inorganic solvents. chem Res. Toxicol,2013,26(9): 1340-1347; Yang XH, Lyakura F, Xie HB, Chen JW, Li XH XL, Cai XY. binding monomers of inorganic and ionic polymers of organic and fluorinated organic chemicals, the interaction of the organic chemicals with the transition fluoride in the molecular form of organic acids, T5 and T5, and the interaction of the organic chemicals with the transition fluoride in the molecular form of organic acids, T5, aromatic rings in the phenolic organic chemicals can form cation-pi interaction with residues of hTTR, namely part of ionizable organic chemicals can be dissociated into ionic states under experimental or physiological pH conditions, and the ionic states and molecular states have non-negligible effects in the interaction process of the ionizable organic chemicals and the hTTR, so that the method does not consider the influence of the ionic states of the ionizable organic chemicals when an hTTR interferent prediction model is constructed.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a method for screening human transthyretin interferent by adopting a k-nearest neighbor algorithm, which has a wide application range and comprehensively considers the interaction between the molecular state and the ionic state of an organic chemical and hTTR.
The technical scheme of the invention is as follows:
the method for screening the human transthyretin interferent by adopting the k-nearest neighbor algorithm comprises the following specific steps:
(1) collecting organic chemical interference effect data
Collecting interference effect data of organic chemicals, wherein the interference effect data are organic chemicals and125I-T4or the ability of the fluorescent probe molecule to compete for hTTR binding sites, i.e., half the competition effect concentration IC50
(2) Computing descriptors
The impact of ionizable group dissociation is characterized using morphological correction based quantitative descriptors: optimizing the molecular state and ionic state structure of the organic chemical by using Gaussian 16 software, directly extracting or calculating the quantitative descriptors of the molecular state and the ionic state of the organic chemical based on an output file of the Gaussian 16, and calculating the quantitative descriptor X based on the morphological correction according to the formula (1)Correction
XCorrection=δM·XMI·XI (1)
Figure BDA0002052424930000021
Wherein X isMAnd XIRespectively the values of the descriptors, delta, for the molecular and ionic states of the organic chemicalMAnd deltaIAre the fractional proportions of the molecular and ionic states, respectively; and calculating functional group and molecular fragment descriptors by using Dragon 6.0 software to represent the influence of various groups of organic chemicals on interference effect;
(3) construction and characterization of binary classification model
Adopting collected qualitative data of organic chemicals with or without activity, constructing a binary classification model according to a Euclidean distance-based kNN algorithm, characterizing the model by adopting a guide rule of economic cooperation and development organization about model construction and verification, and determining an optimal model, wherein the optimal model comprises three descriptors Vaver-adj(morphologically modified average molecular electrostatic potential), F-083 (fluorine atoms connected to carbon atoms hybridized sp 3) and H-047 (hydrogen atoms connected to carbon atoms hybridized sp3 or sp 2), a proximity number (k) of 3, the domain of application of said binary classification model being euclidean distance less than 0.928;
(4) construction and characterization of quantitative prediction model
Selecting quantitative data obtained by the same test method and test conditions, constructing a quantitative model according to a Euclidean distance-based kNN algorithm, and representing organic chemicals and organic chemicals by logarithm-derived relative effect potential RP during modeling125I-T4The ability to compete for hTTR binding site, RP is defined as:
Figure BDA0002052424930000031
wherein, IC50(T4) And IC50(organic chemicals) represents thyroxine (T) respectively4) And IC of organic chemicals50(ii) a Determining an optimal model, wherein the optimal model comprises four descriptors: nCb- (number of carbon atoms of substituted benzene hybridized with Sp 2), nAROH (number of phenolic hydroxyl groups), nHBonds (number of intramolecular hydrogen bonds), and Vadj(average dispersion (pi) of morphology correction), the number of neighbors (k) is 3; the application domain of the quantitative prediction model is that the Euclidean distance is less than 1.11;
(5) screening for human transthyretin interferents
Computing a descriptor, i.e. V, required for the classification modelaver-adj(morphologically corrected average molecular electrostatic potential), F-083 (fluorine atom connected to sp3 hybridized carbon atom), H-047 (hydrogen atom connected to sp3 hybridized or sp2 hybridized carbon atom), assessing whether the target organic chemical is within the binary classification model application domain;
if the target organic chemical is in the application domain range of the binary classification model, calculating whether the target organic chemical has the hTTR interference activity or not according to the binary classification model; if the target organic chemical is inactive, no further evaluation is required; if the target organic chemical is active, predicting the interference effect value according to a quantitative prediction model; if the target organic chemicals are not in the application domain range of the model, the quantitative prediction model cannot be used for prediction;
② for active target organic chemicals, required descriptors, namely nCb- (number of substituted benzene carbon atoms hybridized by Sp 2), nAROH (number of phenolic hydroxyl groups), nHBonds (number of intramolecular hydrogen bonds) and V, are calculated according to the requirements of quantitative prediction modeladj(the morphology-corrected average dispersion (pi)) evaluating whether it is within the application domain of the quantitative prediction model;
if the target organic chemical is in the application domain range of the quantitative prediction model, calculating the logRP value of the target organic chemical to the hTTR according to the selected quantitative prediction model; if the target organic chemicals are not in the application domain range of the quantitative prediction model, the target organic chemicals cannot be predicted by the quantitative prediction model;
thirdly, judging whether the target organic chemical has the ability of interfering the hTTR to transport thyroxine according to the logRP value predicted by the quantitative prediction model:
if the logRP of the organic chemical is greater than 0, the binding capacity of the target organic chemical and the hTTR is stronger than that of thyroxine;
if the logRP of the organic chemical is 0, the binding capacity of the target organic chemical and the hTTR is similar to that of thyroxine;
if the logRP of the organic chemical is less than 0, the target organic chemical is weaker than thyroxine in binding capacity with the hTTR.
The half competition effect concentration IC of the invention50Specifically 50% of125I-T4Or the concentration of organic chemical required to displace the fluorescent probe molecule from the hTTR binding site.
In a specific embodiment of the present invention, in step (1), interference effect data of 355 organic chemicals are collected, wherein the classes of the organic chemicals include uv sunscreens, organotins, organochlorine pesticides, substituted phenols, halogenated benzenes, alkyl carboxylic acids, bisphenol a and derivatives thereof, per/polyfluoro carboxylic acids and per/polyfluoro sulfonic acids, hydroxypolybromobiphenyl ethers, hydroxypolychlorobiphenyls, chlorinated alkenes, phosphate esters, sulfonic acid polychlorinated biphenyls, sulfonamide antibiotics, dioxin-type organic chemicals, polybromobiphenyl ethers, polychlorinated biphenyls, aniline-type organic chemicals, and the like.
In a specific embodiment of the present invention, in step (1), the interference effect data is determined by methods conventional in the art, including a radioligand competition binding method and a fluorescent competition displacement method.
In the embodiment of the present invention, in the step (3), among 355 organic chemicals, 175 and 180 organic chemicals are active and inactive, respectively.
In the specific embodiment of the present invention, in the step (4), quantitative data obtained by using a radioligand competition binding method under a condition of pH 8.0 is selected, and a quantitative model is constructed according to a euclidean distance-based kNN algorithm.
Compared with the prior art, the invention has the following advantages:
(1) in the aspect of data, by looking up the latest literature, collecting interference effect data of more chemicals on the hTTR, expanding the application domain of a model, and being capable of representing the influence of organic chemicals with different forms (molecular state and ionic state) on the action of the organic chemicals and the hTTR;
(2) aiming at the problems of effect existence and effect size prediction, Euclidean distance is adopted to represent the similarity of organic chemicals, a k nearest neighbor algorithm (kNN algorithm) which is easy to program is used for constructing a binary classification model and a quantitative prediction model, the existence of the effect of the target organic chemicals is distinguished by constructing the binary classification model, then the effect value of the target organic chemicals is predicted by the quantitative model, the descriptor mechanism is clear, the calculation is easy, the prediction method is easy to program, and the prediction model has better goodness of fit, robustness and prediction capability;
(3) the screening method has good expandability, and new classification models and quantitative prediction models can be conveniently added into the screening system.
Drawings
Fig. 1 is a graph showing the relationship between the logRP experimental value and the predicted value of the quantitative prediction model.
FIG. 2 is a graph of a binary classification model application domain characterized based on Euclidean distance.
FIG. 3 is a graph of a quantitative predictive model application domain based on Euclidean distance characterization.
Figure 4 is a flow chart of a human transthyretin interferon screen.
Detailed Description
The present invention will be described in more detail with reference to the following examples and the accompanying drawings.
The method for screening the human transthyretin interferent by adopting the k-nearest neighbor algorithm is shown in a flow chart of fig. 4, and comprises the following specific steps:
the interference effect data of the organic chemicals on the hTTR reported in 1990-2018 literature is collected, and 546 effect data of 382 organic chemicals are obtained in total. The classes of organic chemicals include UV sunscreens, organotins, organochlorine pesticides, substituted phenols, halobenzenes, alkyl carboxylic acids, bisphenol A and derivatives thereof, per/polyfluoro carboxylic acids and per/polyfluoro sulfonic acids, hydroxypolybromodiphenyl ethers, hydroxypolychlorobiphenyls, chloroolefins, phosphate esters, sulfonic acid polychlorinated biphenyls, sulfonamide antibiotics, dioxins, polybromodiphenyl ethers, polychlorinated biphenyls, anilines, and the like. Statistics show that 225 organic chemicals out of 382 organic chemicals contain ionizable groups. 355 organic chemical data were used for modeling after data validity analysis and organic chemical deduplication. Number of interference effectsThe determination method comprises a radioligand competitive binding method and a fluorescence competitive displacement method. Organic chemicals and125I-T4or the ability of fluorescent probe molecules to compete for hTTR binding sites using IC50Represents, IC50Is 50% of125I-T4Or the concentration of organic chemical required to displace the fluorescent probe molecule from the hTTR binding site.
(2) Computing descriptors
Quantitative descriptors based on morphological modifications are used to characterize the impact of ionizable group dissociation. Morphological correction based quantization descriptor XCorrectionThe calculation method comprises the following steps:
Xcorrection=δM·XMI·XI (1)
Figure BDA0002052424930000051
Wherein X isMAnd XIDescriptor values for the molecular state and the ionic state of the organic chemical, respectively; deltaMAnd deltaIAre the fractional fractions of the molecular and ionic states, respectively. The method comprises the steps of optimizing the structures of the molecular state and the ionic state of the organic chemical by adopting Gaussian 16 software, directly extracting or calculating the quantitative descriptors of the molecular state and the ionic state of the organic chemical based on an output file of the Gaussian 16, and calculating the quantitative descriptors based on the morphological correction according to the formula (1). In addition, functional groups and molecular fragment descriptors are selected to characterize the influence of various groups of organic chemicals on interference effects, and the descriptors are calculated by using Dragon 6.0 software.
(3) Construction and characterization of binary classification model
The qualitative data collected for the presence or absence of activity of 355 organic chemicals, 175 and 180 for active and inactive organic chemicals, respectively, were used to construct a classification model. And constructing a binary classification model according to a kNN algorithm based on the Euclidean distance. The model is characterized by adopting the guidance of the economic cooperation and development organization on model construction and verification. The results show that the optimal model contains three descriptors: vaver-adj(form correction)Average molecular electrostatic potential of), F-083 (fluorine atom attached to sp3 hybridized carbon atom), H-047 (hydrogen atom attached to sp3 hybridized or sp2 hybridized carbon atom). The neighborhood number (k) is 3. The model evaluation results show the predicted sensitivity S of the training set and the validation setnPredicted specificities S of 0.867 and 0.844, training and validation sets, respectivelyp0.844 and 0.897, respectively, and the prediction accuracy Q of the training set and validation set was 0.856 and 0.873, respectively. The prediction accuracy of the organic chemicals in the training set or the verification set is greater than 0.85, which means that more than 85% of the organic chemicals can be correctly distinguished as active or inactive, and the constructed model has better prediction capability. The application domain of the model is characterized by a euclidean distance, and the application domain of the binary classification model is characterized by a euclidean distance of less than 0.928 (as shown in fig. 2).
(4) Construction and characterization of quantitative prediction model
Because many quantitative data testing methods and testing conditions in the data set are different, in order to reduce data errors, quantitative data with the same testing method and testing conditions are selected to construct a quantitative model. Analysis found that the number of data points was the highest using the radioligand competition binding method and the pH 8.0 condition, and therefore a quantitative prediction model was constructed according to the euclidean distance-based kNN algorithm using 88 quantitative data under this condition. Wherein the training and validation sets comprise 70 and 18 organic chemicals, respectively. Characterization of organic chemicals by logarithmic relative effect potential (RP) in modeling125I-T4The ability to compete for hTTR binding site, RP is defined as:
Figure BDA0002052424930000061
wherein: IC (integrated circuit)50(T4) And IC50(organic chemicals) represents thyroxine (T) respectively4) And IC of organic chemicals50(nM)。
The results show that the optimal model contains four descriptors: nCb- (number of carbon atoms of substituted benzene hybridized with Sp 2), nAROH (number of phenolic hydroxyl groups), nHBonds (number of hydrogen bonds in molecule), Vadj(average dispersion (II) of morphology correction). The neighborhood number (k) is 3. Using the square of the correlation coefficient (R) between the experimental value and the predicted value of the training set2 Training set) Cross validation factor (Q) by one-out method2 Training set) Correlation coefficient (Q) of external verification set2 Verification set) Training set, and external validation set Root Mean Square Error (RMSE)Training setAnd RMSEVerification set) Training set, and external validation set Mean Absolute Error (MAE)Training setAnd MAEVerification set) And evaluating the goodness-of-fit, robustness and prediction capability of the model. The training set characterization results are: r2 Training set=0.910,Q2 Training set=0.804,RMSETraining set=0.397,MAETraining set0.298; the verification set characterization results are: q2 Verification set=0.852,RMSEVerification set=0.544,MAEVerification set0.414. According to a model acceptance criterion, i.e. R2 Training set>0.6、Q2 Training set>0.6、Q2 Verification set>0.7, the model has better goodness-of-fit, robustness and predictive ability (as shown in fig. 1). The application domain of the model is characterized by the Euclidean distance, and the application domain of the quantitative prediction model is that the Euclidean distance is less than 1.11 (shown in figure 3).
(5) Human transthyretin interferon screening method
Computing a descriptor, i.e. V, required for the classification modelaver-adj(morphologically modified average molecular electrostatic potential), F-083 (fluorine atom connected to sp3 hybridized carbon atom), H-047 (hydrogen atom connected to sp3 hybridized or sp2 hybridized carbon atom); evaluating whether the target organic chemical is within the binary classification model application domain.
If the target organic chemical is in the range of the model application domain, calculating whether the target organic chemical has the hTTR interference activity or not according to the classification model; and judging the next processing step according to the classification result. If the target organic chemical is inactive, no further evaluation is required; if the target organic chemical is active, the magnitude of the interference effect is predicted according to the following quantitative prediction model.
If the target organic chemical is not within the application domain of the quantitative prediction model, prediction cannot be performed by the model.
② for active target organic chemicals, calculating required descriptors, namely nCb- (substituted benzene carbon number hybridized by Sp 2), nAROH (phenolic hydroxyl number), nHBonds (hydrogen bond number in molecule), V according to the requirements of quantitative prediction modeladj(average dispersion (II) of morphology correction). It is evaluated whether it is within the application domain of the quantitative prediction model.
If the target organic chemical is in the application domain range of the model, calculating the logRP value of the target organic chemical to the hTTR according to the selected model;
if the target organic chemical is not within the application domain of the model, the model cannot be used for prediction.
And thirdly, judging whether the target organic chemical has the ability of interfering the hTTR to transport thyroxine according to the predicted logRP value. By definition, logRP of T4 is 0. Therefore, the ability of the target organic chemical to compete with thyroxine for binding to the hTTR site can be judged according to the size relationship between the organic chemical logRP and 0.
If the organic chemical logRP is greater than 0, the binding capacity of the target organic chemical and the hTTR is stronger than that of thyroxine, so that the organic chemical has higher priority;
if the logRP of the organic chemical is 0, the binding capacity of the target organic chemical and the hTTR is similar to that of thyroxine;
if the organic chemical logRP <0, it indicates that the target organic chemical has weaker binding ability to hTTR than thyroxine, and thus has lower priority.
Example 1
2,3,3',5,5' -pentachlorodiphenyl has no hTTR interference activity. The steps for predicting the interference activity by using the method are as follows:
calculating the descriptor needed by the classification model according to Gaussian 16 and Dragon 6.0, namely Vaver-adj(morphologically modified average molecular electrostatic potential), F-083 (fluorine atom connected to sp3 hybridized carbon atom), H-047 (hydrogen atom connected to sp3 hybridized or sp2 hybridized carbon atom). Then calculate its European tableThe reed distance is 0.191, within the application domain of the binary classification model (euclidean distance is less than 0.928). Thus, a binary classification model can be used to distinguish the interfering activity of 2,3,3',5,5' -pentachlorodiphenyl on hTTR. And predicting the hTTR interference free activity of the 2,3,3',5,5' -pentachlorodiphenyl by adopting a kNN algorithm based on Euclidean distance according to the descriptors of the organic chemicals and the descriptors of the 2,3,3',5,5' -pentachlorodiphenyl in the binary classification model training set, and the hTTR interference free activity is consistent with the experimental determination result. No further evaluation was necessary.
Example 2
4' -HO-3,3',4,5,5' -pentachlorodiphenyl has hTTR interference activity (logRP is 0.933). The steps for predicting the interference activity by using the method are as follows:
calculating the required descriptor of the required classification model, namely V, according to Gaussian 16 and Dragon 6.0aver-adj(morphologically modified average molecular electrostatic potential), F-083 (fluorine atom connected to sp3 hybridized carbon atom), H-047 (hydrogen atom connected to sp3 hybridized or sp2 hybridized carbon atom). The Euclidean distance was then calculated to be 0.187, within the application domain of the binary classification model (Euclidean distance less than 0.928). Therefore, a binary classification model can be used to distinguish the interference activity of 4' -HO-3,3',4,5,5' -pentachlorodiphenyl on hTTR. And (3) predicting that the 4'-HO-3,3',4,5,5 '-pentachlorobiphenyl has hTTR interference activity by adopting a kNN algorithm based on Euclidean distance according to the descriptors of the organic chemicals and the descriptors of the 4' -HO-3,3',4,5,5' -pentachlorobiphenyl in the binary classification model training set, and the hTTR interference activity is consistent with the experimental determination result. Further evaluation is required.
Then, predicting the interference effect value by adopting a quantitative prediction model: the descriptors required for the quantitative prediction model, namely nCb- (number of substituted benzene carbon atoms hybridized by Sp 2), nAROH (number of phenolic hydroxyl groups), nHBonds (number of intramolecular hydrogen bonds), V, were calculated from Gaussian 16 and Dragon 6.0adj(average dispersion (II) of morphology correction). The euclidean distance is then calculated to be 0.265, within the application domain of the quantitative predictive model (euclidean distance less than 1.11). Therefore, a quantitative prediction model can be used for predicting the interference effect value of 4' -HO-3,3',4,5,5' -pentachlorodiphenyl on hTTR. Training set of organic chemical descriptors and 4' -HO-3,3',4,5,5' -pentachloro according to quantitative prediction modelAnd (3) predicting the interference effect value logRP of 4' -HO-3,3',4,5,5' -pentachlorodiphenyl on the hTTR to be 0.673 by adopting a kNN algorithm based on Euclidean distance, wherein the experimental value logRP is 0.933, and the predicted value is consistent with the experimental value. Due to logRP>0.933, which shows that 4'-HO-3,3',4,5,5 '-pentachlorodiphenyl has stronger binding capacity with hTTR than thyroxine, and needs to pay high attention to the way that 4' -HO-3,3',4,5,5' -pentachlorodiphenyl interferes with the thyroid system by interfering with the transport of thyroxine by hTTR.

Claims (5)

1. The method for screening the human transthyretin hTTR interferent by adopting the k-nearest neighbor algorithm is characterized by comprising the following specific steps of:
(1) collecting organic chemical interference effect data
Collecting interference effect data of organic chemicals, wherein the interference effect data are organic chemicals and125I-T4ability to compete for hTTR binding site, i.e., half the competition effect concentrationIC 50
(2) Computing descriptors
The impact of ionizable group dissociation is characterized using morphological correction based quantitative descriptors: optimizing the molecular state and ionic state structure of the organic chemical by using Gaussian 16 software, directly extracting or calculating the quantitative descriptors of the molecular state and the ionic state of the organic chemical based on an output file of the Gaussian 16, and calculating the quantitative descriptors based on the morphological correction according to the formula (1)X Correction
Figure 333000DEST_PATH_IMAGE001
(1)
Figure 759434DEST_PATH_IMAGE002
Figure 922211DEST_PATH_IMAGE003
(2)
Wherein the content of the first and second substances,X MandX Iare respectively provided withIs a descriptor value of the molecular state and the ionic state of the organic chemical,δ Mandδ Iare the fractional proportions of the molecular and ionic states, respectively; and calculating functional group and molecular fragment descriptors by using Dragon 6.0 software to represent the influence of various groups of organic chemicals on interference effect;
(3) construction and characterization of binary classification model
Establishing a binary classification model according to a Euclidean distance-based kNN algorithm by using collected qualitative data of the existence of activity of organic chemicals, characterizing the model by adopting a guide rule of an economic cooperation and development organization on model establishment and verification, and determining an optimal model, wherein the optimal model comprises three descriptors, namely a form-corrected average molecular electrostatic potentialV aver-adjFluorine atom bonded to sp 3-hybridized carbon atomF-083And a hydrogen atom bonded to a carbon atom that is sp3 hybridized or sp2 hybridizedH-047Number of neighborsk3, the application domain of the binary classification model is that the Euclidean distance is less than 0.928;
(4) construction and characterization of quantitative prediction model
Selecting quantitative data obtained by adopting the same test method and test conditions, constructing a quantitative prediction model according to a Euclidean distance-based kNN algorithm, and using logarithm relative effect potential in modelingRPCharacterization of organic chemicals and125I-T4the ability to compete for the binding site of hTTR,RPis defined as:
Figure 976755DEST_PATH_IMAGE004
(3)
wherein the content of the first and second substances,IC 50(T4) AndIC 50(organic chemicals) representing thyroxine and organic chemicals, respectivelyIC 50(ii) a Determining an optimal model, wherein the optimal model comprises four descriptors: number of carbon atoms of Sp2 hybridized substituted benzenenCb-Number of phenolic hydroxyl groupsnArOHNumber of intramolecular hydrogen bondsnHBondsAnd morphology corrected average dispersionV adjNumber of neighborskIs 3; the application domain of the quantitative prediction model isThe Euclidean distance is less than 1.11;
(5) screening for human transthyretin interferents
Calculating the descriptor needed by the classification model, namely the form corrected average molecular electrostatic potentialV aver-adjFluorine atom bonded to sp 3-hybridized carbon atomF-083A hydrogen atom bonded to a carbon atom that is sp3 hybridized or sp2 hybridizedH-047Evaluating whether the target organic chemicals are in the application domain of the binary classification model;
if the target organic chemical is in the application domain range of the binary classification model, calculating whether the target organic chemical has the hTTR interference activity or not according to the binary classification model; if the target organic chemical is inactive, no further evaluation is required; if the target organic chemical is active, predicting the interference effect value according to a quantitative prediction model; if the target organic chemicals are not in the application domain range of the binary classification model, the target organic chemicals cannot be predicted by the binary classification model;
secondly, for active target organic chemicals, calculating required descriptors, namely the number of carbon atoms of the substituted benzene hybridized by Sp2 according to the requirements of a quantitative prediction modelnCb-Number of phenolic hydroxyl groupsnArOHNumber of intramolecular hydrogen bondsnHBondsAnd morphology corrected average dispersionV adjEvaluating whether the model is within the application domain of the quantitative prediction model;
if the target organic chemical is in the application domain range of the quantitative prediction model, calculating the log of the target organic chemical to the hTTR according to the selected quantitative prediction modelRPA value; if the target organic chemicals are not in the application domain range of the quantitative prediction model, the target organic chemicals cannot be predicted by the quantitative prediction model;
log predicted according to quantitative prediction modelRPValues judge whether the target organic chemical has the ability to interfere with hTTR transport of thyroxine:
if organic chemical logRP>0, indicating that the binding capacity of the target organic chemical and the hTTR is stronger than that of thyroxine;
if organic chemical logRP= 0, indicating that the target organic chemical has a binding capacity similar to that of thyroxine;
if organic chemical logRP<0, indicating that the target organic chemical binds hTTR less strongly than thyroxine.
2. The method of claim 1, wherein in step (1), interference effect data is collected for 355 organic chemicals, said organic chemical classes including UV sunscreens, organotins, organochlorine pesticides, substituted phenols, halobenzenes, alkyl carboxylic acids, bisphenol A and derivatives thereof, per/polyfluoro carboxylic acids and per/polyfluoro sulfonic acids, hydroxypolybromodiphenyl ethers, hydroxypolychlorodiphenyl, chloroolefins, phosphate esters, sulfonic polychlorinated diphenyl, sulfonamide antibiotics, dioxins, polybromodiphenyl ethers, polychlorinated diphenyl, aniline organic chemicals.
3. The method according to claim 1, wherein in the step (1), the interference effect data is measured by a radioligand competitive binding method or a fluorescent competitive displacement method.
4. The method of claim 2, wherein in step (3), the number of active and inactive organic chemicals in the 355 organic chemicals is 175 and 180, respectively.
5. The method according to claim 1, wherein in the step (4), quantitative data obtained by using a radioligand competition binding method under the condition of pH = 8.0 is selected, and a quantitative prediction model is constructed according to a kNN algorithm based on Euclidean distance.
CN201910378233.8A 2019-05-08 2019-05-08 Method for screening human transthyretin interferent by adopting k nearest neighbor algorithm Active CN110146695B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910378233.8A CN110146695B (en) 2019-05-08 2019-05-08 Method for screening human transthyretin interferent by adopting k nearest neighbor algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910378233.8A CN110146695B (en) 2019-05-08 2019-05-08 Method for screening human transthyretin interferent by adopting k nearest neighbor algorithm

Publications (2)

Publication Number Publication Date
CN110146695A CN110146695A (en) 2019-08-20
CN110146695B true CN110146695B (en) 2021-12-10

Family

ID=67594932

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910378233.8A Active CN110146695B (en) 2019-05-08 2019-05-08 Method for screening human transthyretin interferent by adopting k nearest neighbor algorithm

Country Status (1)

Country Link
CN (1) CN110146695B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011157655A1 (en) * 2010-06-15 2011-12-22 Biocrates Life Sciences Ag Use of bile acids for prediction of an onset of sepsis
CN103345544A (en) * 2013-06-11 2013-10-09 大连理工大学 Predicting organic chemical biodegradability according to logistic regression method
CN103650100A (en) * 2011-04-28 2014-03-19 菲利普莫里斯生产公司 Computer-assisted structure identification
CN103761431A (en) * 2014-01-10 2014-04-30 大连理工大学 Method for predicting fish bio-concentration factors of organic chemicals by quantitative structure-activity relationship
CN106407665A (en) * 2016-09-05 2017-02-15 大连理工大学 Virtual screening method of human transthyretin (hTTR) disturbing chemicals
CN107563133A (en) * 2017-08-30 2018-01-09 大连理工大学 Using the method for the chlorine radical reaction rate constant of quantitative structure activity relationship model prediction organic chemicals

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011157655A1 (en) * 2010-06-15 2011-12-22 Biocrates Life Sciences Ag Use of bile acids for prediction of an onset of sepsis
CN103650100A (en) * 2011-04-28 2014-03-19 菲利普莫里斯生产公司 Computer-assisted structure identification
CN103345544A (en) * 2013-06-11 2013-10-09 大连理工大学 Predicting organic chemical biodegradability according to logistic regression method
CN103761431A (en) * 2014-01-10 2014-04-30 大连理工大学 Method for predicting fish bio-concentration factors of organic chemicals by quantitative structure-activity relationship
CN106407665A (en) * 2016-09-05 2017-02-15 大连理工大学 Virtual screening method of human transthyretin (hTTR) disturbing chemicals
CN107563133A (en) * 2017-08-30 2018-01-09 大连理工大学 Using the method for the chlorine radical reaction rate constant of quantitative structure activity relationship model prediction organic chemicals

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Anionic Phenolic Compounds Bind Stronger with Transthyretin than Their Neutral Forms: Nonnegligible Mechanisms in Virtual Screening of Endocrine Disrupting Chemicals;Xianhai Yang等;《Chem. Res. Toxicol.》;20130813(第26期);第1340-1347页 *
Development of classification model and QSAR model for predicting binding affinity of endocrine disrupting chemicals to human sex hormone-binding globulin;Huihui Liu等;《Chemosphere》;20160506(第156期);第1-7页 *
Development of liposome/water partition coefficients predictive models for neutral and ionogenic organic chemicals;Shiyu Lin等;《Ecotoxicology and Environmental Safety》;20190423(第179期);第40-49页 *

Also Published As

Publication number Publication date
CN110146695A (en) 2019-08-20

Similar Documents

Publication Publication Date Title
Brack et al. High-resolution mass spectrometry to complement monitoring and track emerging chemicals and pollution trends in European water resources
Liblit et al. Scalable statistical bug isolation
Shen et al. Efindbugs: Effective error ranking for findbugs
Fusaro et al. Prediction of high-responding peptides for targeted protein assays by mass spectrometry
Ahmed et al. Enhanced feature selection for biomarker discovery in LC-MS data using GP
Chetnik et al. MetaClean: a machine learning-based classifier for reduced false positive peak detection in untargeted LC–MS metabolomics data
Maruya et al. A tiered, integrated biological and chemical monitoring framework for contaminants of emerging concern in aquatic ecosystems
Dávila-Santiago et al. Machine learning applications for chemical fingerprinting and environmental source tracking using non-target chemical data
Luo et al. Protein quantitation using iTRAQ: Review on the sources of variations and analysis of nonrandom missingness
US11630057B2 (en) Deformulation techniques for deducing the composition of a material from a spectrogram
Coffin The emergence of microplastics: charting the path from research to regulations
Fakouri Baygi et al. Automated isotopic profile deconvolution for high resolution mass spectrometric data (APGC-QToF) from biological matrices
Simonnet-Laprade et al. Data analysis strategies for the characterization of chemical contaminant mixtures. Fish as a case study
CN110146695B (en) Method for screening human transthyretin interferent by adopting k nearest neighbor algorithm
Minkus et al. Spotlight on mass spectrometric non‐target screening analysis: Advanced data processing methods recently communicated for extracting, prioritizing and quantifying features
Strynar et al. Practical application guide for the discovery of novel PFAS in environmental samples using high resolution mass spectrometry
Lomio et al. A machine and deep learning analysis among SonarQube rules, product, and process metrics for fault prediction
Akimova et al. Pytracebugs: A large python code dataset for supervised machine learning in software defect prediction
Wallace et al. NIST Mass Spectrometry Data Center standard reference libraries and software tools: Application to seized drug analysis
Fakouri Baygi et al. Comparison between automated and user-interactive non-targeted screening tools: isotopic profile deconvoluted chromatogram (IPDC) algorithm and HaloSeeker 1.0
CN112199295A (en) Deep neural network defect positioning method and system based on frequency spectrum
Herold et al. Detection of violation causes in reflexion models
Hammond et al. Navy Fuel Composition and Screening Tool (FCAST) v3. 0
Liu et al. Assessment for the data processing performance of non-target screening analysis based on high-resolution mass spectrometry
Watanabe et al. Identifying recurring association rules in software defect prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant