CN113782110B - Compound toxicity prediction system and method based on humanized chip, molecular fingerprint and deep learning - Google Patents

Compound toxicity prediction system and method based on humanized chip, molecular fingerprint and deep learning Download PDF

Info

Publication number
CN113782110B
CN113782110B CN202111134799.XA CN202111134799A CN113782110B CN 113782110 B CN113782110 B CN 113782110B CN 202111134799 A CN202111134799 A CN 202111134799A CN 113782110 B CN113782110 B CN 113782110B
Authority
CN
China
Prior art keywords
compound
chip
humanized
data
toxicity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111134799.XA
Other languages
Chinese (zh)
Other versions
CN113782110A (en
Inventor
李健
梁伟成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202111134799.XA priority Critical patent/CN113782110B/en
Publication of CN113782110A publication Critical patent/CN113782110A/en
Application granted granted Critical
Publication of CN113782110B publication Critical patent/CN113782110B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Mathematical Physics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a compound toxicity prediction system and method based on a humanized chip, molecular fingerprint and deep learning, comprising the following steps: the data preprocessing unit is used for downloading and sorting related data and extracting characteristic parameters; the compound similarity comparison unit is used for comparing the similarity of the compound to be detected and the library poisoning, screening the poisoning with the similarity greater than 0.5, displaying the poisoning, and outputting the similarity; a humanized chip related data unit for outputting humanized chip related data of poison; a compound toxicity prediction unit predicts whether the compound has toxicity or not through the structure of the compound; the data visualization unit displays each data in a visual mode. The invention is based on a humanized chip, a molecular fingerprint and a deep learning technology, and can realize the prediction of the toxicity of the compound according to the structure of the compound.

Description

Compound toxicity prediction system and method based on humanized chip, molecular fingerprint and deep learning
Technical Field
The invention relates to a compound toxicity prediction system and method based on a humanized chip, molecular fingerprints and deep learning, and belongs to the technical field of bioinformatics research on compound properties.
Background
Molecular fingerprinting (molecular fingerprint) is a technique whereby a chemical molecular formula is converted into a series of binary fingerprint sequences by detecting the presence or absence of specific structures in the molecular structure. Thus, molecular fingerprinting methods can be employed so that the structure of the compound can be understood by a computer, thereby enabling structural retrieval and prediction of properties of the compound.
The humanized chip (organic-on-a-chip) technology is a microfluidic micro-physiological system which can perform high-resolution and real-time imaging analysis on the structure and function of living human cells at the in-vitro tissue and organ level by utilizing a microfluidic technology. Can be used for constructing organ tissue structures which are close to physiological functions in vitro. Compared with the traditional wet experimental method such as clinical experiments, the method has the advantages of short experimental period, low cost and the like.
At present, the traditional methods such as clinical test and the like have long test period, high test cost and need to spend a great deal of time and financial resources for predicting the toxicity of the compound. With the increasing number of newly discovered compounds in recent years, the speed of the new compounds is increased, and the current traditional toxicity prediction method cannot meet the current requirements of poison prediction.
Disclosure of Invention
Technical problems: the invention aims to realize toxicity prediction of unknown compounds through molecular fingerprint technology and deep learning method.
The technical scheme is as follows: the technical scheme adopted for solving the technical problems is as follows:
a compound toxicity prediction system based on a humanized chip, molecular fingerprint and deep learning comprises a data preprocessing module: the method is used for collecting and analyzing the compound data and extracting characteristic parameters;
compound similarity comparison module: for comparing the similarity of the compounds to poisons in the library and giving the similarity;
and a humanized chip related data module: used for providing humanized chip related experimental data of library poisoning substances;
compound toxicity prediction module: the characteristic parameters are used for being extracted by the machine learning data preprocessing module, and the possible toxicity of the compound is predicted;
and a data visualization module: the method is used for carrying out visual treatment on all data to obtain the compound similar to poison, related humanized chip data and possible toxicity.
The compound toxicity prediction method based on the humanized chip, the molecular fingerprint and the deep learning comprises the following steps of:
s1, downloading relevant original data of poison and compound in pubchem, including smiles, toxicity basic information and structure of the poison and the compound, and screening, extracting and processing characteristic parameters in the poison and the compound;
s2, calculating molecular fingerprints by using a rdkit tool package and combining smiles molecular expressions of the toxicants and the compounds downloaded in the S1, and circularly calculating the similarity between the compound to be tested and the toxicants and the compound molecules downloaded in the database by using a Tanimoto similarity method in the rdkit tool package;
s3, searching the downloaded poison and the downloaded compound in pubmed for application in a humanized chip, downloading related literature information, and extracting important keywords;
s4, establishing a preliminary model based on the characteristic parameter number in the S1, carrying out regression analysis according to the characteristic parameter of the compound and the relation between the characteristic parameter and toxicity, and judging the size of influencing factors of the characteristic parameter and the toxicity of the compound; dividing an original data set into a training set, a verification set and a test set, and obtaining a toxicity prediction effect through a KNN model;
s5, according to the preliminary model established in the S4, a toxicity prediction model is established, and the toxicity of the compound to be tested is predicted by optimizing and adjusting parameters of the prediction model through data in a database.
Further, the specific steps of S1 are as follows:
s1.1, downloading characteristic parameters of poison in pubchem according to CASRN (computer aided manufacturing) according to a control method directory of toxic substances in the United states, wherein the characteristic parameters comprise poison names, smiles, chemical safety and common toxicology information;
s1.2, downloading characteristic parameters of non-toxic compounds according to CASRN, wherein the characteristic parameters comprise poison names, smiles and chemical safety;
s1.3, extracting structural characteristics of toxic substances and nontoxic compounds according to smiles of the toxic substances and the nontoxic compounds, wherein the structural characteristics comprise the number of C atoms, the number of halogen atoms, the number of benzene rings, the number of double bonds, the number of triple bonds, the number of P atoms and the number of S atoms; the toxic toxicity code of the nontoxic substance is 0, and the toxic substance toxicity code is 1, so that the computer can calculate conveniently;
s1.4, converting all data into csv files, and cleaning the missing data and the extreme data to remove unnecessary data.
Further, the specific steps for calculating the similarity between the compound to be tested and the toxic molecules in the database in S2 are as follows:
s2.1, storing smiles of the downloaded substances, and calculating molecular fingerprint characteristics of the downloaded substances;
calculating a molecular fingerprint of a compound to be detected by using an rdkit tool kit, calculating the similarity between the compound to be detected and a downloaded substance by using a Tanimoto similarity method in the rdkit tool kit, and displaying the similarity;
s2.2, a calculation formula of the Tanimoto similarity method is as follows:
wherein: a, a 1 ,a 2 ...a n Is a molecular fingerprint of the downloaded material, b 1 ,b 2 ...b n Is the molecular fingerprint of the compound to be tested;
s2.3, a poison or a compound similar to the compound to be tested is displayed in a visual mode.
Further, the specific steps of extracting the related information of the humanized chip in the step S3 are as follows:
s3.1, searching and downloading related documents of the humanized chip on the Pubmed, wherein the searching keywords are as follows: organic-on-a-Chip, organic chips, organ chips, liver chips, lung chips, kidney chips, skin chips, brain chips, heart chips, intestine chips, blood vessel chips, tumor chips;
s3.2, extracting important keywords in the humanized chip literature, including poison names, target organs, main cells, experimental materials, experimental instruments, culture environments, on-chip environments, model types and statistical modes;
s3.3, placing the humanized chip literature into a database of corresponding toxicants according to the toxicant names contained in the humanized chip literature, and displaying the corresponding extracted important keywords of the humanized chip;
s3.4, classifying whether the poison is subjected to the related research of the humanized chip, and providing corresponding humanized chip data if the poison is subjected to the related research of the humanized chip, so that researchers can conveniently carry out the related research of the humanized chip on substances similar to the poison in structure;
s3.5, screening poisons with the similarity of more than 0.5 according to the calculated similarity of the compounds, displaying the poisons in a list, and visually displaying the related data of the humanized chip on a poison detail page.
Further, the specific steps of creating the preliminary model in S4 are as follows:
s4.1, carrying out data processing on the compounds according to the structural characteristics of the compounds screened and extracted in the S2, and extracting the data processing result of each compound;
s4.2, carrying out regression analysis and normalization treatment on the data, and judging the size of influencing factors of the structure on the toxicity of the compound; the extreme factors are eliminated to have larger influence on the prediction result;
s4.3, a compound toxicity prediction preliminary model is established, the model is trained through a KNN model by utilizing a training set, and the accuracy of the model is tested by a testing set.
Further, the specific steps of establishing the toxicity prediction model in S5 are as follows:
s5.1, performing parameter adjustment optimization on the preliminary model by using the downloaded data;
the expressions of S5.2 and KNN are:
wherein: p is a site value in the model, and q is a site value to be predicted;
s5.3, adjusting model parameters by using the downloaded data, adjusting an n_neighbors value and a weights value in the model, and optimizing an algorithm model;
and S5.4, visually displaying the prediction result of the model.
The invention adopts the technical scheme and has the following beneficial effects:
in the compound toxicity prediction method based on the humanized chip, the molecular fingerprint and the deep learning, the molecular fingerprint of the poison molecule is screened and extracted, so that certain toxicological characteristics of the poison molecule can be effectively reflected, and the molecular fingerprint technology is applied to compound similarity calculation, so that the poison similar to the compound in structure can be rapidly screened, and the possible toxicity of the compound can be predicted; in addition, the invention combines the humanized chip technology to provide the humanized chip experimental data of related toxicants, thereby being convenient for developing the related experimental study of the toxicants; more importantly, the invention combines machine learning with a material structure, analyzes the important structure of the poison through machine learning, establishes a machine learning model, trains the model by utilizing related information resources in a poison database, and realizes the prediction of whether the unknown compound has toxicity or not, thereby solving the current situation that the toxicity research of the current compound consumes a great deal of time and financial resources.
Drawings
FIG. 1 is a block diagram of a compound toxicity prediction method based on a humanized chip, molecular fingerprint and deep learning according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for predicting toxicity of a compound based on a humanized chip, molecular fingerprint and deep learning according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for predicting toxicity of a compound based on a humanized chip, molecular fingerprint and deep learning according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a machine learning prediction model of a compound toxicity prediction method based on a humanized chip, molecular fingerprint and deep learning according to an embodiment of the present invention.
Detailed Description
The present invention will be further illustrated with reference to the following specific examples, but the scope of the invention is not limited thereto.
Referring to fig. 1, a block diagram of a method for predicting toxicity of a compound based on a humanized chip, molecular fingerprint and deep learning according to an embodiment of the present invention includes:
the data layer is mainly used for storing data required by normal operation of the system and mainly comprises a molecular fingerprint library, a poison library and a humanized chip library;
the business layer is mainly used for carrying out compound similarity comparison and compound toxicity prediction and mainly comprises a compound similarity comparison unit and a compound toxicity prediction unit; the compound similarity comparison unit mainly comprises calculation of molecular fingerprints, screening of candidate compounds and calculation of candidate compound similarity; the compound toxicity prediction unit mainly comprises the steps of extracting key structures of the compound, constructing a toxicity prediction model, and predicting the toxicity of the compound;
the molecular fingerprint library and the poison library are mainly used for compound similarity comparison; the poison library and the humanized chip library are mainly used for predicting the toxicity of the compound;
referring to fig. 2, a flowchart of a method for predicting toxicity of a compound based on a humanized chip, molecular fingerprint and deep learning is provided in an embodiment of the present invention, including:
the data preprocessing unit is used for collecting and analyzing the compound data and extracting characteristic parameters;
a compound similarity comparison unit for comparing the compound with poison similarity in the library and giving a similarity;
the humanized chip related data unit is used for providing humanized chip related experimental data of the library poisoning object;
a compound toxicity prediction unit for predicting toxicity which the compound may have, based on the characteristic parameters extracted by the machine learning data preprocessing unit;
the data visualization unit is used for performing visualization processing on all data to obtain the compound similar toxicant, related humanized chip data and possible toxicity;
the method mainly comprises the following steps:
s1, operating a data preprocessing unit, and downloading and arranging relevant data.
In some embodiments, S1 collects and builds a database of poison humanized chips, downloads relevant raw data of poison and compound in pubchem, and screens, extracts and processes characteristic parameters thereof as follows:
s11, downloading characteristic parameters of poison in pubchem according to CASRN according to a U.S. toxic substance control method (TSCA) directory, wherein the characteristic parameters mainly comprise poison names, smiles, chemical safety and common toxicology information;
s12, downloading characteristic parameters of the nontoxic compound according to CASRN, wherein the characteristic parameters mainly comprise poison names, smiles and chemical safety;
s13, extracting structural characteristics of toxic substances and nontoxic compounds according to smiles of the toxic substances and the nontoxic compounds, wherein the structural characteristics mainly comprise 9 aspects of the number of C atoms, the number of N atoms, the number of O atoms, the number of P atoms, the number of S atoms, the number of halogen atoms, the number of benzene rings, the number of double bonds and the number of triple bonds; the toxicity of the toxic substance is coded as 0, and the toxicity of the toxic substance is coded as 1, so that the computer can operate conveniently;
s104, converting all the data into csv files, and cleaning the missing data and the extreme data to remove unnecessary data.
And S2, operating a compound similarity comparison unit, and outputting compound similarity.
In some embodiments, S2 calculating the compound similarity comprises:
s21, storing smiles of the downloaded substances, and calculating molecular fingerprint characteristics of the downloaded substances. The data processing process is as follows: installing an rdkit toolkit, and converting all poison molecular structural formulas into molecular fingerprints in a smiles format;
s22, inputting smiles of a substance to be detected, and calculating molecular fingerprint characteristics of the substance to be detected, wherein the data processing process is as follows: installing an rdkit kit, and converting the molecular structural formula of the compound to be tested into a molecular fingerprint in a smiles format;
s23, calculating the similarity between the compound to be tested and the downloaded substance by using a Tanimoto similarity method in the rdkit tool package, and displaying the similarity;
s24, calculating formula of tanimoto similarity method is
Wherein: a, a 1 ,a 2 ...a n Is a molecular fingerprint of the downloaded material, b 1 ,b 2 ...b n Is the molecular fingerprint of the test compound.
S3, operating the related data unit of the humanized chip and outputting related data of the humanized chip.
Referring to fig. 3, a flowchart of a humanized chip participating in toxicity prediction based on a method for predicting toxicity of a compound based on a humanized chip, molecular fingerprint and deep learning is provided for an example of the present invention, as shown in fig. 3, in some embodiments, S3 outputting humanized chip related data includes:
s31, searching and downloading related documents of the humanized chip on the Pubmed, wherein the search keywords are as follows: organic-on-a-Chip, organic chips, humanized chips, liver chips, lung chips, kidney chips, skin chips, brain chips, heart chips, intestine chips, blood vessel chips, tumor chips;
s32, extracting important keywords in the humanized chip literature, wherein the important keywords mainly comprise poison names, target organs, main cells, experimental materials, experimental instruments, culture environments, on-chip environments, model types, statistical modes and the like;
s33, constructing a humanized chip database, placing the humanized chip document into a database of corresponding toxicants according to the toxicant names contained in the humanized chip document, and displaying the corresponding extracted important keywords of the humanized chip;
s34, correlating with a poison database, and classifying whether the poison is subjected to the humanized chip related study. Searching a humanized chip database according to the name of the poison, and providing corresponding humanized chip data if the poison is subjected to related humanized chip research, so that researchers can conveniently perform humanized chip related research on substances similar to the poison in structure; and simultaneously, the toxicity of the compound to be predicted is predicted by combining a humanized chip database and a poison database.
S35, screening poisons with the similarity of more than 0.5 according to the similarity of the compounds calculated in the step S23, displaying the poisons in a list, and displaying the related data of the humanized chip on a detail page.
And S4, operating a compound toxicity prediction unit, and outputting a toxicity prediction result.
Referring to fig. 4, a schematic diagram of a machine learning prediction model of a compound toxicity prediction method based on a humanized chip, a molecular fingerprint and deep learning according to an embodiment of the present invention is shown in fig. 3, where the machine learning prediction model of the compound toxicity prediction method based on a humanized chip, a molecular fingerprint and deep learning according to the present invention mainly includes:
s41, a data collection module, which is used for collecting and establishing a data set related to toxic compounds;
s42, a model building module builds the correlation between the structure of the compound and the toxicity of the compound according to the structural characteristics of the compound;
s43, training the model established in the S42 by using the compound data downloaded in the S1;
s44, outputting a result trained by the model in S43 by the result output module.
S41, the data collection module classifies related compounds into two types of toxic and nontoxic according to Pubchem data, and codes the toxic into 1 and the nontoxic into 0, so that the computer can conveniently operate; and (3) extracting the data according to the structural characteristics listed in S13.
S42, carrying out regression analysis and normalization treatment on the data by the model building module, and judging the size of influencing factors of the structure on the toxicity of the compound; the extreme factors are eliminated to have larger influence on the prediction result; dividing the processed data into a training set, a verification set and a test set; totally selecting 81 nontoxic substances and 419 toxic substances as data sets; sequencing substances in the data set according to CID, extracting one substance from every 10 substances to serve as a test set, and taking the rest substances as a training set; the training set contains 450 substances in total, the test set contains 50 substances, and the test result shows that the accuracy of the prediction model is 92%;
and S43, the model training module brings the data downloaded in the S1 into the preliminary model according to the preliminary model established in the S402 to perform parameter adjustment optimization.
S43, a model training module mainly adopts a KNN model to predict the toxicity of the compound, and the main steps are as follows: converting the data into data meeting the algorithm requirement; dividing the data set into a training set, a verification set and a test set; constructing a KNN model by using the training set; testing model parameters using a test set; validating the model effect using a validation set; the main algorithm adopted in the KNN algorithm is a Euclidean distance formula:
wherein: p is the site value in the model, q is the site value to be predicted.
S43, inputting training set data into the model by the model training module, and adopting a cross-validation method to avoid overfitting; adjusting an n_neighbors value and a weights value in the KNN model through training effects of the training set and the verification set, and optimizing model prediction accuracy;
from the above steps, the toxicity of the compound can be predicted according to the structure of the compound, and the main structure comprises: the number of C atoms, the number of N atoms, the number of O atoms, the number of P atoms, the number of S atoms, the number of halogen atoms, the number of benzene rings, the number of double bonds, the number of triple bonds. Prediction of toxicity can be achieved by combining the effects of these important structures.
And S5, operating a data visualization unit to visualize each output result.
In some embodiments, the specific steps of the S5 data visualization are:
s51, visualizing the similarity test data established in the S2;
s52, visualizing the related data of the downloaded humanized chip in the S3;
s53, visualizing the toxicity prediction model established in the S4.
Giving whether the compound to be tested is possibly toxic or not through a prediction model; and then combining the basic information of the downloaded poison and the compound to give the poison or the compound with the structure similar to that of the compound to be tested, so as to conveniently judge the property of the compound to be tested; meanwhile, related data of a poison or compound humanized chip similar to the structure of the compound to be tested is provided, and related application of the compound to be tested in the humanized chip is researched.

Claims (5)

1. A compound toxicity prediction method based on a humanized chip, molecular fingerprint and deep learning, characterized in that the method is performed by a compound toxicity prediction system based on a humanized chip, molecular fingerprint and deep learning, the system comprising a data preprocessing module: the method is used for collecting and analyzing the compound data and extracting characteristic parameters;
compound similarity comparison module: for comparing the similarity of the compounds to poisons in the library and giving the similarity;
and a humanized chip related data module: used for providing humanized chip related experimental data of library poisoning substances;
compound toxicity prediction module: the characteristic parameters are used for being extracted by the machine learning data preprocessing module, and the possible toxicity of the compound is predicted;
and a data visualization module: the method is used for carrying out visual treatment on all data to obtain the compound similar toxicant, related humanized chip data and possible toxicity;
the method comprises the following steps:
s1, downloading relevant original data of poison and compound in pubchem, including smiles, toxicity basic information and structure of the poison and the compound, and screening, extracting and processing characteristic parameters in the poison and the compound;
s2, calculating molecular fingerprints by using a rdkit tool package and combining smiles molecular expressions of the toxicants and the compounds downloaded in the S1, and circularly calculating the similarity between the compound to be tested and the toxicants and the compound molecules downloaded in the database by using a Tanimoto similarity method in the rdkit tool package;
s3, searching the downloaded poison and the downloaded compound in pubmed for application in a humanized chip, downloading related literature information, and extracting important keywords;
s4, establishing a preliminary model based on the characteristic parameter number in the S1, carrying out regression analysis according to the characteristic parameter of the compound and the relation between the characteristic parameter and toxicity, and judging the size of influencing factors of the characteristic parameter and the toxicity of the compound; dividing an original data set into a training set, a verification set and a test set, and obtaining a toxicity prediction effect through a KNN model;
s5, establishing a toxicity prediction model according to the preliminary model established in the S4, optimizing the prediction model by utilizing data in a database, and adjusting parameters to realize the prediction of the toxicity of the compound to be tested;
the specific steps of extracting the related information of the humanized chip in the step S3 are as follows:
s3.1, searching and downloading related documents of the humanized chip on the Pubmed, wherein the searching keywords are as follows:
organic-on-a-Chip, organic chips, organ chips, liver chips, lung chips, kidney chips, skin chips, brain chips, heart chips, intestine chips, blood vessel chips, tumor chips;
s3.2, extracting important keywords in the humanized chip literature, including poison names, target organs, main cells, experimental materials, experimental instruments, culture environments, on-chip environments, model types and statistical modes;
s3.3, placing the humanized chip literature into a database of corresponding toxicants according to the toxicant names contained in the humanized chip literature, and displaying the corresponding extracted important keywords of the humanized chip;
s3.4, classifying whether the poison is subjected to the related research of the humanized chip, and providing corresponding humanized chip data if the poison is subjected to the related research of the humanized chip, so that researchers can conveniently carry out the related research of the humanized chip on substances similar to the poison in structure;
s3.5, screening poisons with the similarity of more than 0.5 according to the calculated similarity of the compounds, displaying the poisons in a list, and visually displaying the related data of the humanized chip on a poison detail page.
2. The method for predicting the toxicity of a compound based on a humanized chip, molecular fingerprint and deep learning according to claim 1, wherein the specific steps of S1 are as follows:
s1.1, downloading characteristic parameters of poison in pubchem according to CASRN, including poison names, smiles, chemical safety and common toxicology information, according to a toxic substance control method directory;
s1.2, downloading characteristic parameters of non-toxic compounds according to CASRN, wherein the characteristic parameters comprise poison names, smiles and chemical safety;
s1.3, extracting structural characteristics of toxic substances and nontoxic compounds according to smiles of the toxic substances and the nontoxic compounds, wherein the structural characteristics comprise the number of C atoms, the number of halogen atoms, the number of benzene rings, the number of double bonds, the number of triple bonds, the number of P atoms and the number of S atoms; the toxic toxicity code of the nontoxic substance is 0, and the toxic substance toxicity code is 1, so that the computer can calculate conveniently;
s1.4, converting all data into csv files, and cleaning the missing data and the extreme data to remove unnecessary data.
3. The method for predicting toxicity of a compound based on a humanized chip, molecular fingerprint and deep learning according to claim 2, wherein the specific steps of calculating the similarity between the compound to be tested and the toxic molecule in the database in S2 are as follows:
s2.1, storing smiles of the downloaded substances, and calculating molecular fingerprint characteristics of the downloaded substances;
calculating a molecular fingerprint of a compound to be detected by using an rdkit tool kit, calculating the similarity between the compound to be detected and a downloaded substance by using a Tanimoto similarity method in the rdkit tool kit, and displaying the similarity;
s2.2, a calculation formula of the Tanimoto similarity method is as follows:
wherein: a, a 1 ,a 2 ...a n Is a molecular fingerprint of the downloaded material, b 1 ,b 2 ...b n Is the molecular fingerprint of the compound to be tested;
s2.3, a poison or a compound similar to the compound to be tested is displayed in a visual mode.
4. The method for predicting compound toxicity based on humanized chip, molecular fingerprint and deep learning according to claim 3, wherein the specific steps of creating the preliminary model in S4 are as follows:
s4.1, carrying out data processing on the compounds according to the structural characteristics of the compounds screened and extracted in the S2, and extracting the data processing result of each compound;
s4.2, carrying out regression analysis and normalization treatment on the data, and judging the size of influencing factors of the structure on the toxicity of the compound; the extreme factors are eliminated to have larger influence on the prediction result;
s4.3, a compound toxicity prediction preliminary model is established, the model is trained through a KNN model by utilizing a training set, and the accuracy of the model is tested by a testing set.
5. The method for predicting the toxicity of the compound based on the humanized chip, the molecular fingerprint and the deep learning according to claim 4, wherein the specific steps of establishing the toxicity prediction model by the S5 are as follows:
s5.1, performing parameter adjustment optimization on the preliminary model by using the downloaded data;
the expressions of S5.2 and KNN are:
wherein: p is a site value in the model, and q is a site value to be predicted;
s5.3, adjusting model parameters by using the downloaded data, adjusting an n_neighbors value and a weights value in the model, and optimizing an algorithm model;
and S5.4, visually displaying the prediction result of the model.
CN202111134799.XA 2021-09-27 2021-09-27 Compound toxicity prediction system and method based on humanized chip, molecular fingerprint and deep learning Active CN113782110B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111134799.XA CN113782110B (en) 2021-09-27 2021-09-27 Compound toxicity prediction system and method based on humanized chip, molecular fingerprint and deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111134799.XA CN113782110B (en) 2021-09-27 2021-09-27 Compound toxicity prediction system and method based on humanized chip, molecular fingerprint and deep learning

Publications (2)

Publication Number Publication Date
CN113782110A CN113782110A (en) 2021-12-10
CN113782110B true CN113782110B (en) 2024-02-13

Family

ID=78853672

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111134799.XA Active CN113782110B (en) 2021-09-27 2021-09-27 Compound toxicity prediction system and method based on humanized chip, molecular fingerprint and deep learning

Country Status (1)

Country Link
CN (1) CN113782110B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113192571A (en) * 2021-04-29 2021-07-30 南京邮电大学 Small molecule drug hERG toxicity prediction method and device based on graph attention mechanism transfer learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090024547A1 (en) * 2007-07-17 2009-01-22 Ut-Battelle, Llc Multi-intelligent system for toxicogenomic applications (mista)

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113192571A (en) * 2021-04-29 2021-07-30 南京邮电大学 Small molecule drug hERG toxicity prediction method and device based on graph attention mechanism transfer learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于降噪自编码神经网络的化合物毒性预测方面的研究;黎红;禹龙;田生伟;李莉;王梅;;计算机应用研究;20170321(第03期);全文 *

Also Published As

Publication number Publication date
CN113782110A (en) 2021-12-10

Similar Documents

Publication Publication Date Title
Peng et al. A novel feature selection approach for biomedical data classification
Yang Machine learning approaches to bioinformatics
CN108198621B (en) Database data comprehensive diagnosis and treatment decision method based on neural network
Qu et al. Data reduction using a discrete wavelet transform in discriminant analysis of very high dimensionality data
CN108335756B (en) Nasopharyngeal carcinoma database and comprehensive diagnosis and treatment decision method based on database
CN108206056B (en) Nasopharyngeal darcinoma artificial intelligence assists diagnosis and treatment decision-making terminal
JP2009505231A (en) System, method, and computer program for comparing and editing metabolite data obtained from a plurality of samples using a computer system database
CN115335912A (en) Relative synthetic feasibility of inverse synthesis
Hossain et al. Applying machine learning classifiers on ECG dataset for predicting heart disease
CN118312816A (en) Cluster weighted clustering integrated medical data processing method and system based on member selection
CN117476114B (en) Model construction method and system based on biological multi-group data
Samet et al. Predicting and staging chronic kidney disease using optimized random forest algorithm
Kumar et al. Integrating Diverse Omics Data Using Graph Convolutional Networks: Advancing Comprehensive Analysis and Classification in Colorectal Cancer
CN113782110B (en) Compound toxicity prediction system and method based on humanized chip, molecular fingerprint and deep learning
Santos et al. Enabling ubiquitous data mining in intensive care-features selection and data pre-processing
US20070072250A1 (en) Method and system for analysis of cancer biomarkers using proteome image mining
CN116864011A (en) Colorectal cancer molecular marker identification method and system based on multiple sets of chemical data
Kuzmanovski et al. Extensive evaluation of the generalized relevance network approach to inferring gene regulatory networks
CN114141316A (en) Method and system for predicting biological toxicity of organic matters based on spectrogram analysis
US20090006055A1 (en) Automated Reduction of Biomarkers
Kavitha et al. Predicting Breast Cancer Survivability Using Naïve Baysein Classifier And C4. 5 Algorithm
Rababa et al. Predicting Heart Disease and Reducing Survey Time Using Machine Learning Algorithms
Vilohit et al. Improvisation of Decision Tree Classification Performance in Breast Cancer Diagnosis using Elephant Herding Optimization
JP7350112B2 (en) Cancer diagnostic device and cancer diagnostic method using liquid biopsy data
CN117789828B (en) Anti-aging target spot detection system based on single-cell sequencing and deep learning technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant