CN113782110B

CN113782110B - Compound toxicity prediction system and method based on humanized chip, molecular fingerprint and deep learning

Info

Publication number: CN113782110B
Application number: CN202111134799.XA
Authority: CN
Inventors: 李健; 梁伟成
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2024-02-13
Anticipated expiration: 2041-09-27
Also published as: CN113782110A

Abstract

The invention provides a compound toxicity prediction system and method based on a humanized chip, molecular fingerprint and deep learning, comprising the following steps: the data preprocessing unit is used for downloading and sorting related data and extracting characteristic parameters; the compound similarity comparison unit is used for comparing the similarity of the compound to be detected and the library poisoning, screening the poisoning with the similarity greater than 0.5, displaying the poisoning, and outputting the similarity; a humanized chip related data unit for outputting humanized chip related data of poison; a compound toxicity prediction unit predicts whether the compound has toxicity or not through the structure of the compound; the data visualization unit displays each data in a visual mode. The invention is based on a humanized chip, a molecular fingerprint and a deep learning technology, and can realize the prediction of the toxicity of the compound according to the structure of the compound.

Description

Compound toxicity prediction system and method based on humanized chip, molecular fingerprint and deep learning

Technical Field

The invention relates to a compound toxicity prediction system and method based on a humanized chip, molecular fingerprints and deep learning, and belongs to the technical field of bioinformatics research on compound properties.

Background

Molecular fingerprinting (molecular fingerprint) is a technique whereby a chemical molecular formula is converted into a series of binary fingerprint sequences by detecting the presence or absence of specific structures in the molecular structure. Thus, molecular fingerprinting methods can be employed so that the structure of the compound can be understood by a computer, thereby enabling structural retrieval and prediction of properties of the compound.

The humanized chip (organic-on-a-chip) technology is a microfluidic micro-physiological system which can perform high-resolution and real-time imaging analysis on the structure and function of living human cells at the in-vitro tissue and organ level by utilizing a microfluidic technology. Can be used for constructing organ tissue structures which are close to physiological functions in vitro. Compared with the traditional wet experimental method such as clinical experiments, the method has the advantages of short experimental period, low cost and the like.

At present, the traditional methods such as clinical test and the like have long test period, high test cost and need to spend a great deal of time and financial resources for predicting the toxicity of the compound. With the increasing number of newly discovered compounds in recent years, the speed of the new compounds is increased, and the current traditional toxicity prediction method cannot meet the current requirements of poison prediction.

Disclosure of Invention

Technical problems: the invention aims to realize toxicity prediction of unknown compounds through molecular fingerprint technology and deep learning method.

The technical scheme is as follows: the technical scheme adopted for solving the technical problems is as follows:

a compound toxicity prediction system based on a humanized chip, molecular fingerprint and deep learning comprises a data preprocessing module: the method is used for collecting and analyzing the compound data and extracting characteristic parameters;

compound similarity comparison module: for comparing the similarity of the compounds to poisons in the library and giving the similarity;

and a humanized chip related data module: used for providing humanized chip related experimental data of library poisoning substances;

compound toxicity prediction module: the characteristic parameters are used for being extracted by the machine learning data preprocessing module, and the possible toxicity of the compound is predicted;

and a data visualization module: the method is used for carrying out visual treatment on all data to obtain the compound similar to poison, related humanized chip data and possible toxicity.

The compound toxicity prediction method based on the humanized chip, the molecular fingerprint and the deep learning comprises the following steps of:

s1, downloading relevant original data of poison and compound in pubchem, including smiles, toxicity basic information and structure of the poison and the compound, and screening, extracting and processing characteristic parameters in the poison and the compound;

s2, calculating molecular fingerprints by using a rdkit tool package and combining smiles molecular expressions of the toxicants and the compounds downloaded in the S1, and circularly calculating the similarity between the compound to be tested and the toxicants and the compound molecules downloaded in the database by using a Tanimoto similarity method in the rdkit tool package;

s3, searching the downloaded poison and the downloaded compound in pubmed for application in a humanized chip, downloading related literature information, and extracting important keywords;

s4, establishing a preliminary model based on the characteristic parameter number in the S1, carrying out regression analysis according to the characteristic parameter of the compound and the relation between the characteristic parameter and toxicity, and judging the size of influencing factors of the characteristic parameter and the toxicity of the compound; dividing an original data set into a training set, a verification set and a test set, and obtaining a toxicity prediction effect through a KNN model;

s5, according to the preliminary model established in the S4, a toxicity prediction model is established, and the toxicity of the compound to be tested is predicted by optimizing and adjusting parameters of the prediction model through data in a database.

Further, the specific steps of S1 are as follows:

s1.1, downloading characteristic parameters of poison in pubchem according to CASRN (computer aided manufacturing) according to a control method directory of toxic substances in the United states, wherein the characteristic parameters comprise poison names, smiles, chemical safety and common toxicology information;

s1.2, downloading characteristic parameters of non-toxic compounds according to CASRN, wherein the characteristic parameters comprise poison names, smiles and chemical safety;

s1.3, extracting structural characteristics of toxic substances and nontoxic compounds according to smiles of the toxic substances and the nontoxic compounds, wherein the structural characteristics comprise the number of C atoms, the number of halogen atoms, the number of benzene rings, the number of double bonds, the number of triple bonds, the number of P atoms and the number of S atoms; the toxic toxicity code of the nontoxic substance is 0, and the toxic substance toxicity code is 1, so that the computer can calculate conveniently;

s1.4, converting all data into csv files, and cleaning the missing data and the extreme data to remove unnecessary data.

Further, the specific steps for calculating the similarity between the compound to be tested and the toxic molecules in the database in S2 are as follows:

s2.1, storing smiles of the downloaded substances, and calculating molecular fingerprint characteristics of the downloaded substances;

calculating a molecular fingerprint of a compound to be detected by using an rdkit tool kit, calculating the similarity between the compound to be detected and a downloaded substance by using a Tanimoto similarity method in the rdkit tool kit, and displaying the similarity;

s2.2, a calculation formula of the Tanimoto similarity method is as follows:

wherein: a, a ₁ ，a ₂ ...a _n Is a molecular fingerprint of the downloaded material, b ₁ ，b ₂ ...b _n Is the molecular fingerprint of the compound to be tested;

s2.3, a poison or a compound similar to the compound to be tested is displayed in a visual mode.

Further, the specific steps of extracting the related information of the humanized chip in the step S3 are as follows:

s3.1, searching and downloading related documents of the humanized chip on the Pubmed, wherein the searching keywords are as follows: organic-on-a-Chip, organic chips, organ chips, liver chips, lung chips, kidney chips, skin chips, brain chips, heart chips, intestine chips, blood vessel chips, tumor chips;

s3.2, extracting important keywords in the humanized chip literature, including poison names, target organs, main cells, experimental materials, experimental instruments, culture environments, on-chip environments, model types and statistical modes;

s3.3, placing the humanized chip literature into a database of corresponding toxicants according to the toxicant names contained in the humanized chip literature, and displaying the corresponding extracted important keywords of the humanized chip;

s3.4, classifying whether the poison is subjected to the related research of the humanized chip, and providing corresponding humanized chip data if the poison is subjected to the related research of the humanized chip, so that researchers can conveniently carry out the related research of the humanized chip on substances similar to the poison in structure;

s3.5, screening poisons with the similarity of more than 0.5 according to the calculated similarity of the compounds, displaying the poisons in a list, and visually displaying the related data of the humanized chip on a poison detail page.

Further, the specific steps of creating the preliminary model in S4 are as follows:

s4.1, carrying out data processing on the compounds according to the structural characteristics of the compounds screened and extracted in the S2, and extracting the data processing result of each compound;

s4.2, carrying out regression analysis and normalization treatment on the data, and judging the size of influencing factors of the structure on the toxicity of the compound; the extreme factors are eliminated to have larger influence on the prediction result;

s4.3, a compound toxicity prediction preliminary model is established, the model is trained through a KNN model by utilizing a training set, and the accuracy of the model is tested by a testing set.

Further, the specific steps of establishing the toxicity prediction model in S5 are as follows:

s5.1, performing parameter adjustment optimization on the preliminary model by using the downloaded data;

the expressions of S5.2 and KNN are:

wherein: p is a site value in the model, and q is a site value to be predicted;

s5.3, adjusting model parameters by using the downloaded data, adjusting an n_neighbors value and a weights value in the model, and optimizing an algorithm model;

and S5.4, visually displaying the prediction result of the model.

The invention adopts the technical scheme and has the following beneficial effects:

in the compound toxicity prediction method based on the humanized chip, the molecular fingerprint and the deep learning, the molecular fingerprint of the poison molecule is screened and extracted, so that certain toxicological characteristics of the poison molecule can be effectively reflected, and the molecular fingerprint technology is applied to compound similarity calculation, so that the poison similar to the compound in structure can be rapidly screened, and the possible toxicity of the compound can be predicted; in addition, the invention combines the humanized chip technology to provide the humanized chip experimental data of related toxicants, thereby being convenient for developing the related experimental study of the toxicants; more importantly, the invention combines machine learning with a material structure, analyzes the important structure of the poison through machine learning, establishes a machine learning model, trains the model by utilizing related information resources in a poison database, and realizes the prediction of whether the unknown compound has toxicity or not, thereby solving the current situation that the toxicity research of the current compound consumes a great deal of time and financial resources.

Drawings

FIG. 1 is a block diagram of a compound toxicity prediction method based on a humanized chip, molecular fingerprint and deep learning according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for predicting toxicity of a compound based on a humanized chip, molecular fingerprint and deep learning according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for predicting toxicity of a compound based on a humanized chip, molecular fingerprint and deep learning according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a machine learning prediction model of a compound toxicity prediction method based on a humanized chip, molecular fingerprint and deep learning according to an embodiment of the present invention.

Detailed Description

The present invention will be further illustrated with reference to the following specific examples, but the scope of the invention is not limited thereto.

Referring to fig. 1, a block diagram of a method for predicting toxicity of a compound based on a humanized chip, molecular fingerprint and deep learning according to an embodiment of the present invention includes:

the data layer is mainly used for storing data required by normal operation of the system and mainly comprises a molecular fingerprint library, a poison library and a humanized chip library;

the business layer is mainly used for carrying out compound similarity comparison and compound toxicity prediction and mainly comprises a compound similarity comparison unit and a compound toxicity prediction unit; the compound similarity comparison unit mainly comprises calculation of molecular fingerprints, screening of candidate compounds and calculation of candidate compound similarity; the compound toxicity prediction unit mainly comprises the steps of extracting key structures of the compound, constructing a toxicity prediction model, and predicting the toxicity of the compound;

the molecular fingerprint library and the poison library are mainly used for compound similarity comparison; the poison library and the humanized chip library are mainly used for predicting the toxicity of the compound;

referring to fig. 2, a flowchart of a method for predicting toxicity of a compound based on a humanized chip, molecular fingerprint and deep learning is provided in an embodiment of the present invention, including:

the data preprocessing unit is used for collecting and analyzing the compound data and extracting characteristic parameters;

a compound similarity comparison unit for comparing the compound with poison similarity in the library and giving a similarity;

the humanized chip related data unit is used for providing humanized chip related experimental data of the library poisoning object;

a compound toxicity prediction unit for predicting toxicity which the compound may have, based on the characteristic parameters extracted by the machine learning data preprocessing unit;

the data visualization unit is used for performing visualization processing on all data to obtain the compound similar toxicant, related humanized chip data and possible toxicity;

the method mainly comprises the following steps:

s1, operating a data preprocessing unit, and downloading and arranging relevant data.

In some embodiments, S1 collects and builds a database of poison humanized chips, downloads relevant raw data of poison and compound in pubchem, and screens, extracts and processes characteristic parameters thereof as follows:

s11, downloading characteristic parameters of poison in pubchem according to CASRN according to a U.S. toxic substance control method (TSCA) directory, wherein the characteristic parameters mainly comprise poison names, smiles, chemical safety and common toxicology information;

s12, downloading characteristic parameters of the nontoxic compound according to CASRN, wherein the characteristic parameters mainly comprise poison names, smiles and chemical safety;

s13, extracting structural characteristics of toxic substances and nontoxic compounds according to smiles of the toxic substances and the nontoxic compounds, wherein the structural characteristics mainly comprise 9 aspects of the number of C atoms, the number of N atoms, the number of O atoms, the number of P atoms, the number of S atoms, the number of halogen atoms, the number of benzene rings, the number of double bonds and the number of triple bonds; the toxicity of the toxic substance is coded as 0, and the toxicity of the toxic substance is coded as 1, so that the computer can operate conveniently;

s104, converting all the data into csv files, and cleaning the missing data and the extreme data to remove unnecessary data.

And S2, operating a compound similarity comparison unit, and outputting compound similarity.

In some embodiments, S2 calculating the compound similarity comprises:

s21, storing smiles of the downloaded substances, and calculating molecular fingerprint characteristics of the downloaded substances. The data processing process is as follows: installing an rdkit toolkit, and converting all poison molecular structural formulas into molecular fingerprints in a smiles format;

s22, inputting smiles of a substance to be detected, and calculating molecular fingerprint characteristics of the substance to be detected, wherein the data processing process is as follows: installing an rdkit kit, and converting the molecular structural formula of the compound to be tested into a molecular fingerprint in a smiles format;

s23, calculating the similarity between the compound to be tested and the downloaded substance by using a Tanimoto similarity method in the rdkit tool package, and displaying the similarity;

s24, calculating formula of tanimoto similarity method is

Wherein: a, a ₁ ，a ₂ ...a _n Is a molecular fingerprint of the downloaded material, b ₁ ，b ₂ ...b _n Is the molecular fingerprint of the test compound.

S3, operating the related data unit of the humanized chip and outputting related data of the humanized chip.

Referring to fig. 3, a flowchart of a humanized chip participating in toxicity prediction based on a method for predicting toxicity of a compound based on a humanized chip, molecular fingerprint and deep learning is provided for an example of the present invention, as shown in fig. 3, in some embodiments, S3 outputting humanized chip related data includes:

s31, searching and downloading related documents of the humanized chip on the Pubmed, wherein the search keywords are as follows: organic-on-a-Chip, organic chips, humanized chips, liver chips, lung chips, kidney chips, skin chips, brain chips, heart chips, intestine chips, blood vessel chips, tumor chips;

s32, extracting important keywords in the humanized chip literature, wherein the important keywords mainly comprise poison names, target organs, main cells, experimental materials, experimental instruments, culture environments, on-chip environments, model types, statistical modes and the like;

s33, constructing a humanized chip database, placing the humanized chip document into a database of corresponding toxicants according to the toxicant names contained in the humanized chip document, and displaying the corresponding extracted important keywords of the humanized chip;

s34, correlating with a poison database, and classifying whether the poison is subjected to the humanized chip related study. Searching a humanized chip database according to the name of the poison, and providing corresponding humanized chip data if the poison is subjected to related humanized chip research, so that researchers can conveniently perform humanized chip related research on substances similar to the poison in structure; and simultaneously, the toxicity of the compound to be predicted is predicted by combining a humanized chip database and a poison database.

S35, screening poisons with the similarity of more than 0.5 according to the similarity of the compounds calculated in the step S23, displaying the poisons in a list, and displaying the related data of the humanized chip on a detail page.

And S4, operating a compound toxicity prediction unit, and outputting a toxicity prediction result.

Referring to fig. 4, a schematic diagram of a machine learning prediction model of a compound toxicity prediction method based on a humanized chip, a molecular fingerprint and deep learning according to an embodiment of the present invention is shown in fig. 3, where the machine learning prediction model of the compound toxicity prediction method based on a humanized chip, a molecular fingerprint and deep learning according to the present invention mainly includes:

s41, a data collection module, which is used for collecting and establishing a data set related to toxic compounds;

s42, a model building module builds the correlation between the structure of the compound and the toxicity of the compound according to the structural characteristics of the compound;

s43, training the model established in the S42 by using the compound data downloaded in the S1;

s44, outputting a result trained by the model in S43 by the result output module.

S41, the data collection module classifies related compounds into two types of toxic and nontoxic according to Pubchem data, and codes the toxic into 1 and the nontoxic into 0, so that the computer can conveniently operate; and (3) extracting the data according to the structural characteristics listed in S13.

S42, carrying out regression analysis and normalization treatment on the data by the model building module, and judging the size of influencing factors of the structure on the toxicity of the compound; the extreme factors are eliminated to have larger influence on the prediction result; dividing the processed data into a training set, a verification set and a test set; totally selecting 81 nontoxic substances and 419 toxic substances as data sets; sequencing substances in the data set according to CID, extracting one substance from every 10 substances to serve as a test set, and taking the rest substances as a training set; the training set contains 450 substances in total, the test set contains 50 substances, and the test result shows that the accuracy of the prediction model is 92%;

and S43, the model training module brings the data downloaded in the S1 into the preliminary model according to the preliminary model established in the S402 to perform parameter adjustment optimization.

S43, a model training module mainly adopts a KNN model to predict the toxicity of the compound, and the main steps are as follows: converting the data into data meeting the algorithm requirement; dividing the data set into a training set, a verification set and a test set; constructing a KNN model by using the training set; testing model parameters using a test set; validating the model effect using a validation set; the main algorithm adopted in the KNN algorithm is a Euclidean distance formula:

wherein: p is the site value in the model, q is the site value to be predicted.

S43, inputting training set data into the model by the model training module, and adopting a cross-validation method to avoid overfitting; adjusting an n_neighbors value and a weights value in the KNN model through training effects of the training set and the verification set, and optimizing model prediction accuracy;

from the above steps, the toxicity of the compound can be predicted according to the structure of the compound, and the main structure comprises: the number of C atoms, the number of N atoms, the number of O atoms, the number of P atoms, the number of S atoms, the number of halogen atoms, the number of benzene rings, the number of double bonds, the number of triple bonds. Prediction of toxicity can be achieved by combining the effects of these important structures.

And S5, operating a data visualization unit to visualize each output result.

In some embodiments, the specific steps of the S5 data visualization are:

s51, visualizing the similarity test data established in the S2;

s52, visualizing the related data of the downloaded humanized chip in the S3;

s53, visualizing the toxicity prediction model established in the S4.

Giving whether the compound to be tested is possibly toxic or not through a prediction model; and then combining the basic information of the downloaded poison and the compound to give the poison or the compound with the structure similar to that of the compound to be tested, so as to conveniently judge the property of the compound to be tested; meanwhile, related data of a poison or compound humanized chip similar to the structure of the compound to be tested is provided, and related application of the compound to be tested in the humanized chip is researched.

Claims

1. A compound toxicity prediction method based on a humanized chip, molecular fingerprint and deep learning, characterized in that the method is performed by a compound toxicity prediction system based on a humanized chip, molecular fingerprint and deep learning, the system comprising a data preprocessing module: the method is used for collecting and analyzing the compound data and extracting characteristic parameters;

and a data visualization module: the method is used for carrying out visual treatment on all data to obtain the compound similar toxicant, related humanized chip data and possible toxicity;

the method comprises the following steps:

s5, establishing a toxicity prediction model according to the preliminary model established in the S4, optimizing the prediction model by utilizing data in a database, and adjusting parameters to realize the prediction of the toxicity of the compound to be tested;

the specific steps of extracting the related information of the humanized chip in the step S3 are as follows:

s3.1, searching and downloading related documents of the humanized chip on the Pubmed, wherein the searching keywords are as follows:

organic-on-a-Chip, organic chips, organ chips, liver chips, lung chips, kidney chips, skin chips, brain chips, heart chips, intestine chips, blood vessel chips, tumor chips;

2. The method for predicting the toxicity of a compound based on a humanized chip, molecular fingerprint and deep learning according to claim 1, wherein the specific steps of S1 are as follows:

s1.1, downloading characteristic parameters of poison in pubchem according to CASRN, including poison names, smiles, chemical safety and common toxicology information, according to a toxic substance control method directory;

3. The method for predicting toxicity of a compound based on a humanized chip, molecular fingerprint and deep learning according to claim 2, wherein the specific steps of calculating the similarity between the compound to be tested and the toxic molecule in the database in S2 are as follows:

s2.2, a calculation formula of the Tanimoto similarity method is as follows:

4. The method for predicting compound toxicity based on humanized chip, molecular fingerprint and deep learning according to claim 3, wherein the specific steps of creating the preliminary model in S4 are as follows:

5. The method for predicting the toxicity of the compound based on the humanized chip, the molecular fingerprint and the deep learning according to claim 4, wherein the specific steps of establishing the toxicity prediction model by the S5 are as follows:

the expressions of S5.2 and KNN are:

wherein: p is a site value in the model, and q is a site value to be predicted;

and S5.4, visually displaying the prediction result of the model.