CN111798919B - Tumor neoantigen prediction method, prediction device and storage medium - Google Patents
Tumor neoantigen prediction method, prediction device and storage medium Download PDFInfo
- Publication number
- CN111798919B CN111798919B CN202010587400.2A CN202010587400A CN111798919B CN 111798919 B CN111798919 B CN 111798919B CN 202010587400 A CN202010587400 A CN 202010587400A CN 111798919 B CN111798919 B CN 111798919B
- Authority
- CN
- China
- Prior art keywords
- amino acid
- codes
- chromatin
- conformation
- peptide sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 54
- 238000000034 method Methods 0.000 title claims abstract description 43
- 108090000765 processed proteins & peptides Proteins 0.000 claims abstract description 72
- 108010077544 Chromatin Proteins 0.000 claims abstract description 53
- 210000003483 chromatin Anatomy 0.000 claims abstract description 53
- 238000013528 artificial neural network Methods 0.000 claims abstract description 30
- 125000003275 alpha amino acid group Chemical group 0.000 claims abstract description 29
- 102000004196 processed proteins & peptides Human genes 0.000 claims abstract description 29
- 229920001184 polypeptide Polymers 0.000 claims abstract description 24
- 238000012549 training Methods 0.000 claims abstract description 18
- 230000005847 immunogenicity Effects 0.000 claims abstract description 16
- 239000000427 antigen Substances 0.000 claims description 50
- 102000036639 antigens Human genes 0.000 claims description 50
- 108091007433 antigens Proteins 0.000 claims description 50
- 150000001413 amino acids Chemical class 0.000 claims description 20
- 230000035772 mutation Effects 0.000 claims description 13
- 108090000623 proteins and genes Proteins 0.000 claims description 7
- 238000000329 molecular dynamics simulation Methods 0.000 claims description 5
- 238000002474 experimental method Methods 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 4
- 210000004940 nucleus Anatomy 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 3
- 239000012528 membrane Substances 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 230000001413 cellular effect Effects 0.000 claims description 2
- 230000005851 tumor immunogenicity Effects 0.000 claims 1
- 210000004881 tumor cell Anatomy 0.000 description 9
- 230000000694 effects Effects 0.000 description 7
- 238000009826 distribution Methods 0.000 description 6
- 206010064571 Gene mutation Diseases 0.000 description 5
- 239000011324 bead Substances 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 5
- 230000018109 developmental process Effects 0.000 description 5
- 210000001744 T-lymphocyte Anatomy 0.000 description 4
- 238000002790 cross-validation Methods 0.000 description 4
- 238000009169 immunotherapy Methods 0.000 description 4
- 210000004027 cell Anatomy 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000002068 genetic effect Effects 0.000 description 3
- 230000007614 genetic variation Effects 0.000 description 3
- 238000007477 logistic regression Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000012706 support-vector machine Methods 0.000 description 3
- 208000005623 Carcinogenesis Diseases 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000001363 autoimmune Effects 0.000 description 2
- 230000036952 cancer formation Effects 0.000 description 2
- 231100000504 carcinogenesis Toxicity 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000009456 molecular mechanism Effects 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 238000012049 whole transcriptome sequencing Methods 0.000 description 2
- 206010069754 Acquired gene mutation Diseases 0.000 description 1
- 230000009946 DNA mutation Effects 0.000 description 1
- 208000034951 Genetic Translocation Diseases 0.000 description 1
- 206010027476 Metastases Diseases 0.000 description 1
- 102000007079 Peptide Fragments Human genes 0.000 description 1
- 108010033276 Peptide Fragments Proteins 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000005513 bias potential Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 230000002759 chromosomal effect Effects 0.000 description 1
- 210000000349 chromosome Anatomy 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000002651 drug therapy Methods 0.000 description 1
- 230000001973 epigenetic effect Effects 0.000 description 1
- 210000000987 immune system Anatomy 0.000 description 1
- 230000036039 immunity Effects 0.000 description 1
- 238000001727 in vivo Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000009401 metastasis Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 229920000642 polymer Polymers 0.000 description 1
- 238000004393 prognosis Methods 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 238000011127 radiochemotherapy Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 210000001082 somatic cell Anatomy 0.000 description 1
- 230000037439 somatic mutation Effects 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 230000005748 tumor development Effects 0.000 description 1
- 229960005486 vaccine Drugs 0.000 description 1
- 238000007482 whole exome sequencing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B45/00—ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- General Engineering & Computer Science (AREA)
- Genetics & Genomics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Physiology (AREA)
- Peptides Or Proteins (AREA)
Abstract
The invention relates to a tumor neoantigen prediction method, a prediction device and a storage medium based on chromatin advanced conformation and deep sparse learning, wherein the method invents a deep neural network prediction model based on group selection, and trains the model through training data to obtain an immunogenicity prediction value of an object to be predicted (namely a potential tumor neoantigen peptide); wherein each sample used in training the deep neural network prediction model includes chromatin 3D conformation information and features generated based on the polypeptide amino acid sequence. Compared with the prior art, the method has the advantages of high prediction precision, convenience in prediction and the like.
Description
Technical Field
The invention relates to the field of prediction of new antigens in tumor personalized immunotherapy, in particular to a tumor new antigen prediction method, a prediction device and a storage medium based on chromatin advanced conformation and deep sparse learning.
Background
At present, the conventional treatment of tumor patients mainly depends on non-individualized surgical excision, chemoradiotherapy, targeted drug therapy and other means, but the conventional means have many problems, such as incomplete treatment, great side effect, easy tumor metastasis resistance and the like, and the life cycle of the tumor patients is only temporarily prolonged.
In recent years, the approach of tumor immunotherapy by targeting tumor cells of patients through their own immune system has entered the field of people. In personalized tumor immunotherapy, tumor patient-specific target molecules that play a critical role are called tumor neoantigens. The nature of the tumor neoantigen is protein, is generated by tumor genome mutation, and is different from the tumor self-protein antigen which is abnormally expressed because of containing non-synonymous mutation. In vivo, the tumor neoantigen can be recognized as a foreign antigen by the autoimmune system, and is not affected by central tolerance, thereby enabling the autoimmune system to specifically target tumor cells of a patient. Therefore, the tumor neoantigen is prepared into a vaccine or a polypeptide preparation for tumor immunotherapy, can selectively kill tumor cells, and has high safety and obvious effect. In this strategy, it is critical to individually select the tumor neoantigen with good expected curative effect from a plurality of peptide fragments which can distinguish tumor from normal tissue accurately and efficiently. However, the existing selection technology of tumor neoantigens still has more technical problems, such as large selection workload, low precision and the like.
The vigorous development of genomics in the last twenty years provides powerful support for tumor research. By comparing the genomes of tumor cells and normal cells, a plurality of genetic variations closely related to tumorigenesis and development are discovered, and the molecular mechanism of the genetic variations in tumorigenesis and development is partially revealed, so that powerful technical support is provided for developing novel tumor diagnosis, typing, prognosis and guiding clinical treatment. In the aspect of somatic mutation of tumor genome, it is found that single mutation on chromatin can not cause tumor, and tumor cells of almost every tumor patient can find numerous genetic and epigenetic variations through detection, including coexistence of several to hundreds of gene mutations, chromosomal translocation accompanied with gene mutation, chromosomal copy number variation at multiple positions, and the like, which are very common. More and more evidences show that the genetic variation which occurs along with (simultaneously or sequentially) has an intrinsic rule, some gene mutations are often accompanied with other gene mutations but do not occur randomly, the intrinsic genetic structure basis of the gene mutations which occur along with is not clear, but the establishment of the relevant mechanism lays the theoretical basis for deeply knowing the molecular mechanism of tumor occurrence and development, especially for knowing the causal relationship of the genetic events in tumor development, provides an effective means for accurately selecting new tumor antigens, and further provides a certain basis for tumor diagnosis and treatment.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a tumor neoantigen prediction method, a prediction device and a storage medium based on chromatin high-order conformation and deep sparse learning, which have high prediction precision and convenient prediction.
The purpose of the invention can be realized by the following technical scheme:
a tumor neoantigen prediction method based on chromatin advanced conformation and deep sparse learning is characterized in that a trained deep neural network prediction model based on group selection is used for processing a to-be-predicted object to obtain tumor neoantigen immunogenicity information corresponding to the to-be-predicted object;
wherein each sample used in training the deep neural network prediction model includes chromatin 3D conformation information and features generated based on the polypeptide amino acid sequence.
Further, each of the samples features are on the order of thousands of levels.
Each feature of the sample belongs to a certain group, and in the neural network model training, all the features in the certain group are selected or eliminated.
Further, the output of the deep neural network prediction model based on group selection comprises a new antigen with activated immunogenicity and a plurality of characteristics with the highest association degree with the new antigen.
Further, the deep neural network prediction model based on group selection is in a form of a full connection layer.
Further, the chromatin 3D conformation information is derived from a cellular chromatin 3D conformation thermodynamic map matrix obtained by Hi-C (chromatin conformation capture technique) experiments.
Further, the chromatin 3D conformation information was obtained from public Hi-C datasets.
Further, the characteristics generated based on the amino acid sequence of the polypeptide include characteristics of the polypeptide including the site of the amino acid mutation and information on high expression of the gene including the mutation.
The invention also provides a tumor neoantigen prediction device based on chromatin high-order conformation and deep sparse learning, which comprises:
a data acquisition unit that acquires samples for training, each sample including chromatin 3D conformation information and features generated based on the polypeptide amino acid sequence;
the model training unit is used for obtaining a deep neural network prediction model based on group selection based on the sample training;
and the prediction unit is used for acquiring the object to be predicted, processing the object to be predicted through the deep neural network prediction model selected based on the group and acquiring the immunogenicity information of the tumor neoantigen corresponding to the object to be predicted.
The invention also provides a computer-readable storage medium comprising a computer program which can be executed by a processor to implement the prediction method.
Compared with the prior art, the invention has the following beneficial effects:
firstly, the invention provides a method for examining and analyzing whether an amino acid polypeptide antigen corresponding to a mutated DNA site can activate T cell immunogenicity or not from the perspective of chromatin 3D conformation based on a plurality of creative researches on chromatin high-order conformation by the inventor, and adds the chromatin 3D conformation into a characteristic set predicted by machine learning, namely the space distribution information of the DNA mutation site corresponding to a neoantigen peptide on chromatin, so as to obviously improve the prediction accuracy of whether the neoantigen has immunogenicity or not.
Secondly, the invention autonomously develops a Group Selection based Deep Neural Network (DNN-GFS) classification model, has convenient prediction and small prediction workload (unnecessary nodes and edges of an input layer in the Neural Network are cut out), and has the following advantages:
1. thousands of features including comprehensive chromatin three-dimensional structure information are adopted in the feature set, so that the overfitting problem of the traditional deep neural network under the condition of more features can be avoided better, and the overall classification prediction accuracy is improved;
2. different from the traditional deep neural network which is just a black box for a user, the method can select the input features while classifying and predicting, and selects the most key features, thereby providing a basis for further mining the correlation between the input features and the output result;
3. the invention adopts the strategy of grouping selection, and can select the characteristics which should be together in groups, namely, simultaneously select the characteristics of the same group or simultaneously remove the characteristics of the same group, so that the model can be well compatible with the prior knowledge of group classification, and the self-learning effect of the model is improved.
Drawings
FIG. 1 is a schematic diagram of the principle of the present invention, and the problem solved by the present invention is mainly the content in the dashed box;
FIG. 2 is a schematic representation of the authentic signature of 3909 peptides that are immunopositive and immunopositive;
FIG. 3 is a ROC graph (ROC curve) comparing the prediction method (DNN-GFS) of the present invention with different prediction methods such as Deep Neural Network (DNN), support Vector Machine (SVM), logistic Regression (LR), K-nearest neighbor algorithm (KNN), neopsee, pTuneos, deephlApan, netherMHCpan, netherMHC and IEDB immunene under 5-partition and LOO cross-validation;
FIG. 4 is a graph comparing accuracy versus recall (P-R curves) for different prediction methods under division 5 and LOO cross validation;
FIG. 5 is a comparison of the prediction effectiveness of the ROC curve and the P-R curve on different independent validation datasets;
FIG. 6 is a comparison of the score distributions scored on positive and negative samples for different methods, LOO cross validation, 5 partition cross validation, and scoring on validation datasets, respectively;
fig. 7 is a schematic diagram of the deep neural network (DNN-GFS) based on group feature selection of the present invention, where a is a graphical representation of features belonging to different large groups, b is an explanation of DNN-GFS architecture and the effect of group feature selection, and c illustrates the geometry principles of different regularization terms applied to weighted neural network routing and two-dimensional projection from three representative perspectives.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.
In recent years, people can more globally excavate the abnormal chromatin structure of tumor cells through the Hi-C technology, and find that the remote control between chromatin probably plays a key role in gene control. The inventor of the present application found in previous work that point mutations accompanied in almost all tumors have obvious proximity in chromatin three-dimensional conformation, so that the concept of "spatial mutation hot spot of tumor" is proposed and published, and thus extended from the concept, we consider that the concept of "chromatin three-dimensional conformation driven cell functional block" is very important, and can help people to examine the development of tumor in a new angle. The existing method for discovering the tumor immunity personalized new antigen peptide usually focuses on the sequence attribute of the antigen peptide, the interaction between the antigen peptide and MHC molecules, the interaction between an antigen peptide-MHC molecule compound pMHC and TCR on somatic cells and the like, but ignores the source of the antigen peptide, namely the corresponding mutated gene, and the distribution of the special property on chromatin. The distribution rule of genes corresponding to antigen peptides on chromatin space is systematically analyzed for the first time, and the obvious difference between the distribution of a neoantigen with immunogenicity (capable of activating T cells) and a neoantigen without immunogenicity on the chromatin space is found, so that the spatial distribution information of the neoantigen on the chromatin is added into a feature set of a machine learning prediction algorithm, and the prediction accuracy of whether the neoantigen has the immunogenicity or not is found to be obviously improved.
Based on the basis, the invention realizes a tumor neoantigen prediction method based on chromatin advanced conformation and Deep sparse learning, and the method processes a to-be-predicted object (namely potential tumor neoantigen peptide) through a trained Deep Neural Network prediction model based on Group Selection (Deep sparse learning algorithm based on Group Feature Selection, DNN-GFS, group Feature Selection based Deep Neural Network), so as to obtain tumor neoantigen immunogenicity information corresponding to the to-be-predicted object; wherein each sample used in training the deep neural network prediction model includes chromatin 3D conformation information and features generated based on the polypeptide amino acid sequence. The principle of the method is shown in the dashed box of fig. 1.
The feature set of each sample is formed by combining chromatin 3D conformation information with other features generated based on the amino acid sequence of a polypeptide, and represents the feature set of a certain polypeptide. The characteristic magnitude of each sample is thousands of levels (more than 5000 characteristics), and specifically comprises < x, y, z >3D coordinates of a DNA site corresponding to a target peptide on a chromatin 3D space, a distance from a nucleus center (or a nucleus membrane), HLA subtype codes of MHC molecules presenting antigen peptides, the occurrence frequency of 20 amino acids in the target peptide, amino acid sparse codes of antigen peptide sequences, amino acid BLOSM codes of antigen peptide sequences, amino acid BLOMAP codes of antigen peptide sequences, amino acid side chain classification codes of antigen peptide sequences, amino acid side chain polarity codes of antigen peptide sequences, amino acid side chain charge codes of antigen peptide sequences, amino acid side chain hydrophilicity and hydrophobicity codes of antigen peptide sequences, amino acid side chain molecular weight codes of antigen peptide sequences, occurrence frequency codes of amino acid side chains of antigen peptide sequences in a biological population, and AAindex-based codes of all amino acid index indexes listed in an AAindex database. In this embodiment, each sample is a vector containing 5459 features.
In this example, additional polypeptide amino acid sequence-based features in the sample were obtained by high throughput whole exome sequencing (ExonSeq) and whole transcriptome sequencing (RNASeq) protocols. According to the sequencing result of the whole exome, the mutation information of the tumor cells in the sample can be obtained, and finally the specific coordinate position of a certain mutation on several chromosomes is obtained, and the mutation site of the corresponding coding amino acid is found out; based on the whole transcriptome sequencing results, it can be analyzed that those genes are highly expressed in tumor cells. On the basis of the above results, polypeptides containing amino acid mutation sites are enumerated, and then high-expression variant polypeptides based on the polypeptides are selected, wherein the length of the polypeptide is defined as 9 by default, but not limited to 9.
The chromatin 3D conformation information is a chromatin 3D conformation thermodynamic map matrix of tumor cells. In this example, chromatin 3D conformation information in a sample was obtained by Hi-C experiments or replaced with multiple Hi-C datasets in a common database.
The invention adopts a Molecular Dynamics (MD) method to develop a human genome three-dimensional conformation modeling method with the resolution of 500kb (bin-size). These vessels are coarse-grained beads, and the complete genome is represented by a bead structure consisting of 23 polymer chains. The spatial position of the beads is influenced by chromatin connectivity, which limits the linear adjacency of the beads in the near 3D range, and by chromatin activity, which ensures that the active region is close to the centre of the nucleus. Chromatin activity was determined based on the directly calculable spacing of the Hi-C matrix as described above. The distance of the beads from the core center is assigned according to the interval index, and then the conformation of chromatin is optimized from a random structure by applying a bias potential to satisfy these distance constraints using molecular dynamics methods. For each cell line, 300 feasible conformational structures were optimized from the random conformational structures to reduce possible variation for further analysis.
The prediction method adopts a deep neural network prediction model based on group selection to score and classify the polypeptides encoded by the input feature set, and can find out the polypeptides which are most likely to activate the immunogenicity of T cells. In this example, the input of the deep neural network prediction model is 5459 feature codes of a potential new antigen peptide sequence, and the output is a score of whether the polypeptide can activate immunogenicity, and a higher score indicates that the immunogenicity of T cells can be activated more. When the model is trained, as shown in fig. 7, all features of each sample belong to at most one group, and one group comprises one or more elements, so that a grouping selection strategy is conveniently adopted, some features which should appear together or be removed together, namely group features, can be selected simultaneously or removed simultaneously, the self-learning effectiveness of the model is improved, the overfitting risk is reduced, and the algorithm calculation efficiency is improved.
The deep neural network prediction model of the embodiment adopts a full-connection layer form, the output of the deep neural network prediction model comprises a new antigen with immunogenicity and a plurality of characteristics with the highest relevance with the new antigen, the most key characteristics are screened out during prediction, and the relationship between the characteristics and the output result can be better clarified.
Fig. 2-6 are schematic diagrams showing the predicted results of the above prediction method (DNN-GFS) and different classification methods such as DNN, SVM, LR, KNN, neopesee, pTuneos, dephlapan, netMHCpan, netMHC and IEDB immuno. This set of graphs illustrates that, taken together, our method DNN-GFS has superior predictive potency for novel antigens over other traditional machine learning algorithms.
After the prediction is carried out by the prediction method, the polypeptide sequence with high score and classified as positive is synthesized to obtain the predicted tumor neoantigen, and then the immune efficacy observation test can be carried out on the predicted tumor neoantigen by adopting a mouse.
In another embodiment, a tumor neoantigen prediction device based on chromatin high-order conformation and deep sparse learning is provided, comprising a data acquisition unit, a model training unit and a prediction unit, wherein the data acquisition unit acquires samples for training, each sample comprising chromatin 3D conformation information and features generated based on a polypeptide amino acid sequence; the model training unit is used for obtaining a deep neural network prediction model based on group selection based on the sample training; and the prediction unit acquires an object to be predicted, processes the object to be predicted through the deep neural network prediction model selected based on the group, and acquires tumor new antigen information corresponding to the object to be predicted.
In another embodiment, a computer-readable storage medium is provided, comprising a computer program executable by a processor to implement the prediction method.
In another embodiment, a web page is provided, after the object to be predicted is obtained, the prediction result of the tumor neoantigen is rapidly obtained by using the prediction method.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions that can be obtained by a person skilled in the art through logical analysis, reasoning or limited experiments based on the prior art according to the concepts of the present invention should be within the scope of protection determined by the claims.
Claims (9)
1. A tumor neoantigen prediction method based on chromatin advanced conformation and deep sparse learning is characterized in that a to-be-predicted object is processed through a trained deep neural network prediction model selected based on group characteristics, and tumor neoantigen immunogenicity information corresponding to the to-be-predicted object is obtained;
the characteristic set of each sample adopted during training of the deep neural network prediction model is formed by combining chromatin 3D conformation information and other characteristics generated based on polypeptide amino acid sequences, wherein the chromatin 3D conformation information is obtained by a human genome three-dimensional conformation modeling method based on molecular dynamics, each characteristic of a sample belongs to a certain group, and in the training of the neural network model, the characteristics in the certain group are selected or removed completely;
the characteristics of each sample comprise < x, y, z >3D coordinates of a target peptide corresponding DNA site on a chromatin 3D space, a distance between the target peptide corresponding DNA site and a nucleus center or a nucleus membrane, HLA subtype codes of MHC molecules presenting antigen peptides, appearance frequency of 20 amino acids in the target peptide, amino acid sparse codes of the antigen peptide sequence, amino acid BLOSM codes of the antigen peptide sequence, amino acid BLOMAP codes of the antigen peptide sequence, amino acid side chain classification codes of the antigen peptide sequence, amino acid side chain polarity codes of the antigen peptide sequence, amino acid side chain charge codes of the antigen peptide sequence, amino acid side chain hydrophilicity and hydrophobicity codes of the antigen peptide sequence, amino acid side chain molecular weight codes of the antigen peptide sequence, appearance frequency codes of the amino acid side chains of the antigen peptide sequence in a biological population and codes based on all amino acid AAindex indexes listed in an AAindex database.
2. The method of claim 1, wherein each sample is characterized by an order of thousands of levels.
3. The method of claim 1, wherein the output of the deep neural network prediction model selected based on the group characteristics comprises prediction of tumor immunogenicity of potential neoantigens and a plurality of characteristics with highest association with the neoantigens.
4. The method for predicting tumor neoantigens based on chromatin high order conformation and deep sparse learning according to claim 1, wherein the deep neural network prediction model selected based on group characteristics is in a fully connected layer form.
5. The method for predicting tumor neoantigens based on high order conformation and deep sparse learning of chromatin according to claim 1, wherein said chromatin 3D conformation information is derived from a cellular chromatin 3D conformation thermodynamic map matrix obtained by Hi-C experiments.
6. The method for predicting tumor neoantigens based on high order conformation and deep sparse learning of chromatin according to claim 1, wherein said chromatin 3D conformation information is obtained from a public Hi-C dataset.
7. The method of claim 1, wherein the characteristics generated based on the amino acid sequence of the polypeptide include the characteristics of the polypeptide containing the mutation site of the amino acid and the high expression information of the gene containing the mutation.
8. A tumor neoantigen prediction device based on chromatin high order conformation and deep sparse learning, comprising:
the data acquisition unit is used for acquiring samples for training, the feature set of each sample is formed by combining chromatin 3D conformation information and other features generated based on polypeptide amino acid sequences, and the chromatin 3D conformation information is acquired by adopting a human genome three-dimensional conformation modeling method based on molecular dynamics;
the model training unit is used for obtaining a deep neural network prediction model selected based on group characteristics based on the sample training, each characteristic of the sample belongs to a certain group, and in the neural network model training, the characteristics in the certain group are selected or removed;
the prediction unit is used for acquiring an object to be predicted, processing the object to be predicted through the deep neural network prediction model selected based on the group characteristics and acquiring tumor new antigen immunogenicity information corresponding to the object to be predicted;
the characteristics of each sample comprise < x, y, z >3D coordinates of a target peptide corresponding DNA site on a chromatin 3D space, a distance between the target peptide corresponding DNA site and a nucleus center or a nucleus membrane, HLA subtype codes of MHC molecules presenting antigen peptides, appearance frequency of 20 amino acids in the target peptide, amino acid sparse codes of the antigen peptide sequence, amino acid BLOSM codes of the antigen peptide sequence, amino acid BLOMAP codes of the antigen peptide sequence, amino acid side chain classification codes of the antigen peptide sequence, amino acid side chain polarity codes of the antigen peptide sequence, amino acid side chain charge codes of the antigen peptide sequence, amino acid side chain hydrophilicity and hydrophobicity codes of the antigen peptide sequence, amino acid side chain molecular weight codes of the antigen peptide sequence, appearance frequency codes of the amino acid side chains of the antigen peptide sequence in a biological population and codes based on all amino acid AAindex indexes listed in an AAindex database.
9. A computer-readable storage medium, comprising a computer program executable by a processor to perform the prediction method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010587400.2A CN111798919B (en) | 2020-06-24 | 2020-06-24 | Tumor neoantigen prediction method, prediction device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010587400.2A CN111798919B (en) | 2020-06-24 | 2020-06-24 | Tumor neoantigen prediction method, prediction device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111798919A CN111798919A (en) | 2020-10-20 |
CN111798919B true CN111798919B (en) | 2022-11-25 |
Family
ID=72803402
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010587400.2A Active CN111798919B (en) | 2020-06-24 | 2020-06-24 | Tumor neoantigen prediction method, prediction device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111798919B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113129998B (en) * | 2021-04-23 | 2022-06-21 | 云测智能科技有限公司 | Method for constructing prediction model of clinical individualized tumor neoantigen |
CN114242159B (en) * | 2022-02-24 | 2022-06-07 | 北京晶泰科技有限公司 | Method for constructing antigen peptide presentation prediction model, and antigen peptide prediction method and device |
WO2023168079A2 (en) * | 2022-03-04 | 2023-09-07 | New York University | Cell type-specific prediction of 3d chromatin architecture |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109073659A (en) * | 2016-02-16 | 2018-12-21 | 新加坡科技研究局 | Apparent gene group analysis discloses the body cell promoter situation of primary gastric adenocarcinomas |
CN110600077A (en) * | 2019-08-29 | 2019-12-20 | 北京优迅医学检验实验室有限公司 | Prediction method of tumor neoantigen and application thereof |
CN110592213A (en) * | 2019-09-02 | 2019-12-20 | 深圳市新合生物医疗科技有限公司 | Gene panel for prediction of neoantigen load and detection of genomic mutations |
CN110770838A (en) * | 2017-12-01 | 2020-02-07 | Illumina公司 | Method and system for determining clonality of somatic mutations |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB201608000D0 (en) * | 2016-05-06 | 2016-06-22 | Oxford Biodynamics Ltd | Chromosome detection |
CN107119120A (en) * | 2017-05-04 | 2017-09-01 | 河海大学常州校区 | A kind of key effect molecular detecting method based on chromatin 3D conformation technologies |
CN108300767B (en) * | 2017-10-27 | 2021-08-20 | 清华大学 | Analysis method for interaction of nucleic acid segments in nucleic acid complex |
US20200411135A1 (en) * | 2018-02-27 | 2020-12-31 | Gritstone Oncology, Inc. | Neoantigen Identification with Pan-Allele Models |
CN110853706B (en) * | 2018-08-01 | 2022-07-22 | 中国科学院深圳先进技术研究院 | Tumor clone composition construction method and system integrating epigenetics |
CN109021062B (en) * | 2018-08-06 | 2021-08-20 | 倍而达药业(苏州)有限公司 | Screening method of tumor neoantigen |
CN110277135B (en) * | 2019-08-10 | 2021-06-01 | 杭州新范式生物医药科技有限公司 | Method and system for selecting individualized tumor neoantigen based on expected curative effect |
CN110752041B (en) * | 2019-10-23 | 2023-11-07 | 深圳裕策生物科技有限公司 | Method, device and storage medium for predicting neoantigen based on second-generation sequencing |
-
2020
- 2020-06-24 CN CN202010587400.2A patent/CN111798919B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109073659A (en) * | 2016-02-16 | 2018-12-21 | 新加坡科技研究局 | Apparent gene group analysis discloses the body cell promoter situation of primary gastric adenocarcinomas |
CN110770838A (en) * | 2017-12-01 | 2020-02-07 | Illumina公司 | Method and system for determining clonality of somatic mutations |
CN110600077A (en) * | 2019-08-29 | 2019-12-20 | 北京优迅医学检验实验室有限公司 | Prediction method of tumor neoantigen and application thereof |
CN110592213A (en) * | 2019-09-02 | 2019-12-20 | 深圳市新合生物医疗科技有限公司 | Gene panel for prediction of neoantigen load and detection of genomic mutations |
Also Published As
Publication number | Publication date |
---|---|
CN111798919A (en) | 2020-10-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111798919B (en) | Tumor neoantigen prediction method, prediction device and storage medium | |
Tampuu et al. | ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples | |
DeWitt III et al. | Human T cell receptor occurrence patterns encode immune history, genetic background, and receptor specificity | |
Luo et al. | Disease gene prediction by integrating ppi networks, clinical rna-seq data and omim data | |
Chen et al. | Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou’s pseudo amino acid composition | |
Sayal et al. | Quantitative perturbation-based analysis of gene expression predicts enhancer activity in early Drosophila embryo | |
US10886007B2 (en) | Methods and systems for identification of biomolecule sequence coevolution and applications thereof | |
A Theofilatos et al. | Computational approaches for the prediction of protein-protein interactions: a survey | |
Peng et al. | A novel codon-based de Bruijn graph algorithm for gene construction from unassembled transcriptomes | |
Bi et al. | Prediction of epitope-associated TCR by using network topological similarity based on deepwalk | |
Palmal et al. | Integrative prognostic modeling for breast cancer: Unveiling optimal multimodal combinations using graph convolutional networks and calibrated random forest | |
Liu et al. | Computational intelligence and bioinformatics | |
EP4350708A1 (en) | Method for diagnosing cancer and predicting cancer type by using terminal sequence motif frequency and size of cell-free nucleic acid fragment | |
Wang et al. | Sequence-based protein-protein interaction prediction via support vector machine | |
Tahmasebipour et al. | Disease-gene association using a genetic algorithm | |
Lesturgie et al. | Ecological and biogeographic features shaped the complex evolutionary history of an iconic apex predator (Galeocerdo cuvier) | |
Azé et al. | Using Kendall-τ meta-bagging to improve protein-protein docking predictions | |
Shao et al. | Computational prediction of human body-fluid protein | |
Zhu et al. | Identifying virus-receptor interactions through matrix completion with similarity fusion | |
Kavousi et al. | A post-method condition analysis of using ensemble machine learning for cancer prognosis and diagnosis: a systematic review | |
BASU | Application of machine learning polymer models explaining hypokalemia in COVID-19 patients | |
Xie et al. | A review of artificial intelligence applications in bacterial genomics | |
Huang | Computational Discovery and Annotations of Cell-Type Specific Long-Range Gene Regulation | |
Zeng et al. | Chrombus-XMBD: A Graph Generative Model Predicting 3D-Genome, ab initio from Chromatin Features | |
Mutalib et al. | Towards applying associative classifier for genetic variants |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20231029 Address after: 200240 No. 800, Dongchuan Road, Shanghai, Minhang District Patentee after: Shi Yi Address before: 200240 No. 800, Dongchuan Road, Shanghai, Minhang District Patentee before: SHANGHAI JIAO TONG University |