CN111798919B

CN111798919B - Tumor neoantigen prediction method, prediction device and storage medium

Info

Publication number: CN111798919B
Application number: CN202010587400.2A
Authority: CN
Inventors: 石毅; 贺光
Original assignee: Shanghai Jiao Tong University
Current assignee: DePuShi (Hangzhou) Biotechnology Co.,Ltd.
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2022-11-25
Anticipated expiration: 2040-06-24
Also published as: CN111798919A

Abstract

The invention relates to a tumor neoantigen prediction method, a prediction device and a storage medium based on chromatin advanced conformation and deep sparse learning, wherein the method invents a deep neural network prediction model based on group selection, and trains the model through training data to obtain an immunogenicity prediction value of an object to be predicted (namely a potential tumor neoantigen peptide); wherein each sample used in training the deep neural network prediction model includes chromatin 3D conformation information and features generated based on the polypeptide amino acid sequence. Compared with the prior art, the method has the advantages of high prediction precision, convenience in prediction and the like.

Description

A tumor neoantigen prediction method, prediction device and storage medium

技术领域technical field

本发明涉及肿瘤个性化免疫治疗当中的新抗原预测领域，尤其是涉及一种基于染色质高级构象与深度稀疏学习的肿瘤新抗原预测方法、预测装置及存储介质。The present invention relates to the field of neoantigen prediction in personalized tumor immunotherapy, in particular to a tumor neoantigen prediction method, prediction device and storage medium based on high-level chromatin conformation and deep sparse learning.

背景技术Background technique

目前肿瘤患者的常规治疗主要依赖非个体化的手术切除、放化疗、靶向药物治疗等手段，但这些常规手段存在很多问题，如治疗不彻底、副作用大、易使肿瘤转移耐药等，仅短暂延长肿瘤患者的生存期。At present, the conventional treatment of cancer patients mainly relies on non-individualized surgical resection, radiotherapy and chemotherapy, and targeted drug therapy. Transiently prolongs the survival of cancer patients.

近年来，通过肿瘤患者自身的免疫系统靶向患者肿瘤细胞进行肿瘤免疫治疗的方法进入了人们的视野。在个体化肿瘤免疫治疗中，发挥关键作用的肿瘤患者特异靶标分子被称为肿瘤新抗原。肿瘤新抗原的本质是蛋白质，由肿瘤基因组突变产生，因含有非同义突变，与异常表达的肿瘤自体蛋白抗原不同。在体内，肿瘤新抗原能够被自身免疫系统识别为外来抗原，不受中枢耐受影响，从而使自身免疫系统特异性靶向患者肿瘤细胞。因此，将肿瘤新抗原制备成疫苗或多肽制剂进行肿瘤免疫治疗，可以选择性杀灭肿瘤细胞，安全性高，效果显著。而在这一策略中，准确高效地在众多可能区分肿瘤和正常组织的肽段中个体化地选择预期疗效好的肿瘤新抗原尤为关键。但目前的肿瘤新抗原的选择技术还存在较多技术问题，如选择工作量大、精度不高等。In recent years, the method of tumor immunotherapy by targeting the patient's tumor cells through the tumor patient's own immune system has entered people's field of vision. In individualized tumor immunotherapy, tumor patient-specific target molecules that play a key role are called tumor neoantigens. The essence of tumor neoantigen is protein, which is produced by mutation of tumor genome. Because it contains non-synonymous mutation, it is different from abnormally expressed tumor autologous protein antigen. In vivo, tumor neoantigens can be recognized as foreign antigens by the autoimmune system and are not affected by central tolerance, thereby enabling the autoimmune system to specifically target the patient's tumor cells. Therefore, preparing tumor neoantigens into vaccines or polypeptide preparations for tumor immunotherapy can selectively kill tumor cells with high safety and remarkable effect. In this strategy, accurate and efficient individualized selection of tumor neoantigens with good expected curative effect among many peptides that may distinguish tumors from normal tissues is particularly critical. However, the current tumor neoantigen selection technology still has many technical problems, such as heavy selection workload and low precision.

基因组学在过去二十余年的蓬勃发展为肿瘤研究提供了强有力的支持。人们通过对比肿瘤细胞与正常细胞的基因组，发现了许多与肿瘤发生发展密切相关的遗传变异，并部分揭示了肿瘤发生发展中这些遗传变异的分子机制，这为发展新型肿瘤诊断、分型、预后以及指导临床治疗提供了有力的技术支撑。在肿瘤基因组的体细胞突变方面，人们发现染色质上的单个突变往往不能引起肿瘤，几乎每个肿瘤患者的肿瘤细胞经过检测都能发现众多的遗传和表观遗传变异，包括数个至数百个基因突变共同存在，染色体易位伴随基因突变，多个位置的染色体拷贝数变异等，这些都非常常见。越来越多的证据显示，这些伴随出现(同时或先后出现)的遗传变异有其内在规律，一些基因突变常伴随另一些基因突变而并非随机发生，这种伴随发生的基因突变其内在的遗传学结构基础尚不十分清楚，但是弄清相关机制将为深入认识肿瘤发生发展分子机制，尤其为认识肿瘤发展中遗传事件发生的因果关系奠定了理论基础，为肿瘤新抗原的准确选择提供有效手段，进而为肿瘤诊疗提供一定依据。The vigorous development of genomics in the past two decades has provided strong support for tumor research. By comparing the genomes of tumor cells and normal cells, people have discovered many genetic variations closely related to the occurrence and development of tumors, and partially revealed the molecular mechanisms of these genetic variations in the occurrence and development of tumors. And provide strong technical support to guide clinical treatment. In terms of somatic mutations in tumor genomes, it has been found that a single mutation in chromatin often cannot cause tumors. Numerous genetic and epigenetic variations can be found in almost every tumor patient's tumor cells after testing, including several to hundreds of Co-existence of several gene mutations, chromosomal translocations accompanied by gene mutations, and chromosome copy number variation at multiple locations are all very common. More and more evidence shows that these concomitant (simultaneous or successive) genetic variations have their inherent laws, some gene mutations are often accompanied by other gene mutations rather than random occurrence, and this concomitant gene mutation has its inherent genetic The basis of the genetic structure is still not very clear, but clarifying the relevant mechanism will lay a theoretical foundation for in-depth understanding of the molecular mechanism of tumor occurrence and development, especially for understanding the causal relationship of genetic events in tumor development, and provide an effective means for the accurate selection of tumor neoantigens , and then provide a certain basis for tumor diagnosis and treatment.

发明内容Contents of the invention

本发明的目的就是为了克服上述现有技术存在的缺陷而提供一种预测精度高、预测方便的基于染色质高级构象与深度稀疏学习的肿瘤新抗原预测方法、预测装置及存储介质。The purpose of the present invention is to provide a tumor neoantigen prediction method, prediction device and storage medium based on high-level chromatin conformation and deep sparse learning with high prediction accuracy and convenient prediction in order to overcome the above-mentioned defects in the prior art.

本发明的目的可以通过以下技术方案来实现：The purpose of the present invention can be achieved through the following technical solutions:

一种基于染色质高级构象与深度稀疏学习的肿瘤新抗原预测方法，该方法通过一经训练的基于组选择的深度神经网络预测模型对待预测对象进行处理，获得与所述待预测对象对应的肿瘤新抗原免疫原性信息；A tumor neoantigen prediction method based on high-level chromatin conformation and deep sparse learning, the method processes the object to be predicted through a trained deep neural network prediction model based on group selection, and obtains the tumor neoantigen corresponding to the object to be predicted Antigen immunogenicity information;

其中，所述深度神经网络预测模型进行训练时采用的每个样本包括染色质3D构象信息和基于多肽氨基酸序列产生的特征。Wherein, each sample used for training the deep neural network prediction model includes chromatin 3D conformation information and features generated based on polypeptide amino acid sequences.

进一步地，每个所述样本的特征数量级为数千个级。Further, the order of magnitude of features of each sample is thousands of levels.

样本的每个特征都属于某个组，在所述神经网络模型训练中，某一组内的特征或者全部被选中，或者全部被剔除。Each feature of the sample belongs to a certain group, and in the training of the neural network model, all the features in a certain group are either selected or all eliminated.

进一步地，所述基于组选择的深度神经网络预测模型的输出包括具有激活免疫原性的新抗原以及与该新抗原关联度最高的若干个特征。Further, the output of the group selection-based deep neural network prediction model includes neoantigens with activated immunogenicity and several features most correlated with the neoantigens.

进一步地，所述基于组选择的深度神经网络预测模型采用全连接层形式。Further, the group selection-based deep neural network prediction model adopts the form of a fully connected layer.

进一步地，所述染色质3D构象信息来源于通过Hi-C(染色质构象俘获技术)实验获得的细胞染色质3D构象热力图矩阵。Further, the chromatin 3D conformation information is derived from the cellular chromatin 3D conformation thermodynamic matrix obtained through Hi-C (chromatin conformation capture technology) experiments.

进一步地，所述染色质3D构象信息由公开Hi-C数据集获取。Further, the 3D conformation information of chromatin is obtained from public Hi-C datasets.

进一步地，所述基于多肽氨基酸序列产生的特征包括含氨基酸突变位点的多肽特征和包含了该突变的基因的高表达信息。Further, the feature generated based on the amino acid sequence of the polypeptide includes the feature of the polypeptide containing the amino acid mutation site and the high expression information of the gene containing the mutation.

本发明还提供一种基于染色质高级构象与深度稀疏学习的肿瘤新抗原预测装置，包括：The present invention also provides a tumor neoantigen prediction device based on high-level chromatin conformation and deep sparse learning, including:

数据获取单元，获取用于训练的样本，每个样本包括染色质3D构象信息和基于多肽氨基酸序列产生的特征；The data acquisition unit acquires samples for training, each sample includes chromatin 3D conformation information and features generated based on polypeptide amino acid sequences;

模型训练单元，基于所述样本训练获得一基于组选择的深度神经网络预测模型；A model training unit, which obtains a deep neural network prediction model based on group selection based on the sample training;

预测单元，获取待预测对象，通过所述基于组选择的深度神经网络预测模型对待预测对象进行处理，获得与所述待预测对象对应的肿瘤新抗原免疫原性信息。The prediction unit acquires the object to be predicted, processes the object to be predicted through the deep neural network prediction model based on group selection, and obtains tumor neoantigen immunogenicity information corresponding to the object to be predicted.

本发明还提供一种计算机可读存储介质，包括计算机程序，所述计算机程序能够被处理器执行以实现所述的预测方法。The present invention also provides a computer-readable storage medium, including a computer program that can be executed by a processor to implement the prediction method.

与现有技术相比，本发明具有如下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

第一、本发明基于发明人对染色质高级构象的许多创造性研究，首次提出了从染色质3D构象的角度来审视、分析突变的DNA位点对应的氨基酸多肽抗原是否能激活T细胞免疫原性，在机器学习预测的特征集合中加入染色质3D构象，即新抗原肽对应的DNA突变位点在染色质上的空间分布信息，明显提高新抗原是否有免疫原性的预测准确度。First, based on the inventor’s many creative studies on the high-level conformation of chromatin, the present invention proposes for the first time to examine and analyze whether the amino acid polypeptide antigen corresponding to the mutated DNA site can activate the immunogenicity of T cells from the perspective of the 3D conformation of chromatin , the 3D conformation of chromatin is added to the feature set predicted by machine learning, that is, the spatial distribution information of the DNA mutation site corresponding to the neoantigen peptide on the chromatin, which can significantly improve the prediction accuracy of whether the neoantigen is immunogenic.

第二、本发明自主开发了基于组选择的深度神经网络(Group Feature Selectionbased Deep Neural Network,DNN-GFS)分类模型，预测方便，预测工作量小(剪出神经网络中输入层不必要的节点和边)，且有如下优点：Second, the present invention independently develops a classification model based on group feature selection based deep neural network (Group Feature Selection based Deep Neural Network, DNN-GFS), which is convenient for prediction, and the prediction workload is small (cut out unnecessary nodes and nodes of the input layer in the neural network) side), and has the following advantages:

1、本发明特征集合采用了综合染色质三维结构信息在内的数千个特征，能更好地规避传统的深度神经网络在遇到特征比较多的情况下的过拟合问题，从而提高总体的分类预测准确度；1. The feature set of the present invention uses thousands of features including comprehensive chromatin three-dimensional structure information, which can better avoid the over-fitting problem of traditional deep neural networks when encountering a large number of features, thereby improving the overall The classification prediction accuracy of ;

2、与传统的深度神经网络对用户就像是一个黑盒不同，本发明能在分类预测的同时对输入特征进行特征选择，选择出最关键的特征，从而为进一步挖掘输入特征与输出结果之间的相关性提供了基础；2. Unlike the traditional deep neural network, which is like a black box to users, the present invention can perform feature selection on input features while classifying and predicting, and select the most critical features, so as to further mine the relationship between input features and output results. provides the basis for the correlation between

3、本发明采用分组选择的策略，能将应该在一起的特征进行组选择，即，同时选择同一分组的特征或者同时剔除同一分组的特征，使得模型能很好的兼容组分类的先验知识，提高模型的自学习效力。3. The present invention adopts the strategy of group selection, which can group the features that should be together, that is, select the features of the same group at the same time or remove the features of the same group at the same time, so that the model can be well compatible with the prior knowledge of group classification , to improve the self-learning effectiveness of the model.

附图说明Description of drawings

图1为本发明的原理示意图，本发明解决的问题主要为虚线框中内容；Fig. 1 is a principle schematic diagram of the present invention, and the problem that the present invention solves is mainly content in the dotted line frame;

图2为3909个肽的免疫阳性和免疫阴性的真实标签示意图；Figure 2 is a schematic diagram of the real labels of immunopositive and immunonegative of 3909 peptides;

图3为本发明预测方法(DNN-GFS)与深度神经网络(DNN)、支持向量机(SVM)、逻辑回归(LR)、K最邻近算法(KNN)、Neopepsee、pTuneos、DeepHLApan、NetMHCpan、NetMHC和IEDBimmune等不同预测方法在5划分和LOO交叉验证下的ROC图(ROC曲线)比较；Fig. 3 is the prediction method (DNN-GFS) of the present invention and deep neural network (DNN), support vector machine (SVM), logistic regression (LR), K nearest neighbor algorithm (KNN), Neopepsee, pTuneos, DeepHLApan, NetMHCpan, NetMHC Compared with the ROC graph (ROC curve) of different prediction methods such as IEDBimmune under 5 divisions and LOO cross-validation;

图4为5划分和LOO交叉验证下不同预测方法的准确率-召回率图(P-R曲线)比较；Figure 4 is a comparison of the accuracy-recall rate graph (P-R curve) of different prediction methods under 5 divisions and LOO cross-validation;

图5为不同预测方法独立验证数据集上的ROC曲线与P-R曲线预测效力的比较；Fig. 5 is the comparison of the ROC curve and the P-R curve prediction efficacy on the independent verification data set of different prediction methods;

图6为不同方法对正样本和负样本打分的分数分布比较，分别是LOO交叉验证、5划分交叉验证以及验证数据集上的打分；Figure 6 is a comparison of the score distributions of positive samples and negative samples scored by different methods, which are LOO cross-validation, 5-part cross-validation, and scoring on the verification data set;

图7为本发明基于组特征选择的深度神经网络(DNN-GFS)示意图，其中，a为属于不同大小组的特征的图示，b为DNN-GFS体系结构和组特征选择效果的说明，c从三个有代表性的角度说明不同正则化项应用于加权神经网络布线和二维投影的几何原理。Figure 7 is a schematic diagram of the deep neural network (DNN-GFS) based on group feature selection in the present invention, wherein a is a diagram of features belonging to different large groups, b is an illustration of the DNN-GFS architecture and group feature selection effect, and c The geometric rationale for the application of different regularization terms to weighted neural network wiring and 2D projections is illustrated from three representative perspectives.

具体实施方式Detailed ways

下面结合附图和具体实施例对本发明进行详细说明。本实施例以本发明技术方案为前提进行实施，给出了详细的实施方式和具体的操作过程，但本发明的保护范围不限于下述的实施例。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments. This embodiment is carried out on the premise of the technical solution of the present invention, and detailed implementation and specific operation process are given, but the protection scope of the present invention is not limited to the following embodiments.

近年来通过Hi-C技术，人们能够更加全局地挖掘肿瘤细胞的染色质结构异常现象，并发现染色质之间的远距离调控可能在基因调控中起到了关键作用。本申请发明人在前期工作中发现，在几乎所有的肿瘤中，伴随发生的点突变在染色质三维构象上有明显的邻近性，因此提出并发表了“肿瘤的空间突变热点”这一概念，由此概念延伸，我们认为“染色质三维构象驱动的细胞功能区块”这一概念非常重要，能帮助人们以新的角度来审视肿瘤的发生发展。现有的肿瘤免疫个性化新抗原肽的发现方法往往聚焦于抗原肽的序列属性、抗原肽与MHC分子之间的相互作用、抗原肽-MHC分子复合物pMHC与体细胞上的TCR之间的相互作用等，而忽略了抗原肽的来源，即其对应的有突变的基因，在染色质上的分布有何特殊性质。我们将首次系统性的分析抗原肽对应的基因在染色质空间上的分布规律，并发现有免疫原性(能激活T细胞)的新抗原与没有免疫原性的新抗原，在染色质空间上的分布有着显著的差异，于是我们在机器学习预测算法的特征集合中加入新抗原在染色质上的空间分布信息，发现可以明显提高新抗原是否有免疫原性的预测准确度。In recent years, through Hi-C technology, people have been able to dig out the chromatin structure abnormalities of tumor cells more globally, and found that the long-distance regulation between chromatin may play a key role in gene regulation. In the previous work, the inventors of the present application found that in almost all tumors, the concomitant point mutations have obvious proximity in the three-dimensional conformation of chromatin, so they proposed and published the concept of "spatial mutation hotspots of tumors", Extending this concept, we believe that the concept of "cellular functional blocks driven by chromatin three-dimensional conformation" is very important and can help people examine the occurrence and development of tumors from a new perspective. Existing methods for discovering personalized neoantigen peptides in tumor immunity often focus on the sequence properties of antigen peptides, the interaction between antigen peptides and MHC molecules, and the interaction between antigen peptide-MHC molecular complex pMHC and TCR on somatic cells. interaction, etc., while ignoring the source of the antigenic peptide, that is, the special properties of the distribution of the corresponding mutated gene on the chromatin. For the first time, we will systematically analyze the distribution of genes corresponding to antigenic peptides in chromatin space, and find that neoantigens with immunogenicity (capable of activating T cells) and non-immunogenic neoantigens are different in chromatin space. Therefore, we added the spatial distribution information of neoantigens on chromatin to the feature set of the machine learning prediction algorithm, and found that the prediction accuracy of whether neoantigens are immunogenic can be significantly improved.

基于上述基础，本发明实现一种基于染色质高级构象与深度稀疏学习的肿瘤新抗原预测方法，该方法通过一经训练的基于组选择的深度神经网络预测模型(基于组特征选择的深度稀疏学习算法，DNN-GFS，Group Feature Selection based Deep NeuralNetwork)对待预测对象(即潜在的肿瘤新抗原肽)进行处理，获得与所述待预测对象对应的肿瘤新抗原免疫原性信息；其中，所述深度神经网络预测模型进行训练时采用的每个样本包括染色质3D构象信息和基于多肽氨基酸序列产生的特征。该方法原理如图1虚线框中内容所示。Based on the above, the present invention implements a tumor neoantigen prediction method based on advanced chromatin conformation and deep sparse learning, which uses a trained deep neural network prediction model based on group selection (deep sparse learning algorithm based on group feature selection) , DNN-GFS, Group Feature Selection based Deep NeuralNetwork) to process the object to be predicted (ie, potential tumor neoantigen peptide), and obtain the immunogenicity information of tumor neoantigen corresponding to the object to be predicted; wherein, the deep neural network Each sample used in the training of the network prediction model includes chromatin 3D conformation information and features generated based on the amino acid sequence of the polypeptide. The principle of this method is shown in the dotted box in Figure 1.

每个样本的特征集合由染色质3D构象信息与其他基于多肽氨基酸序列产生的特征组合而成，为代表某一条多肽的特征集合。每个所述样本的特征数量级为数千个级(超过5000个特征)，具体包括目标肽对应DNA位点在染色质3D空间上的<x,y,z>3D坐标、距离细胞核中心(或细胞核膜)的距离、递呈抗原肽的MHC分子的HLA亚型编码、20种氨基酸在目标肽中的出现频率、抗原肽序列的氨基酸稀疏编码、抗原肽序列的氨基酸BLOSM编码、抗原肽序列的氨基酸BLOMAP编码、抗原肽序列的氨基酸侧链分类编码、抗原肽序列的氨基酸侧链极性编码、抗原肽序列的氨基酸侧链电荷编码、抗原肽序列的氨基酸侧链亲疏水性编码、抗原肽序列的氨基酸侧链分子量编码、抗原肽序列的氨基酸侧链在生物种群中的出现频率编码、基于AAindex(来源：https://www.genome.jp/aaindex/)数据库中所列的所有氨基酸AAindex指标的编码。本实施例中，每个样本为一个包含5459个特征的向量。The feature set of each sample is composed of chromatin 3D conformation information and other features generated based on the amino acid sequence of the polypeptide, which is a feature set representing a certain polypeptide. The magnitude of the features of each sample is thousands of levels (more than 5000 features), specifically including the <x, y, z> 3D coordinates of the DNA site corresponding to the target peptide in the chromatin 3D space, the distance from the center of the nucleus (or nuclear membrane), the HLA subtype code of the MHC molecule presenting the antigen peptide, the occurrence frequency of 20 amino acids in the target peptide, the amino acid sparse code of the antigen peptide sequence, the amino acid BLOSM code of the antigen peptide sequence, the Amino acid BLOMAP encoding, amino acid side chain classification encoding of antigenic peptide sequence, amino acid side chain polarity encoding of antigenic peptide sequence, amino acid side chain charge encoding of antigenic peptide sequence, amino acid side chain hydrophilicity and hydrophobicity encoding of antigenic peptide sequence, antigenic peptide sequence Amino acid side chain molecular weight code, amino acid side chain frequency code of the antigenic peptide sequence in biological populations, based on all amino acid AAindex indicators listed in the AAindex (source: https://www.genome.jp/aaindex/) database coding. In this embodiment, each sample is a vector containing 5459 features.

本实施例中，样本中的其他基于多肽氨基酸序列产生的特征通过高通量全外显子组测序(ExonSeq)和全转录组测序(RNASeq)方式获得。根据全外显子组测序结果，可获得样本中的肿瘤细胞的突变的信息，最终得到某个突变在几号染色体的哪个具体坐标位置上，并找出那些改变了对应的编码氨基酸的突变位点；根据全转录组测序结果，可分析出那些基因在肿瘤细胞中被高表达了。在上述结果基础上，先将包含氨基酸突变位点的多肽都枚举出来，然后筛选出有高表达的基于对应的变异多肽，多肽的长度默认定为9，但不仅限于9。In this embodiment, other features in the sample based on the amino acid sequence of the polypeptide are obtained by high-throughput whole-exome sequencing (ExonSeq) and whole-transcriptome sequencing (RNASeq). According to the results of whole exome sequencing, the mutation information of the tumor cells in the sample can be obtained, and finally the specific coordinate position of a certain mutation on the chromosome number is obtained, and those mutation positions that change the corresponding encoded amino acid can be found point; according to the results of whole transcriptome sequencing, it can be analyzed which genes are highly expressed in tumor cells. On the basis of the above results, all polypeptides containing amino acid mutation sites are enumerated first, and then the corresponding variant polypeptides with high expression are screened out. The length of the polypeptide is set to 9 by default, but it is not limited to 9.

染色质3D构象信息为肿瘤细胞的染色质3D构象热力图矩阵。本实施例中，样本中的染色质3D构象信息通过Hi-C实验获得或用公共数据库中的多个Hi-C数据集代替。The chromatin 3D conformation information is the chromatin 3D conformation heat map matrix of tumor cells. In this embodiment, the 3D conformation information of chromatin in the sample is obtained through Hi-C experiments or replaced by multiple Hi-C data sets in public databases.

本发明采用分子动力学(MD)方法，开发了一种分辨率为500kb(bin-size)的人类基因组三维构象建模方法。这些容器是粗颗粒的珠子，完整的基因组由23个聚合物链组成的串珠结构表示。珠子的空间位置受染色质连接性和染色质活性的影响，染色质连接性限制了珠子在近3D范围内的线性邻接，染色质活性确保了活性区域靠近细胞核中心。染色质活性是根据上述Hi-C基质可直接计算的间隔度来确定的。根据间隔度指数，将珠子与核中心的距离赋值，然后利用分子动力学方法，通过施加偏压势来满足这些距离约束，从随机结构中优化染色质的构象。对每一个细胞株，从随机构象结构中优化出300个可行的构象结构，以减少可能的变异，便于进一步分析。The present invention adopts molecular dynamics (MD) method to develop a human genome three-dimensional conformation modeling method with a resolution of 500kb (bin-size). These containers are coarse-grained beads, and the complete genome is represented by a beaded structure composed of 23 polymer chains. The spatial position of the beads is influenced by chromatin connectivity, which limits linear adjacency of beads in near 3D, and chromatin activity, which ensures that active regions are close to the center of the nucleus. Chromatin activity was determined based on directly calculable compartmentalization of Hi-C substrates described above. The distances of the beads from the center of the nucleus were assigned values according to the spacing index, and molecular dynamics methods were used to optimize the conformation of chromatin from random structures by applying bias potentials to satisfy these distance constraints. For each cell line, 300 feasible conformational structures were optimized from random conformational structures to reduce possible variations and facilitate further analysis.

本预测方法采用基于组选择的深度神经网络预测模型对输入的特征集合编码的多肽进行打分分类，能够找出最可能激活T细胞免疫原性的多肽。本实施例中，深度神经网络预测模型的输入为某潜在新抗原肽序列的5459个特征编码，输出为该多肽是否能激活免疫原性的分数，分数越高说明越能激活T细胞的免疫原性。该模型进行训练时，如图7所示，每个样本的所有特征至多属于一个组，一个组包含一个或多个要素，以方便采用分组选择策略，使得一些应该共同出现或者共同被剔除的特征，即组特征，能被同时选中或者同时剔除，提高模型的自学习效力、降低过拟合风险、提高算法计算效率。This prediction method uses a deep neural network prediction model based on group selection to score and classify the polypeptides encoded by the input feature set, and can find out the polypeptides most likely to activate T cell immunogenicity. In this example, the input of the deep neural network prediction model is 5,459 feature codes of a potential neoantigen peptide sequence, and the output is the score of whether the polypeptide can activate immunogenicity. The higher the score, the more immunogen that can activate T cells. sex. When the model is trained, as shown in Figure 7, all features of each sample belong to at most one group, and a group contains one or more elements to facilitate the use of group selection strategies, so that some features that should co-occur or be eliminated together , that is, group features, which can be selected or eliminated at the same time to improve the self-learning effect of the model, reduce the risk of over-fitting, and improve the calculation efficiency of the algorithm.

本实施例的深度神经网络预测模型采用全连接层形式，其输出包括具有免疫原性的新抗原以及与该新抗原关联度最高的若干个特征，在预测的同时筛选出最关键的特征，能更加好的阐明特征和输出结果之间的关系。The deep neural network prediction model of this embodiment adopts the form of a fully connected layer, and its output includes immunogenic neoantigens and several features with the highest degree of correlation with the neoantigens, and the most critical features are screened out while predicting, which can Better clarify the relationship between features and output results.

如图2-图6所示为上述预测方法(DNN-GFS)与、DNN、SVM、LR、KNN、Neopepsee、pTuneos、DeepHLApan、NetMHCpan、NetMHC和IEDB immuno等不同分类方法的预测结果比较示意图。该组图说明了综合来看，我们的方法DNN-GFS对新抗原的预测效力要比其他传统的机器学习算法更优。Figure 2-Figure 6 shows the comparison diagrams of the prediction results of the above prediction method (DNN-GFS) and different classification methods such as DNN, SVM, LR, KNN, Neopepsee, pTuneos, DeepHLApan, NetMHCpan, NetMHC and IEDB immuno. This group of figures shows that in general, our method DNN-GFS is more effective in predicting neoantigens than other traditional machine learning algorithms.

利用上述预测方法进行预测后，将打分高的、被分类为阳性的多肽序列进行合成可获得预测的肿瘤新抗原，后续可采用小鼠对预测的肿瘤新抗原进行免疫效力观察试验。After the above-mentioned prediction method is used for prediction, the predicted tumor neoantigens can be obtained by synthesizing the high-scoring and positively classified polypeptide sequences, and then mice can be used to observe the immune efficacy of the predicted tumor neoantigens.

在另一实施例中，提供一种基于染色质高级构象与深度稀疏学习的肿瘤新抗原预测装置，包括数据获取单元、模型训练单元和预测单元，其中，数据获取单元获取用于训练的样本，每个样本包括染色质3D构象信息和基于多肽氨基酸序列产生的特征；模型训练单元基于所述样本训练获得一基于组选择的深度神经网络预测模型；预测单元获取待预测对象，通过所述基于组选择的深度神经网络预测模型对待预测对象进行处理，获得与所述待预测对象对应的肿瘤新抗原信息。In another embodiment, a tumor neoantigen prediction device based on high-level chromatin conformation and deep sparse learning is provided, including a data acquisition unit, a model training unit, and a prediction unit, wherein the data acquisition unit acquires samples for training, Each sample includes chromatin 3D conformation information and features generated based on the polypeptide amino acid sequence; the model training unit obtains a deep neural network prediction model based on group selection based on the sample training; the prediction unit obtains the object to be predicted, through the group-based The selected deep neural network prediction model processes the object to be predicted to obtain tumor neoantigen information corresponding to the object to be predicted.

在另一实施例中，提供一种计算机可读存储介质，包括计算机程序，所述计算机程序能够被处理器执行以实现所述的预测方法。In another embodiment, a computer-readable storage medium is provided, including a computer program, and the computer program can be executed by a processor to implement the prediction method.

在另一实施例中，提供一种网页端，在获得待预测对象后，利用上述预测方法快速获得对肿瘤新抗原的预测结果。In another embodiment, a web page terminal is provided. After obtaining the object to be predicted, the above prediction method is used to quickly obtain the prediction result of the tumor neoantigen.

以上详细描述了本发明的较佳具体实施例。应当理解，本领域的普通技术人员无需创造性劳动就可以根据本发明的构思作出诸多修改和变化。因此，凡本技术领域中技术人员依本发明的构思在现有技术的基础上通过逻辑分析、推理或者有限的实验可以得到的技术方案，皆应在由权利要求书所确定的保护范围内。The preferred specific embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make many modifications and changes according to the concept of the present invention without creative effort. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning or limited experiments on the basis of the prior art shall be within the scope of protection defined by the claims.

Claims

1. A tumor neoantigen prediction method based on high-level chromatin conformation and deep sparse learning, characterized in that the method processes the object to be predicted through a trained deep neural network prediction model based on group feature selection, and obtains the same as described Immunogenicity information of tumor neoantigen corresponding to the object to be predicted;

Wherein, the feature set of each sample used when the deep neural network prediction model is trained is composed of chromatin 3D conformation information and other features generated based on polypeptide amino acid sequences, and the chromatin 3D conformation information is based on molecular dynamics Each feature of the sample belongs to a certain group, and in the training of the neural network model, all the features in a certain group are either selected or all eliminated;

The characteristics of each sample include the <x, y, z> 3D coordinates of the DNA site corresponding to the target peptide in the chromatin 3D space, the distance between the DNA site corresponding to the target peptide and the center of the nucleus or the nuclear membrane, and the presentation of antigenic peptides. The HLA subtype code of the MHC molecule, the occurrence frequency of 20 amino acids in the target peptide, the amino acid sparse code of the antigenic peptide sequence, the amino acid BLOSM code of the antigenic peptide sequence, the amino acid BLOMAP code of the antigenic peptide sequence, the amino acid side of the antigenic peptide sequence Chain classification coding, amino acid side chain polarity coding of antigenic peptide sequence, amino acid side chain charge coding of antigenic peptide sequence, amino acid side chain hydrophilicity coding of antigenic peptide sequence, molecular weight coding of amino acid side chain of antigenic peptide sequence, antigenic peptide sequence The frequency coding of amino acid side chains in biological populations is based on the coding of all amino acid AAindex indicators listed in the AAindex database.

2. The tumor neoantigen prediction method based on high-level chromatin conformation and deep sparse learning according to claim 1, wherein the order of magnitude of the features of each sample is thousands of levels.

3. The tumor neoantigen prediction method based on high-level chromatin conformation and deep sparse learning according to claim 1, wherein the output of the deep neural network prediction model based on group feature selection includes tumor immunity to potential neoantigens Prediction of originality and several features most correlated with this neoantigen.

4. The tumor neoantigen prediction method based on high-level chromatin conformation and deep sparse learning according to claim 1, wherein the deep neural network prediction model based on group feature selection adopts a fully connected layer form.

5. The tumor neoantigen prediction method based on high-level chromatin conformation and deep sparse learning according to claim 1, wherein the 3D conformation information of chromatin is derived from the 3D conformation of cellular chromatin obtained by Hi-C experiment Heatmap matrix.

6. The tumor neoantigen prediction method based on chromatin advanced conformation and deep sparse learning according to claim 1, wherein the chromatin 3D conformation information is obtained from a public Hi-C dataset.

7. The tumor neoantigen prediction method based on high-level chromatin conformation and deep sparse learning according to claim 1, wherein the features generated based on the polypeptide amino acid sequence include polypeptide features containing amino acid mutation sites and features containing The high expression information of the mutated gene was obtained.

8. A tumor neoantigen prediction device based on advanced chromatin conformation and deep sparse learning, characterized in that it includes:

The data acquisition unit acquires samples used for training, and the feature set of each sample is composed of chromatin 3D conformation information and other features generated based on polypeptide amino acid sequences, and the chromatin 3D conformation information uses human Genome three-dimensional conformational modeling method to obtain;

The model training unit obtains a deep neural network prediction model based on group feature selection based on the sample training, each feature of the sample belongs to a certain group, and in the neural network model training, the features in a certain group or all are selected , or all are eliminated;

The prediction unit acquires the object to be predicted, processes the object to be predicted through the deep neural network prediction model based on group feature selection, and obtains the tumor neoantigen immunogenicity information corresponding to the object to be predicted;

9. A computer-readable storage medium, comprising a computer program that can be executed by a processor to implement the prediction method according to any one of claims 1-7.