CN115497564A - A method for establishing an identification antigen model and an identification method for an antigen - Google Patents
A method for establishing an identification antigen model and an identification method for an antigen Download PDFInfo
- Publication number
- CN115497564A CN115497564A CN202211066490.6A CN202211066490A CN115497564A CN 115497564 A CN115497564 A CN 115497564A CN 202211066490 A CN202211066490 A CN 202211066490A CN 115497564 A CN115497564 A CN 115497564A
- Authority
- CN
- China
- Prior art keywords
- antigen
- neural network
- tcr
- training
- pmhc
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000000427 antigen Substances 0.000 title claims abstract description 122
- 102000036639 antigens Human genes 0.000 title claims abstract description 122
- 108091007433 antigens Proteins 0.000 title claims abstract description 122
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000013528 artificial neural network Methods 0.000 claims abstract description 112
- 238000000605 extraction Methods 0.000 claims abstract description 69
- 238000012549 training Methods 0.000 claims abstract description 69
- 239000011159 matrix material Substances 0.000 claims description 50
- 239000013598 vector Substances 0.000 claims description 38
- 101100112922 Candida albicans CDR3 gene Proteins 0.000 claims description 29
- 210000001744 T-lymphocyte Anatomy 0.000 abstract description 8
- 230000028993 immune response Effects 0.000 abstract description 3
- 230000000694 effects Effects 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 150000001413 amino acids Chemical class 0.000 description 4
- 238000004590 computer program Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 102000004169 proteins and genes Human genes 0.000 description 3
- 108090000623 proteins and genes Proteins 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000000513 principal component analysis Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 206010028980 Neoplasm Diseases 0.000 description 1
- 230000033289 adaptive immune response Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 210000001151 cytotoxic T lymphocyte Anatomy 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000009169 immunotherapy Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 102000004196 processed proteins & peptides Human genes 0.000 description 1
- 108090000765 processed proteins & peptides Proteins 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Physics & Mathematics (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Public Health (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Peptides Or Proteins (AREA)
Abstract
Description
技术领域technical field
本发明涉及神经网络和免疫反应预测技术领域,具体涉及一种鉴定抗原模型建立方法及鉴定抗原方法。The invention relates to the technical field of neural network and immune response prediction, in particular to a method for establishing an antigen identification model and an antigen identification method.
背景技术Background technique
免疫疗法是一项可用于癌症治疗的新技术。该方法中最重要的步骤是激活人的适应性免疫反应,即通过介导细胞毒性T淋巴细胞对病变细胞的识别和破坏。Immunotherapy is a new technology that can be used in cancer treatment. The most important step in this method is the activation of the human adaptive immune response, namely, the recognition and destruction of diseased cells by mediating cytotoxic T lymphocytes.
在现有技术中,一般采用直接检测的方法、建模预测的方法和机器学习的方法,进行T细胞抗原鉴定,然而,这些方法在实现过程中用到的数据有限,因此能够鉴定的肽段十分有限,由于该反应的复杂性,目前人们对于何种抗原能够激活T淋巴细胞的免疫反应的了解仍然十分有限。In the prior art, direct detection methods, modeling and prediction methods, and machine learning methods are generally used to identify T cell antigens. However, these methods use limited data in the implementation process, so the peptides that can be identified Due to the complexity of this response, our understanding of which antigens can activate T lymphocyte immune responses is still very limited.
发明内容Contents of the invention
本发明要解决的技术问题是如何实现活化T细胞的相关抗原的鉴定。The technical problem to be solved by the present invention is how to realize the identification of relevant antigens of activated T cells.
为解决上述问题,本发明提供一种鉴定抗原模型建立方法,包括:构建神经网络,其中,所述神经网络包括TCR特征提取神经网络、pMHC特征提取神经网络和抗原鉴定神经网络;构建数据集,其中,所述数据集包括TCR CDR3、HLA I和抗原序列;将所述数据集输入所述神经网络,对所述神经网络进行训练,以建立鉴定抗原模型。In order to solve the above problems, the present invention provides a method for establishing an identification antigen model, comprising: constructing a neural network, wherein the neural network includes a TCR feature extraction neural network, a pMHC feature extraction neural network, and an antigen identification neural network; constructing a data set, Wherein, the data set includes TCR CDR3, HLA I and antigen sequences; the data set is input into the neural network, and the neural network is trained to establish an antigen identification model.
优选地,采用Z描述符描述数据集中的TCR CDR3、HLA I和抗原序列,将描述后的矩阵归一化,以分别确定TCR CDR3序列矩阵、HLA I序列矩阵和抗原序列矩阵。Preferably, the Z descriptor is used to describe the TCR CDR3, HLA I and antigen sequences in the data set, and the described matrix is normalized to determine the TCR CDR3 sequence matrix, HLA I sequence matrix and antigen sequence matrix respectively.
优选地,将所述数据集输入所述神经网络,对所述神经网络进行训练,包括:将所述TCR CDR3序列矩阵作为训练集,对所述TCR特征提取神经网络进行训练;将所述TCR CDR3序列矩阵输入到训练好的TCR特征提取神经网络中,得到TCR特征向量;Preferably, the data set is input into the neural network, and the neural network is trained, including: using the TCR CDR3 sequence matrix as a training set, and training the TCR feature extraction neural network; The CDR3 sequence matrix is input into the trained TCR feature extraction neural network to obtain the TCR feature vector;
将所述HLA I序列矩阵和所述抗原序列矩阵作为训练集,对所述pMHC特征提取神经网络中进行训练;将所述HLA I序列矩阵和所述抗原序列矩阵输入到训练好的pMHC特征提取神经网络中,得到pMHC特征向量;Using the HLA I sequence matrix and the antigen sequence matrix as a training set, the pMHC feature extraction neural network is trained; the HLA I sequence matrix and the antigen sequence matrix are input into the trained pMHC feature extraction In the neural network, the pMHC feature vector is obtained;
将所述TCR特征向量和所述pMHC特征向量作为训练集,对所述抗原鉴定神经网络进行训练。The TCR feature vector and the pMHC feature vector are used as a training set to train the antigen identification neural network.
优选地,对所述抗原鉴定神经网络进行训练之前,所述鉴定抗原模型建立方法还包括:采用SMOTE算法对所述TCR特征向量和所述pMHC特征向量组成的训练集进行平衡处理。Preferably, before training the antigen identification neural network, the method for establishing an antigen identification model further includes: performing balancing processing on the training set composed of the TCR feature vector and the pMHC feature vector using the SMOTE algorithm.
优选地,对所述抗原鉴定神经网络进行训练包括:将所述TCR特征向量和所述pMHC特征向量能否结合状态作为分类标签。Preferably, training the neural network for antigen identification includes: taking the state of whether the TCR feature vector and the pMHC feature vector can be combined as a classification label.
优选地,所述TCR特征向量和所述pMHC特征向量组成的训练集包括人造阴性结合数据。Preferably, the training set composed of the TCR feature vector and the pMHC feature vector includes artificial negative binding data.
优选地,所述TCR特征提取神经网络,包括编码器模块、特征提取模块和解码器模块;Preferably, the TCR feature extraction neural network includes an encoder module, a feature extraction module and a decoder module;
所述编码器模块所采用的卷积层模块包括四层Cov1D层;The convolution layer module adopted by the encoder module includes four layers of Cov1D layers;
所述特征提取模块包括一层用于输出TCR序列特征的全连接层;The feature extraction module includes a fully connected layer for outputting TCR sequence features;
所述解码器模块包括四层Conv1D层。The decoder module includes four Conv1D layers.
优选地,所述pMHC特征提取神经网络包括HLA特征提取模块、抗原特征提取模块、特征提取模块和标签训练模块;Preferably, the pMHC feature extraction neural network includes an HLA feature extraction module, an antigen feature extraction module, a feature extraction module and a label training module;
所述HLA特征提取模块包括四层Cov1D层,一层Reshape层,以及一层全连接层;The HLA feature extraction module includes four layers of Cov1D layers, one layer of Reshape layer, and one layer of fully connected layer;
所述抗原特征提取模块包括四层Cov1D层,一层Reshape层,以及一层全连接层;The antigen feature extraction module includes four layers of Cov1D layers, one layer of Reshape layer, and one layer of fully connected layer;
所述特征提取模块包括一层用于输出pMHC序列特征的全连接层;The feature extraction module includes a fully connected layer for outputting pMHC sequence features;
所述标签训练模块包括两层全连接层。The label training module includes two fully connected layers.
优选地,所述抗原鉴定神经网络包括TCR特征学习模块、pMHC特征学习模块和输出模块;Preferably, the antigen identification neural network includes a TCR feature learning module, a pMHC feature learning module and an output module;
所述TCR特征学习模块包括两个全连接层;The TCR feature learning module includes two fully connected layers;
所述pMHC特征学习模块包括两个全连接层;The pMHC feature learning module includes two fully connected layers;
所述输出模块采用三个用于输出抗原鉴定结果的全连接层。The output module employs three fully connected layers for outputting antigen identification results.
本发明所述的鉴定抗原模型建立方法,通过对序列特征进行描述,采用不同神经网络将序列特征进行降维提取,并采用监督深度学习神经网络,通过学习降维后的序列特征,实现鉴定活化T细胞相关抗原的目标;同时为保证该模型训练效果,本发明采用SMOTE方式重新构建了训练数据集,使训练集的数据更加平衡,避免了少量数据类过拟合,提升了模型效果;另外,在固定模型结构的情况下,尝试多种训练集和测试集比例,每种比例进行多次训练,普遍的高验证率证明了该模型具有较高的稳定性和预测准确性;最终,本发明设计的一系列神经网络能够对TCR CDR3、HLA I和抗原的序列信息进行特征提取,从而能够鉴定活化T细胞的相关抗原。The identification antigen model establishment method described in the present invention, by describing the sequence features, using different neural networks to reduce the dimensionality of the sequence features, and using the supervised deep learning neural network, by learning the sequence features after dimensionality reduction, the activation of identification is realized. The target of T cell-associated antigen; meanwhile, in order to ensure the training effect of the model, the present invention uses the SMOTE method to reconstruct the training data set, so that the data in the training set is more balanced, avoiding over-fitting of a small amount of data, and improving the model effect; in addition , in the case of a fixed model structure, try a variety of training set and test set ratios, each ratio for multiple training, the general high verification rate proves that the model has high stability and prediction accuracy; finally, this A series of neural networks designed by the invention can extract the features of the sequence information of TCR CDR3, HLA I and antigens, so as to identify the relevant antigens of activated T cells.
本发明还提供一种鉴定抗原方法,包括:获取需要检测的TCR CDR3、HLA I和抗原序列,将所述需要检测的TCR CDR3、HLA I和抗原序列输入到上述任一项所述的鉴定抗原模型建立方法所建立的鉴定抗原模型中,得到抗原的鉴定结果。所述鉴定抗原方法与上述鉴定抗原模型建立方法相对于现有技术所具有的优势相同,在此不再赘述。The present invention also provides a method for identifying antigens, comprising: obtaining the TCR CDR3, HLA I and antigen sequences that need to be detected, and inputting the TCR CDR3, HLA I and antigen sequences that need to be detected into any of the identification antigens described above In the identification antigen model established by the model building method, the identification result of the antigen is obtained. The method for identifying antigens has the same advantages as the above method for identifying antigen models over the prior art, and will not be repeated here.
附图说明Description of drawings
图1为本发明实施例所述的鉴定抗原模型建立方法流程图;Fig. 1 is the flow chart of the establishment method of identification antigen model described in the embodiment of the present invention;
图2为本发明实施例所述的TCR特征提取神经网络的结构示意图;Fig. 2 is the structural representation of the TCR feature extraction neural network described in the embodiment of the present invention;
图3为本发明实施例所述的pMHC特征提取神经网络的结构示意图;FIG. 3 is a schematic structural diagram of the pMHC feature extraction neural network described in the embodiment of the present invention;
图4为本发明实施例所述的抗原鉴定神经网络的结构示意图;Fig. 4 is a schematic structural diagram of the antigen identification neural network described in the embodiment of the present invention;
图5为本发明实施例所述的活化T细胞相关抗原鉴定的流程示意图;Figure 5 is a schematic flow chart of the identification of activated T cell-associated antigens described in the embodiment of the present invention;
图6为本发明实施例所述的鉴定抗原模型建立装置结构示意图之一;Fig. 6 is one of the structural schematic diagrams of the establishment device for identifying antigen models described in the embodiment of the present invention;
图7为本发明实施例所述的鉴定抗原模型建立装置结构示意图之二。Fig. 7 is the second schematic diagram of the structure of the device for identifying antigen models described in the embodiment of the present invention.
具体实施方式detailed description
为使本发明的上述目的、特征和优点能够更为明显易懂,下面结合附图对本发明的具体实施例做详细的说明。In order to make the above objects, features and advantages of the present invention more comprehensible, specific embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings.
如图1所示,本发明实施例提供一种鉴定抗原模型建立方法,包括如下步骤:构建神经网络,其中,所述神经网络包括TCR特征提取神经网络、pMHC特征提取神经网络和抗原鉴定神经网络;构建数据集,其中,所述数据集包括TCR CDR3、HLA I和抗原序列;将所述数据集输入所述神经网络,对所述神经网络进行训练,以建立鉴定抗原模型。As shown in Figure 1, an embodiment of the present invention provides a method for establishing an antigen identification model, including the following steps: constructing a neural network, wherein the neural network includes a TCR feature extraction neural network, a pMHC feature extraction neural network, and an antigen identification neural network Constructing a data set, wherein the data set includes TCR CDR3, HLA I and antigen sequences; inputting the data set into the neural network, and training the neural network to establish a model for identifying antigens.
具体地,鉴定抗原模型建立方法包括:构建TCR特征提取神经网络,TCR CDR3序列矩阵作为TCR特征提取模块的输入,进行无监督训练。训练出一个能够提取TCR序列特征的自编码器神经网络;pMHC特征提取神经网络,HLA I序列矩阵和抗原序列序列矩阵作为pMHC特征提取模块的输入,训练出一个能够提取pMHC序列特征的基于卷积层的神经网络;抗原鉴定神经网络,TCR特征向量和pMHC特征向量作为训练集,已知的二者能否结合状态作为分类标签,进行抗原鉴定的神经网络训练。Specifically, the method for establishing an antigen identification model includes: constructing a TCR feature extraction neural network, and using the TCR CDR3 sequence matrix as the input of the TCR feature extraction module for unsupervised training. Train an autoencoder neural network that can extract TCR sequence features; pMHC feature extraction neural network, HLA I sequence matrix and antigen sequence sequence matrix are used as the input of pMHC feature extraction module, and train a convolution-based algorithm that can extract pMHC sequence features Layer neural network; antigen identification neural network, TCR feature vector and pMHC feature vector as a training set, known whether the two can be combined with the state as a classification label, neural network training for antigen identification.
构建数据集,采用Z描述符描述数据集中的TCR CDR3、HLA I和抗原序列,将描述后的矩阵归一化,以分别确定TCR CDR3序列矩阵、HLA I序列矩阵和抗原序列矩阵。The data set was constructed, and the TCR CDR3, HLA I and antigen sequences in the data set were described by Z descriptor, and the described matrix was normalized to determine the TCR CDR3 sequence matrix, HLA I sequence matrix and antigen sequence matrix respectively.
将数据集输入神经网络,对神经网络进行训练,以建立鉴定抗原模型,将TCR CDR3序列矩阵作为训练集,对TCR特征提取神经网络进行训练;将TCR CDR3序列矩阵输入到训练好的TCR特征提取神经网络中,得到TCR特征向量;Input the data set into the neural network, train the neural network to establish an identification antigen model, use the TCR CDR3 sequence matrix as the training set, and train the TCR feature extraction neural network; input the TCR CDR3 sequence matrix into the trained TCR feature extraction In the neural network, the TCR feature vector is obtained;
将HLA I序列矩阵和所抗原序列矩阵作为训练集,对pMHC特征提取神经网络中进行训练;将HLA I序列矩阵和抗原序列矩阵输入到训练好的pMHC特征提取神经网络中,得到pMHC特征向量;The HLA I sequence matrix and the antigen sequence matrix are used as the training set to train the pMHC feature extraction neural network; the HLA I sequence matrix and the antigen sequence matrix are input into the trained pMHC feature extraction neural network to obtain the pMHC feature vector;
将TCR特征向量和pMHC特征向量作为训练集,对抗原鉴定神经网络进行训练。The TCR feature vector and pMHC feature vector were used as the training set to train the antigen identification neural network.
在本实施例中,通过构建神经网络,将数据集输入到构建好的神经网络,对神经网络进行训练得到特征向量,将特征向量作为训练集,对抗原鉴定神经网络进行训练,建立鉴定抗原模型,实现鉴定活化T细胞相关抗原的目标。In this embodiment, by constructing a neural network, input the data set into the constructed neural network, train the neural network to obtain the feature vector, use the feature vector as the training set, train the antigen identification neural network, and establish an identification antigen model , to achieve the goal of identifying antigens associated with activated T cells.
优选地,所述鉴定抗原模型建立方法还包括:采用Z描述符描述所述数据集中的TCR CDR3、HLA I和抗原序列,将描述后的矩阵归一化,以分别确定TCR CDR3序列矩阵、HLAI序列矩阵和抗原序列矩阵。Preferably, the method for establishing an antigen identification model further includes: using the Z descriptor to describe the TCR CDR3, HLA I and antigen sequences in the data set, and normalizing the described matrix to respectively determine the TCR CDR3 sequence matrix, HLA I Sequence Matrix and Antigen Sequence Matrix.
具体地,Z描述符是高度凝练的描述符,该描述符由全部自然存在的氨基酸的29个实验或计算的物理化学特性的主成分分析(PCA)得出,蛋白质序列中的每个氨基酸由三个Z描述符表示,分别代表氨基酸的疏水性,空间特性和极性;每个蛋白序列被描述为一个3*N的矩阵,其中,N表示蛋白序列长度。归一化后的矩阵中,每个数值都在-1到1之间。Specifically, the Z-descriptor is a highly condensed descriptor derived from principal component analysis (PCA) of 29 experimental or calculated physicochemical properties of all naturally occurring amino acids, with each amino acid in a protein sequence represented by The three Z descriptors represent the hydrophobicity, spatial characteristics and polarity of amino acids, respectively; each protein sequence is described as a 3*N matrix, where N represents the length of the protein sequence. In the normalized matrix, each value is between -1 and 1.
在本实施例中,通过将描述后的矩阵进行归一化处理,后续数据处理更为方便,加快学习速度。In this embodiment, by normalizing the described matrix, the subsequent data processing is more convenient and the learning speed is accelerated.
优选地,所述将所述数据集输入所述神经网络,对所述神经网络进行训练,包括:Preferably, said inputting said data set into said neural network, and training said neural network includes:
将所述TCR CDR3序列矩阵作为训练集,对所述TCR特征提取神经网络进行训练;将所述TCR CDR3序列矩阵输入到训练好的TCR特征提取神经网络中,得到TCR特征向量;The TCR CDR3 sequence matrix is used as a training set to train the TCR feature extraction neural network; the TCR CDR3 sequence matrix is input into the trained TCR feature extraction neural network to obtain a TCR feature vector;
将所述HLA I序列矩阵和所述抗原序列矩阵作为训练集,对所述pMHC特征提取神经网络中进行训练;将所述HLA I序列矩阵和所述抗原序列矩阵输入到训练好的pMHC特征提取神经网络中,得到pMHC特征向量;Using the HLA I sequence matrix and the antigen sequence matrix as a training set, the pMHC feature extraction neural network is trained; the HLA I sequence matrix and the antigen sequence matrix are input into the trained pMHC feature extraction In the neural network, the pMHC feature vector is obtained;
将所述TCR特征向量和所述pMHC特征向量作为训练集,对所述抗原鉴定神经网络进行训练。The TCR feature vector and the pMHC feature vector are used as a training set to train the antigen identification neural network.
具体地,将HLA和抗原的结合亲和力作为该模块的训练标签,进行神经网络训练,结合亲和力为一个0和1之间的数值。该数值越接近1则表示二者结合亲和力越高,反之,则表示二者结合亲和力较低,通过该步骤,神经网络可以将一个HLA特征的二维矩阵和一个抗原特征的二维矩阵通过特征提取降维,并根据结合亲和力进行结合特征提取。提取出的pMHC序列特征该特征用一系列长度为50的向量表示。Specifically, the binding affinity between HLA and the antigen is used as the training label of the module to perform neural network training, and the binding affinity is a value between 0 and 1. The closer the value is to 1, the higher the binding affinity between the two, on the contrary, the lower the binding affinity between the two. Through this step, the neural network can combine a two-dimensional matrix of HLA features and a two-dimensional matrix of antigen features through feature Dimensionality reduction was extracted, and binding features were extracted based on binding affinity. Extracted pMHC sequence features The features are represented by a series of vectors with a length of 50.
优选地,所述对所述抗原鉴定神经网络进行训练之前,所述鉴定抗原模型建立方法还包括:采用SMOTE算法对所述TCR特征向量和所述pMHC特征向量组成的训练集进行平衡处理。Preferably, before the training of the neural network for antigen identification, the method for establishing the antigen identification model further includes: using the SMOTE algorithm to balance the training set composed of the TCR feature vector and the pMHC feature vector.
具体地,采用SMOTE算法对训练集进行平衡处理,还包括:构建新的少数类样本,将新样本与原数据结合成新的训练集。在已发表的数据集中,对结合阴性数据的报道较少,而不平衡问题会使训练深度学习网络的过程中出现过拟合问题,STOME根据已有的少数类样本随机选择其临近样本,来合成新的少数样本作为训练集,对神经网络进行训练。Specifically, the SMOTE algorithm is used to balance the training set, which also includes: constructing a new minority class sample, and combining the new sample with the original data to form a new training set. In the published data sets, there are few reports on the combination of negative data, and the imbalance problem will lead to overfitting problems in the process of training the deep learning network. STOME randomly selects its adjacent samples according to the existing minority class samples. Synthesize a new minority sample as a training set to train the neural network.
在本实施例中,采用SMOTE算法对训练集进行平衡处理,对少数类别样本进行分析模拟,并将人工模拟的新样本添加到数据集中,从而使原始数据集中的类别不再严重失衡,以解决这种不平衡问题,防止出现过拟合问题。In this embodiment, the SMOTE algorithm is used to balance the training set, analyze and simulate the samples of a few categories, and add artificially simulated new samples to the data set, so that the categories in the original data set are no longer seriously unbalanced, in order to solve This imbalance problem prevents overfitting problems.
优选地,所述对所述抗原鉴定神经网络进行训练包括:将所述TCR特征向量和所述pMHC特征向量能否结合状态作为分类标签。Preferably, the training of the antigen identification neural network includes: using the state of whether the TCR feature vector and the pMHC feature vector can be combined as a classification label.
具体地,将得到的TCR特征向量和pMHC特征向量作为训练集,对抗原鉴定神经网络进行训练,已知二者能否结合状态作为分类标签,进行抗原鉴定模块的神经网络训练,分类标签用0和1表示,分别代表二者结合阴性和结合阳性。Specifically, the obtained TCR eigenvector and pMHC eigenvector are used as the training set to train the antigen identification neural network. It is known whether the two can be combined with the state as the classification label, and the neural network training of the antigen identification module is performed. The classification label uses 0 and 1 represent the negative and positive combinations of the two, respectively.
在本实施例中,采用分类标签将二者结合状态分别表示便于神经网络在后续训练学习中更好的鉴定抗原。In this embodiment, classification labels are used to represent the combined states of the two, so that the neural network can better identify antigens in subsequent training and learning.
优选地,对抗原鉴定神经网络进行训练,还包括:将训练后的神经网络在测试集上进行测试。Preferably, training the neural network for antigen identification also includes: testing the trained neural network on a test set.
具体地,神经网络结构保持不变,改变数验证集和测试集的数据比例,每种比例按照进行50次训练,将训练后的神经网络在测试集上进行测试,选择效果最好的网络,作为最终的抗原鉴定模块。Specifically, keep the neural network structure unchanged, change the data ratio of the verification set and the test set, and perform 50 trainings for each ratio, test the trained neural network on the test set, and select the network with the best effect. As the final antigen identification module.
在本实施例中,采用多次训练,在模型参数变化的情况下,该模型的验证率持续保持在一个较高的水平,保证了模型的准确性和稳定性。In this embodiment, multiple trainings are adopted, and the verification rate of the model is maintained at a high level continuously under the condition of model parameter changes, which ensures the accuracy and stability of the model.
优选地,所述TCR特征向量和所述pMHC特征向量组成的训练集包括人造阴性结合数据。Preferably, the training set composed of the TCR feature vector and the pMHC feature vector includes artificial negative binding data.
具体地,随机替换本研究数据库中的TCR序列中的任意一个或多个氨基酸,若该序列不在真TCR序列的数据库中存在,则将其作为人造阴性TCR序列,放入阴性TCR序列库中。将阴性TCR序列库中的任意TCR序列匹配任意已知的pMHC序列,即可得到结合阴性TCR-pMHC对。Specifically, randomly replace any one or more amino acids in the TCR sequence in the research database. If the sequence does not exist in the real TCR sequence database, it will be used as an artificial negative TCR sequence and put into the negative TCR sequence library. By matching any TCR sequence in the negative TCR sequence library with any known pMHC sequence, a binding negative TCR-pMHC pair can be obtained.
在本实施例中,通过训练集中添加少量人造阴性结合数据,可以使抗原鉴定神经网络学习人造阴性结合数据的特点,让该网络能够学习到更多错误模式,使其训练更加准确。In this embodiment, by adding a small amount of artificial negative binding data in the training set, the antigen identification neural network can learn the characteristics of the artificial negative binding data, so that the network can learn more error modes and make its training more accurate.
优选地,结合图2所示,所述TCR特征提取神经网络,包括编码器模块、特征提取模块和解码器模块;Preferably, as shown in FIG. 2, the TCR feature extraction neural network includes an encoder module, a feature extraction module and a decoder module;
所述编码器模块所采用的卷积层模块包括四层Cov1D层;The convolution layer module adopted by the encoder module includes four layers of Cov1D layers;
所述特征提取模块包括一层用于输出TCR序列特征的全连接层;The feature extraction module includes a fully connected layer for outputting TCR sequence features;
所述解码器模块包括四层Conv1D层。The decoder module includes four Conv1D layers.
具体地,该神经网络为自编码器神经网络。该网络为无监督训练,通过对比输入和输出的相似程度来更新参数,使输入和输出的相似度达到指定阈值;编码器模块与解码器模块的结构镜像对称;编码器模块中前两层Conv1D层的卷积核大小为3,后两层卷积核大小为2。Specifically, the neural network is an autoencoder neural network. The network is unsupervised training, and the parameters are updated by comparing the similarity between the input and the output, so that the similarity between the input and the output reaches the specified threshold; the structure of the encoder module and the decoder module is mirror-symmetrical; the first two layers of Conv1D in the encoder module The convolution kernel size of the first layer is 3, and the convolution kernel size of the last two layers is 2.
在本实施例中,TCR序列特征提取模块及子模块能够有效地提取出TCR CDR3序列矩阵,以提高对神经网络的训练效果。In this embodiment, the TCR sequence feature extraction module and submodules can effectively extract the TCR CDR3 sequence matrix to improve the training effect of the neural network.
优选地,结合图3所示,所述pMHC特征提取神经网络包括HLA特征提取模块、抗原特征提取模块、特征提取模块和标签训练模块;Preferably, as shown in Figure 3, the pMHC feature extraction neural network includes an HLA feature extraction module, an antigen feature extraction module, a feature extraction module and a label training module;
所述HLA特征提取模块包括四层Cov1D层,一层Reshape层,以及一层全连接层;The HLA feature extraction module includes four layers of Cov1D layers, one layer of Reshape layer, and one layer of fully connected layer;
所述抗原特征提取模块包括四层Cov1D层,一层Reshape层,以及一层全连接层;The antigen feature extraction module includes four layers of Cov1D layers, one layer of Reshape layer, and one layer of fully connected layer;
所述特征提取模块包括一层用于输出pMHC序列特征的全连接层;The feature extraction module includes a fully connected layer for outputting pMHC sequence features;
所述标签训练模块包括两层全连接层。The label training module includes two fully connected layers.
具体地,HLA特征提取模块中的Conv1D层的卷积核大小为2,抗原特征提取模块中的Conv1D层的卷积核大小为2,Reshape层将卷积层输出结果的二维结构降成一维结构。Specifically, the convolution kernel size of the Conv1D layer in the HLA feature extraction module is 2, the convolution kernel size of the Conv1D layer in the antigen feature extraction module is 2, and the Reshape layer reduces the two-dimensional structure of the output result of the convolution layer to one-dimensional structure.
在本实施例中,根据pMHC序列特征提取模块及子模块,能够有效地提取出TCRCDR3序列矩阵,以提高对神经网络的训练效果。In this embodiment, according to the pMHC sequence feature extraction module and submodules, the TCRCDR3 sequence matrix can be effectively extracted to improve the training effect of the neural network.
优选地,结合图4所示,所述抗原鉴定神经网络包括TCR特征学习模块、pMHC特征学习模块和输出模块;Preferably, as shown in Figure 4, the antigen identification neural network includes a TCR feature learning module, a pMHC feature learning module and an output module;
所述TCR特征学习模块包括两个全连接层;The TCR feature learning module includes two fully connected layers;
所述pMHC特征学习模块包括两个全连接层;The pMHC feature learning module includes two fully connected layers;
所述输出模块采用三个用于输出抗原鉴定结果的全连接层。The output module employs three fully connected layers for outputting antigen identification results.
具体地,该网络的输入为TCR序列特征提取网络和pMHC序列特征提取网络的输出结果。Specifically, the input of the network is the output result of the TCR sequence feature extraction network and the pMHC sequence feature extraction network.
在本实施例中,根据抗原鉴定模块能够有效地提取出抗原序列矩阵,以提高对神经网络的训练效果。In this embodiment, the antigen sequence matrix can be effectively extracted according to the antigen identification module, so as to improve the training effect of the neural network.
如图5所示,本发明另一实施例还提供一种鉴定抗原方法,包括:获取需要检测的TCR CDR3、HLA I和抗原序列,将需要检测的TCR CDR3、HLA I和抗原序列,输入鉴定抗原模型建立方法所建立的鉴定抗原模型中,得到抗原的鉴定结果。As shown in Figure 5, another embodiment of the present invention also provides a method for identifying antigens, including: obtaining the TCR CDR3, HLA I and antigen sequences that need to be detected, and inputting the TCR CDR3, HLA I and antigen sequences that need to be detected for identification In the identified antigen model established by the antigen model establishment method, the identification result of the antigen is obtained.
如图6所示,本发明另一实施例还提供一种鉴定抗原模型建立装置,包括:构建模块,构建模块用于构建神经网络,其中,神经网络包括TCR特征提取神经网络、pMHC特征提取神经网络和抗原鉴定神经网络;还用于构建构建数据集,其中,数据集包括TCR CDR3、HLA I和抗原序列;训练模块,训练模块用于将数据集输入神经网络,对神经网络进行训练,以建立鉴定抗原模型。As shown in Figure 6, another embodiment of the present invention also provides a device for establishing an identification antigen model, including: a building block, the building block is used to build a neural network, wherein the neural network includes a TCR feature extraction neural network, a pMHC feature extraction neural network Network and antigen identification neural network; also be used for constructing and constructing data set, wherein, data set comprises TCR CDR3, HLA I and antigen sequence; Training module, training module is used for inputting data set into neural network, neural network is trained, with Establish identification antigen model.
如图7所示,本发明另一实施例提供一种鉴定抗原模型建立装置,包括存储器和处理器:所述存储器,用于存储计算机程序;所述处理器,用于当执行所述计算机程序时,实现如上鉴定抗原方法。As shown in Figure 7, another embodiment of the present invention provides a device for identifying antigen models, including a memory and a processor: the memory is used to store a computer program; the processor is used to execute the computer program , implement the antigen identification method as above.
本发明另一实施例还提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器读取并运行时,实现如上鉴定抗原模型建立方法或鉴定抗原方法。Another embodiment of the present invention also provides a computer-readable storage medium. The computer-readable storage medium stores a computer program. When the computer program is read and run by a processor, the above method for establishing an antigen identification model or the method for identifying an antigen is realized.
虽然本发明公开披露如上,但本发明公开的保护范围并非仅限于此。本领域技术人员在不脱离本发明公开的精神和范围的前提下,可进行各种变更与修改,这些变更与修改均将落入本发明的保护范围。Although the disclosure of the present invention is as above, the protection scope of the disclosure of the present invention is not limited thereto. Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention, and these changes and modifications will all fall within the protection scope of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211066490.6A CN115497564A (en) | 2022-09-01 | 2022-09-01 | A method for establishing an identification antigen model and an identification method for an antigen |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211066490.6A CN115497564A (en) | 2022-09-01 | 2022-09-01 | A method for establishing an identification antigen model and an identification method for an antigen |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115497564A true CN115497564A (en) | 2022-12-20 |
Family
ID=84468609
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211066490.6A Pending CN115497564A (en) | 2022-09-01 | 2022-09-01 | A method for establishing an identification antigen model and an identification method for an antigen |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115497564A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117095825A (en) * | 2023-10-20 | 2023-11-21 | 鲁东大学 | Human immune state prediction method based on multi-instance learning |
-
2022
- 2022-09-01 CN CN202211066490.6A patent/CN115497564A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117095825A (en) * | 2023-10-20 | 2023-11-21 | 鲁东大学 | Human immune state prediction method based on multi-instance learning |
CN117095825B (en) * | 2023-10-20 | 2024-01-05 | 鲁东大学 | Human immune state prediction method based on multi-instance learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109829057B (en) | Knowledge graph entity semantic space embedding method based on graph second-order similarity | |
CN105138973B (en) | The method and apparatus of face authentication | |
CN105956560B (en) | A kind of model recognizing method based on the multiple dimensioned depth convolution feature of pondization | |
CN111582225A (en) | A kind of remote sensing image scene classification method and device | |
CN114692732B (en) | A method, system, device and storage medium for online label updating | |
CN109977893B (en) | Deep multitask pedestrian re-identification method based on hierarchical saliency channel learning | |
CN109086886A (en) | A kind of convolutional neural networks learning algorithm based on extreme learning machine | |
CN111429977B (en) | Novel molecular similarity search algorithm based on attention of graph structure | |
CN117690178B (en) | A method and system for face image recognition based on computer vision | |
CN103440471A (en) | Human body action identifying method based on lower-rank representation | |
CN112784921A (en) | Task attention guided small sample image complementary learning classification algorithm | |
CN115147641A (en) | A Video Classification Method Based on Knowledge Distillation and Multimodal Fusion | |
CN113066528B (en) | A Protein Classification Method Based on Active Semi-Supervised Graph Neural Networks | |
CN112364974B (en) | YOLOv3 algorithm based on activation function improvement | |
CN116959581A (en) | Training method, device, equipment and storage medium for immunogenicity prediction model | |
CN111310648B (en) | Cross-modal biometric feature matching method and system based on disentanglement expression learning | |
CN111797705A (en) | An Action Recognition Method Based on Character Relationship Modeling | |
CN116416334A (en) | Scene graph generation method of embedded network based on prototype | |
CN115497564A (en) | A method for establishing an identification antigen model and an identification method for an antigen | |
Mohana et al. | Emotion recognition from facial expression using hybrid CNN–LSTM network | |
Kotwal et al. | Yolov5-based convolutional feature attention neural network for plant disease classification | |
CN113838524B (en) | S-nitrosylation site prediction method, model training method and storage medium | |
CN108496174B (en) | Method and system for face recognition | |
Yücel et al. | Classification of tea leaves diseases by developed CNN, feature fusion, and classifier based model | |
CN114972904A (en) | Zero sample knowledge distillation method and system based on triple loss resistance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |