CN117854601A - Decisive complementary region dividing method based on gene type and amino acid sequence - Google Patents

Decisive complementary region dividing method based on gene type and amino acid sequence Download PDF

Info

Publication number
CN117854601A
CN117854601A CN202410240576.9A CN202410240576A CN117854601A CN 117854601 A CN117854601 A CN 117854601A CN 202410240576 A CN202410240576 A CN 202410240576A CN 117854601 A CN117854601 A CN 117854601A
Authority
CN
China
Prior art keywords
amino acid
acid sequence
gene
decisive
word vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410240576.9A
Other languages
Chinese (zh)
Other versions
CN117854601B (en
Inventor
刘峻江
周树森
臧睦君
刘通
柳婵娟
王庆军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ludong University
Original Assignee
Ludong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ludong University filed Critical Ludong University
Priority to CN202410240576.9A priority Critical patent/CN117854601B/en
Publication of CN117854601A publication Critical patent/CN117854601A/en
Application granted granted Critical
Publication of CN117854601B publication Critical patent/CN117854601B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention belongs to the field of bioinformatics, and relates to a method for classifying a decisive complementary region based on a gene type and an amino acid sequence, which is used for sufficiently fusing a variable gene segment, a connecting gene segment and the amino acid sequence of the decisive complementary region of a T cell receptor and aims to improve the classifying capability of the decisive complementary region. The method uses a novel pretreatment method, so that the gene type and the amino acid sequence data of the decisive complementary region are combined into new data, and the characteristics of the variable gene segment and the connecting gene segment are simultaneously extracted by using two convolutional neural network modules respectively, and the decisive complementary region is classified according to the characteristics extracted by convolution. The method comprises three steps of pretreatment of gene types and amino acid sequences, construction of a feature extraction module and construction of a feature classification module. The method can effectively fuse the gene type and amino acid sequence characteristics of the decisive complementary region, obtain better classification effect, and has important significance for the study of clinical treatment and immune process.

Description

Decisive complementary region dividing method based on gene type and amino acid sequence
Technical Field
The invention belongs to the field of bioinformatics, and relates to a decisive complementary region classification method based on gene types and amino acid sequences.
Background
The decisive complementary region is located adjacent to the variable gene segment of the variable region of the T cell receptor and the linking gene segment. The function of the decisive complementary region is directly related to the type of variable gene fragment, the type of linked gene fragment and its own amino acid sequence.
Different decisive complementary regions enable T cells to recognize different antigens, so that accurate classification of decisive complementary regions has important significance for human immune functions and disease treatment.
Most of the current methods for determining complementary regions are based on the amino acid sequences of the determining complementary regions, and a small part of the methods for separating the two data by using the gene types and the amino acid sequences cannot effectively fuse the two data. It is therefore a great difficulty in the field at present how to sufficiently fuse and classify the type of variable gene fragment of the decisive complementary region, the type of the connecting gene fragment and the amino acid sequence of itself.
Disclosure of Invention
In order to overcome the difficulty, the invention provides a method for distinguishing a decisive complementary region based on a gene type and an amino acid sequence, and the method fully fuses a variable gene segment, a connecting gene segment and the amino acid sequence of the decisive complementary region of a T cell receptor, so that the classification performance is improved.
The decisive complementary distinction method based on the gene type and the amino acid sequence comprises three steps of pretreatment of the gene type and the amino acid sequence, construction of a feature extraction module and construction of a feature classification module, and the method comprises the following specific steps:
step 1, using word vectors to represent each amino acid in an amino acid sequence. After converting the amino acid sequence into a word vector matrix corresponding to the amino acid sequence, combining the word vector matrix with variable gene and connecting gene type data respectively to obtain a variable gene characteristic matrix and a connecting gene characteristic matrix;
step 2, constructing a feature extraction module consisting of two identical convolution modules, wherein each convolution module comprises two convolution neural networks with parallel structures, and the two convolution modules are used for extracting features of the variable gene feature matrix and the connecting gene feature matrix obtained in the step 1 respectively;
and 3, constructing a feature classification module consisting of two fully connected layers, wherein the feature extracted in the step 2 is taken as input, and the output of the module is taken as a prediction result.
A decisive complementary region dividing method based on gene type and amino acid sequence, step 1 is realized as follows:
a nn.embedding function of the pytorch framework is used to generate a word vector for all types of amino acids and additionally a nonsensical word vector. The amino acid sequence of the decisive complementary region is converted into its corresponding word vector matrix using the generated word vector. A feature matrix is generated for each of the variable genes and the connecting genes determining the complementary region, and the dimension of the matrix is the number of gene types x the length of the amino acid sequence x the dimension of the word vector. The line where the two gene types are located is replaced by a word vector matrix of amino acid sequences, and the rest of the space is filled by nonsensical word vectors.
A decisive complementary region dividing method based on gene type and amino acid sequence, step 2 is realized as follows:
two convolution modules are constructed using the nn.conv2d function of the pytorch framework, each containing two parallel and different convolution kernels of the convolutional neural network. The first convolution module takes as input the variable gene feature matrix in step 1 and the second convolution module takes as input the connected gene feature matrix in step 1.
A decisive complementary region dividing method based on gene type and amino acid sequence, step 3 is realized as follows:
and (3) constructing two full-connection layers by using an nn.linear function of a pyrach framework, splicing the features extracted by the two convolution modules in the step (2) together and transmitting the features into the full-connection layers, wherein the full-connection layers finish classification of the decisive complementary regions by carrying out dimension transformation on the features.
Drawings
FIG. 1 is a flow chart of a method for determining complementary regions based on the type of gene and amino acid sequence.
FIG. 2 is a flow chart of the pretreatment of the genotype and amino acid sequence.
Fig. 3 is a flow chart of a feature extraction module.
Fig. 4 is a flow chart of a feature classification module.
Detailed Description
The invention is described in detail below with reference to the drawings and examples.
The invention provides a method for classifying a decisive complementary region based on a gene type and an amino acid sequence, in particular for classifying the decisive complementary region.
A method for determining complementary regions based on gene type and amino acid sequence, fig. 1 is a flow chart of a method for determining complementary regions based on gene type and amino acid sequence, comprising three steps of pretreatment of gene type and amino acid sequence, construction of a feature extraction module and construction of a feature classification module, and the specific implementation modes are as follows:
step 1: genotype and amino acid sequence pretreatment fig. 2 is a flow chart of the genotype and amino acid sequence pretreatment, including the following:
the decisive complementary region of the T cell receptor is located at the junction of the variable gene and the junction gene, the function of which is determined by the amino acid sequence, the variable gene type and the junction gene type. The amino acid sequence is composed of 20 amino acids, and the maximum length is 40. The variable gene types were divided into 16 types, and the junction gene types were divided into 43 types. 21 word vectors are generated using the nn.embedding function of the pytorch framework, each of which has a dimension of 50, where the word vector numbered 0 is a nonsensical word vector and the word vectors numbered 1 through 20 represent 20 amino acids. In preprocessing the amino acid sequence, all the amino acids in the sequence are converted into their corresponding word vector numbers. When the variable gene type is preprocessed, a zero matrix with the size of 16 multiplied by 40 is generated for the variable gene type, wherein 16 represents the category number of the variable gene type, 40 represents the maximum length of an amino acid sequence, a row corresponding to the variable gene type is replaced by a word vector number sequence of the amino acid sequence, for example, a second variable gene replaces the second row in the zero matrix by the amino acid sequence, and after replacement, the number in the matrix is replaced by the corresponding word vector, so that the characteristic matrix of the variable gene is obtained. When the connection gene type is preprocessed, a zero matrix with the size of 43 multiplied by 40 is generated for the connection gene type, wherein 43 represents the type number of the connection gene type, 40 represents the maximum length of an amino acid sequence, a row corresponding to the connection gene type is replaced by a word vector number sequence of the amino acid sequence, for example, a first connection gene replaces the first row in the zero matrix by the amino acid sequence, and after replacement, the number in the matrix is replaced by the corresponding word vector, so that the characteristic matrix of the connection gene is obtained.
Step 2: construction of a feature extraction module, fig. 3 is a flow chart of the feature extraction module, including the following:
two convolution modules are constructed using the nn.conv2d function of the pytorch framework, each containing two parallel structured convolutional neural networks, each consisting of a convolutional layer, an activation function, a normalization function, and a random deactivation function. The first convolution module takes the variable gene characteristic matrix as input, in the module, the number of input channels of the first convolution neural network is 16, the number of output channels is 3, the convolution kernel is 3 multiplied by 50, and the low latitude characteristic of the variable gene is obtained after the characteristic is extracted by using the first convolution neural network. The second convolution neural network has 16 input channels, 3 output channels and 5×50 convolution kernel, and the feature extraction of the second convolution neural network is used to obtain the high latitude feature of the variable gene. The low latitude characteristic and the high latitude characteristic of the variable gene are connected end to obtain the total characteristic of the variable gene. The second convolution module takes the characteristic matrix of the connecting genes as input, in the module, the number of input channels of the first convolution neural network is 43, the number of output channels is 3, the convolution kernel is 3 multiplied by 50, and the low latitude characteristic of the connecting genes is obtained after the characteristic is extracted by using the first convolution neural network. The second convolutional neural network has 43 input channels, 3 output channels and 5×50 convolutional kernel, and the high latitude characteristic of the connecting gene is obtained after the characteristic is extracted by using the second convolutional neural network. And (3) connecting the low latitude characteristic and the high latitude characteristic of the connecting gene end to obtain the total characteristic of the connecting gene.
Step 3: construction of a feature classification module, fig. 4 is a flow chart of the feature classification module, including the following:
two fully connected layers were constructed using the nn. Linear function of the pyrach framework, the input dimension of the first fully connected layer being 444, the output dimension being 128, the input dimension of the second fully connected layer being 128, the output dimension being 9, where 9 is the number of classifications of the deterministic complementary regions. And (3) transferring the variable gene total characteristics and the connecting gene total characteristics obtained in the step (2) into a first full-connecting layer after being connected end to end, then using an activation function for the result of the first full-connecting layer, and transferring the result into a second full-connecting layer to finally obtain the classification result of the deterministic complementary region.
When the method is applied to the classification of the decisive complementary regions, the AUC obtained by testing on the data set provided by the deep TCR is 0.861, which is superior to the performance of DeepTCR, deepCat on the data set, wherein the AUC of the deep TCR is 0.831 and the AUC of the deep Cat is 0.812. The invention sufficiently fuses and extracts the characteristics of the gene type and the amino acid sequence of the decisive complementary region, so the performance is higher than that of other existing methods.
The optimal model parameters are shown in the following table.
TABLE 1 optimal model parameters
The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims (2)

1. The method for distinguishing the decisive complementary regions based on the gene types and the amino acid sequences is characterized in that a novel pretreatment mode is used for effectively fusing the gene types and the amino acid sequences of the decisive complementary regions so as to form new data, and a corresponding deep learning model is provided for fully extracting the characteristics contained in the data, and the method comprises three steps of pretreatment of the gene types and the amino acid sequences, construction of a characteristic extraction module and construction of a characteristic classification module, wherein the specific steps are as follows:
step 1, generating a word vector for all types of amino acids by using an nn.Embedding function of a pyrach framework, additionally generating a nonsensical word vector, converting an amino acid sequence of a decisive complementary region into a word vector matrix corresponding to the decisive complementary region by using the generated word vector, generating a feature matrix for each of variable genes and connecting genes of the decisive complementary region, wherein the dimension of the matrix is the number of gene types multiplied by the length of the amino acid sequence multiplied by the dimension of the word vector, replacing rows where two gene types are positioned with the word vector matrix of the amino acid sequence, and filling the rest space by using nonsensical word vectors;
step 2, constructing two convolution modules by using an nn.Conv2d function of a pyrach framework, wherein each convolution module comprises two parallel convolution neural networks with different convolution kernels, the first convolution module takes the variable gene feature matrix in the step 1 as input, and the second convolution module takes the connecting gene feature matrix in the step 1 as input;
step 3, a decisive complementary region classification method based on gene type and amino acid sequence, wherein the implementation process of step 3 is as follows:
and (3) constructing two full-connection layers by using an nn.linear function of a pyrach framework, splicing the features extracted by the two convolution modules in the step (2) together and transmitting the features into the full-connection layers, wherein the full-connection layers finish classification of the decisive complementary regions by carrying out dimension transformation on the features.
2. The method for determining complementary regions based on gene type and amino acid sequence according to claim 1, wherein the single thermal code of each atom in the atomic sequence is combined with its three-dimensional space coordinates to obtain the coordinate thermal code of the atom, the word vector is used to represent each amino acid in the amino acid sequence, the amino acid sequence is converted into its corresponding word vector matrix, and then combined with variable gene and connecting gene type data to obtain variable gene feature matrix and connecting gene feature matrix, and the pretreatment of the gene type and amino acid sequence is realized as follows:
generating 21 word vectors using the nn.decoding function of the pytorch framework, each word vector having a dimension of 50, wherein the word vector numbered 0 is a nonsensical word vector, and the word vectors numbered 1 to 20 represent 20 amino acids; when the amino acid sequence is preprocessed, converting all the amino acids in the sequence into corresponding word vector numbers; when the variable gene type is preprocessed, a zero matrix with the size of 16 multiplied by 40 is generated for the variable gene type, wherein 16 represents the category number of the variable gene type, 40 represents the maximum length of an amino acid sequence, a row corresponding to the variable gene type is replaced by a word vector number sequence of the amino acid sequence, after the replacement is completed, the number in the matrix is replaced by a corresponding word vector to obtain a feature matrix of the variable gene, when the connection gene type is preprocessed, a zero matrix with the size of 43 multiplied by 40 is generated for the connection gene type, 43 represents the category number of the connection gene type, 40 represents the maximum length of the amino acid sequence, the row corresponding to the connection gene type is replaced by the word vector number sequence of the amino acid sequence, and after the replacement is completed, the number in the matrix is replaced by the corresponding word vector to obtain the feature matrix of the connection gene.
CN202410240576.9A 2024-03-04 2024-03-04 Decisive complementary region dividing method based on gene type and amino acid sequence Active CN117854601B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410240576.9A CN117854601B (en) 2024-03-04 2024-03-04 Decisive complementary region dividing method based on gene type and amino acid sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410240576.9A CN117854601B (en) 2024-03-04 2024-03-04 Decisive complementary region dividing method based on gene type and amino acid sequence

Publications (2)

Publication Number Publication Date
CN117854601A true CN117854601A (en) 2024-04-09
CN117854601B CN117854601B (en) 2024-05-14

Family

ID=90544345

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410240576.9A Active CN117854601B (en) 2024-03-04 2024-03-04 Decisive complementary region dividing method based on gene type and amino acid sequence

Country Status (1)

Country Link
CN (1) CN117854601B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1087681A (en) * 1992-09-09 1994-06-08 史密丝克莱恩比彻姆公司 New antibodies with passive immunity of pathogenic bacterial infection in the anti-human body
GB0912754D0 (en) * 2009-05-20 2009-08-26 Novimmune Sa Synthetic polypeptide libraries and methods for generating naturally diversified polypeptide variants
CN114303201A (en) * 2019-05-19 2022-04-08 贾斯特-埃沃泰克生物制品有限公司 Generation of protein sequences using machine learning techniques
CN115148277A (en) * 2022-07-08 2022-10-04 腾讯科技(深圳)有限公司 Affinity prediction method, device, equipment and storage medium
CN115394355A (en) * 2022-08-18 2022-11-25 电子科技大学 Protein post-translational modification prediction method based on multi-head attention
CN116304889A (en) * 2023-05-22 2023-06-23 鲁东大学 Receptor classification method based on convolution and transducer
CN116913383A (en) * 2023-09-13 2023-10-20 鲁东大学 T cell receptor sequence classification method based on multiple modes

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1087681A (en) * 1992-09-09 1994-06-08 史密丝克莱恩比彻姆公司 New antibodies with passive immunity of pathogenic bacterial infection in the anti-human body
GB0912754D0 (en) * 2009-05-20 2009-08-26 Novimmune Sa Synthetic polypeptide libraries and methods for generating naturally diversified polypeptide variants
CN114303201A (en) * 2019-05-19 2022-04-08 贾斯特-埃沃泰克生物制品有限公司 Generation of protein sequences using machine learning techniques
CN115148277A (en) * 2022-07-08 2022-10-04 腾讯科技(深圳)有限公司 Affinity prediction method, device, equipment and storage medium
CN115394355A (en) * 2022-08-18 2022-11-25 电子科技大学 Protein post-translational modification prediction method based on multi-head attention
CN116304889A (en) * 2023-05-22 2023-06-23 鲁东大学 Receptor classification method based on convolution and transducer
CN116913383A (en) * 2023-09-13 2023-10-20 鲁东大学 T cell receptor sequence classification method based on multiple modes

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHAHID AKBAR ET AL.: "cACP-DeepGram: Classification of anticancer peptides via deep neural network and skip-gram-based word embedding model", 《ARTIFICIAL INTELLIGENCE IN MEDICINE》, 6 July 2022 (2022-07-06), pages 102349 *
沙雨彤: "基于集成学习的丝氨酸磷酸化和ADP-核糖基化修饰"原位串扰"位点的预测和研究", 《中国优秀硕士学位论文全文数据库 基础科学辑》, 15 March 2023 (2023-03-15), pages 006 - 160 *

Also Published As

Publication number Publication date
CN117854601B (en) 2024-05-14

Similar Documents

Publication Publication Date Title
Lee et al. Self-attention graph pooling
CN112199532B (en) Zero sample image retrieval method and device based on Hash coding and graph attention machine mechanism
CN113486190B (en) Multi-mode knowledge representation method integrating entity image information and entity category information
CN110459258B (en) Multi-memory built-in self-test method based on multi-target clustering genetic algorithm
CN109273054A (en) Protein Subcellular interval prediction method based on relation map
CN110197217B (en) Image classification method based on deep interleaving fusion packet convolution network
CN114092815B (en) Remote sensing intelligent extraction method for large-range photovoltaic power generation facility
CN114091603A (en) Spatial transcriptome cell clustering and analyzing method
CN113850311A (en) Long-tail distribution image identification method based on grouping and diversity enhancement
CN116304889A (en) Receptor classification method based on convolution and transducer
EP4376010A1 (en) Video processing method and apparatus, device, and storage medium
CN103971136A (en) Large-scale data-oriented parallel structured support vector machine classification method
CN113240683A (en) Attention mechanism-based lightweight semantic segmentation model construction method
CN114530222A (en) Cancer patient classification system based on multiomics and image data fusion
Wang et al. Efficient utilization on PSSM combining with recurrent neural network for membrane protein types prediction
Lin et al. Object point cloud classification via poly-convolutional architecture search
CN113223620B (en) Protein solubility prediction method based on multi-dimensional sequence embedding
CN116913383B (en) T cell receptor sequence classification method based on multiple modes
CN112668633B (en) Adaptive graph migration learning method based on fine granularity field
CN117854601B (en) Decisive complementary region dividing method based on gene type and amino acid sequence
CN116152570A (en) Commodity image classification method based on hierarchical convolutional neural network
CN116705146A (en) Multi-view enzyme function prediction method taking molecular structure and sequence mining into consideration
CN110728683A (en) Image semantic segmentation method based on dense connection
CN113378722B (en) Behavior identification method and system based on 3D convolution and multilevel semantic information fusion
CN114332491A (en) Saliency target detection algorithm based on feature reconstruction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant