CN117854601A

CN117854601A - Decisive complementary region dividing method based on gene type and amino acid sequence

Info

Publication number: CN117854601A
Application number: CN202410240576.9A
Authority: CN
Inventors: 刘峻江; 周树森; 臧睦君; 刘通; 柳婵娟; 王庆军
Original assignee: Ludong University
Current assignee: Ludong University
Priority date: 2024-03-04
Filing date: 2024-03-04
Publication date: 2024-04-09
Anticipated expiration: 2044-03-04
Also published as: CN117854601B

Abstract

The invention belongs to the field of bioinformatics, and relates to a method for classifying a decisive complementary region based on a gene type and an amino acid sequence, which is used for sufficiently fusing a variable gene segment, a connecting gene segment and the amino acid sequence of the decisive complementary region of a T cell receptor and aims to improve the classifying capability of the decisive complementary region. The method uses a novel pretreatment method, so that the gene type and the amino acid sequence data of the decisive complementary region are combined into new data, and the characteristics of the variable gene segment and the connecting gene segment are simultaneously extracted by using two convolutional neural network modules respectively, and the decisive complementary region is classified according to the characteristics extracted by convolution. The method comprises three steps of pretreatment of gene types and amino acid sequences, construction of a feature extraction module and construction of a feature classification module. The method can effectively fuse the gene type and amino acid sequence characteristics of the decisive complementary region, obtain better classification effect, and has important significance for the study of clinical treatment and immune process.

Description

Decisive complementary region dividing method based on gene type and amino acid sequence

Technical Field

The invention belongs to the field of bioinformatics, and relates to a decisive complementary region classification method based on gene types and amino acid sequences.

Background

The decisive complementary region is located adjacent to the variable gene segment of the variable region of the T cell receptor and the linking gene segment. The function of the decisive complementary region is directly related to the type of variable gene fragment, the type of linked gene fragment and its own amino acid sequence.

Different decisive complementary regions enable T cells to recognize different antigens, so that accurate classification of decisive complementary regions has important significance for human immune functions and disease treatment.

Most of the current methods for determining complementary regions are based on the amino acid sequences of the determining complementary regions, and a small part of the methods for separating the two data by using the gene types and the amino acid sequences cannot effectively fuse the two data. It is therefore a great difficulty in the field at present how to sufficiently fuse and classify the type of variable gene fragment of the decisive complementary region, the type of the connecting gene fragment and the amino acid sequence of itself.

Disclosure of Invention

In order to overcome the difficulty, the invention provides a method for distinguishing a decisive complementary region based on a gene type and an amino acid sequence, and the method fully fuses a variable gene segment, a connecting gene segment and the amino acid sequence of the decisive complementary region of a T cell receptor, so that the classification performance is improved.

The decisive complementary distinction method based on the gene type and the amino acid sequence comprises three steps of pretreatment of the gene type and the amino acid sequence, construction of a feature extraction module and construction of a feature classification module, and the method comprises the following specific steps:

step 1, using word vectors to represent each amino acid in an amino acid sequence. After converting the amino acid sequence into a word vector matrix corresponding to the amino acid sequence, combining the word vector matrix with variable gene and connecting gene type data respectively to obtain a variable gene characteristic matrix and a connecting gene characteristic matrix;

step 2, constructing a feature extraction module consisting of two identical convolution modules, wherein each convolution module comprises two convolution neural networks with parallel structures, and the two convolution modules are used for extracting features of the variable gene feature matrix and the connecting gene feature matrix obtained in the step 1 respectively;

and 3, constructing a feature classification module consisting of two fully connected layers, wherein the feature extracted in the step 2 is taken as input, and the output of the module is taken as a prediction result.

A decisive complementary region dividing method based on gene type and amino acid sequence, step 1 is realized as follows:

a nn.embedding function of the pytorch framework is used to generate a word vector for all types of amino acids and additionally a nonsensical word vector. The amino acid sequence of the decisive complementary region is converted into its corresponding word vector matrix using the generated word vector. A feature matrix is generated for each of the variable genes and the connecting genes determining the complementary region, and the dimension of the matrix is the number of gene types x the length of the amino acid sequence x the dimension of the word vector. The line where the two gene types are located is replaced by a word vector matrix of amino acid sequences, and the rest of the space is filled by nonsensical word vectors.

A decisive complementary region dividing method based on gene type and amino acid sequence, step 2 is realized as follows:

two convolution modules are constructed using the nn.conv2d function of the pytorch framework, each containing two parallel and different convolution kernels of the convolutional neural network. The first convolution module takes as input the variable gene feature matrix in step 1 and the second convolution module takes as input the connected gene feature matrix in step 1.

A decisive complementary region dividing method based on gene type and amino acid sequence, step 3 is realized as follows:

and (3) constructing two full-connection layers by using an nn.linear function of a pyrach framework, splicing the features extracted by the two convolution modules in the step (2) together and transmitting the features into the full-connection layers, wherein the full-connection layers finish classification of the decisive complementary regions by carrying out dimension transformation on the features.

Drawings

FIG. 1 is a flow chart of a method for determining complementary regions based on the type of gene and amino acid sequence.

FIG. 2 is a flow chart of the pretreatment of the genotype and amino acid sequence.

Fig. 3 is a flow chart of a feature extraction module.

Fig. 4 is a flow chart of a feature classification module.

Detailed Description

The invention is described in detail below with reference to the drawings and examples.

The invention provides a method for classifying a decisive complementary region based on a gene type and an amino acid sequence, in particular for classifying the decisive complementary region.

A method for determining complementary regions based on gene type and amino acid sequence, fig. 1 is a flow chart of a method for determining complementary regions based on gene type and amino acid sequence, comprising three steps of pretreatment of gene type and amino acid sequence, construction of a feature extraction module and construction of a feature classification module, and the specific implementation modes are as follows:

step 1: genotype and amino acid sequence pretreatment fig. 2 is a flow chart of the genotype and amino acid sequence pretreatment, including the following:

the decisive complementary region of the T cell receptor is located at the junction of the variable gene and the junction gene, the function of which is determined by the amino acid sequence, the variable gene type and the junction gene type. The amino acid sequence is composed of 20 amino acids, and the maximum length is 40. The variable gene types were divided into 16 types, and the junction gene types were divided into 43 types. 21 word vectors are generated using the nn.embedding function of the pytorch framework, each of which has a dimension of 50, where the word vector numbered 0 is a nonsensical word vector and the word vectors numbered 1 through 20 represent 20 amino acids. In preprocessing the amino acid sequence, all the amino acids in the sequence are converted into their corresponding word vector numbers. When the variable gene type is preprocessed, a zero matrix with the size of 16 multiplied by 40 is generated for the variable gene type, wherein 16 represents the category number of the variable gene type, 40 represents the maximum length of an amino acid sequence, a row corresponding to the variable gene type is replaced by a word vector number sequence of the amino acid sequence, for example, a second variable gene replaces the second row in the zero matrix by the amino acid sequence, and after replacement, the number in the matrix is replaced by the corresponding word vector, so that the characteristic matrix of the variable gene is obtained. When the connection gene type is preprocessed, a zero matrix with the size of 43 multiplied by 40 is generated for the connection gene type, wherein 43 represents the type number of the connection gene type, 40 represents the maximum length of an amino acid sequence, a row corresponding to the connection gene type is replaced by a word vector number sequence of the amino acid sequence, for example, a first connection gene replaces the first row in the zero matrix by the amino acid sequence, and after replacement, the number in the matrix is replaced by the corresponding word vector, so that the characteristic matrix of the connection gene is obtained.

Step 2: construction of a feature extraction module, fig. 3 is a flow chart of the feature extraction module, including the following:

two convolution modules are constructed using the nn.conv2d function of the pytorch framework, each containing two parallel structured convolutional neural networks, each consisting of a convolutional layer, an activation function, a normalization function, and a random deactivation function. The first convolution module takes the variable gene characteristic matrix as input, in the module, the number of input channels of the first convolution neural network is 16, the number of output channels is 3, the convolution kernel is 3 multiplied by 50, and the low latitude characteristic of the variable gene is obtained after the characteristic is extracted by using the first convolution neural network. The second convolution neural network has 16 input channels, 3 output channels and 5×50 convolution kernel, and the feature extraction of the second convolution neural network is used to obtain the high latitude feature of the variable gene. The low latitude characteristic and the high latitude characteristic of the variable gene are connected end to obtain the total characteristic of the variable gene. The second convolution module takes the characteristic matrix of the connecting genes as input, in the module, the number of input channels of the first convolution neural network is 43, the number of output channels is 3, the convolution kernel is 3 multiplied by 50, and the low latitude characteristic of the connecting genes is obtained after the characteristic is extracted by using the first convolution neural network. The second convolutional neural network has 43 input channels, 3 output channels and 5×50 convolutional kernel, and the high latitude characteristic of the connecting gene is obtained after the characteristic is extracted by using the second convolutional neural network. And (3) connecting the low latitude characteristic and the high latitude characteristic of the connecting gene end to obtain the total characteristic of the connecting gene.

Step 3: construction of a feature classification module, fig. 4 is a flow chart of the feature classification module, including the following:

two fully connected layers were constructed using the nn. Linear function of the pyrach framework, the input dimension of the first fully connected layer being 444, the output dimension being 128, the input dimension of the second fully connected layer being 128, the output dimension being 9, where 9 is the number of classifications of the deterministic complementary regions. And (3) transferring the variable gene total characteristics and the connecting gene total characteristics obtained in the step (2) into a first full-connecting layer after being connected end to end, then using an activation function for the result of the first full-connecting layer, and transferring the result into a second full-connecting layer to finally obtain the classification result of the deterministic complementary region.

When the method is applied to the classification of the decisive complementary regions, the AUC obtained by testing on the data set provided by the deep TCR is 0.861, which is superior to the performance of DeepTCR, deepCat on the data set, wherein the AUC of the deep TCR is 0.831 and the AUC of the deep Cat is 0.812. The invention sufficiently fuses and extracts the characteristics of the gene type and the amino acid sequence of the decisive complementary region, so the performance is higher than that of other existing methods.

The optimal model parameters are shown in the following table.

TABLE 1 optimal model parameters

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. The method for distinguishing the decisive complementary regions based on the gene types and the amino acid sequences is characterized in that a novel pretreatment mode is used for effectively fusing the gene types and the amino acid sequences of the decisive complementary regions so as to form new data, and a corresponding deep learning model is provided for fully extracting the characteristics contained in the data, and the method comprises three steps of pretreatment of the gene types and the amino acid sequences, construction of a characteristic extraction module and construction of a characteristic classification module, wherein the specific steps are as follows:

step 1, generating a word vector for all types of amino acids by using an nn.Embedding function of a pyrach framework, additionally generating a nonsensical word vector, converting an amino acid sequence of a decisive complementary region into a word vector matrix corresponding to the decisive complementary region by using the generated word vector, generating a feature matrix for each of variable genes and connecting genes of the decisive complementary region, wherein the dimension of the matrix is the number of gene types multiplied by the length of the amino acid sequence multiplied by the dimension of the word vector, replacing rows where two gene types are positioned with the word vector matrix of the amino acid sequence, and filling the rest space by using nonsensical word vectors;

step 2, constructing two convolution modules by using an nn.Conv2d function of a pyrach framework, wherein each convolution module comprises two parallel convolution neural networks with different convolution kernels, the first convolution module takes the variable gene feature matrix in the step 1 as input, and the second convolution module takes the connecting gene feature matrix in the step 1 as input;

step 3, a decisive complementary region classification method based on gene type and amino acid sequence, wherein the implementation process of step 3 is as follows:

2. The method for determining complementary regions based on gene type and amino acid sequence according to claim 1, wherein the single thermal code of each atom in the atomic sequence is combined with its three-dimensional space coordinates to obtain the coordinate thermal code of the atom, the word vector is used to represent each amino acid in the amino acid sequence, the amino acid sequence is converted into its corresponding word vector matrix, and then combined with variable gene and connecting gene type data to obtain variable gene feature matrix and connecting gene feature matrix, and the pretreatment of the gene type and amino acid sequence is realized as follows:

generating 21 word vectors using the nn.decoding function of the pytorch framework, each word vector having a dimension of 50, wherein the word vector numbered 0 is a nonsensical word vector, and the word vectors numbered 1 to 20 represent 20 amino acids; when the amino acid sequence is preprocessed, converting all the amino acids in the sequence into corresponding word vector numbers; when the variable gene type is preprocessed, a zero matrix with the size of 16 multiplied by 40 is generated for the variable gene type, wherein 16 represents the category number of the variable gene type, 40 represents the maximum length of an amino acid sequence, a row corresponding to the variable gene type is replaced by a word vector number sequence of the amino acid sequence, after the replacement is completed, the number in the matrix is replaced by a corresponding word vector to obtain a feature matrix of the variable gene, when the connection gene type is preprocessed, a zero matrix with the size of 43 multiplied by 40 is generated for the connection gene type, 43 represents the category number of the connection gene type, 40 represents the maximum length of the amino acid sequence, the row corresponding to the connection gene type is replaced by the word vector number sequence of the amino acid sequence, and after the replacement is completed, the number in the matrix is replaced by the corresponding word vector to obtain the feature matrix of the connection gene.