CN117476106B

CN117476106B - Multi-class unbalanced protein secondary structure prediction method and system

Info

Publication number: CN117476106B
Application number: CN202311804115.1A
Authority: CN
Inventors: 何朝政; 赵君研; 肖秦琨; 付玲
Original assignee: Xi'an Huisuan Intelligent Technology Co ltd
Current assignee: Shaanxi Qianruanhui Information Technology Co ltd
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-04-02
Anticipated expiration: 2043-12-26
Also published as: CN117476106A

Abstract

The present invention discloses a method and system for predicting the secondary structure of multiple types of unbalanced proteins, and relates to the technical field of computer-aided drug development. The method comprises obtaining a target protein sequence to be predicted, inputting the target protein sequence to be predicted into a pre-constructed Transformer-based multiple types of unbalanced protein secondary structure prediction model, obtaining a secondary structure prediction value of the target protein sequence to be predicted, and outputting a secondary structure prediction value of the target protein sequence. When predicting the secondary structure of a protein sequence, the method has low global dependence on amino acids and improves the prediction accuracy of the secondary structure of rare proteins.

Description

A multi-category unbalanced protein secondary structure prediction method and system

技术领域Technical field

本发明涉及计算机辅助药物研发技术领域，具体涉及一种多类不平衡蛋白质二级结构预测方法及系统。The present invention relates to the technical field of computer-aided drug development, and in particular to a method and system for predicting secondary structures of multiple types of unbalanced proteins.

背景技术Background technique

蛋白质是构成和修复人体组织的重要有机物，围绕蛋白质二级结构预测问题涌现出了大量的计算方法。常用的方法有DeepACLSTM和MUFold-SS。DeepACLSTM模型利用蛋白质序列特征和Profile特征，所述Profile特征由多序列比对构建的序列特征谱，这里具体指位置特异性打分矩阵(Position Special Scoring Matrix,PSSM)，具体采用卷积核为1×42的一维卷积和3×1的二维卷积构成的非对称卷积网络(asymmetric convolutionalneural network，ACNN)结合双向长短期记忆(bidirectional long short term memory，BiLSTM)网络获取氨基酸残基间的局部及全局相关性；MUFold-SS模型的特征矩阵是由氨基酸的理化性质、PSI-BLAST图谱和HHBlits图谱组成的，包含了丰富的进化信息，利用卷积核尺度为1和3的并行卷积构成深度卷积网络来提取氨基酸之间的局部及全局相关性。但这两种方法对蛋白质二级结构预测时，存在氨基酸全局依赖性强，罕见类蛋白质二级结构预测精度低的问题。Proteins are important organic compounds that constitute and repair human tissues. A large number of computational methods have emerged around the problem of protein secondary structure prediction. Commonly used methods include DeepACLSTM and MUFold-SS. The DeepACLSTM model uses protein sequence features and Profile features. The Profile feature is a sequence feature spectrum constructed from multiple sequence alignments. Here, it specifically refers to the Position Special Scoring Matrix (PSSM). Specifically, the convolution kernel is 1× The asymmetric convolutional network (ACNN) composed of 42 one-dimensional convolution and 3×1 two-dimensional convolution is combined with the bidirectional long short term memory (BiLSTM) network to obtain the relationship between amino acid residues. Local and global correlation; the feature matrix of the MUFold-SS model is composed of the physical and chemical properties of amino acids, the PSI-BLAST map and the HHBlits map. It contains rich evolutionary information and uses parallel convolution with convolution kernel scales of 1 and 3. A deep convolutional network is constructed to extract local and global correlations between amino acids. However, when predicting protein secondary structure, these two methods have the problem of strong global dependence on amino acids and low accuracy in predicting the secondary structure of rare proteins.

发明内容Contents of the invention

为了解决上述背景技术中存在的不足，主要针对现有技术中对蛋白质二级结构预测时，对氨基酸全局依赖性强，罕见类蛋白质二级结构预测精度低的问题。本发明提供一种多类不平衡蛋白质二级结构预测方法及系统，该方法对氨基酸全局依赖性低、提升了罕见类蛋白质二级结构预测的精度。In order to solve the deficiencies in the above-mentioned background technology, the current technology mainly focuses on the problem that when predicting the secondary structure of proteins, the global dependence on amino acids is strong, and the prediction accuracy of the secondary structure of rare proteins is low. The present invention provides a method and system for predicting the secondary structure of multiple types of unbalanced proteins, which has low global dependence on amino acids and improves the accuracy of the prediction of the secondary structure of rare proteins.

为了实现上述目的，本发明第一方面提供一种多类不平衡蛋白质二级结构预测方法：In order to achieve the above objectives, the first aspect of the present invention provides a multi-category unbalanced protein secondary structure prediction method:

获取待预测的目标蛋白质序列；Obtain the target protein sequence to be predicted;

将待预测的目标蛋白质序列输入预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型，获得待预测的目标蛋白质序列的二级结构预测值；多类不平衡蛋白质结构分为8种：分别用L、B、E、G、I、H、S和T表示，但在这8种结构中B、G和S比例较低，特别是结构G，而罕见类在训练过程中容易被淹没，使其预测精度降低，因此蛋白质二级结构中存在着多类不平衡的问题。其中，预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型包括：Input the target protein sequence to be predicted into the pre-constructed Transformer-based multi-class unbalanced protein secondary structure prediction model to obtain the secondary structure prediction value of the target protein sequence to be predicted; multi-class unbalanced protein structures are divided into 8 types: Represented by L, B, E, G, I, H, S and T respectively, but the proportion of B, G and S among these 8 structures is low, especially structure G, and rare classes are easily overwhelmed during the training process. , which reduces the prediction accuracy, so there is a multi-category imbalance problem in the protein secondary structure. Among them, the pre-built Transformer-based multi-category unbalanced protein secondary structure prediction models include:

用于对输入样本数据进行加权处理和批归一化处理的数据预处理层；A data preprocessing layer for weighting and batch normalization of input sample data;

用于对预处理后的多个输入样本数据进行处理，获得第一输出矩阵和第二输出矩阵的pytorch函数层；A pytorch function layer used to process multiple preprocessed input sample data to obtain the first output matrix and the second output matrix;

用于迭代处理第二输出矩阵，获得蛋白质序列间的局部特征矩阵的MobileNet v2层；The MobileNet v2 layer used to iteratively process the second output matrix to obtain the local feature matrix between protein sequences;

用于处理根据第一输出矩阵和局部特征矩阵相加得到的第三输出矩阵，获得蛋白质序列间的关联矩阵的Transformer层；A Transformer layer used to process the third output matrix obtained by adding the first output matrix and the local feature matrix to obtain the correlation matrix between protein sequences;

用于处理蛋白质序列间的关联矩阵，获得全局特征矩阵的两层双向门控循环单元层；A two-layer bidirectional gated cyclic unit layer used to process the correlation matrix between protein sequences and obtain the global feature matrix;

用于依次处理全局特征矩阵，获得蛋白质二级结构预测值的卷积层和全连接层；Convolutional layers and fully connected layers used to sequentially process the global feature matrix and obtain protein secondary structure prediction values;

将待预测的目标蛋白质序列的二级结构预测值输出。Output the predicted secondary structure value of the target protein sequence to be predicted.

可选的，对输入样本数据进行加权处理和批归一化处理的步骤包括：Optional steps for weighting and batch normalizing the input sample data include:

根据水平拼接的位置特异性打分矩阵、隐马尔科夫模型特征矩阵以及氨基酸理化性质特征矩阵获得的特征矩阵，处理输入样本数据，获得加权后的样本数据；According to the feature matrix obtained by horizontally splicing the position-specific scoring matrix, the hidden Markov model feature matrix and the amino acid physicochemical property feature matrix, process the input sample data to obtain weighted sample data;

分别获取所有初始元素为false的掩码矩阵和所有初始元素为0的输入特征矩阵；Get the mask matrix with all initial elements being false and the input feature matrix with all initial elements being 0 respectively;

根据遍历加权后的样本数据中每个批次的蛋白质序列得到的蛋白质链索引，蛋白质链长度和蛋白质链最大长度，获得更新后的掩码矩阵和更新后的输入特征矩阵；According to the protein chain index, protein chain length and maximum protein chain length obtained by traversing the protein sequences of each batch in the weighted sample data, obtain the updated mask matrix and the updated input feature matrix;

利用预设的第一公式和第二公式，分别计算更新后的掩码矩阵和更新后的输入特征矩阵的均值和方差；Using the preset first formula and the second formula, calculate the mean and variance of the updated mask matrix and the updated input feature matrix respectively;

利用根据预设的第三公式预构建的批归一化层，处理均值和方差，获得批归一化输出矩阵。The batch normalization layer pre-constructed according to the preset third formula is used to process the mean and variance to obtain the batch normalization output matrix.

可选的，所述预设的第三公式包括：Optionally, the preset third formula includes:

其中，μ表示利用预设的第一公式处理更新后的掩码矩阵和更新后的输入特征矩阵获得的均值、var表示利用预设的第二公式处理更新后的掩码矩阵和更新后的输入特征矩阵获得的方差，X_{B，C，max_L}表示输入的加权后的样本数据，X^*表示批归一化层的输出，β和γ为批归一化层中待学习的参数，其中ε＝0.01；Among them, μ represents the mean value obtained by using the preset first formula to process the updated mask matrix and the updated input feature matrix, and var represents using the preset second formula to process the updated mask matrix and the updated input. The variance obtained by the feature matrix, X _{B, C, max_L} represents the input weighted sample data, X ^* represents the output of the batch normalization layer, β and γ are the parameters to be learned in the batch normalization layer, where ε = 0.01;

其中，预设的第一公式和预设的第二公式分别为：Among them, the preset first formula and the preset second formula are respectively:

其中，μ表示所有样本数据的均值，var₁表示所有样本数据的方差，M_bl表示掩码矩阵，X_bcl是每个蛋白质序列中的每个氨基酸向量的元素值，B指训练过程中设置的批大小，即Batchsize，这里设置为16；b指遍历这一批次的16个蛋白质，所以b＝[1,16]；max_L指该批次蛋白质中最大的序列长度；L代表该批次蛋白质的序列长度；Among them, μ represents the mean of all sample data, var ₁ represents the variance of all sample data, M _bl represents the mask matrix, X _bcl is the element value of each amino acid vector in each protein sequence, B refers to the batch size set during training, that is, Batchsize, which is set to 16 here; b refers to the 16 proteins traversed in this batch, so b = [1, 16]; max_L refers to the maximum sequence length in the batch of proteins; L represents the sequence length of the batch of proteins;

可选的，pytorch函数层包括依次连接的nn.Conv1d函数、F.relu函数、MaskedBatchNorm1d函数和nn.Dropout函数，其中nn.Conv1d函数的参量包括卷积的尺寸为输出通道数57，输入通道数448，卷积核长3，填充padding＝1，偏置bias＝False，F.relu函数的输入为nn.Conv1d函数的输出，MaskedBatchNorm1d函数的维度为448，nn.Dropout函数的神经元不被激活的概率为0.2。Optional, the pytorch function layer includes the nn.Conv1d function, F.relu function, MaskedBatchNorm1d function and nn.Dropout function, which are connected in sequence. The parameters of the nn.Conv1d function include the size of the convolution, which is the number of output channels 57, and the number of input channels. 448, convolution kernel length 3, padding=1, bias bias=False, the input of the F.relu function is the output of the nn.Conv1d function, the dimension of the MaskedBatchNorm1d function is 448, and the neurons of the nn.Dropout function are not activated. The probability is 0.2.

可选的，迭代处理第二输出矩阵包括：Optionally, iteratively processing the second output matrix includes:

将所述nn.Dropout函数输出的第二输出矩阵，作为MobileNet v2层的初始输入矩阵，得到初始输出矩阵；Using the second output matrix output by the nn.Dropout function as the initial input matrix of the MobileNet v2 layer to obtain an initial output matrix;

利用初始输出矩阵作为MobileNet v2层的输入，将MobileNet v2层处理过的中间输出矩阵再作为MobileNet v2层的输入进行多次迭代，获得蛋白质序列间的局部特征矩阵。The initial output matrix is used as the input of the MobileNet v2 layer, and the intermediate output matrix processed by the MobileNet v2 layer is used as the input of the MobileNet v2 layer for multiple iterations to obtain the local feature matrix between protein sequences.

可选的，Transformer的编码层为两层，每个编码器均设有八个注意力头矩阵的集合。Optionally, the Transformer has two encoding layers, and each encoder is equipped with a set of eight attention head matrices.

可选的，卷积层和全连接层的输出通道均为57，所述卷积层的卷积核为1，所述全连接层的输出大小为8。Optionally, the output channels of the convolution layer and the fully connected layer are both 57, the convolution kernel of the convolution layer is 1, and the output size of the fully connected layer is 8.

可选的，在用于依次处理全局特征矩阵，获得蛋白质二级结构预测值的卷积层和全连接层后，还包括：Optionally, after the convolutional layer and fully connected layer used to sequentially process the global feature matrix to obtain the protein secondary structure prediction value, it also includes:

利用标签分布感知边际损失函数计算预测误差，将得到的预测误差反向传播更新批归一化层的参数β和γ。The prediction error is calculated using the label distribution aware marginal loss function, and the obtained prediction error is back-propagated to update the parameters β and γ of the batch normalization layer.

本发明另一方面提供一种多类不平衡蛋白质二级结构预测系统，包括：On the other hand, the present invention provides a multi-category unbalanced protein secondary structure prediction system, including:

输入模块，用于获取待预测的目标蛋白质序列；Input module, used to obtain the target protein sequence to be predicted;

处理模块，用于将待预测的目标蛋白质序列输入预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型，获得待预测的目标蛋白质序列的二级结构预测值；其中，预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型包括：The processing module is used to input the target protein sequence to be predicted into the pre-built Transformer-based multi-type unbalanced protein secondary structure prediction model to obtain the secondary structure prediction value of the target protein sequence to be predicted; wherein, the pre-built based on Transformer's multi-category unbalanced protein secondary structure prediction models include:

用于对输入样本数据进行加权处理和批归一化处理的数据预处理层；A data preprocessing layer for weighting and batch normalizing the input sample data;

用于依次处理全局特征矩阵，获得蛋白质二级结构预测值的卷积层和全连接层；Convolutional layers and fully connected layers are used to process the global feature matrix in sequence to obtain the predicted value of protein secondary structure;

输出模块，用于将待预测的目标蛋白质序列的二级结构预测值输出。The output module is used to output the secondary structure prediction value of the target protein sequence to be predicted.

与现有技术相比，本发明的有益效果是：Compared with the prior art, the beneficial effects of the present invention are:

本发明提供的一种多类不平衡蛋白质二级结构预测方法和系统，通过将待预测的目标蛋白质序列输入预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型，获得待预测的目标蛋白质序列的二级结构预测值；其中，预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型包括：用于对输入样本数据进行加权处理和批归一化处理的数据预处理层；用于对预处理后的多个输入样本数据进行处理，获得第一输出矩阵和第二输出矩阵的pytorch函数层；用于迭代处理第二输出矩阵，获得蛋白质序列间的局部特征矩阵的MobileNet v2层；用于处理根据第一输出矩阵和局部特征矩阵相加得到的第三输出矩阵，获得蛋白质序列间的关联矩阵的Transformer层；用于处理蛋白质序列间的关联矩阵，获得全局特征矩阵的两层双向门控循环单元层；用于依次处理全局特征矩阵，获得蛋白质二级结构预测值的卷积层和全连接层。该方法在对蛋白质二级结构预测时，对氨基酸全局依赖性低、提高了罕见类蛋白质二级结构的预测精度。The invention provides a multi-category unbalanced protein secondary structure prediction method and system. By inputting the target protein sequence to be predicted into a pre-constructed Transformer-based multi-category unbalanced protein secondary structure prediction model, the target to be predicted is obtained. Secondary structure prediction value of protein sequence; among them, the pre-built Transformer-based multi-category unbalanced protein secondary structure prediction model includes: a data preprocessing layer for weighting and batch normalization of input sample data; The pytorch function layer used to process multiple input sample data after preprocessing to obtain the first output matrix and the second output matrix; the MobileNet v2 used to iteratively process the second output matrix to obtain the local feature matrix between protein sequences The Transformer layer is used to process the third output matrix obtained by adding the first output matrix and the local feature matrix to obtain the correlation matrix between protein sequences; the Transformer layer is used to process the correlation matrix between protein sequences to obtain the global feature matrix. layer bidirectional gated cyclic unit layer; convolutional layer and fully connected layer used to sequentially process the global feature matrix to obtain protein secondary structure prediction values. When predicting protein secondary structure, this method has low global dependence on amino acids and improves the prediction accuracy of rare protein secondary structures.

附图说明Description of drawings

图1为本发明提供的一种多类不平衡蛋白质二级结构预测方法流程图；Figure 1 is a flow chart of a multi-category unbalanced protein secondary structure prediction method provided by the present invention;

图2为一种多类不平衡蛋白质二级结构预测方法在不同特征下模型的性能比较直方图；Figure 2 is a histogram comparing the performance of a multi-category unbalanced protein secondary structure prediction method under different characteristics;

图3为一种多类不平衡蛋白质二级结构预测方法在不同BN下模型的性能比较直方图；Figure 3 is a histogram comparing the performance of a multi-category unbalanced protein secondary structure prediction method under different BN models;

图4为一种多类不平衡蛋白质二级结构预测方法在有无Transformer时模型的性能比较直方图；Figure 4 is a histogram comparing the performance of a multi-class unbalanced protein secondary structure prediction method with and without Transformer;

图5为一种多类不平衡蛋白质二级结构预测系统的结构示意图。FIG. 5 is a schematic diagram of the structure of a multi-class imbalanced protein secondary structure prediction system.

具体实施方式Detailed ways

下面结合具体实施例和附图对本发明作进一步说明，但所举实施例不作为对本发明的限定。The present invention will be further described below with reference to specific embodiments and drawings, but the embodiments are not intended to limit the present invention.

图1为本发明提供的一种多类不平衡蛋白质二级结构预测方法流程图，如图1所示，本发明提供一种多类不平衡蛋白质二级结构预测方法，包括：Figure 1 is a flow chart of a multi-category unbalanced protein secondary structure prediction method provided by the present invention. As shown in Figure 1, the present invention provides a multi-category unbalanced protein secondary structure prediction method, including:

101、获取待预测的目标蛋白质序列。101. Obtain the target protein sequence to be predicted.

102、将待预测的目标蛋白质序列输入预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型，获得待预测的目标蛋白质序列的二级结构预测值。其中，预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型包括：102. Input the target protein sequence to be predicted into the pre-constructed Transformer-based multi-type unbalanced protein secondary structure prediction model to obtain the secondary structure prediction value of the target protein sequence to be predicted. Among them, the pre-built Transformer-based multi-category unbalanced protein secondary structure prediction models include:

用于对输入样本数据进行加权处理和批归一化处理的数据预处理层。Data preprocessing layer for weighting and batch normalizing input sample data.

其中，对输入样本数据进行加权处理和批归一化处理的步骤包括：Among them, the steps of weighting and batch normalization processing of input sample data include:

根据水平拼接的位置特异性打分矩阵、隐马尔科夫模型特征矩阵以及氨基酸理化性质特征矩阵获得的特征矩阵，处理输入样本数据，获得加权后的样本数据。Based on the feature matrix obtained from the horizontally spliced position-specific scoring matrix, hidden Markov model feature matrix, and amino acid physicochemical property feature matrix, the input sample data is processed to obtain weighted sample data.

具体的，将PSI-BLAST图谱收敛参数e设置为0.001，运行PSI-BLAST图谱对Uniref数据库进行两次迭代，生成位置特异性打分矩阵(PSSM)，大小为L×20，其中L为蛋白质序列长度。Specifically, set the PSI-BLAST map convergence parameter e to 0.001, run the PSI-BLAST map for two iterations on the Uniref database, and generate a position-specific scoring matrix (PSSM) with a size of L×20, where L is the length of the protein sequence. .

运行HHblits图谱对Uniprot20数据库进行四次迭代，生成隐马尔科夫模型(HMM)特征矩阵，其大小为L×30。其中Uniprot20是一个蛋白质序列数据库，它包含了UniProtKB(蛋白质知识库)中的所有物种、所有蛋白质的序列信息，共70432686个蛋白质序列。其中包括Swiss-Prot和TrEMBL两个部分。其中Swiss-Prot部分包含手工注释的高质量蛋白质序列，而TrEMBL则包含了自动注释的蛋白质序列和未经验证的预测蛋白质序列，可从Run the HHblits map for four iterations on the Uniprot20 database to generate a hidden Markov model (HMM) feature matrix of size L×30. Uniprot20 is a protein sequence database that contains sequence information of all species and all proteins in UniProtKB (Protein Knowledge Base), with a total of 70,432,686 protein sequences. It includes two parts: Swiss-Prot and TrEMBL. The Swiss-Prot part contains manually annotated high-quality protein sequences, while TrEMBL contains automatically annotated protein sequences and unverified predicted protein sequences, which can be obtained from

http://wwwuser.gwdg.de/～compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/中下载。Download from http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/.

将得到的位置特异性打分矩阵、隐马尔科夫模型特征矩阵和七种氨基酸理化性质特征矩阵拼接融合，氨基酸理化性质包括薄片概率、螺旋概率、等电点、疏水性、范德华体积、极化率和图形形状指数，得到大小为L×57的特征矩阵，三者融合后的矩阵为模型提供丰富的信息，有效提升了模型预测精度。The obtained position-specific scoring matrix, hidden Markov model characteristic matrix and seven amino acid physical and chemical property characteristic matrices are spliced and fused. The amino acid physical and chemical properties include flake probability, helix probability, isoelectric point, hydrophobicity, van der Waals volume, and polarizability. and the graphic shape index to obtain a feature matrix with a size of L×57. The fused matrix of the three provides rich information for the model, effectively improving the prediction accuracy of the model.

分别获取所有初始元素为false的掩码矩阵和所有初始元素为0的输入特征矩阵。Get the mask matrix with all initial elements as false and the input feature matrix with all initial elements as 0 respectively.

根据遍历加权后的样本数据中每个批次的蛋白质序列得到蛋白质链索引，利用蛋白质链长度和蛋白质链最大长度，获得更新后的掩码矩阵和更新后的输入特征矩阵。Obtain the protein chain index by traversing the protein sequences of each batch in the weighted sample data, and use the protein chain length and the maximum length of the protein chain to obtain an updated mask matrix and an updated input feature matrix.

具体的，在网络的批归一化层中引入掩码矩阵，具体地，遍历每一批次共B个蛋白质链，得到每个蛋白质链在该批次中的索引batch_idx和长度L以及该批次蛋白质链的最大长度max_L，并将掩码矩阵M_{B,max_L}中所有元素初始化为false，将输入特征矩阵X_{B,C,max_L}中所有元素初始化为0，对于掩码矩阵每个batch_idx对应M_{B,max_L}的每一行，根据batch_idx和ProteinLen将M_{B,max_L}每一行的0至ProteinLen-1个元素更新为True，其中ProteinLen表示蛋白质序列的真实长度，得到每一批次最终的掩码矩阵M_{B,max_L}。对于X_{B,C,max_L}，每个batch_idx对应一个蛋白质链，根据batch_idx将X按X[batch_idx,ProteinLen]＝X填充至X_{B,C,max_L}中，得到最终的X_{B,C,max_L}；具体在实施时输入X的维度是B×L×F,Masks的维度是B×L，将X中填充部分进行掩码，非填充部分取出放入一个新的张量中，并根据描述每个氨基酸特征向量的长度，即num_features，来调整新张量的维度,根据新的张量来计算批归一化层中的均值方差。Specifically, a mask matrix is introduced in the batch normalization layer of the network. Specifically, each batch of B protein chains is traversed to obtain the index batch_idx and length L of each protein chain in the batch and the maximum length max_L of the protein chain in the batch, and all elements in the mask matrix MB _{,max_L} are initialized to false, and all elements in the input feature matrix XB _{,C,max_L} are initialized to 0. For each row of _{MB,max_L} corresponding to each batch_idx of the mask matrix, elements 0 to ProteinLen-1 of each row of _{MB,max_L} are updated to True according to batch_idx and ProteinLen, where ProteinLen represents the true length of the protein sequence, and the final mask matrix MB _{,max_L} of each batch is obtained. For X _{B,C,max_L} , each batch_idx corresponds to a protein chain. According to batch_idx, X is filled into X _{B,C,max_L} according to X[batch_idx,ProteinLen]=X to obtain the final X _{B,C,max_L} ; specifically, in implementation, the dimension of input X is B×L×F, the dimension of Masks is B×L, the filled part in X is masked, the non-filled part is taken out and put into a new tensor, and the dimension of the new tensor is adjusted according to the length of the feature vector describing each amino acid, that is, num_features, and the mean variance in the batch normalization layer is calculated based on the new tensor.

为了避免在后续特征提取过程中，补0位置影响其他位置特征提取结果的准确性引入了掩码矩阵。In order to avoid the influence of 0-filled positions on the accuracy of feature extraction results at other positions in the subsequent feature extraction process, a mask matrix is introduced.

利用预设的第一公式和第二公式，分别计算更新后的掩码矩阵和更新后的输入特征矩阵的均值和方差。The preset first formula and second formula are used to calculate the mean and variance of the updated mask matrix and the updated input feature matrix respectively.

利用根据预设的第三公式预构建的批归一化层，处理均值和方差，获得批归一化输出矩阵。The batch normalization layer pre-constructed according to the preset third formula is used to process the mean and variance to obtain a batch normalized output matrix.

具体的，预设的第三公式包括：Specifically, the preset third formula includes:

其中，μ表示利用预设的第一公式处理更新后的掩码矩阵和更新后的输入特征矩阵获得的均值、var表示利用预设的第二公式处理更新后的掩码矩阵和更新后的输入特征矩阵获得的方差，X_{B，C，max_L}表示输入的加权后的样本数据，X^*表示批归一化层的输出，β和γ为批归一化层中待学习的参数，ε＝0.01。Among them, μ represents the mean value obtained by using the preset first formula to process the updated mask matrix and the updated input feature matrix, and var represents using the preset second formula to process the updated mask matrix and the updated input. The variance obtained by the feature matrix, X _{B, C, max_L} represents the input weighted sample data, X ^* represents the output of the batch normalization layer, β and γ are the parameters to be learned in the batch normalization layer, ε = 0.01 .

其中，μ表示所有样本数据的均值，var₁表示所有样本数据的方差，M_bl表示掩码矩阵，X_bcl是每个蛋白质序列中的每个氨基酸向量的元素值，B指训练过程中设置的批大小，即Batchsize，这里设置为16；b指遍历这一批次的16个蛋白质，所以b＝[1,16]；max_L指该批次蛋白质中最大的序列长度；L代表该批次蛋白质的序列长度。Among them, μ represents the mean of all sample data, var ₁ represents the variance of all sample data, M _bl represents the mask matrix, X _bcl is the element value of each amino acid vector in each protein sequence, B refers to the batch size set during training, that is, Batchsize, which is set to 16 here; b refers to the 16 proteins traversed in this batch, so b = [1,16]; max_L refers to the maximum sequence length in the batch of proteins; L represents the sequence length of the batch of proteins.

用于对预处理后的多个输入样本数据进行处理，获得第一输出矩阵和第二输出矩阵的pytorch函数层。A pytorch function layer used to process multiple preprocessed input sample data to obtain the first output matrix and the second output matrix.

其中，pytorch函数层包括依次连接的nn.Conv1d函数、F.relu函数、MaskedBatchNorm1d函数和nn.Dropout函数，其中nn.Conv1d函数的参量包括卷积的尺寸为输出通道数57，输入通道数448，卷积核长3，填充padding＝1，偏置bias＝False，F.relu函数的输入为nn.Conv1d函数的输出，MaskedBatchNorm1d函数的维度为448，nn.Dropout函数的神经元不被激活的概率为0.2。Among them, the pytorch function layer includes the nn.Conv1d function, F.relu function, MaskedBatchNorm1d function and nn.Dropout function connected in sequence, where the parameters of the nn.Conv1d function include the convolution size of 57 output channels, 448 input channels, 3 convolution kernel length, padding=1, bias=False, the input of the F.relu function is the output of the nn.Conv1d function, the dimension of the MaskedBatchNorm1d function is 448, and the probability of the neuron of the nn.Dropout function not being activated is 0.2.

具体的，在训练样本中随机选择经加权处理和归一化处理后的B个样本构成输入矩阵X_B,C,L，将X_B,C,L作为输入，依次调用out1＝nn.Conv1d(57，448，3，padding＝1，bias＝False)、out＝F.relu(out1)、out＝MaskedBatchNorm1d(448)及out＝nn.Dropout(0.2)，得到矩阵X_B,C,L。Specifically, B samples that have been weighted and normalized are randomly selected from the training samples to form the input matrix X _{B, C, L.} X _{B, C, L} are used as inputs, and out1=nn.Conv1d ( 57, 448, 3, padding=1, bias=False), out=F.relu(out1), out=MaskedBatchNorm1d(448) and out=nn.Dropout(0.2), and the matrix X _{B, C, L} is obtained.

用于迭代处理第二输出矩阵，获得蛋白质序列间的局部特征矩阵的MobileNet v2层。MobileNet v2 layer used to iteratively process the second output matrix to obtain the local feature matrix between protein sequences.

其中，将所述nn.Dropout函数输出的第二输出矩阵，作为MobileNet v2层的初始输入矩阵，得到经MobileNet v2层的初始输出矩阵。Among them, the second output matrix output by the nn.Dropout function is used as the initial input matrix of the MobileNet v2 layer to obtain the initial output matrix of the MobileNet v2 layer.

利用经MobileNet v2层处理的初始输出矩阵作为MobileNet v2层的输入，将MobileNet v2层处理过的中间输出矩阵再作为MobileNet v2层的输入进行多次迭代，获得蛋白质序列间的局部特征矩阵。The initial output matrix processed by the MobileNet v2 layer is used as the input of the MobileNet v2 layer, and the intermediate output matrix processed by the MobileNet v2 layer is used as the input of the MobileNet v2 layer for multiple iterations to obtain the local feature matrix between protein sequences.

具体的，运行MobileNet v2网络的局部特征提取代码迭代执行n次，将首次迭代得到的矩阵X_B,C,L作为输入，将第i次输出作为第i+1次的输入，其中，i＝1,2,3，…,n-1，获得序列中氨基酸间的局部特征矩阵 Specifically, the local feature extraction code of the MobileNet v2 network is run iteratively n times, and the matrix X _{B, C, L} obtained in the first iteration is used as input, and the i-th output is used as the i+1-th input, where i = 1, 2, 3, ..., n-1, to obtain the local feature matrix between amino acids in the sequence

用于处理根据第一输出矩阵和局部特征矩阵相加得到的第三输出矩阵，获得蛋白质序列间的关联矩阵的Transformer层。The Transformer layer is used to process the third output matrix obtained by adding the first output matrix and the local feature matrix to obtain the correlation matrix between protein sequences.

其中，Transformer的编码层为两层，每个编码器均设有八个注意力头矩阵的集合。Among them, the Transformer has two coding layers, and each encoder is equipped with a set of eight attention head matrices.

具体的，将局部特征矩阵X_L和特征out1相加得到X_R，运行两层八头的TransformerEncoder程序，将得到的X_R作为输入，获得所有氨基酸序列间的关联矩阵X_E。 _Specifically , add _the local _feature _matrix

用于处理蛋白质序列间的关联矩阵，获得全局特征矩阵的两层双向门控循环单元层。A two-layer bidirectional gated cyclic unit layer used to process the correlation matrix between protein sequences and obtain the global feature matrix.

具体的，运行两层双向门控循环单元(BiGRU)，将氨基酸间序列间的关联矩阵X_E作为输入，得到蛋白质序列的全局特征矩阵X_G。Specifically, a two-layer bidirectional gated recurrent unit (BiGRU) is run, and the correlation matrix X _E between amino acid sequences is used as input to obtain the global feature matrix X _G of the protein sequence.

用于依次处理全局特征矩阵，获得蛋白质二级结构预测值的卷积层和全连接层。Convolutional layers and fully connected layers are used to sequentially process the global feature matrix and obtain protein secondary structure prediction values.

其中，卷积层和全连接层的输出通道均为57，所述卷积层的卷积核为1，所述全连接层的输出大小为8。Among them, the output channels of the convolution layer and the fully connected layer are both 57, the convolution kernel of the convolution layer is 1, and the output size of the fully connected layer is 8.

具体的，运行输出通道为57、卷积核大小为1的一维卷积程序，将全局特征矩阵X_G作为输入，得到最终的特征X_F，运行输入大小为57、输出大小为8的全连接层程序，将最终的特征X_F作为输入，得到蛋白质二级结构预测值P。Specifically, run a one-dimensional convolution program with an output channel of 57 and a convolution kernel size of 1, take the global feature matrix X _G as input, _and obtain the final feature The connection layer program takes the final feature X _F as input to obtain the protein secondary structure prediction value P.

在用于依次处理全局特征矩阵，获得蛋白质二级结构预测值的卷积层和全连接层后，还包括：After the convolutional layer and fully connected layer used to sequentially process the global feature matrix and obtain the protein secondary structure prediction value, it also includes:

利用标签分布感知边际损失函数计算预测误差，将得到的预测误差反向传播更新批归一化层的参数β和γ。The label distribution-aware marginal loss function is used to calculate the prediction error, and the obtained prediction error is back-propagated to update the parameters β and γ of the batch normalization layer.

具体的，运行标签分布感知边际损失函数代码计算预测误差，将得到的预测误差反向传播更新批归一化层的参数，包括β、λ及网络的权重，将最大迭代步数设置为200，学习率设置为0.0001，批大小设置为16，对模型循环迭代，当验证集中预测精度不再提高或到达最大迭代步数时，训练结束，得到基于Transformer的多类不平衡蛋白质二级结构预测模型。Specifically, the label distribution aware marginal loss function code is run to calculate the prediction error, and the obtained prediction error is backpropagated to update the parameters of the batch normalization layer, including β, λ and the weight of the network, and the maximum number of iteration steps is set to 200. The learning rate is set to 0.0001, the batch size is set to 16, and the model is iterated in a loop. When the prediction accuracy in the validation set no longer improves or reaches the maximum number of iteration steps, the training ends, and a multi-class imbalanced protein secondary structure prediction model based on Transformer is obtained. .

表1Table 1

表1表示在数据集CB513上与最先进方法的比较，加粗表示最佳性能。从表1中可以看出，本文模型性能在CB513数据集上明显优于其他基准方法，既保证了8类二级结构中常见类的预测精度，对罕见类B、G、和S的预测效果也有一定提升。Table 1 shows the comparison with state-of-the-art methods on the dataset CB513, with bold indicating the best performance. As can be seen from Table 1, the performance of the model in this paper is significantly better than other benchmark methods on the CB513 data set, which not only ensures the prediction accuracy of common classes among 8 types of secondary structures, but also the prediction effect of rare classes B, G, and S There has also been some improvement.

表2Table 2

表2表示在数据集CASP12上与最先进方法的比较，加粗表示最佳性能。在数据集CASP12上，整体预测精度比MUFold_SS低0.02％，但对于较罕见的类别，如：类别B、S和G与其他方法相比预测精度都有较大的提升，但由于数目较少，所以对整体预测精度的影响不大。Table 2 shows the comparison with state-of-the-art methods on the data set CASP12, bold indicates the best performance. On the data set CASP12, the overall prediction accuracy is 0.02% lower than MUFold_SS, but for rarer categories, such as categories B, S and G, the prediction accuracy is greatly improved compared with other methods, but due to the small number, Therefore, it has little impact on the overall prediction accuracy.

表3table 3

表3表示在数据集CASP13上与最先进方法的比较，加粗表示最佳性能。在数据集CASP13上，整体预测精度达到最大，对于罕见类的预测提升幅度也很大，但在类别L和T上的预测精度与其他方法相比还有一定差距。Table 3 shows the comparison with the state-of-the-art methods on the data set CASP13, bold indicates the best performance. On the data set CASP13, the overall prediction accuracy reaches the maximum, and the prediction accuracy for rare classes is also greatly improved. However, the prediction accuracy on categories L and T still lags behind other methods.

表4Table 4

表4表示在数据集CASP14上与最先进方法的比较，加粗表示最佳性能。在数据集CASP14上，整体预测精度比MUFold_SS提升了0.05％，整体精度提升有限，但对罕见类的预测精度提升幅度较大。观察上述结果可以发现，占比越低(样本量越小)结构类的Q₈精度往往越低，特别是八态中的“I”类结构出现的概率不足千分之一，几乎所有方法都难以正确预测。有理由相信，如果低频子类的样本量能够得到扩增，预测效果将会进一步提升。因此，关于蛋白质二级结构预测的数据增强是一个值得研究的方向。Table 4 shows the comparison with the most advanced methods on the CASP14 dataset, and the best performance is indicated in bold. On the CASP14 dataset, the overall prediction accuracy is improved by 0.05% compared with MUFold_SS. The overall accuracy improvement is limited, but the prediction accuracy of rare classes is improved significantly. Observing the above results, it can be found that the lower the proportion (the smaller the sample size), the lower the Q ₈ accuracy of the structural class, especially the probability of the "I" class structure in the eight states is less than one thousandth, and almost all methods are difficult to predict correctly. There is reason to believe that if the sample size of low-frequency subclasses can be expanded, the prediction effect will be further improved. Therefore, data enhancement for protein secondary structure prediction is a direction worthy of research.

图3为一种多类不平衡蛋白质二级结构预测方法在不同BN下模型的性能比较直方图，其中BN指的是批归一化层，即Batch Normalization；Figure 3 is a performance comparison histogram of a multi-class unbalanced protein secondary structure prediction method under different BN models, where BN refers to the batch normalization layer, i.e., Batch Normalization;

图4为一种多类不平衡蛋白质二级结构预测方法在有无Transformer模型的性能比较直方图。Figure 4 is a histogram comparing the performance of a multi-category unbalanced protein secondary structure prediction method with and without a Transformer model.

作为结构预测的信息来源和基础，不同的氨基酸编码所含进化信息不同，特征表示的探究将在上述实验中预测结果最好的模型参数下展开。这里我们首先逐个分析不同编码方式对预测精度的影响，接着对它们进行组合实验。图2表示在验证集上，以Q₈精度为依据，9种不同输入特征对模型预测性能的影响。具体来说，我们先单独考虑PSSM编码和HMM编码，由于独热编码和理化性质编码与位置无关，其中不含进化信息，因此不单独考虑，从实验结果可以看出PSSM的性能优于HMM，可能PSSM所蕴含的进化信息更丰富。接着又将PSSM和HMM分别与独热编码、理化性质编码组合，并将PSSM编码和HMM编码组合，我们发现PSSM编码分别与独热编码和理化性质编码进行组合时，模型预测性能接近，而HMM与这两者分别组合时性能均低于PSSM与它们组合的结果，这也印证了PSSM所含信息更丰富或更适合进行二级结构预测的结论，但是与理化性质编码结合时预测性能低于独热编码，与预期不符，但在后续实验中可以看出，在蛋白质形成稳定结构的进化过程中，由于不同氨基酸性质相异，这些性质影响它们在序列中与周围氨基酸的相互作用方式，从而影响蛋白质的结构。当PSSM编码和HMM编码结合时，预测性能相对于前面四种明显提升，PSSM包含着同一氨基酸在不同的序列中，形成稳定蛋白质结构的过程中突变成其他氨基酸的概率降低，而HMM编码包含着不同氨基酸的匹配状态概率、翻译频率及局部多样性，可见两者包含着一定的对蛋白质二级结构预测有用的互补信息。基于此，又以PSSM编码和HMM编码的组合为基础分别于独热编码和理化性质编码结合，从结果看到与理化性质结合，是最佳编码方式，预测精度达到最高，也达到了我们的预期。As the information source and basis for structure prediction, different amino acid codes contain different evolutionary information. The exploration of feature representation will be carried out under the model parameters with the best prediction results in the above experiments. Here we first analyze the impact of different encoding methods on prediction accuracy one by one, and then conduct combined experiments on them. Figure 2 shows the impact of 9 different input features on the model's prediction performance based on Q ₈ accuracy on the validation set. Specifically, we first consider PSSM coding and HMM coding separately. Since one-hot coding and physical and chemical property coding have nothing to do with position and do not contain evolutionary information, they are not considered separately. From the experimental results, we can see that the performance of PSSM is better than HMM. Maybe PSSM contains richer evolutionary information. Then PSSM and HMM were combined with one-hot coding and physical and chemical property coding respectively, and PSSM coding and HMM coding were combined. We found that when PSSM coding was combined with one-hot coding and physical and chemical property coding respectively, the model prediction performance was close, while HMM When combined with these two separately, the performance is lower than that of PSSM combined with them. This also confirms the conclusion that PSSM contains richer information or is more suitable for secondary structure prediction. However, when combined with physical and chemical property encoding, the prediction performance is lower than One-hot encoding is not in line with expectations, but in subsequent experiments it can be seen that in the evolutionary process of proteins forming stable structures, due to the different properties of different amino acids, these properties affect the way they interact with surrounding amino acids in the sequence, thus Affects protein structure. When PSSM coding and HMM coding are combined, the prediction performance is significantly improved compared to the previous four. PSSM contains the same amino acid in different sequences, and the probability of mutation into other amino acids is reduced in the process of forming a stable protein structure, while HMM coding contains With the matching state probability, translation frequency and local diversity of different amino acids, it can be seen that the two contain certain complementary information useful for protein secondary structure prediction. Based on this, the combination of PSSM coding and HMM coding is combined with one-hot coding and physical and chemical property coding respectively. From the results, it can be seen that combining with physical and chemical properties is the best coding method, and the prediction accuracy reaches the highest, which also reaches our level. expected.

对于改进后的批归一化(Batch Normalization)层，如图3所示，在测试集CB513上引入掩码矩阵后，Q₈精度提升0.23％，F₁提升约0.05％，可见填充位置的特征向量在特征提取过程中可能会变成非零向量，这将会影响非填充位置氨基酸对应二级结构的预测结果，而引入掩码矩阵可以一定程度缓解这一问题，提高特征提取的准确度，进而提高二级结构预测的精度。For the improved batch normalization (Batch Normalization) layer, as shown in Figure 3, after introducing the mask matrix on the test set CB513, the Q ₈ accuracy increased by 0.23%, and the F ₁ increased by about 0.05%. The characteristics of the filling position can be seen The vector may become a non-zero vector during the feature extraction process, which will affect the prediction results of the secondary structure corresponding to the amino acid at the non-filled position. The introduction of the mask matrix can alleviate this problem to a certain extent and improve the accuracy of feature extraction. This further improves the accuracy of secondary structure prediction.

图4为网络在有无特征加强模块时所起到的网络性能的对比结果。如图中所示，网络增加了特征加强模块后，预测精度提高了0.16％、F₁值提高了0.79％，这表明在长距离作用前加上2层8头Transformer Encoder增强了序列中残基之间关联性表达，能够使后续的BiGRU更灵活地捕获长距离依赖关系。综上，本发明通过将待预测的目标蛋白质序列输入预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型，获得待预测的目标蛋白质序列的二级结构预测值。其中，预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型包括：用于对输入样本数据进行加权处理和批归一化处理的数据预处理层。用于对预处理后的多个输入样本数据进行处理，获得第一输出矩阵和第二输出矩阵的pytorch函数层。用于迭代处理第二输出矩阵，获得蛋白质序列间的局部特征矩阵的MobileNet v2层。用于处理根据第一输出矩阵和局部特征矩阵相加得到的第三输出矩阵，获得蛋白质序列间的关联矩阵的Transformer层。用于处理蛋白质序列间的关联矩阵，获得全局特征矩阵的两层双向门控循环单元层。用于依次处理全局特征矩阵，获得蛋白质二级结构预测值的卷积层和全连接层。该方法在对蛋白质二级结构预测时，对氨基酸全局依赖性低、提高了罕见类蛋白质二级结构预测精度高。Figure 4 shows the comparison results of network performance with and without feature enhancement modules. As shown in the figure, after adding the feature enhancement module to the network, the prediction accuracy increased by 0.16% and the F ₁ value increased by 0.79%. This shows that adding 2 layers of 8-head Transformer Encoder before long-distance action enhances the residues in the sequence. The expression of dependencies between them can make subsequent BiGRU more flexibly capture long-distance dependencies. In summary, the present invention obtains the secondary structure prediction value of the target protein sequence to be predicted by inputting the target protein sequence to be predicted into the pre-constructed Transformer-based multi-type unbalanced protein secondary structure prediction model. Among them, the pre-built Transformer-based multi-class unbalanced protein secondary structure prediction model includes: a data preprocessing layer for weighting and batch normalization of input sample data. A pytorch function layer used to process multiple preprocessed input sample data to obtain the first output matrix and the second output matrix. The MobileNet v2 layer is used to iteratively process the second output matrix and obtain the local feature matrix between protein sequences. Transformer layer used to process the third output matrix obtained by adding the first output matrix and the local feature matrix to obtain the correlation matrix between protein sequences. A two-layer bidirectional gated cyclic unit layer used to process the correlation matrix between protein sequences and obtain the global feature matrix. Convolutional layers and fully connected layers are used to sequentially process the global feature matrix and obtain protein secondary structure prediction values. When predicting protein secondary structure, this method has low global dependence on amino acids and improves the prediction accuracy of rare protein secondary structures.

103、将待预测的目标蛋白质序列的二级结构预测值P输出。103. Output the secondary structure prediction value P of the target protein sequence to be predicted.

图5为一种多类不平衡蛋白质二级结构预测系统的结构200示意图，如图5所示，装置包括：Figure 5 is a schematic diagram of the structure 200 of a multi-type unbalanced protein secondary structure prediction system. As shown in Figure 5, the device includes:

输入模块201，用于获取待预测的目标蛋白质序列。Input module 201 is used to obtain the target protein sequence to be predicted.

处理模块202，用于将待预测的目标蛋白质序列输入预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型，获得待预测的目标蛋白质序列的二级结构预测值。其中，预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型包括：The processing module 202 is used to input the target protein sequence to be predicted into the pre-constructed Transformer-based multi-type unbalanced protein secondary structure prediction model, and obtain the secondary structure prediction value of the target protein sequence to be predicted. Among them, the pre-built Transformer-based multi-category unbalanced protein secondary structure prediction models include:

用于对预处理后的多个输入样本数据进行处理，获得第一输出矩阵和第二输出矩阵的pytorch函数层。A pytorch function layer used to process multiple preprocessed input sample data to obtain a first output matrix and a second output matrix.

用于处理根据第一输出矩阵和局部特征矩阵相加得到的第三输出矩阵，获得蛋白质序列间的关联矩阵的Transformer层。Transformer layer used to process the third output matrix obtained by adding the first output matrix and the local feature matrix to obtain the correlation matrix between protein sequences.

用于依次处理全局特征矩阵，获得蛋白质二级结构预测值的卷积层和全连接层。The convolutional layer and fully connected layer are used to process the global feature matrix in sequence to obtain the protein secondary structure prediction value.

输出模块203，用于将待预测的目标蛋白质序列的二级结构预测值输出。The output module 203 is used to output the secondary structure prediction value of the target protein sequence to be predicted.

本领域内的技术人员应明白，本发明可采用一个或多个其中包含可用程序代码的计算机、可用存储介质上实施的计算机程序产品的形式。尽管已经示出和描述了本发明的实施例，对于本领域的普通技术人员而言，可以理解在不脱离本发明的原理和精神的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由所附权利要求及其等同物限定。It should be understood by those skilled in the art that the present invention may take the form of a computer program product implemented on one or more computers containing available program codes and available storage media. Although the embodiments of the present invention have been shown and described, it is understood by those skilled in the art that various changes, modifications, substitutions and variations may be made to these embodiments without departing from the principles and spirit of the present invention, and the scope of the present invention is defined by the appended claims and their equivalents.

Claims

1. A method for predicting a secondary structure of a plurality of unbalanced proteins, comprising:

obtaining a target protein sequence to be predicted;

inputting a target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a transducer to obtain a secondary structure prediction value of the target protein sequence to be predicted; the pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:

a data preprocessing layer for performing weighting processing and batch normalization processing on the input sample data;

the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix;

the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences;

the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences;

the two-layer bidirectional gating cyclic unit layer is used for processing the incidence matrix among protein sequences to obtain a global feature matrix;

the convolution layer and the full connection layer are used for sequentially processing the global feature matrix to obtain a protein secondary structure predicted value;

outputting a secondary structure predicted value of the target protein sequence to be predicted;

the pyrach function layer comprises an nn.Conv1d function, an F.relu function, a MaskedBatchNorm1d function and an nn.Dropout function which are connected in sequence, wherein parameters of the nn.Conv1d function comprise a convolution size of 57 output channels, 448 input channels, 3 convolution kernels, padding=1, bias bias=false, the input of the F.relu function is the output of the nn.Conv1d function, the dimension of the MaskedBatchNorm1d function is 448, and the probability that neurons of the nn.Dropout function are not activated is 0.2;

the coding layers of the transducer are two layers, and each coder is provided with a set of eight attention head matrixes;

the output channels of the convolution layer and the full connection layer are 57, the convolution kernel of the convolution layer is 1, and the output size of the full connection layer is 8.

2. The method of claim 1, wherein the step of weighting and batch normalizing the input sample data comprises:

processing input sample data according to a position specificity scoring matrix, a hidden Markov model feature matrix and a feature matrix obtained by an amino acid physicochemical property feature matrix of horizontal splicing to obtain weighted sample data;

respectively obtaining mask matrixes with all initial elements being false and input feature matrixes with all initial elements being 0;

according to protein chain indexes, protein chain lengths and protein chain maximum lengths obtained by traversing protein sequences of each batch in the weighted sample data, an updated mask matrix and an updated input feature matrix are obtained;

respectively calculating the mean value and the variance of the updated mask matrix and the updated input feature matrix by using a preset first formula and a preset second formula;

and processing the mean and variance by using a batch normalization layer pre-constructed according to a preset third formula to obtain a batch normalization output matrix.

3. The method for predicting a secondary structure of a plurality of unbalanced proteins according to claim 2, wherein the predetermined third formula comprises:

wherein μ represents the average value obtained by processing the updated mask matrix and the updated input feature matrix using a preset first formula, var represents the variance obtained by processing the updated mask matrix and the updated input feature matrix using a preset second formula, X _{B，C，max_L} Representing input weighted sample data, X ^* Representing the output of the batch normalization layer, β and γ being parameters to be learned in the batch normalization layer, where ε=0.01;

the preset first formula and the preset second formula are respectively as follows:

wherein μ represents the mean value of all sample data, var ₁ Representing the variance of all sample data, M _bl Represents a mask matrix, X _bcl Is the element value of each amino acid vector in each protein sequence, B is the batch size set during training, namely, batchsize, here set to 16; b refers to traversing the batch of 16 proteins, so b= [1,16]The method comprises the steps of carrying out a first treatment on the surface of the max_l refers to the batch of proteinsMaximum sequence length in the stroma; l represents the sequence length of the batch of proteins.

4. The method of claim 1, wherein iteratively processing the second output matrix comprises:

taking the second output matrix output by the nn. Dropout function as an initial input matrix of a MobileNet v2 layer to obtain an initial output matrix;

and (3) taking the initial output matrix as the input of the MobileNet v2 layer, and carrying out multiple iterations on the obtained intermediate output matrix processed by the MobileNet v2 layer as the input of the MobileNet v2 layer to obtain the local feature matrix among protein sequences.

5. A method for predicting a secondary structure of a multi-class unbalanced protein according to claim 3, wherein after the convolution layer and the full-connection layer for sequentially processing the global feature matrix to obtain the predicted value of the secondary structure of the protein, the method further comprises:

and calculating a prediction error by using a label distribution perception marginal loss function, and reversely propagating and updating parameters beta and gamma of the batch normalization layer by the obtained prediction error.

6. A multi-class unbalanced protein secondary structure prediction system, comprising:

the input module is used for acquiring a target protein sequence to be predicted;

the processing module is used for inputting the target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a transducer to obtain a secondary structure predicted value of the target protein sequence to be predicted; the pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:

the output module is used for outputting the predicted value of the secondary structure of the target protein sequence to be predicted;