CN117476106B - Multi-class unbalanced protein secondary structure prediction method and system - Google Patents

Multi-class unbalanced protein secondary structure prediction method and system Download PDF

Info

Publication number
CN117476106B
CN117476106B CN202311804115.1A CN202311804115A CN117476106B CN 117476106 B CN117476106 B CN 117476106B CN 202311804115 A CN202311804115 A CN 202311804115A CN 117476106 B CN117476106 B CN 117476106B
Authority
CN
China
Prior art keywords
matrix
layer
secondary structure
output
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311804115.1A
Other languages
Chinese (zh)
Other versions
CN117476106A (en
Inventor
何朝政
赵君研
肖秦琨
付玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Qianruanhui Information Technology Co ltd
Original Assignee
Xi'an Huisuan Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Huisuan Intelligent Technology Co ltd filed Critical Xi'an Huisuan Intelligent Technology Co ltd
Priority to CN202311804115.1A priority Critical patent/CN117476106B/en
Publication of CN117476106A publication Critical patent/CN117476106A/en
Application granted granted Critical
Publication of CN117476106B publication Critical patent/CN117476106B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Genetics & Genomics (AREA)
  • Physiology (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

本发明公开了一种多类不平衡蛋白质二级结构预测方法和系统,涉及计算机辅助药物研发技术领域。包括获取待预测的目标蛋白质序列,将待预测的目标蛋白质序列输入预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型,获得待预测的目标蛋白质序列的二级结构预测值,输出目标蛋白质序列的二级结构预测值,本发明在对蛋白质序列的二级结构预测时,对氨基酸全局依赖性低、提高了罕见类蛋白质二级结构的预测精度。

The present invention discloses a method and system for predicting the secondary structure of multiple types of unbalanced proteins, and relates to the technical field of computer-aided drug development. The method comprises obtaining a target protein sequence to be predicted, inputting the target protein sequence to be predicted into a pre-constructed Transformer-based multiple types of unbalanced protein secondary structure prediction model, obtaining a secondary structure prediction value of the target protein sequence to be predicted, and outputting a secondary structure prediction value of the target protein sequence. When predicting the secondary structure of a protein sequence, the method has low global dependence on amino acids and improves the prediction accuracy of the secondary structure of rare proteins.

Description

一种多类不平衡蛋白质二级结构预测方法和系统A multi-category unbalanced protein secondary structure prediction method and system

技术领域Technical field

本发明涉及计算机辅助药物研发技术领域,具体涉及一种多类不平衡蛋白质二级结构预测方法及系统。The present invention relates to the technical field of computer-aided drug development, and in particular to a method and system for predicting secondary structures of multiple types of unbalanced proteins.

背景技术Background technique

蛋白质是构成和修复人体组织的重要有机物,围绕蛋白质二级结构预测问题涌现出了大量的计算方法。常用的方法有DeepACLSTM和MUFold-SS。DeepACLSTM模型利用蛋白质序列特征和Profile特征,所述Profile特征由多序列比对构建的序列特征谱,这里具体指位置特异性打分矩阵(Position Special Scoring Matrix,PSSM),具体采用卷积核为1×42的一维卷积和3×1的二维卷积构成的非对称卷积网络(asymmetric convolutionalneural network,ACNN)结合双向长短期记忆(bidirectional long short term memory,BiLSTM)网络获取氨基酸残基间的局部及全局相关性;MUFold-SS模型的特征矩阵是由氨基酸的理化性质、PSI-BLAST图谱和HHBlits图谱组成的,包含了丰富的进化信息,利用卷积核尺度为1和3的并行卷积构成深度卷积网络来提取氨基酸之间的局部及全局相关性。但这两种方法对蛋白质二级结构预测时,存在氨基酸全局依赖性强,罕见类蛋白质二级结构预测精度低的问题。Proteins are important organic compounds that constitute and repair human tissues. A large number of computational methods have emerged around the problem of protein secondary structure prediction. Commonly used methods include DeepACLSTM and MUFold-SS. The DeepACLSTM model uses protein sequence features and Profile features. The Profile feature is a sequence feature spectrum constructed from multiple sequence alignments. Here, it specifically refers to the Position Special Scoring Matrix (PSSM). Specifically, the convolution kernel is 1× The asymmetric convolutional network (ACNN) composed of 42 one-dimensional convolution and 3×1 two-dimensional convolution is combined with the bidirectional long short term memory (BiLSTM) network to obtain the relationship between amino acid residues. Local and global correlation; the feature matrix of the MUFold-SS model is composed of the physical and chemical properties of amino acids, the PSI-BLAST map and the HHBlits map. It contains rich evolutionary information and uses parallel convolution with convolution kernel scales of 1 and 3. A deep convolutional network is constructed to extract local and global correlations between amino acids. However, when predicting protein secondary structure, these two methods have the problem of strong global dependence on amino acids and low accuracy in predicting the secondary structure of rare proteins.

发明内容Contents of the invention

为了解决上述背景技术中存在的不足,主要针对现有技术中对蛋白质二级结构预测时,对氨基酸全局依赖性强,罕见类蛋白质二级结构预测精度低的问题。本发明提供一种多类不平衡蛋白质二级结构预测方法及系统,该方法对氨基酸全局依赖性低、提升了罕见类蛋白质二级结构预测的精度。In order to solve the deficiencies in the above-mentioned background technology, the current technology mainly focuses on the problem that when predicting the secondary structure of proteins, the global dependence on amino acids is strong, and the prediction accuracy of the secondary structure of rare proteins is low. The present invention provides a method and system for predicting the secondary structure of multiple types of unbalanced proteins, which has low global dependence on amino acids and improves the accuracy of the prediction of the secondary structure of rare proteins.

为了实现上述目的,本发明第一方面提供一种多类不平衡蛋白质二级结构预测方法:In order to achieve the above objectives, the first aspect of the present invention provides a multi-category unbalanced protein secondary structure prediction method:

获取待预测的目标蛋白质序列;Obtain the target protein sequence to be predicted;

将待预测的目标蛋白质序列输入预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型,获得待预测的目标蛋白质序列的二级结构预测值;多类不平衡蛋白质结构分为8种:分别用L、B、E、G、I、H、S和T表示,但在这8种结构中B、G和S比例较低,特别是结构G,而罕见类在训练过程中容易被淹没,使其预测精度降低,因此蛋白质二级结构中存在着多类不平衡的问题。其中,预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型包括:Input the target protein sequence to be predicted into the pre-constructed Transformer-based multi-class unbalanced protein secondary structure prediction model to obtain the secondary structure prediction value of the target protein sequence to be predicted; multi-class unbalanced protein structures are divided into 8 types: Represented by L, B, E, G, I, H, S and T respectively, but the proportion of B, G and S among these 8 structures is low, especially structure G, and rare classes are easily overwhelmed during the training process. , which reduces the prediction accuracy, so there is a multi-category imbalance problem in the protein secondary structure. Among them, the pre-built Transformer-based multi-category unbalanced protein secondary structure prediction models include:

用于对输入样本数据进行加权处理和批归一化处理的数据预处理层;A data preprocessing layer for weighting and batch normalization of input sample data;

用于对预处理后的多个输入样本数据进行处理,获得第一输出矩阵和第二输出矩阵的pytorch函数层;A pytorch function layer used to process multiple preprocessed input sample data to obtain the first output matrix and the second output matrix;

用于迭代处理第二输出矩阵,获得蛋白质序列间的局部特征矩阵的MobileNet v2层;The MobileNet v2 layer used to iteratively process the second output matrix to obtain the local feature matrix between protein sequences;

用于处理根据第一输出矩阵和局部特征矩阵相加得到的第三输出矩阵,获得蛋白质序列间的关联矩阵的Transformer层;A Transformer layer used to process the third output matrix obtained by adding the first output matrix and the local feature matrix to obtain the correlation matrix between protein sequences;

用于处理蛋白质序列间的关联矩阵,获得全局特征矩阵的两层双向门控循环单元层;A two-layer bidirectional gated cyclic unit layer used to process the correlation matrix between protein sequences and obtain the global feature matrix;

用于依次处理全局特征矩阵,获得蛋白质二级结构预测值的卷积层和全连接层;Convolutional layers and fully connected layers used to sequentially process the global feature matrix and obtain protein secondary structure prediction values;

将待预测的目标蛋白质序列的二级结构预测值输出。Output the predicted secondary structure value of the target protein sequence to be predicted.

可选的,对输入样本数据进行加权处理和批归一化处理的步骤包括:Optional steps for weighting and batch normalizing the input sample data include:

根据水平拼接的位置特异性打分矩阵、隐马尔科夫模型特征矩阵以及氨基酸理化性质特征矩阵获得的特征矩阵,处理输入样本数据,获得加权后的样本数据;According to the feature matrix obtained by horizontally splicing the position-specific scoring matrix, the hidden Markov model feature matrix and the amino acid physicochemical property feature matrix, process the input sample data to obtain weighted sample data;

分别获取所有初始元素为false的掩码矩阵和所有初始元素为0的输入特征矩阵;Get the mask matrix with all initial elements being false and the input feature matrix with all initial elements being 0 respectively;

根据遍历加权后的样本数据中每个批次的蛋白质序列得到的蛋白质链索引,蛋白质链长度和蛋白质链最大长度,获得更新后的掩码矩阵和更新后的输入特征矩阵;According to the protein chain index, protein chain length and maximum protein chain length obtained by traversing the protein sequences of each batch in the weighted sample data, obtain the updated mask matrix and the updated input feature matrix;

利用预设的第一公式和第二公式,分别计算更新后的掩码矩阵和更新后的输入特征矩阵的均值和方差;Using the preset first formula and the second formula, calculate the mean and variance of the updated mask matrix and the updated input feature matrix respectively;

利用根据预设的第三公式预构建的批归一化层,处理均值和方差,获得批归一化输出矩阵。The batch normalization layer pre-constructed according to the preset third formula is used to process the mean and variance to obtain the batch normalization output matrix.

可选的,所述预设的第三公式包括:Optionally, the preset third formula includes:

其中,μ表示利用预设的第一公式处理更新后的掩码矩阵和更新后的输入特征矩阵获得的均值、var表示利用预设的第二公式处理更新后的掩码矩阵和更新后的输入特征矩阵获得的方差,XB,C,max_L表示输入的加权后的样本数据,X*表示批归一化层的输出,β和γ为批归一化层中待学习的参数,其中ε=0.01;Among them, μ represents the mean value obtained by using the preset first formula to process the updated mask matrix and the updated input feature matrix, and var represents using the preset second formula to process the updated mask matrix and the updated input. The variance obtained by the feature matrix, X B, C, max_L represents the input weighted sample data, X * represents the output of the batch normalization layer, β and γ are the parameters to be learned in the batch normalization layer, where ε = 0.01;

其中,预设的第一公式和预设的第二公式分别为:Among them, the preset first formula and the preset second formula are respectively:

其中,μ表示所有样本数据的均值,var1表示所有样本数据的方差,Mbl表示掩码矩阵,Xbcl是每个蛋白质序列中的每个氨基酸向量的元素值,B指训练过程中设置的批大小,即Batchsize,这里设置为16;b指遍历这一批次的16个蛋白质,所以b=[1,16];max_L指该批次蛋白质中最大的序列长度;L代表该批次蛋白质的序列长度;Among them, μ represents the mean of all sample data, var 1 represents the variance of all sample data, M bl represents the mask matrix, X bcl is the element value of each amino acid vector in each protein sequence, B refers to the batch size set during training, that is, Batchsize, which is set to 16 here; b refers to the 16 proteins traversed in this batch, so b = [1, 16]; max_L refers to the maximum sequence length in the batch of proteins; L represents the sequence length of the batch of proteins;

可选的,pytorch函数层包括依次连接的nn.Conv1d函数、F.relu函数、MaskedBatchNorm1d函数和nn.Dropout函数,其中nn.Conv1d函数的参量包括卷积的尺寸为输出通道数57,输入通道数448,卷积核长3,填充padding=1,偏置bias=False,F.relu函数的输入为nn.Conv1d函数的输出,MaskedBatchNorm1d函数的维度为448,nn.Dropout函数的神经元不被激活的概率为0.2。Optional, the pytorch function layer includes the nn.Conv1d function, F.relu function, MaskedBatchNorm1d function and nn.Dropout function, which are connected in sequence. The parameters of the nn.Conv1d function include the size of the convolution, which is the number of output channels 57, and the number of input channels. 448, convolution kernel length 3, padding=1, bias bias=False, the input of the F.relu function is the output of the nn.Conv1d function, the dimension of the MaskedBatchNorm1d function is 448, and the neurons of the nn.Dropout function are not activated. The probability is 0.2.

可选的,迭代处理第二输出矩阵包括:Optionally, iteratively processing the second output matrix includes:

将所述nn.Dropout函数输出的第二输出矩阵,作为MobileNet v2层的初始输入矩阵,得到初始输出矩阵;Using the second output matrix output by the nn.Dropout function as the initial input matrix of the MobileNet v2 layer to obtain an initial output matrix;

利用初始输出矩阵作为MobileNet v2层的输入,将MobileNet v2层处理过的中间输出矩阵再作为MobileNet v2层的输入进行多次迭代,获得蛋白质序列间的局部特征矩阵。The initial output matrix is used as the input of the MobileNet v2 layer, and the intermediate output matrix processed by the MobileNet v2 layer is used as the input of the MobileNet v2 layer for multiple iterations to obtain the local feature matrix between protein sequences.

可选的,Transformer的编码层为两层,每个编码器均设有八个注意力头矩阵的集合。Optionally, the Transformer has two encoding layers, and each encoder is equipped with a set of eight attention head matrices.

可选的,卷积层和全连接层的输出通道均为57,所述卷积层的卷积核为1,所述全连接层的输出大小为8。Optionally, the output channels of the convolution layer and the fully connected layer are both 57, the convolution kernel of the convolution layer is 1, and the output size of the fully connected layer is 8.

可选的,在用于依次处理全局特征矩阵,获得蛋白质二级结构预测值的卷积层和全连接层后,还包括:Optionally, after the convolutional layer and fully connected layer used to sequentially process the global feature matrix to obtain the protein secondary structure prediction value, it also includes:

利用标签分布感知边际损失函数计算预测误差,将得到的预测误差反向传播更新批归一化层的参数β和γ。The prediction error is calculated using the label distribution aware marginal loss function, and the obtained prediction error is back-propagated to update the parameters β and γ of the batch normalization layer.

本发明另一方面提供一种多类不平衡蛋白质二级结构预测系统,包括:On the other hand, the present invention provides a multi-category unbalanced protein secondary structure prediction system, including:

输入模块,用于获取待预测的目标蛋白质序列;Input module, used to obtain the target protein sequence to be predicted;

处理模块,用于将待预测的目标蛋白质序列输入预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型,获得待预测的目标蛋白质序列的二级结构预测值;其中,预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型包括:The processing module is used to input the target protein sequence to be predicted into the pre-built Transformer-based multi-type unbalanced protein secondary structure prediction model to obtain the secondary structure prediction value of the target protein sequence to be predicted; wherein, the pre-built based on Transformer's multi-category unbalanced protein secondary structure prediction models include:

用于对输入样本数据进行加权处理和批归一化处理的数据预处理层;A data preprocessing layer for weighting and batch normalizing the input sample data;

用于对预处理后的多个输入样本数据进行处理,获得第一输出矩阵和第二输出矩阵的pytorch函数层;A pytorch function layer used to process multiple preprocessed input sample data to obtain the first output matrix and the second output matrix;

用于迭代处理第二输出矩阵,获得蛋白质序列间的局部特征矩阵的MobileNet v2层;The MobileNet v2 layer used to iteratively process the second output matrix to obtain the local feature matrix between protein sequences;

用于处理根据第一输出矩阵和局部特征矩阵相加得到的第三输出矩阵,获得蛋白质序列间的关联矩阵的Transformer层;A Transformer layer used to process the third output matrix obtained by adding the first output matrix and the local feature matrix to obtain the correlation matrix between protein sequences;

用于处理蛋白质序列间的关联矩阵,获得全局特征矩阵的两层双向门控循环单元层;A two-layer bidirectional gated cyclic unit layer used to process the correlation matrix between protein sequences and obtain the global feature matrix;

用于依次处理全局特征矩阵,获得蛋白质二级结构预测值的卷积层和全连接层;Convolutional layers and fully connected layers are used to process the global feature matrix in sequence to obtain the predicted value of protein secondary structure;

输出模块,用于将待预测的目标蛋白质序列的二级结构预测值输出。The output module is used to output the secondary structure prediction value of the target protein sequence to be predicted.

与现有技术相比,本发明的有益效果是:Compared with the prior art, the beneficial effects of the present invention are:

本发明提供的一种多类不平衡蛋白质二级结构预测方法和系统,通过将待预测的目标蛋白质序列输入预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型,获得待预测的目标蛋白质序列的二级结构预测值;其中,预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型包括:用于对输入样本数据进行加权处理和批归一化处理的数据预处理层;用于对预处理后的多个输入样本数据进行处理,获得第一输出矩阵和第二输出矩阵的pytorch函数层;用于迭代处理第二输出矩阵,获得蛋白质序列间的局部特征矩阵的MobileNet v2层;用于处理根据第一输出矩阵和局部特征矩阵相加得到的第三输出矩阵,获得蛋白质序列间的关联矩阵的Transformer层;用于处理蛋白质序列间的关联矩阵,获得全局特征矩阵的两层双向门控循环单元层;用于依次处理全局特征矩阵,获得蛋白质二级结构预测值的卷积层和全连接层。该方法在对蛋白质二级结构预测时,对氨基酸全局依赖性低、提高了罕见类蛋白质二级结构的预测精度。The invention provides a multi-category unbalanced protein secondary structure prediction method and system. By inputting the target protein sequence to be predicted into a pre-constructed Transformer-based multi-category unbalanced protein secondary structure prediction model, the target to be predicted is obtained. Secondary structure prediction value of protein sequence; among them, the pre-built Transformer-based multi-category unbalanced protein secondary structure prediction model includes: a data preprocessing layer for weighting and batch normalization of input sample data; The pytorch function layer used to process multiple input sample data after preprocessing to obtain the first output matrix and the second output matrix; the MobileNet v2 used to iteratively process the second output matrix to obtain the local feature matrix between protein sequences The Transformer layer is used to process the third output matrix obtained by adding the first output matrix and the local feature matrix to obtain the correlation matrix between protein sequences; the Transformer layer is used to process the correlation matrix between protein sequences to obtain the global feature matrix. layer bidirectional gated cyclic unit layer; convolutional layer and fully connected layer used to sequentially process the global feature matrix to obtain protein secondary structure prediction values. When predicting protein secondary structure, this method has low global dependence on amino acids and improves the prediction accuracy of rare protein secondary structures.

附图说明Description of drawings

图1为本发明提供的一种多类不平衡蛋白质二级结构预测方法流程图;Figure 1 is a flow chart of a multi-category unbalanced protein secondary structure prediction method provided by the present invention;

图2为一种多类不平衡蛋白质二级结构预测方法在不同特征下模型的性能比较直方图;Figure 2 is a histogram comparing the performance of a multi-category unbalanced protein secondary structure prediction method under different characteristics;

图3为一种多类不平衡蛋白质二级结构预测方法在不同BN下模型的性能比较直方图;Figure 3 is a histogram comparing the performance of a multi-category unbalanced protein secondary structure prediction method under different BN models;

图4为一种多类不平衡蛋白质二级结构预测方法在有无Transformer时模型的性能比较直方图;Figure 4 is a histogram comparing the performance of a multi-class unbalanced protein secondary structure prediction method with and without Transformer;

图5为一种多类不平衡蛋白质二级结构预测系统的结构示意图。FIG. 5 is a schematic diagram of the structure of a multi-class imbalanced protein secondary structure prediction system.

具体实施方式Detailed ways

下面结合具体实施例和附图对本发明作进一步说明,但所举实施例不作为对本发明的限定。The present invention will be further described below with reference to specific embodiments and drawings, but the embodiments are not intended to limit the present invention.

图1为本发明提供的一种多类不平衡蛋白质二级结构预测方法流程图,如图1所示,本发明提供一种多类不平衡蛋白质二级结构预测方法,包括:Figure 1 is a flow chart of a multi-category unbalanced protein secondary structure prediction method provided by the present invention. As shown in Figure 1, the present invention provides a multi-category unbalanced protein secondary structure prediction method, including:

101、获取待预测的目标蛋白质序列。101. Obtain the target protein sequence to be predicted.

102、将待预测的目标蛋白质序列输入预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型,获得待预测的目标蛋白质序列的二级结构预测值。其中,预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型包括:102. Input the target protein sequence to be predicted into the pre-constructed Transformer-based multi-type unbalanced protein secondary structure prediction model to obtain the secondary structure prediction value of the target protein sequence to be predicted. Among them, the pre-built Transformer-based multi-category unbalanced protein secondary structure prediction models include:

用于对输入样本数据进行加权处理和批归一化处理的数据预处理层。Data preprocessing layer for weighting and batch normalizing input sample data.

其中,对输入样本数据进行加权处理和批归一化处理的步骤包括:Among them, the steps of weighting and batch normalization processing of input sample data include:

根据水平拼接的位置特异性打分矩阵、隐马尔科夫模型特征矩阵以及氨基酸理化性质特征矩阵获得的特征矩阵,处理输入样本数据,获得加权后的样本数据。Based on the feature matrix obtained from the horizontally spliced position-specific scoring matrix, hidden Markov model feature matrix, and amino acid physicochemical property feature matrix, the input sample data is processed to obtain weighted sample data.

具体的,将PSI-BLAST图谱收敛参数e设置为0.001,运行PSI-BLAST图谱对Uniref数据库进行两次迭代,生成位置特异性打分矩阵(PSSM),大小为L×20,其中L为蛋白质序列长度。Specifically, set the PSI-BLAST map convergence parameter e to 0.001, run the PSI-BLAST map for two iterations on the Uniref database, and generate a position-specific scoring matrix (PSSM) with a size of L×20, where L is the length of the protein sequence. .

运行HHblits图谱对Uniprot20数据库进行四次迭代,生成隐马尔科夫模型(HMM)特征矩阵,其大小为L×30。其中Uniprot20是一个蛋白质序列数据库,它包含了UniProtKB(蛋白质知识库)中的所有物种、所有蛋白质的序列信息,共70432686个蛋白质序列。其中包括Swiss-Prot和TrEMBL两个部分。其中Swiss-Prot部分包含手工注释的高质量蛋白质序列,而TrEMBL则包含了自动注释的蛋白质序列和未经验证的预测蛋白质序列,可从Run the HHblits map for four iterations on the Uniprot20 database to generate a hidden Markov model (HMM) feature matrix of size L×30. Uniprot20 is a protein sequence database that contains sequence information of all species and all proteins in UniProtKB (Protein Knowledge Base), with a total of 70,432,686 protein sequences. It includes two parts: Swiss-Prot and TrEMBL. The Swiss-Prot part contains manually annotated high-quality protein sequences, while TrEMBL contains automatically annotated protein sequences and unverified predicted protein sequences, which can be obtained from

http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/中下载。Download from http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/old-releases/.

将得到的位置特异性打分矩阵、隐马尔科夫模型特征矩阵和七种氨基酸理化性质特征矩阵拼接融合,氨基酸理化性质包括薄片概率、螺旋概率、等电点、疏水性、范德华体积、极化率和图形形状指数,得到大小为L×57的特征矩阵,三者融合后的矩阵为模型提供丰富的信息,有效提升了模型预测精度。The obtained position-specific scoring matrix, hidden Markov model characteristic matrix and seven amino acid physical and chemical property characteristic matrices are spliced and fused. The amino acid physical and chemical properties include flake probability, helix probability, isoelectric point, hydrophobicity, van der Waals volume, and polarizability. and the graphic shape index to obtain a feature matrix with a size of L×57. The fused matrix of the three provides rich information for the model, effectively improving the prediction accuracy of the model.

分别获取所有初始元素为false的掩码矩阵和所有初始元素为0的输入特征矩阵。Get the mask matrix with all initial elements as false and the input feature matrix with all initial elements as 0 respectively.

根据遍历加权后的样本数据中每个批次的蛋白质序列得到蛋白质链索引,利用蛋白质链长度和蛋白质链最大长度,获得更新后的掩码矩阵和更新后的输入特征矩阵。Obtain the protein chain index by traversing the protein sequences of each batch in the weighted sample data, and use the protein chain length and the maximum length of the protein chain to obtain an updated mask matrix and an updated input feature matrix.

具体的,在网络的批归一化层中引入掩码矩阵,具体地,遍历每一批次共B个蛋白质链,得到每个蛋白质链在该批次中的索引batch_idx和长度L以及该批次蛋白质链的最大长度max_L,并将掩码矩阵MB,max_L中所有元素初始化为false,将输入特征矩阵XB,C,max_L中所有元素初始化为0,对于掩码矩阵每个batch_idx对应MB,max_L的每一行,根据batch_idx和ProteinLen将MB,max_L每一行的0至ProteinLen-1个元素更新为True,其中ProteinLen表示蛋白质序列的真实长度,得到每一批次最终的掩码矩阵MB,max_L。对于XB,C,max_L,每个batch_idx对应一个蛋白质链,根据batch_idx将X按X[batch_idx,ProteinLen]=X填充至XB,C,max_L中,得到最终的XB,C,max_L;具体在实施时输入X的维度是B×L×F,Masks的维度是B×L,将X中填充部分进行掩码,非填充部分取出放入一个新的张量中,并根据描述每个氨基酸特征向量的长度,即num_features,来调整新张量的维度,根据新的张量来计算批归一化层中的均值方差。Specifically, a mask matrix is introduced in the batch normalization layer of the network. Specifically, each batch of B protein chains is traversed to obtain the index batch_idx and length L of each protein chain in the batch and the maximum length max_L of the protein chain in the batch, and all elements in the mask matrix MB ,max_L are initialized to false, and all elements in the input feature matrix XB ,C,max_L are initialized to 0. For each row of MB,max_L corresponding to each batch_idx of the mask matrix, elements 0 to ProteinLen-1 of each row of MB,max_L are updated to True according to batch_idx and ProteinLen, where ProteinLen represents the true length of the protein sequence, and the final mask matrix MB ,max_L of each batch is obtained. For X B,C,max_L , each batch_idx corresponds to a protein chain. According to batch_idx, X is filled into X B,C,max_L according to X[batch_idx,ProteinLen]=X to obtain the final X B,C,max_L ; specifically, in implementation, the dimension of input X is B×L×F, the dimension of Masks is B×L, the filled part in X is masked, the non-filled part is taken out and put into a new tensor, and the dimension of the new tensor is adjusted according to the length of the feature vector describing each amino acid, that is, num_features, and the mean variance in the batch normalization layer is calculated based on the new tensor.

为了避免在后续特征提取过程中,补0位置影响其他位置特征提取结果的准确性引入了掩码矩阵。In order to avoid the influence of 0-filled positions on the accuracy of feature extraction results at other positions in the subsequent feature extraction process, a mask matrix is introduced.

利用预设的第一公式和第二公式,分别计算更新后的掩码矩阵和更新后的输入特征矩阵的均值和方差。The preset first formula and second formula are used to calculate the mean and variance of the updated mask matrix and the updated input feature matrix respectively.

利用根据预设的第三公式预构建的批归一化层,处理均值和方差,获得批归一化输出矩阵。The batch normalization layer pre-constructed according to the preset third formula is used to process the mean and variance to obtain a batch normalized output matrix.

具体的,预设的第三公式包括:Specifically, the preset third formula includes:

其中,μ表示利用预设的第一公式处理更新后的掩码矩阵和更新后的输入特征矩阵获得的均值、var表示利用预设的第二公式处理更新后的掩码矩阵和更新后的输入特征矩阵获得的方差,XB,C,max_L表示输入的加权后的样本数据,X*表示批归一化层的输出,β和γ为批归一化层中待学习的参数,ε=0.01。Among them, μ represents the mean value obtained by using the preset first formula to process the updated mask matrix and the updated input feature matrix, and var represents using the preset second formula to process the updated mask matrix and the updated input. The variance obtained by the feature matrix, X B, C, max_L represents the input weighted sample data, X * represents the output of the batch normalization layer, β and γ are the parameters to be learned in the batch normalization layer, ε = 0.01 .

其中,预设的第一公式和预设的第二公式分别为:Among them, the preset first formula and the preset second formula are respectively:

其中,μ表示所有样本数据的均值,var1表示所有样本数据的方差,Mbl表示掩码矩阵,Xbcl是每个蛋白质序列中的每个氨基酸向量的元素值,B指训练过程中设置的批大小,即Batchsize,这里设置为16;b指遍历这一批次的16个蛋白质,所以b=[1,16];max_L指该批次蛋白质中最大的序列长度;L代表该批次蛋白质的序列长度。Among them, μ represents the mean of all sample data, var 1 represents the variance of all sample data, M bl represents the mask matrix, X bcl is the element value of each amino acid vector in each protein sequence, B refers to the batch size set during training, that is, Batchsize, which is set to 16 here; b refers to the 16 proteins traversed in this batch, so b = [1,16]; max_L refers to the maximum sequence length in the batch of proteins; L represents the sequence length of the batch of proteins.

用于对预处理后的多个输入样本数据进行处理,获得第一输出矩阵和第二输出矩阵的pytorch函数层。A pytorch function layer used to process multiple preprocessed input sample data to obtain the first output matrix and the second output matrix.

其中,pytorch函数层包括依次连接的nn.Conv1d函数、F.relu函数、MaskedBatchNorm1d函数和nn.Dropout函数,其中nn.Conv1d函数的参量包括卷积的尺寸为输出通道数57,输入通道数448,卷积核长3,填充padding=1,偏置bias=False,F.relu函数的输入为nn.Conv1d函数的输出,MaskedBatchNorm1d函数的维度为448,nn.Dropout函数的神经元不被激活的概率为0.2。Among them, the pytorch function layer includes the nn.Conv1d function, F.relu function, MaskedBatchNorm1d function and nn.Dropout function connected in sequence, where the parameters of the nn.Conv1d function include the convolution size of 57 output channels, 448 input channels, 3 convolution kernel length, padding=1, bias=False, the input of the F.relu function is the output of the nn.Conv1d function, the dimension of the MaskedBatchNorm1d function is 448, and the probability of the neuron of the nn.Dropout function not being activated is 0.2.

具体的,在训练样本中随机选择经加权处理和归一化处理后的B个样本构成输入矩阵XB,C,L,将XB,C,L作为输入,依次调用out1=nn.Conv1d(57,448,3,padding=1,bias=False)、out=F.relu(out1)、out=MaskedBatchNorm1d(448)及out=nn.Dropout(0.2),得到矩阵XB,C,LSpecifically, B samples that have been weighted and normalized are randomly selected from the training samples to form the input matrix X B, C, L. X B, C, L are used as inputs, and out1=nn.Conv1d ( 57, 448, 3, padding=1, bias=False), out=F.relu(out1), out=MaskedBatchNorm1d(448) and out=nn.Dropout(0.2), and the matrix X B, C, L is obtained.

用于迭代处理第二输出矩阵,获得蛋白质序列间的局部特征矩阵的MobileNet v2层。MobileNet v2 layer used to iteratively process the second output matrix to obtain the local feature matrix between protein sequences.

其中,将所述nn.Dropout函数输出的第二输出矩阵,作为MobileNet v2层的初始输入矩阵,得到经MobileNet v2层的初始输出矩阵。Among them, the second output matrix output by the nn.Dropout function is used as the initial input matrix of the MobileNet v2 layer to obtain the initial output matrix of the MobileNet v2 layer.

利用经MobileNet v2层处理的初始输出矩阵作为MobileNet v2层的输入,将MobileNet v2层处理过的中间输出矩阵再作为MobileNet v2层的输入进行多次迭代,获得蛋白质序列间的局部特征矩阵。The initial output matrix processed by the MobileNet v2 layer is used as the input of the MobileNet v2 layer, and the intermediate output matrix processed by the MobileNet v2 layer is used as the input of the MobileNet v2 layer for multiple iterations to obtain the local feature matrix between protein sequences.

具体的,运行MobileNet v2网络的局部特征提取代码迭代执行n次,将首次迭代得到的矩阵XB,C,L作为输入,将第i次输出作为第i+1次的输入,其中,i=1,2,3,…,n-1,获得序列中氨基酸间的局部特征矩阵 Specifically, the local feature extraction code of the MobileNet v2 network is run iteratively n times, and the matrix X B, C, L obtained in the first iteration is used as input, and the i-th output is used as the i+1-th input, where i = 1, 2, 3, ..., n-1, to obtain the local feature matrix between amino acids in the sequence

用于处理根据第一输出矩阵和局部特征矩阵相加得到的第三输出矩阵,获得蛋白质序列间的关联矩阵的Transformer层。The Transformer layer is used to process the third output matrix obtained by adding the first output matrix and the local feature matrix to obtain the correlation matrix between protein sequences.

其中,Transformer的编码层为两层,每个编码器均设有八个注意力头矩阵的集合。Among them, the Transformer has two coding layers, and each encoder is equipped with a set of eight attention head matrices.

具体的,将局部特征矩阵XL和特征out1相加得到XR,运行两层八头的TransformerEncoder程序,将得到的XR作为输入,获得所有氨基酸序列间的关联矩阵XE Specifically , add the local feature matrix

用于处理蛋白质序列间的关联矩阵,获得全局特征矩阵的两层双向门控循环单元层。A two-layer bidirectional gated cyclic unit layer used to process the correlation matrix between protein sequences and obtain the global feature matrix.

具体的,运行两层双向门控循环单元(BiGRU),将氨基酸间序列间的关联矩阵XE作为输入,得到蛋白质序列的全局特征矩阵XGSpecifically, a two-layer bidirectional gated recurrent unit (BiGRU) is run, and the correlation matrix X E between amino acid sequences is used as input to obtain the global feature matrix X G of the protein sequence.

用于依次处理全局特征矩阵,获得蛋白质二级结构预测值的卷积层和全连接层。Convolutional layers and fully connected layers are used to sequentially process the global feature matrix and obtain protein secondary structure prediction values.

其中,卷积层和全连接层的输出通道均为57,所述卷积层的卷积核为1,所述全连接层的输出大小为8。Among them, the output channels of the convolution layer and the fully connected layer are both 57, the convolution kernel of the convolution layer is 1, and the output size of the fully connected layer is 8.

具体的,运行输出通道为57、卷积核大小为1的一维卷积程序,将全局特征矩阵XG作为输入,得到最终的特征XF,运行输入大小为57、输出大小为8的全连接层程序,将最终的特征XF作为输入,得到蛋白质二级结构预测值P。Specifically, run a one-dimensional convolution program with an output channel of 57 and a convolution kernel size of 1, take the global feature matrix X G as input, and obtain the final feature The connection layer program takes the final feature X F as input to obtain the protein secondary structure prediction value P.

在用于依次处理全局特征矩阵,获得蛋白质二级结构预测值的卷积层和全连接层后,还包括:After the convolutional layer and fully connected layer used to sequentially process the global feature matrix and obtain the protein secondary structure prediction value, it also includes:

利用标签分布感知边际损失函数计算预测误差,将得到的预测误差反向传播更新批归一化层的参数β和γ。The label distribution-aware marginal loss function is used to calculate the prediction error, and the obtained prediction error is back-propagated to update the parameters β and γ of the batch normalization layer.

具体的,运行标签分布感知边际损失函数代码计算预测误差,将得到的预测误差反向传播更新批归一化层的参数,包括β、λ及网络的权重,将最大迭代步数设置为200,学习率设置为0.0001,批大小设置为16,对模型循环迭代,当验证集中预测精度不再提高或到达最大迭代步数时,训练结束,得到基于Transformer的多类不平衡蛋白质二级结构预测模型。Specifically, the label distribution aware marginal loss function code is run to calculate the prediction error, and the obtained prediction error is backpropagated to update the parameters of the batch normalization layer, including β, λ and the weight of the network, and the maximum number of iteration steps is set to 200. The learning rate is set to 0.0001, the batch size is set to 16, and the model is iterated in a loop. When the prediction accuracy in the validation set no longer improves or reaches the maximum number of iteration steps, the training ends, and a multi-class imbalanced protein secondary structure prediction model based on Transformer is obtained. .

表1Table 1

表1表示在数据集CB513上与最先进方法的比较,加粗表示最佳性能。从表1中可以看出,本文模型性能在CB513数据集上明显优于其他基准方法,既保证了8类二级结构中常见类的预测精度,对罕见类B、G、和S的预测效果也有一定提升。Table 1 shows the comparison with state-of-the-art methods on the dataset CB513, with bold indicating the best performance. As can be seen from Table 1, the performance of the model in this paper is significantly better than other benchmark methods on the CB513 data set, which not only ensures the prediction accuracy of common classes among 8 types of secondary structures, but also the prediction effect of rare classes B, G, and S There has also been some improvement.

表2Table 2

表2表示在数据集CASP12上与最先进方法的比较,加粗表示最佳性能。在数据集CASP12上,整体预测精度比MUFold_SS低0.02%,但对于较罕见的类别,如:类别B、S和G与其他方法相比预测精度都有较大的提升,但由于数目较少,所以对整体预测精度的影响不大。Table 2 shows the comparison with state-of-the-art methods on the data set CASP12, bold indicates the best performance. On the data set CASP12, the overall prediction accuracy is 0.02% lower than MUFold_SS, but for rarer categories, such as categories B, S and G, the prediction accuracy is greatly improved compared with other methods, but due to the small number, Therefore, it has little impact on the overall prediction accuracy.

表3table 3

表3表示在数据集CASP13上与最先进方法的比较,加粗表示最佳性能。在数据集CASP13上,整体预测精度达到最大,对于罕见类的预测提升幅度也很大,但在类别L和T上的预测精度与其他方法相比还有一定差距。Table 3 shows the comparison with the state-of-the-art methods on the data set CASP13, bold indicates the best performance. On the data set CASP13, the overall prediction accuracy reaches the maximum, and the prediction accuracy for rare classes is also greatly improved. However, the prediction accuracy on categories L and T still lags behind other methods.

表4Table 4

表4表示在数据集CASP14上与最先进方法的比较,加粗表示最佳性能。在数据集CASP14上,整体预测精度比MUFold_SS提升了0.05%,整体精度提升有限,但对罕见类的预测精度提升幅度较大。观察上述结果可以发现,占比越低(样本量越小)结构类的Q8精度往往越低,特别是八态中的“I”类结构出现的概率不足千分之一,几乎所有方法都难以正确预测。有理由相信,如果低频子类的样本量能够得到扩增,预测效果将会进一步提升。因此,关于蛋白质二级结构预测的数据增强是一个值得研究的方向。Table 4 shows the comparison with the most advanced methods on the CASP14 dataset, and the best performance is indicated in bold. On the CASP14 dataset, the overall prediction accuracy is improved by 0.05% compared with MUFold_SS. The overall accuracy improvement is limited, but the prediction accuracy of rare classes is improved significantly. Observing the above results, it can be found that the lower the proportion (the smaller the sample size), the lower the Q 8 accuracy of the structural class, especially the probability of the "I" class structure in the eight states is less than one thousandth, and almost all methods are difficult to predict correctly. There is reason to believe that if the sample size of low-frequency subclasses can be expanded, the prediction effect will be further improved. Therefore, data enhancement for protein secondary structure prediction is a direction worthy of research.

图2为一种多类不平衡蛋白质二级结构预测方法在不同特征下模型的性能比较直方图;Figure 2 is a histogram comparing the performance of a multi-category unbalanced protein secondary structure prediction method under different characteristics;

图3为一种多类不平衡蛋白质二级结构预测方法在不同BN下模型的性能比较直方图,其中BN指的是批归一化层,即Batch Normalization;Figure 3 is a performance comparison histogram of a multi-class unbalanced protein secondary structure prediction method under different BN models, where BN refers to the batch normalization layer, i.e., Batch Normalization;

图4为一种多类不平衡蛋白质二级结构预测方法在有无Transformer模型的性能比较直方图。Figure 4 is a histogram comparing the performance of a multi-category unbalanced protein secondary structure prediction method with and without a Transformer model.

作为结构预测的信息来源和基础,不同的氨基酸编码所含进化信息不同,特征表示的探究将在上述实验中预测结果最好的模型参数下展开。这里我们首先逐个分析不同编码方式对预测精度的影响,接着对它们进行组合实验。图2表示在验证集上,以Q8精度为依据,9种不同输入特征对模型预测性能的影响。具体来说,我们先单独考虑PSSM编码和HMM编码,由于独热编码和理化性质编码与位置无关,其中不含进化信息,因此不单独考虑,从实验结果可以看出PSSM的性能优于HMM,可能PSSM所蕴含的进化信息更丰富。接着又将PSSM和HMM分别与独热编码、理化性质编码组合,并将PSSM编码和HMM编码组合,我们发现PSSM编码分别与独热编码和理化性质编码进行组合时,模型预测性能接近,而HMM与这两者分别组合时性能均低于PSSM与它们组合的结果,这也印证了PSSM所含信息更丰富或更适合进行二级结构预测的结论,但是与理化性质编码结合时预测性能低于独热编码,与预期不符,但在后续实验中可以看出,在蛋白质形成稳定结构的进化过程中,由于不同氨基酸性质相异,这些性质影响它们在序列中与周围氨基酸的相互作用方式,从而影响蛋白质的结构。当PSSM编码和HMM编码结合时,预测性能相对于前面四种明显提升,PSSM包含着同一氨基酸在不同的序列中,形成稳定蛋白质结构的过程中突变成其他氨基酸的概率降低,而HMM编码包含着不同氨基酸的匹配状态概率、翻译频率及局部多样性,可见两者包含着一定的对蛋白质二级结构预测有用的互补信息。基于此,又以PSSM编码和HMM编码的组合为基础分别于独热编码和理化性质编码结合,从结果看到与理化性质结合,是最佳编码方式,预测精度达到最高,也达到了我们的预期。As the information source and basis for structure prediction, different amino acid codes contain different evolutionary information. The exploration of feature representation will be carried out under the model parameters with the best prediction results in the above experiments. Here we first analyze the impact of different encoding methods on prediction accuracy one by one, and then conduct combined experiments on them. Figure 2 shows the impact of 9 different input features on the model's prediction performance based on Q 8 accuracy on the validation set. Specifically, we first consider PSSM coding and HMM coding separately. Since one-hot coding and physical and chemical property coding have nothing to do with position and do not contain evolutionary information, they are not considered separately. From the experimental results, we can see that the performance of PSSM is better than HMM. Maybe PSSM contains richer evolutionary information. Then PSSM and HMM were combined with one-hot coding and physical and chemical property coding respectively, and PSSM coding and HMM coding were combined. We found that when PSSM coding was combined with one-hot coding and physical and chemical property coding respectively, the model prediction performance was close, while HMM When combined with these two separately, the performance is lower than that of PSSM combined with them. This also confirms the conclusion that PSSM contains richer information or is more suitable for secondary structure prediction. However, when combined with physical and chemical property encoding, the prediction performance is lower than One-hot encoding is not in line with expectations, but in subsequent experiments it can be seen that in the evolutionary process of proteins forming stable structures, due to the different properties of different amino acids, these properties affect the way they interact with surrounding amino acids in the sequence, thus Affects protein structure. When PSSM coding and HMM coding are combined, the prediction performance is significantly improved compared to the previous four. PSSM contains the same amino acid in different sequences, and the probability of mutation into other amino acids is reduced in the process of forming a stable protein structure, while HMM coding contains With the matching state probability, translation frequency and local diversity of different amino acids, it can be seen that the two contain certain complementary information useful for protein secondary structure prediction. Based on this, the combination of PSSM coding and HMM coding is combined with one-hot coding and physical and chemical property coding respectively. From the results, it can be seen that combining with physical and chemical properties is the best coding method, and the prediction accuracy reaches the highest, which also reaches our level. expected.

对于改进后的批归一化(Batch Normalization)层,如图3所示,在测试集CB513上引入掩码矩阵后,Q8精度提升0.23%,F1提升约0.05%,可见填充位置的特征向量在特征提取过程中可能会变成非零向量,这将会影响非填充位置氨基酸对应二级结构的预测结果,而引入掩码矩阵可以一定程度缓解这一问题,提高特征提取的准确度,进而提高二级结构预测的精度。For the improved batch normalization (Batch Normalization) layer, as shown in Figure 3, after introducing the mask matrix on the test set CB513, the Q 8 accuracy increased by 0.23%, and the F 1 increased by about 0.05%. The characteristics of the filling position can be seen The vector may become a non-zero vector during the feature extraction process, which will affect the prediction results of the secondary structure corresponding to the amino acid at the non-filled position. The introduction of the mask matrix can alleviate this problem to a certain extent and improve the accuracy of feature extraction. This further improves the accuracy of secondary structure prediction.

图4为网络在有无特征加强模块时所起到的网络性能的对比结果。如图中所示,网络增加了特征加强模块后,预测精度提高了0.16%、F1值提高了0.79%,这表明在长距离作用前加上2层8头Transformer Encoder增强了序列中残基之间关联性表达,能够使后续的BiGRU更灵活地捕获长距离依赖关系。综上,本发明通过将待预测的目标蛋白质序列输入预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型,获得待预测的目标蛋白质序列的二级结构预测值。其中,预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型包括:用于对输入样本数据进行加权处理和批归一化处理的数据预处理层。用于对预处理后的多个输入样本数据进行处理,获得第一输出矩阵和第二输出矩阵的pytorch函数层。用于迭代处理第二输出矩阵,获得蛋白质序列间的局部特征矩阵的MobileNet v2层。用于处理根据第一输出矩阵和局部特征矩阵相加得到的第三输出矩阵,获得蛋白质序列间的关联矩阵的Transformer层。用于处理蛋白质序列间的关联矩阵,获得全局特征矩阵的两层双向门控循环单元层。用于依次处理全局特征矩阵,获得蛋白质二级结构预测值的卷积层和全连接层。该方法在对蛋白质二级结构预测时,对氨基酸全局依赖性低、提高了罕见类蛋白质二级结构预测精度高。Figure 4 shows the comparison results of network performance with and without feature enhancement modules. As shown in the figure, after adding the feature enhancement module to the network, the prediction accuracy increased by 0.16% and the F 1 value increased by 0.79%. This shows that adding 2 layers of 8-head Transformer Encoder before long-distance action enhances the residues in the sequence. The expression of dependencies between them can make subsequent BiGRU more flexibly capture long-distance dependencies. In summary, the present invention obtains the secondary structure prediction value of the target protein sequence to be predicted by inputting the target protein sequence to be predicted into the pre-constructed Transformer-based multi-type unbalanced protein secondary structure prediction model. Among them, the pre-built Transformer-based multi-class unbalanced protein secondary structure prediction model includes: a data preprocessing layer for weighting and batch normalization of input sample data. A pytorch function layer used to process multiple preprocessed input sample data to obtain the first output matrix and the second output matrix. The MobileNet v2 layer is used to iteratively process the second output matrix and obtain the local feature matrix between protein sequences. Transformer layer used to process the third output matrix obtained by adding the first output matrix and the local feature matrix to obtain the correlation matrix between protein sequences. A two-layer bidirectional gated cyclic unit layer used to process the correlation matrix between protein sequences and obtain the global feature matrix. Convolutional layers and fully connected layers are used to sequentially process the global feature matrix and obtain protein secondary structure prediction values. When predicting protein secondary structure, this method has low global dependence on amino acids and improves the prediction accuracy of rare protein secondary structures.

103、将待预测的目标蛋白质序列的二级结构预测值P输出。103. Output the secondary structure prediction value P of the target protein sequence to be predicted.

图5为一种多类不平衡蛋白质二级结构预测系统的结构200示意图,如图5所示,装置包括:Figure 5 is a schematic diagram of the structure 200 of a multi-type unbalanced protein secondary structure prediction system. As shown in Figure 5, the device includes:

输入模块201,用于获取待预测的目标蛋白质序列。Input module 201 is used to obtain the target protein sequence to be predicted.

处理模块202,用于将待预测的目标蛋白质序列输入预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型,获得待预测的目标蛋白质序列的二级结构预测值。其中,预构建的基于Transformer的多类不平衡蛋白质二级结构预测模型包括:The processing module 202 is used to input the target protein sequence to be predicted into the pre-constructed Transformer-based multi-type unbalanced protein secondary structure prediction model, and obtain the secondary structure prediction value of the target protein sequence to be predicted. Among them, the pre-built Transformer-based multi-category unbalanced protein secondary structure prediction models include:

用于对输入样本数据进行加权处理和批归一化处理的数据预处理层。Data preprocessing layer for weighting and batch normalizing input sample data.

用于对预处理后的多个输入样本数据进行处理,获得第一输出矩阵和第二输出矩阵的pytorch函数层。A pytorch function layer used to process multiple preprocessed input sample data to obtain a first output matrix and a second output matrix.

用于迭代处理第二输出矩阵,获得蛋白质序列间的局部特征矩阵的MobileNet v2层。MobileNet v2 layer used to iteratively process the second output matrix to obtain the local feature matrix between protein sequences.

用于处理根据第一输出矩阵和局部特征矩阵相加得到的第三输出矩阵,获得蛋白质序列间的关联矩阵的Transformer层。Transformer layer used to process the third output matrix obtained by adding the first output matrix and the local feature matrix to obtain the correlation matrix between protein sequences.

用于处理蛋白质序列间的关联矩阵,获得全局特征矩阵的两层双向门控循环单元层。A two-layer bidirectional gated cyclic unit layer used to process the correlation matrix between protein sequences and obtain the global feature matrix.

用于依次处理全局特征矩阵,获得蛋白质二级结构预测值的卷积层和全连接层。The convolutional layer and fully connected layer are used to process the global feature matrix in sequence to obtain the protein secondary structure prediction value.

输出模块203,用于将待预测的目标蛋白质序列的二级结构预测值输出。The output module 203 is used to output the secondary structure prediction value of the target protein sequence to be predicted.

本领域内的技术人员应明白,本发明可采用一个或多个其中包含可用程序代码的计算机、可用存储介质上实施的计算机程序产品的形式。尽管已经示出和描述了本发明的实施例,对于本领域的普通技术人员而言,可以理解在不脱离本发明的原理和精神的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由所附权利要求及其等同物限定。It should be understood by those skilled in the art that the present invention may take the form of a computer program product implemented on one or more computers containing available program codes and available storage media. Although the embodiments of the present invention have been shown and described, it is understood by those skilled in the art that various changes, modifications, substitutions and variations may be made to these embodiments without departing from the principles and spirit of the present invention, and the scope of the present invention is defined by the appended claims and their equivalents.

Claims (6)

1. A method for predicting a secondary structure of a plurality of unbalanced proteins, comprising:
obtaining a target protein sequence to be predicted;
inputting a target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a transducer to obtain a secondary structure prediction value of the target protein sequence to be predicted; the pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:
a data preprocessing layer for performing weighting processing and batch normalization processing on the input sample data;
the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix;
the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences;
the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences;
the two-layer bidirectional gating cyclic unit layer is used for processing the incidence matrix among protein sequences to obtain a global feature matrix;
the convolution layer and the full connection layer are used for sequentially processing the global feature matrix to obtain a protein secondary structure predicted value;
outputting a secondary structure predicted value of the target protein sequence to be predicted;
the pyrach function layer comprises an nn.Conv1d function, an F.relu function, a MaskedBatchNorm1d function and an nn.Dropout function which are connected in sequence, wherein parameters of the nn.Conv1d function comprise a convolution size of 57 output channels, 448 input channels, 3 convolution kernels, padding=1, bias bias=false, the input of the F.relu function is the output of the nn.Conv1d function, the dimension of the MaskedBatchNorm1d function is 448, and the probability that neurons of the nn.Dropout function are not activated is 0.2;
the coding layers of the transducer are two layers, and each coder is provided with a set of eight attention head matrixes;
the output channels of the convolution layer and the full connection layer are 57, the convolution kernel of the convolution layer is 1, and the output size of the full connection layer is 8.
2. The method of claim 1, wherein the step of weighting and batch normalizing the input sample data comprises:
processing input sample data according to a position specificity scoring matrix, a hidden Markov model feature matrix and a feature matrix obtained by an amino acid physicochemical property feature matrix of horizontal splicing to obtain weighted sample data;
respectively obtaining mask matrixes with all initial elements being false and input feature matrixes with all initial elements being 0;
according to protein chain indexes, protein chain lengths and protein chain maximum lengths obtained by traversing protein sequences of each batch in the weighted sample data, an updated mask matrix and an updated input feature matrix are obtained;
respectively calculating the mean value and the variance of the updated mask matrix and the updated input feature matrix by using a preset first formula and a preset second formula;
and processing the mean and variance by using a batch normalization layer pre-constructed according to a preset third formula to obtain a batch normalization output matrix.
3. The method for predicting a secondary structure of a plurality of unbalanced proteins according to claim 2, wherein the predetermined third formula comprises:
wherein μ represents the average value obtained by processing the updated mask matrix and the updated input feature matrix using a preset first formula, var represents the variance obtained by processing the updated mask matrix and the updated input feature matrix using a preset second formula, X B,C,max_L Representing input weighted sample data, X * Representing the output of the batch normalization layer, β and γ being parameters to be learned in the batch normalization layer, where ε=0.01;
the preset first formula and the preset second formula are respectively as follows:
wherein μ represents the mean value of all sample data, var 1 Representing the variance of all sample data, M bl Represents a mask matrix, X bcl Is the element value of each amino acid vector in each protein sequence, B is the batch size set during training, namely, batchsize, here set to 16; b refers to traversing the batch of 16 proteins, so b= [1,16]The method comprises the steps of carrying out a first treatment on the surface of the max_l refers to the batch of proteinsMaximum sequence length in the stroma; l represents the sequence length of the batch of proteins.
4. The method of claim 1, wherein iteratively processing the second output matrix comprises:
taking the second output matrix output by the nn. Dropout function as an initial input matrix of a MobileNet v2 layer to obtain an initial output matrix;
and (3) taking the initial output matrix as the input of the MobileNet v2 layer, and carrying out multiple iterations on the obtained intermediate output matrix processed by the MobileNet v2 layer as the input of the MobileNet v2 layer to obtain the local feature matrix among protein sequences.
5. A method for predicting a secondary structure of a multi-class unbalanced protein according to claim 3, wherein after the convolution layer and the full-connection layer for sequentially processing the global feature matrix to obtain the predicted value of the secondary structure of the protein, the method further comprises:
and calculating a prediction error by using a label distribution perception marginal loss function, and reversely propagating and updating parameters beta and gamma of the batch normalization layer by the obtained prediction error.
6. A multi-class unbalanced protein secondary structure prediction system, comprising:
the input module is used for acquiring a target protein sequence to be predicted;
the processing module is used for inputting the target protein sequence to be predicted into a pre-constructed multi-class unbalanced protein secondary structure prediction model based on a transducer to obtain a secondary structure predicted value of the target protein sequence to be predicted; the pre-constructed transducer-based multi-class unbalanced protein secondary structure prediction model comprises the following components:
a data preprocessing layer for performing weighting processing and batch normalization processing on the input sample data;
the pytorch function layer is used for processing the preprocessed multiple input sample data to obtain a first output matrix and a second output matrix;
the MobileNet v2 layer is used for iteratively processing the second output matrix to obtain a local feature matrix among protein sequences;
the transducer layer is used for processing a third output matrix obtained by adding the first output matrix and the local feature matrix to obtain an association matrix between protein sequences;
the two-layer bidirectional gating cyclic unit layer is used for processing the incidence matrix among protein sequences to obtain a global feature matrix;
the convolution layer and the full connection layer are used for sequentially processing the global feature matrix to obtain a protein secondary structure predicted value;
the output module is used for outputting the predicted value of the secondary structure of the target protein sequence to be predicted;
the pyrach function layer comprises an nn.Conv1d function, an F.relu function, a MaskedBatchNorm1d function and an nn.Dropout function which are connected in sequence, wherein parameters of the nn.Conv1d function comprise a convolution size of 57 output channels, 448 input channels, 3 convolution kernels, padding=1, bias bias=false, the input of the F.relu function is the output of the nn.Conv1d function, the dimension of the MaskedBatchNorm1d function is 448, and the probability that neurons of the nn.Dropout function are not activated is 0.2;
the coding layers of the transducer are two layers, and each coder is provided with a set of eight attention head matrixes;
the output channels of the convolution layer and the full connection layer are 57, the convolution kernel of the convolution layer is 1, and the output size of the full connection layer is 8.
CN202311804115.1A 2023-12-26 2023-12-26 Multi-class unbalanced protein secondary structure prediction method and system Active CN117476106B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311804115.1A CN117476106B (en) 2023-12-26 2023-12-26 Multi-class unbalanced protein secondary structure prediction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311804115.1A CN117476106B (en) 2023-12-26 2023-12-26 Multi-class unbalanced protein secondary structure prediction method and system

Publications (2)

Publication Number Publication Date
CN117476106A CN117476106A (en) 2024-01-30
CN117476106B true CN117476106B (en) 2024-04-02

Family

ID=89633271

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311804115.1A Active CN117476106B (en) 2023-12-26 2023-12-26 Multi-class unbalanced protein secondary structure prediction method and system

Country Status (1)

Country Link
CN (1) CN117476106B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118658528B (en) * 2024-08-20 2025-01-21 电子科技大学长三角研究院(衢州) A method for constructing a specific myoglobin prediction model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667884A (en) * 2020-06-12 2020-09-15 天津大学 A Convolutional Neural Network Model for Predicting Protein Interactions Using Protein Primary Sequences Based on Attention Mechanism
CN112767997A (en) * 2021-02-04 2021-05-07 齐鲁工业大学 Protein secondary structure prediction method based on multi-scale convolution attention neural network
CN113178229A (en) * 2021-05-31 2021-07-27 吉林大学 Deep learning-based RNA and protein binding site recognition method
CN114974397A (en) * 2021-02-23 2022-08-30 腾讯科技(深圳)有限公司 Training method of protein structure prediction model and protein structure prediction method
CN115458039A (en) * 2022-08-08 2022-12-09 北京分子之心科技有限公司 Single-sequence protein structure prediction method and system based on machine learning
CN115662501A (en) * 2022-10-25 2023-01-31 浙江大学杭州国际科创中心 Protein generation method based on position specificity weight matrix
CN116486900A (en) * 2023-04-25 2023-07-25 徐州医科大学 Drug target affinity prediction method based on depth mode data fusion

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220122689A1 (en) * 2020-10-15 2022-04-21 Salesforce.Com, Inc. Systems and methods for alignment-based pre-training of protein prediction models

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667884A (en) * 2020-06-12 2020-09-15 天津大学 A Convolutional Neural Network Model for Predicting Protein Interactions Using Protein Primary Sequences Based on Attention Mechanism
CN112767997A (en) * 2021-02-04 2021-05-07 齐鲁工业大学 Protein secondary structure prediction method based on multi-scale convolution attention neural network
CN114974397A (en) * 2021-02-23 2022-08-30 腾讯科技(深圳)有限公司 Training method of protein structure prediction model and protein structure prediction method
CN113178229A (en) * 2021-05-31 2021-07-27 吉林大学 Deep learning-based RNA and protein binding site recognition method
CN115458039A (en) * 2022-08-08 2022-12-09 北京分子之心科技有限公司 Single-sequence protein structure prediction method and system based on machine learning
CN115662501A (en) * 2022-10-25 2023-01-31 浙江大学杭州国际科创中心 Protein generation method based on position specificity weight matrix
CN116486900A (en) * 2023-04-25 2023-07-25 徐州医科大学 Drug target affinity prediction method based on depth mode data fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于自注意力机制和GAN的蛋白质二级结构预测;杨璐, 董洪伟;《中国科技论文在线精品论文》;20230615;第16卷(第02期);148-159 *

Also Published As

Publication number Publication date
CN117476106A (en) 2024-01-30

Similar Documents

Publication Publication Date Title
JP7247258B2 (en) Computer system, method and program
Cuntz et al. A scaling law derived from optimal dendritic wiring
CN111798921A (en) RNA binding protein prediction method and device based on multi-scale attention convolution neural network
CN107622182A (en) Method and system for predicting local structural features of proteins
CN113344700B (en) Multi-objective optimization-based wind control model construction method and device and electronic equipment
US20200365238A1 (en) Drug compound identification for target tissue cells
CN117476106B (en) Multi-class unbalanced protein secondary structure prediction method and system
US20240120022A1 (en) Predicting protein amino acid sequences using generative models conditioned on protein structure embeddings
CN113591955B (en) Method, system, equipment and medium for extracting global information of graph data
CN117036894A (en) Multi-mode data classification method and device based on deep learning and computer equipment
CN115018157A (en) Surface subsidence prediction method based on EMD-CASSA-ELM
Downey et al. alineR: An R package for optimizing feature-weighted alignments and linguistic distances
Tian et al. Model-based autoencoders for imputing discrete single-cell RNA-seq data
CN118708943B (en) Training method of feature extraction model, data classification method, device and equipment
CN113611354B (en) A Protein Torsion Angle Prediction Method Based on Lightweight Deep Convolutional Networks
US11367006B1 (en) Toxic substructure extraction using clustering and scaffold extraction
CN114997366A (en) Protein structure model quality evaluation method based on graph neural network
CN113823352A (en) Drug-target protein affinity prediction method and system
CN116631512A (en) Prediction method of piRNA-disease association based on deep decomposition machine
CN114120367B (en) Pedestrian re-identification method and system based on circle loss metric under meta-learning framework
Dong et al. An optimization method for pruning rates of each layer in CNN based on the GA-SMSM
JP2024513995A (en) Multichannel protein voxelization to predict variant pathogenicity using deep convolutional neural networks
CN116325002A (en) Prediction of protein structure using assisted folding networks
Jha et al. Hpc enabled a novel deep fuzzy scalable clustering algorithm and its application for protein data
CN116740403B (en) Image classification method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20250121

Address after: Room 501, 5th Floor, Building 15, Northwest University Science and Technology Park, Yongchang Road, High tech Industrial Development Zone, Xianyang City, Shaanxi Province, 712023

Patentee after: Shaanxi Qianruanhui Information Technology Co.,Ltd.

Country or region after: China

Address before: Room 001, F2005, 20th Floor, Building 4-A, Xixian Financial Port, Fengdong New City Energy Jinmao District, Xixian New District, Xi'an City, Shaanxi Province, 710000

Patentee before: Xi'an Huisuan Intelligent Technology Co.,Ltd.

Country or region before: China