CN112037849A

CN112037849A - Device and method for predicting protein-protein interaction based on alternative direction multiplier method

Info

Publication number: CN112037849A
Application number: CN202010952823.XA
Authority: CN
Inventors: 陈际秋; 钟裕荣; 吴昊; 袁野
Original assignee: Chongqing University; Chongqing Institute of Green and Intelligent Technology of CAS
Current assignee: Chongqing University; Chongqing Institute of Green and Intelligent Technology of CAS
Priority date: 2020-09-04
Filing date: 2020-09-11
Publication date: 2020-12-04
Anticipated expiration: 2040-09-11
Also published as: CN112037849B

Abstract

The invention discloses a device and a method for predicting protein-protein interaction based on an alternative direction multiplier method, which comprises the following steps of S1: inputting initial protein interaction data and constructing a symmetric sparse matrix W; s2: constructing an augmented Lagrange function and initializing parameters; s3: performing iterative optimization on the augmented Lagrangian function to obtain an optimized hidden feature matrix; s4: and calculating the predicted value of the interaction between the deleted proteins. According to the invention, the symmetrical non-negative implicit characteristic decomposition method by using the alternative direction multiplier method can provide high-precision protein interaction data prediction with smaller time and space complexity, so that the prediction precision of the interaction between the missing proteins considering the non-negative symmetry of the data is improved, and the method plays an important role in scientific research.

Description

Apparatus and method for protein-protein interaction prediction based on alternating direction multiplier method

技术领域technical field

本发明涉及数据处理技术领域，特别涉及基于交替方向乘子法的蛋白质间相互作用预测装置和方法。The present invention relates to the technical field of data processing, in particular to a device and method for predicting the interaction between proteins based on the alternating direction multiplier method.

背景技术Background technique

物种存在多种多样的蛋白质，人们对生命活动的理解往往离不开蛋白质间的相互作用，通过传统的生物实验方法难以完全确定物种所有的蛋白质间相互作用、然而，可以通过计算机设计来对物种的所有的蛋白质间相互作用来进行全预测。因此，如何通过计算机设计来高效且准确地预测出蛋白质间缺失的相互作用成为了业界日益关注的问题。There are a variety of proteins in species, and people's understanding of life activities is often inseparable from the interaction between proteins. It is difficult to completely determine the interaction between all proteins of a species through traditional biological experimental methods. However, it is possible to analyze species through computer design. full prediction of all protein-protein interactions. Therefore, how to efficiently and accurately predict the missing interactions between proteins by computer design has become a growing concern in the industry.

一般而言，由于物种所包含的蛋白质有许多不同的种类，且我们在现实中只能确定部分的蛋白质间的相互作用信息，所以由物种的蛋白质间的相互作用所构成的网络是一个无向高维稀疏网络。近年来，许多学者提出了用于预测缺失蛋白质间相互作用的算法，其中，使用奇异值分解方法可以进行缺失值的有效预测。然而，这种方法不仅无法处理高维数据，而且也没有考虑到数据的非负对称性问题，即是说算法的建模并不是针对蛋白质间相互作用这个无向网络来进行设计的。另一方面，有学者使用对称非负矩阵分解方法来对其他对称数据问题来进行缺失值的预测。然而，对称非负矩阵分解并不能高效地去处理庞大的高维网络。对于由蛋白质间相互作用数据所构建的无向高维稀疏网络，如何在考虑数据非负对称性的前提下来准确且高效地对缺失蛋白质间相互作用来进行预测，已成为学术研究中热门却棘手的问题。Generally speaking, since there are many different kinds of proteins contained in a species, and we can only determine part of the interaction information between proteins in reality, the network formed by the interactions between proteins of a species is an undirected network. High-dimensional sparse networks. In recent years, many scholars have proposed algorithms for predicting missing protein-protein interactions, among which the singular value decomposition method can be used for effective prediction of missing values. However, this method not only cannot handle high-dimensional data, but also does not take into account the non-negative symmetry of the data, that is, the modeling of the algorithm is not designed for the undirected network of protein-protein interactions. On the other hand, some scholars use symmetric non-negative matrix factorization methods to predict missing values for other symmetric data problems. However, symmetric non-negative matrix factorization is not efficient for large high-dimensional networks. For the undirected high-dimensional sparse network constructed from the protein-protein interaction data, how to accurately and efficiently predict the missing protein-protein interaction under the premise of considering the non-negative symmetry of the data has become a popular but difficult problem in academic research. The problem.

发明内容SUMMARY OF THE INVENTION

针对现有技术中对缺失蛋白质间相互作用预测精度较低的问题，本发明提出一种基于交替方向乘子法的蛋白质间相互作用预测装置和方法，利用交替方向乘子法的对称非负隐特征分解方法，能够以较小的时间和空间复杂度，提供高精度的蛋白质相互作用数据预测，提高缺失蛋白质间相互作用预测精度。Aiming at the problem that the prediction accuracy of the missing protein interaction in the prior art is low, the present invention proposes a protein interaction prediction device and method based on the alternating direction multiplier method. The feature decomposition method can provide high-precision protein interaction data prediction with less time and space complexity, and improve the prediction accuracy of missing protein interactions.

为了实现上述目的，本发明提供以下技术方案：In order to achieve the above object, the present invention provides the following technical solutions:

基于交替方向乘子法的蛋白质间相互作用预测装置，包括依次连接的数据转换模块、数据初始化模块、交替方向乘子训练模块和预测数据生成模块；其中，The protein-protein interaction prediction device based on the alternate direction multiplier method includes a data conversion module, a data initialization module, an alternate direction multiplier training module and a prediction data generation module connected in sequence; wherein,

所述数据转换模块，用于将接收的初始蛋白质间相互作用数据构建为对应的对称稀疏矩阵，并将对称稀疏矩阵中所有的非缺失值进行存储；The data conversion module is used to construct the received initial protein-protein interaction data into a corresponding symmetric sparse matrix, and store all non-missing values in the symmetric sparse matrix;

所述数据初始化模块，用于生成初始的隐特征矩阵、线性偏差向量、乘子矩阵以及乘子向量，然后根据非缺失值、隐特征矩阵、线性偏差向量、乘子矩阵和乘子向量来构造对应的增广拉格朗日函数，并对该函数进行初始化；The data initialization module is used to generate the initial latent feature matrix, linear deviation vector, multiplier matrix and multiplier vector, and then constructs according to the non-missing value, latent feature matrix, linear deviation vector, multiplier matrix and multiplier vector The corresponding augmented Lagrangian function, and initialize the function;

所述交替方向乘子训练模块，用于先将隐特征矩阵按隐特征维度进行切片，然后分片依次对迭代参数、非负参数及乘子参数进行迭代优化更新，从而可以得到收敛后的隐特征矩阵；The alternating direction multiplier training module is used to first slice the hidden feature matrix according to the hidden feature dimension, and then perform iterative optimization and updating of the iterative parameters, non-negative parameters and multiplier parameters in sequence, so that the convergent hidden feature can be obtained. feature matrix;

所述预测数据生成模块，用于根据收敛后的隐特征矩阵，计算缺失蛋白质间相互作用的预测值。The predicted data generation module is used to calculate the predicted value of the interaction between missing proteins according to the converged latent feature matrix.

优选的,所述数据转换模块包括对称稀疏矩阵生成单元和蛋白质间相互作用数据存储单元；其中，Preferably, the data conversion module includes a symmetric sparse matrix generation unit and a protein-protein interaction data storage unit; wherein,

所述对称稀疏矩阵生成单元，用于将接收到的初始蛋白质间相互作用数据构建为对称稀疏矩阵W；The symmetric sparse matrix generation unit is used to construct the received initial protein-protein interaction data into a symmetric sparse matrix W;

所述蛋白质间相互作用数据存储单元，用于存储已构建的对称稀疏矩阵W内所有的非缺失值。The protein-protein interaction data storage unit is used to store all non-missing values in the constructed symmetric sparse matrix W.

优选的,所述数据初始化模块包括线性偏差数据生成单元，增广拉格朗日函数构建单元以及初始化单元；其中，Preferably, the data initialization module includes a linear deviation data generation unit, an augmented Lagrangian function construction unit and an initialization unit; wherein,

所述线性偏差数据生成单元，用于生成初始的蛋白质间相互作用线性偏差向量；The linear deviation data generation unit is used to generate an initial linear deviation vector of protein-protein interaction;

所述增广拉格朗日函数构建单元，用于根据初始的蛋白质间相互作用线性偏差向量和非缺失值构造对应的增广拉格朗日函数；The augmented Lagrangian function building unit is configured to construct a corresponding augmented Lagrangian function according to the initial inter-protein interaction linear deviation vector and the non-missing value;

所述初始化单元，用于初始化蛋白质间相互作用预测过程中所涉及的参数。The initialization unit is used to initialize the parameters involved in the prediction process of protein-protein interaction.

本发明还提供基于交替方向乘子法的蛋白质间相互作用预测方法，具体包括以下步骤：The present invention also provides a protein-protein interaction prediction method based on the alternating direction multiplier method, which specifically includes the following steps:

S1：输入初始蛋白质间相互作用数据并构造对称稀疏矩阵W；S1: Input the initial protein-protein interaction data and construct a symmetric sparse matrix W;

S2：构建增广拉格朗日函数并进行参数初始化；S2: Build an augmented Lagrangian function and initialize parameters;

S3：对增广拉格朗日函数进行迭代优化，得到优化后的隐特征矩阵；S3: Iteratively optimize the augmented Lagrangian function to obtain the optimized latent feature matrix;

S4：计算缺失蛋白质间相互作用预测值。S4: Calculate the predicted value of missing protein-protein interactions.

优选的,所述S1包括：Preferably, the S1 includes:

S1-1:构建对称稀疏矩阵W；S1-1: Construct a symmetric sparse matrix W;

对于接收到的初始蛋白质间相互作用数据，以三元组条目存储的，该三元组条目的表示形式为(p_i,p_j,v_ij)，其中p_i表示第i个蛋白质，p_j表示第j个蛋白质，v_ij表示第i个蛋白质与第j个蛋白质间的相互作用值；将每个三元组条目所对应的对称条目给生成出来，从而构建对称稀疏矩阵W。For the received initial protein-protein interaction data, it is stored as a triple entry, and the triple entry is represented as (pi , p _j , v _ij ), where pi represents the _ith protein, _and p _j represents the j-th protein, and v _ij represents the interaction value between the i-th protein and the j-th protein; the symmetric entry corresponding to each triple entry is generated to construct a symmetric sparse matrix W.

优选的,所述S2包括：Preferably, the S2 includes:

S2-1：构建目标损失函数Q；S2-1: Construct the target loss function Q;

根据对称稀疏矩阵W，得到所有的非缺失值集合Γ，结合集合Γ及所生成的线性偏差向量H和G，以欧式距离来作为优化目标，构建对应的目标损失函数Q：According to the symmetric sparse matrix W, all the non-missing value sets Γ are obtained. Combined with the set Γ and the generated linear deviation vectors H and G, the Euclidean distance is used as the optimization goal to construct the corresponding objective loss function Q:

s.t.E＝F，E≥0；G＝H，G≥0；(1)s.t.E=F, E≥0; G=H, G≥0; (1)

公式(1)中，E，F为M行D列的隐特征矩阵；线性偏差向量H，G的容量为M；Γ表示蛋白质间相互作用数据所对应的对称稀疏矩阵W中的非缺失值集合；D表示隐特征维数；w_i,j表示蛋白质i与蛋白质j间相互作用值；e_i,d∈E，表示隐特征矩阵E中第i个蛋白质所对应隐特征的第d个元素；f_i,d∈F，表示隐特征矩阵F中第i个蛋白质所对应隐特征的第d个元素；g_i∈G，表示线性偏差向量G的第i个元素；h_i∈H，表示线性偏差向量H的第i个元素；h_j∈H，表示线性偏差向量H的第j个元素；In formula (1), E and F are latent feature matrices with M rows and D columns; the capacity of linear deviation vectors H and G is M; Γ represents the set of non-missing values in the symmetric sparse matrix W corresponding to the protein-protein interaction data ; D represents the latent feature dimension; w _i,j represents the interaction value between protein i and protein j; e _i,d ∈ E, represents the d-th element of the latent feature corresponding to the ith protein in the latent feature matrix E; f _i,d ∈ F, represents the d-th element of the latent feature corresponding to the ith protein in the latent feature matrix F; g _i ∈ G, represents the ith element of the linear deviation vector G; h _i ∈ H, represents the linear The i-th element of the deviation vector H; h _j ∈ H, represents the j-th element of the linear deviation vector H;

S2-2：构建增广拉格朗日函数。S2-2: Construct an augmented Lagrangian function.

根据交替方向乘子法的原理，可得到对应的增广拉格朗日函数ε，使用以下公式表示：According to the principle of the alternating direction multiplier method, the corresponding augmented Lagrangian function ε can be obtained, which is expressed by the following formula:

公式(2)中，Γ表示蛋白质间相互作用数据所对应的对称稀疏矩阵W中的非缺失值集合；M表示蛋白质的个数，D表示隐特征维数；w_i,j表示蛋白质i与蛋白质j间相互作用值；e_i,d∈E，表示隐特征矩阵E中第i个蛋白质所对应隐特征的第d个元素；f_i,d∈F，表示隐特征矩阵F中第i个蛋白质所对应隐特征的第d个元素；f_j,d∈F，表示隐特征矩阵F中第j个蛋白质所对应隐特征的第d个元素；g_i∈G，表示线性偏差向量G的第i个元素；h_i∈H，表示线性偏差向量H的第i个元素；h_j∈H，表示线性偏差向量H的第j个元素；κ_i,d∈K，表示乘子矩阵K中第i个蛋白质所对应隐特征的第d个元素；δ_i∈Z，表示乘子向量Z的第i个元素；ρ_i和u_i为惩罚参数，它们是非负整数；In formula (2), Γ represents the set of non-missing values in the symmetric sparse matrix W corresponding to the protein-protein interaction data; M represents the number of proteins, D represents the latent feature dimension; w _i,j represents protein i and protein The interaction value between j; e _i,d ∈ E, represents the d-th element of the latent feature corresponding to the ith protein in the latent feature matrix E; f _i,d ∈ F, represents the ith protein in the latent feature matrix F The d-th element of the corresponding latent feature; f _j,d ∈ F, represents the d-th element of the latent feature corresponding to the j-th protein in the latent feature matrix F; g _i ∈ G, represents the i-th element of the linear deviation vector G h _i ∈ H, represents the i-th element of the linear deviation vector H; h _j ∈ H, represents the j-th element of the linear deviation vector H; κ _{i,d ∈} K, represents the i-th element in the multiplier matrix K The d-th element of the latent feature corresponding to each protein; δ _i ∈ Z, represents the ith element of the multiplier vector Z; ρ _i and _ui are the penalty parameters, which are non-negative integers;

S2-3：初始化用于预测的相关参数；S2-3: Initialize relevant parameters for prediction;

初始化用于预测的相关参数，所述参数包括隐特征矩阵F、隐特征矩阵E、乘子矩阵K、线性偏差向量H、线性偏差向量G、乘子向量Z、隐特征维数D、最大训练迭代轮数T、迭代轮数控制变量t、收敛终止阈值τ、学习率η、惩罚参数ρ_i和u_i。Initialize the relevant parameters for prediction, the parameters include the latent feature matrix F, the latent feature matrix E, the multiplier matrix K, the linear deviation vector H, the linear deviation vector G, the multiplier vector Z, the hidden feature dimension D, the maximum training Iterative round number T, iteration round number control variable t, convergence termination threshold τ, learning rate η, penalty parameters ρ _i and _ui .

优选的,所述S3包括：Preferably, the S3 includes:

S3-1：依次对迭代参数、非负参数及乘子参数来进行迭代更新，更新策略如下所示：S3-1: Iteratively update the iterative parameters, non-negative parameters and multiplier parameters in turn. The update strategy is as follows:

for d＝1Dfor d=1D

公式(3)中，F_(1～M),d表示隐特征矩阵F中第d个列向量，其中该列向量包含M个元素，H_d表示线性偏差向量H中第d个元素；

表示为第t+1轮迭代中由隐特征矩阵F中第1～第(d-1)个列向量所组成的M行(d-1)列的隐特征子矩阵；

表示为第t轮迭代中由隐特征矩阵F中第(d+1)～第D个列向量所组成的M行(D-d)列的隐特征子矩阵；

表示为第t轮迭代中由隐特征矩阵F中第d～第D个列向量所组成的M行(D-d+1)列的隐特征子矩阵；

表示为第t+1轮迭代中由线性偏差向量H中第1～第(d-1)个元素所组成的大小为(d-1)的线性偏差子向量；

表示为第t轮迭代中由线性偏差向量H中第d～第D个元素所组成的大小为(D-d+1)的线性偏差子向量；

表示为第t轮迭代中由线性偏差向量H中第(d+1)～第D个元素所组成的大小为(D-d)的线性偏差子向量；F^t+1表示为第t+1轮迭代中隐特征矩阵F；H^t+1表示为第t+1轮迭代中线性偏差向量H；G^t+1表示为第t+1轮迭代中线性偏差向量G；G^t表示为第t轮迭代中线性偏差向量G；E^t+1表示为第t+1轮迭代中隐特征矩阵E；E^t表示为第t轮迭代中隐特征矩阵E；K^t+1表示为第t+1轮迭代中乘子矩阵K；K^t表示为第t轮迭代中乘子矩阵K；Z^t+1表示为第t+1轮迭代中乘子向量Z；Z^t表示为第t轮迭代中乘子向量Z；η用于控制每次参数迭代优化过程中所走的步长；

表示增广拉格朗日函数ε对乘子矩阵K求偏导数；

表示增广拉格朗日函数ε对乘子向量Z求偏导数；In formula (3), F _{(1～M), d} represents the d-th column vector in the latent feature matrix F, wherein the column vector contains M elements, and H _d represents the d-th element in the linear deviation vector H;

It is expressed as a latent feature sub-matrix of M rows (d-1) columns composed of the 1st to (d-1)th column vectors in the latent feature matrix F in the t+1th iteration;

is represented as a latent feature sub-matrix of M rows (Dd) columns composed of (d+1) to D-th column vectors in the latent feature matrix F in the t-th iteration;

is represented as a latent feature sub-matrix of M rows (D-d+1) columns composed of the dth to Dth column vectors in the latent feature matrix F in the t-th iteration;

is expressed as a linear deviation sub-vector of size (d-1) composed of the 1st to (d-1)th elements in the linear deviation vector H in the t+1th iteration;

is expressed as a linear deviation sub-vector of size (D-d+1) composed of the d-th elements in the linear deviation vector H in the t-th iteration;

It is expressed as the linear deviation sub-vector of size (Dd) composed of the (d+1)~Dth elements in the linear deviation vector H in the t-th iteration; F ^t+1 is expressed as the t+1-th iteration Implicit feature matrix F; H ^t+1 represents the linear deviation vector H in the t+1 round of iteration; G ^t+1 represents the linear deviation vector G in the t+1 round of iteration; G ^t represents the t-th round iteration Medium linear deviation vector G; E ^t+1 represents the latent feature matrix E in the t+1 round of iteration; E ^t represents the latent feature matrix E in the t-th round of iteration; K ^t+1 represents the t+1 round of iteration Middle multiplier matrix K; K ^t is the multiplier matrix K in the t-th iteration; Z ^t+1 is the multiplier vector Z in the t+1-th iteration; Z ^t is the multiplier vector in the t-th iteration Z; η is used to control the step size taken in each parameter iterative optimization process;

represents the partial derivative of the augmented Lagrangian function ε with respect to the multiplier matrix K;

represents the partial derivative of the augmented Lagrangian function ε with respect to the multiplier vector Z;

S3-2：对增广拉格朗日目标损失函数ε进行迭代优化；S3-2: Iteratively optimize the augmented Lagrangian objective loss function ε;

训练迭代公式分别如下所示：The training iteration formulas are as follows:

公式(4)中，Γ(i)表示非缺失值集合Γ中与蛋白质i相关的所有非缺失值集合；D表示隐特征维数；w_i,j表示蛋白质i与蛋白质j间相互作用值；e_i,d∈E，为隐特征矩阵E中第i个蛋白质所对应隐特征的第d个元素；f_i,d∈F，为隐特征矩阵F中第i个蛋白质所对应隐特征的第d个元素；f_j,d∈F，为隐特征矩阵F中第j个蛋白质所对应隐特征的第d个元素；g_i∈G，为线性偏差向量G的第i个元素；h_j∈H，为线性偏差向量H的第j个元素；h_i∈H，为线性偏差向量H的第i个元素；κ_i,d∈K，为乘子矩阵K中第i个蛋白质所对应隐特征的第d个元素；δ_i∈Z，为乘子向量Z的第i个元素；ρ_i和u_i为惩罚参数；In formula (4), Γ(i) represents all the non-missing value sets related to protein i in the non-missing value set Γ; D represents the latent feature dimension; w _i,j represents the interaction value between protein i and protein j; e _i,d ∈ E, is the d-th element of the latent feature corresponding to the ith protein in the latent feature matrix E; f _i,d ∈ F, is the d-th element of the latent feature corresponding to the ith protein in the latent feature matrix F d elements; f _j,d ∈ F, is the d-th element of the latent feature corresponding to the j-th protein in the latent feature matrix F; g _i ∈ G, is the i-th element of the linear deviation vector G; h _j ∈ H, is the jth element of the linear deviation vector H; h _i ∈ H, is the ith element of the linear deviation vector H; κ _{i,d ∈} K, is the latent feature corresponding to the ith protein in the multiplier matrix K The d-th element of ; δ _i ∈ Z is the i-th element of the multiplier vector Z; ρ _i and u _i are the penalty parameters;

S3-3：判断增广拉格朗日目标损失函数ε的迭代过程是否终止：S3-3: Determine whether the iterative process of the augmented Lagrangian objective loss function ε is terminated:

判断条件为增广拉格朗日目标损失函数ε每迭代一轮，训练迭代轮数控制变量t的值加1，当t的值达到最大训练迭代轮数T时，ε停止训练；或增广拉格朗日目标损失函数ε训练过程中，本轮迭代结束后计算得到的ε值与上一轮ε值的差的绝对值已经小于收敛终止阈值τ时，ε停止训练。The judgment condition is that for each iteration of the augmented Lagrangian objective loss function ε, the value of the control variable t for the number of training iterations increases by 1. When the value of t reaches the maximum number of training iterations T, ε stops training; or augmentation During the training process of the Lagrangian objective loss function ε, when the absolute value of the difference between the ε value calculated after the current iteration and the previous round ε value has become smaller than the convergence termination threshold τ, ε stops training.

优选的,所述S4中，所述缺失蛋白质间相互作用预测值的计算公式为：Preferably, in the S4, the calculation formula of the predicted value of the missing protein-protein interaction is:

公式(5)中，

表示计算得到的蛋白质间相互作用估计值；g_i∈G，为线性偏差向量G的第i个元素；g_j∈G，为线性偏差向量G的第j个元素；e_i,d∈E，为隐特征矩阵E中第i个蛋白质所对应隐特征的第d个元素；e_j,d∈E，为隐特征矩阵E中第i个蛋白质所对应隐特征的第d个元素；D表示隐特征维数。In formula (5),

represents the calculated estimated value of protein-protein interaction; g _i ∈ G, is the i-th element of the linear deviation vector G; g _j ∈ G, is the j-th element of the linear deviation vector G; e _i,d ∈ E, is the d-th element of the latent feature corresponding to the ith protein in the latent feature matrix E; e _j,d ∈ E is the d-th element of the latent feature corresponding to the ith protein in the latent feature matrix E; D represents the hidden feature feature dimension.

综上所述，由于采用了上述技术方案，与现有技术相比，本发明至少具有以下有益效果：To sum up, due to the adoption of the above technical solutions, compared with the prior art, the present invention has at least the following beneficial effects:

本发明通过利用交替方向乘子法的对称非负隐特征分解方法，能够以较小的时间和空间复杂度，提供高精度的蛋白质相互作用数据预测，以提高考虑数据非负对称性的缺失蛋白质间相互作用预测精度，对科研起到重要作用。By using the symmetrical non-negative latent feature decomposition method of the alternating direction multiplier method, the present invention can provide high-precision protein interaction data prediction with less time and space complexity, so as to improve the missing protein considering the non-negative symmetry of the data. Interaction prediction accuracy plays an important role in scientific research.

附图说明：Description of drawings:

图1为根据本发明示例性实施例的基于交替方向乘子法的蛋白质间相互作用预测装置示意图。FIG. 1 is a schematic diagram of a protein-protein interaction prediction device based on the alternating direction multiplier method according to an exemplary embodiment of the present invention.

图2为根据本发明示例性实施例的基于交替方向乘子法的蛋白质间相互作用预测方法示意图。FIG. 2 is a schematic diagram of a protein-protein interaction prediction method based on the alternating direction multiplier method according to an exemplary embodiment of the present invention.

具体实施方式Detailed ways

下面结合实施例及具体实施方式对本发明作进一步的详细描述。但不应将此理解为本发明上述主题的范围仅限于以下的实施例，凡基于本发明内容所实现的技术均属于本发明的范围。The present invention will be further described in detail below with reference to the examples and specific implementation manners. However, it should not be construed that the scope of the above-mentioned subject matter of the present invention is limited to the following embodiments, and all technologies realized based on the content of the present invention belong to the scope of the present invention.

在本发明的描述中，需要理解的是，术语“纵向”、“横向”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。In the description of the present invention, it should be understood that the terms "portrait", "horizontal", "upper", "lower", "front", "rear", "left", "right", "vertical", The orientations or positional relationships indicated by "horizontal", "top", "bottom", "inside", "outside", etc. are based on the orientations or positional relationships shown in the accompanying drawings, which are only for the convenience of describing the present invention and simplifying the description, rather than An indication or implication that the referred device or element must have a particular orientation, be constructed and operate in a particular orientation, is not to be construed as a limitation of the invention.

如图1所示，本发明提出一种基于交替方向乘子法的蛋白质间相互作用预测装置，包括数据转换模块10、数据初始化模块20、交替方向乘子训练模块30和预测数据生成模块40；数据转换模块10的输入端输入初始蛋白质间相互作用数据，数据转换模块10的输出端与数据初始化模块20的输入端连接，数据初始化模块20的输出端与交替方向乘子训练模块30的输入端连接，交替方向乘子训练模块30的输出端与预测数据生成模块40的输入端连接，预测数据生成模块40的输出端输出蛋白质间相互作用预测数据。As shown in FIG. 1 , the present invention proposes a protein-protein interaction prediction device based on the alternate direction multiplier method, including a data conversion module 10, a data initialization module 20, an alternate direction multiplier training module 30 and a prediction data generation module 40; The input terminal of the data conversion module 10 inputs the initial protein interaction data, the output terminal of the data conversion module 10 is connected to the input terminal of the data initialization module 20, and the output terminal of the data initialization module 20 is connected to the input terminal of the alternating direction multiplier training module 30. The output terminal of the alternating direction multiplier training module 30 is connected with the input terminal of the prediction data generation module 40, and the output terminal of the prediction data generation module 40 outputs the prediction data of protein-protein interaction.

数据转换模块10，用于将接收的初始蛋白质间相互作用数据构建为对应的对称稀疏矩阵W，并将对称稀疏矩阵W中所有的非缺失值进行存储。The data conversion module 10 is configured to construct the received initial protein-protein interaction data into a corresponding symmetric sparse matrix W, and store all non-missing values in the symmetric sparse matrix W.

数据初始化模块20，用于生成初始的蛋白质间相互作用线性偏差向量Z，然后根据所生成的对称稀疏矩阵中的所有非缺失值及线性偏差数据来构造对应的增广拉格朗日函数，并对该函数所涉及的相关参数进行初始化。The data initialization module 20 is used to generate an initial protein-protein interaction linear deviation vector Z, and then construct a corresponding augmented Lagrangian function according to all non-missing values and linear deviation data in the generated symmetric sparse matrix, and Initialize the relevant parameters involved in the function.

交替方向乘子训练模块30，用于先将隐特征矩阵按隐特征维度进行切片，然后分片依次对迭代参数、非负参数及对偶参数进行迭代优化更新，从而可以得到收敛后的隐特征矩阵。The alternating direction multiplier training module 30 is used for firstly slicing the hidden feature matrix according to the hidden feature dimension, and then performing iterative optimization and updating of the iterative parameters, non-negative parameters and dual parameters in sequence, so that the converged hidden feature matrix can be obtained. .

预测数据生成模块40，用于根据经过交替方向乘子训练的蛋白质间相互作用的隐特征矩阵，计算缺失蛋白质间相互作用的预测值。The predicted data generation module 40 is configured to calculate the predicted value of the missing protein interaction according to the latent feature matrix of the protein interaction trained by the alternating direction multiplier.

本实施例中，数据转换模块10包括对称稀疏矩阵生成单元101和蛋白质间相互作用数据存储单元102，对称稀疏矩阵生成单元101的输出端与蛋白质间相互作用数据存储单元102的输入端连接。In this embodiment, the data conversion module 10 includes a symmetric sparse matrix generation unit 101 and an inter-protein interaction data storage unit 102 . The output end of the symmetric sparse matrix generation unit 101 is connected to the input end of the inter-protein interaction data storage unit 102 .

对称稀疏矩阵生成单元101，用于将接收到的初始蛋白质间相互作用数据构建为对称稀疏矩阵W。其中，对于接收到的初始蛋白质间相互作用数据，其中每条蛋白质间相互作用都是以三元组形式存储的，该三元组的表示形式为ppi＝(p_i,p_j,v_ij)，其中p_i表示第i个蛋白质，p_j表示第j个蛋白质，v_ij表示第i个蛋白质与第j个蛋白质间的相互作用值。接收到的初始蛋白质间相互作用数据并不是真正的全部的蛋白质间相互作用数据，在初始接收到的蛋白质间相互作用数据中，以蛋白质i与蛋白质j的相互作用为例，在初始数据集中只有只有(p_i,p_j,v_ij)条目，并没有对应的(p_j,p_i,v_ij)。因此，在做其他数据处理之前先将初始接收到的蛋白质间相互作用数据中的每条条目所对应的对称条目给生成出来，从而构建成一个对称矩阵W。矩阵W的行列所对应的就是同一蛋白质序列，由于蛋白质众多，所以，已知的蛋白质间相互作用数据肯定是远远小于矩阵W的元素总个数，故矩阵W是对称稀疏矩阵。The symmetric sparse matrix generation unit 101 is configured to construct the received initial protein-protein interaction data into a symmetric sparse matrix W. Among them, for the received initial protein-protein interaction data, each protein-protein interaction is stored in the form of triples, and the representation of the triples is ppi ₌ (pi , p _j , v _ij ) , where pi represents the _ith protein, p _j represents the jth protein, and v _ij represents the interaction value between the ith protein and the jth protein. The received initial protein-protein interaction data is not really all the protein-protein interaction data. In the initially received protein-protein interaction data, taking the interaction between protein i and protein j as an example, in the initial data set only There are only (pi , p _j , v _ij ) entries, and no corresponding (p _j , p _i , _{v ij} ₎ . Therefore, before doing other data processing, the symmetric entry corresponding to each entry in the initially received protein-protein interaction data is generated to construct a symmetric matrix W. The rows and columns of matrix W correspond to the same protein sequence. Since there are many proteins, the known interaction data between proteins must be far less than the total number of elements of matrix W, so matrix W is a symmetric sparse matrix.

蛋白质间相互作用数据存储单元102，用于存储已构建完成的对称稀疏矩阵W内所有的非缺失值，其中每个非缺失值也是以三元组的形式来进行存储的。The protein-protein interaction data storage unit 102 is used to store all the non-missing values in the constructed symmetric sparse matrix W, wherein each non-missing value is also stored in the form of triples.

数据初始化模块20包括包括线性偏差数据生成单元201，增广拉格朗日函数构建单元202，以及初始化单元203。The data initialization module 20 includes a linear deviation data generation unit 201 , an augmented Lagrangian function construction unit 202 , and an initialization unit 203 .

线性偏差数据生成单元201，用于生成初始的蛋白质间相互作用线性偏差数据。The linear deviation data generating unit 201 is used for generating initial linear deviation data of protein-protein interaction.

增广拉格朗日函数构建单元202，用于根据初始的蛋白质间相互作用线性偏差数据和从对称稀疏矩阵W中抽取出的非缺失值构造对应的目标损失函数。The augmented Lagrangian function constructing unit 202 is configured to construct a corresponding target loss function according to the initial protein-protein interaction linear deviation data and the non-missing values extracted from the symmetric sparse matrix W.

初始化单元203，用于初始化蛋白质间相互作用预测过程中所涉及的参数，其中参数包括初始化负责迭代的隐特征矩阵F、负责非负控制的隐特征矩阵E、对偶矩阵K、负责迭代的线性偏差向量H、负责非负控制的线性偏差向量G、对偶向量Z、隐特征维数D、最大训练迭代轮数T、训练过程中迭代轮数控制变量t、收敛终止阈值τ、学习率η、惩罚参数ρ_i和u_i。其中隐特征维数D决定了隐特征矩阵E和F的隐特征空间维数，初始化为正整数；隐特征矩阵E和F的结构大小由接收到的初始蛋白质间相互作用数据中所涉及到的蛋白质的个数M和隐特征维数D确定，即E和F为M行D列的隐特征矩阵，对于隐特征矩阵E和F及线性偏差向量G和H分别用较小的随机正数进行初始化；最大训练迭代轮数T是控制迭代过程上限的变量，初始化为较大的正整数；迭代轮数控制变量t初始化为0；收敛终止阈值τ是用于判断迭代过程是否以收敛的参数，用极小的正数初始化；学习率η是用于控制每次迭代优化过程中所走的步长，用极小的正数初始化；惩罚参数ρ_i和u_i是用于控制拉格朗日函数中增广项效果，用正数初始化。The initialization unit 203 is used to initialize the parameters involved in the prediction process of protein-protein interaction, wherein the parameters include the initialization of the latent feature matrix F responsible for iteration, the latent feature matrix E responsible for non-negative control, the dual matrix K, and the linear deviation responsible for iteration vector H, linear deviation vector G responsible for non-negative control, dual vector Z, latent feature dimension D, maximum number of training iterations T, control variable t for the number of iterations in the training process, convergence termination threshold τ, learning rate η, penalty Parameters ρ _i and _ui . The latent feature dimension D determines the latent feature space dimension of the latent feature matrices E and F, and is initialized to a positive integer; the structure size of the latent feature matrices E and F is determined by the received initial protein-protein interaction data. The number of proteins M and the hidden feature dimension D are determined, that is, E and F are latent feature matrices with M rows and D columns. Initialization; the maximum number of training iteration rounds T is a variable that controls the upper limit of the iteration process and is initialized to a large positive integer; the iteration round number control variable t is initialized to 0; the convergence termination threshold τ is a parameter used to judge whether the iteration process is converged, Initialize with a very small positive number; the learning rate η is used to control the step size taken in each iteration of the optimization process, initialized with a very small positive number; the penalty parameters ρ _i and _ui are used to control the Lagrangian The effect of the augmentation term in the function, initialized with a positive number.

交替方向乘子训练模块30，用于对矩阵F，E，K按隐特征维度D来进行分片，并根据分片依次对迭代参数、非负参数及对偶参数来进行迭代更新。The alternating direction multiplier training module 30 is used for slicing the matrices F, E, K according to the hidden feature dimension D, and iteratively updates the iterative parameters, non-negative parameters and dual parameters in sequence according to the slicing.

本实施例中，预测数据生成模块40包括预测数据存储单元，用于存储预测的缺失蛋白质间相互作用值，其中每个缺失蛋白质间相互作用预测值也是以三元组的形式来进行存储的。In this embodiment, the prediction data generation module 40 includes a prediction data storage unit for storing the predicted missing protein interaction values, wherein each missing protein interaction prediction value is also stored in the form of triples.

本装置可部署于一个现有的服务器中，也可部署于一个单独设置的、专用于进行蛋白质间相互作用预测的服务器中。The device can be deployed in an existing server, or can be deployed in a separate server dedicated to predicting interactions between proteins.

基于上述装置，本发明还提出一种基于交替方向乘子法的蛋白质间相互作用预测方法，作用于缺失蛋白质间相互作用预测，能够进行高效的、准确度高的缺失蛋白质间相互作用预测，如图2所示，具体包括以下步骤：Based on the above device, the present invention also proposes a method for predicting the interaction between proteins based on the alternating direction multiplier method, which acts on the prediction of the interaction between missing proteins and can perform efficient and high-accuracy prediction of the interaction between missing proteins, such as As shown in Figure 2, it specifically includes the following steps:

S1：输入初始蛋白质间相互作用数据并构造对称稀疏矩阵W。S1: Input initial protein-protein interaction data and construct a symmetric sparse matrix W.

本实施例中，服务器将要求预测蛋白质间相互作用的指令和初始蛋白质间相互作用数据发送给装置，指令包括定期、装置的通知、服务器的通知等。In this embodiment, the server sends an instruction for predicting the protein-protein interaction and initial protein-protein interaction data to the device, and the instruction includes periodicity, notification from the device, notification from the server, and the like.

S1-1:构造对称稀疏矩阵W。S1-1: Construct a symmetric sparse matrix W.

本实施例中，对于接收到的初始蛋白质间相互作用数据，都是以三元组形式存储的，该三元组的表示形式为ppi＝(p_i,p_j,v_ij)，其中p_i表示第i个蛋白质，p_j表示第j个蛋白质，v_ij表示第i个蛋白质与第j个蛋白质间的相互作用值。In this embodiment, the received initial protein-protein interaction data are all stored in the form of triples, and the representation of the triples is ppi _{=(pi , p j , v ij} ₎ _, where _pi represents the ith protein, p _j represents the jth protein, and v _ij represents the interaction value between the ith protein and the jth protein.

此时接收到的初始蛋白质间相互作用数据并不是真正的全部的蛋白质间相互作用数据，在接收到的初始蛋白质间相互作用数据中，以蛋白质i与蛋白质j的相互作用为例，在初始数据集中只有(p_i,p_j,v_ij)条目，并没有对应的(p_j,p_i,v_ij)，因为由蛋白质相互作用数据所形成的矩阵是一个对称矩阵，所以有v_ij＝v_ji。因此，为了节省存储数据的空间，因此初始数据集中只需包含(p_i,p_j,v_ij)条目即可。因此，在做其他数据处理之前先将接收到的初始蛋白质间相互作用数据中的每条条目所对应的对称条目给生成出来，从而构建成一个对称稀疏矩阵W。对称稀疏矩阵W的行列所对应的就是同一蛋白质序列，由于蛋白质众多，所以，已知的蛋白质间相互作用数据肯定是远远小于对称稀疏矩阵W中的元素总个数。The initial protein-protein interaction data received at this time is not really all the protein-protein interaction data. In the received initial protein-protein interaction data, taking the interaction between protein i and protein j as an example, in the initial data There are only (p _i , p _j , v _ij ) entries in the set, and there is no corresponding (p _j , p _i , v _ij ), because the matrix formed by the protein interaction data is a symmetric matrix, so there is v _ij = v _ji . Therefore, in order to save space for storing data, only (pi , p _j , v _ij ) entries are required in the initial _dataset . Therefore, before doing other data processing, the symmetric entry corresponding to each entry in the received initial protein-protein interaction data is generated to construct a symmetric sparse matrix W. The rows and columns of the symmetric sparse matrix W correspond to the same protein sequence. Since there are many proteins, the known interaction data between proteins must be far less than the total number of elements in the symmetric sparse matrix W.

S2：构建增广拉格朗日函数并进行参数初始化。S2: Build the augmented Lagrangian function and initialize the parameters.

S2-1：构建目标损失函数Q。S2-1: Construct the target loss function Q.

在本步骤中，根据所生成的对称稀疏矩阵W，遍历矩阵W上三角中的非缺失值元素，在每次遍历中，对于所遍历到的上三角非缺失值元素，根据对称矩阵的特性，生成对应下三角中那个非缺失值元素，然后将这两个元素添加到非缺失值集合中，当遍历完成后，则可以得到所有非缺失值的集合Γ。对对称稀疏矩阵W进行分解得到隐特征矩阵F和偏差向量H，其中，F的结构大小由接收到的初始蛋白质间相互作用数据中所涉及到的蛋白质的个数M和隐特征维数D确定，即F为M行D列的隐特征矩阵，对于隐特征矩阵F用开区间(0,0.004)的随机正数进行初始化；H的结构大小由接收到的初始蛋白质间相互作用数据中所涉及到的蛋白质的个数M确定，即H为包含M个元素的向量，对于偏差向量H用开区间(0,0.004)的随机正数进行初始化。隐特征E的结构大小与F相同，偏差向量G的结构大小与H相同。然后让E中每个元素的初始值等于F中对应元素的初始值，G中的每个元素的初始值等于H中对应元素的初始值。结合集合Γ及所生成的隐特征矩阵E、F和偏差向量G、H，以欧式距离来作为优化目标，构建对应的目标损失函数Q，使用以下公式表示：In this step, according to the generated symmetric sparse matrix W, the non-missing value elements in the upper triangle of the matrix W are traversed. In each traversal, for the traversed upper triangle non-missing value elements, according to the characteristics of the symmetric matrix, Generate the non-missing value element corresponding to the lower triangle, and then add these two elements to the non-missing value set. When the traversal is completed, the set Γ of all non-missing values can be obtained. Decompose the symmetric sparse matrix W to obtain the latent feature matrix F and the deviation vector H, where the structural size of F is determined by the number M of proteins involved in the received initial protein-protein interaction data and the latent feature dimension D. , that is, F is the latent feature matrix with M rows and D columns, and the latent feature matrix F is initialized with a random positive number in the open interval (0, 0.004); the structure size of H is determined by the received initial protein-protein interaction data. The number M of proteins received is determined, that is, H is a vector containing M elements, and the deviation vector H is initialized with a random positive number in the open interval (0, 0.004). The structure size of the latent feature E is the same as that of F, and the structure size of the bias vector G is the same as that of H. Then let the initial value of each element in E equal the initial value of the corresponding element in F, and the initial value of each element in G is equal to the initial value of the corresponding element in H. Combining the set Γ and the generated latent feature matrices E and F and the deviation vectors G and H, the Euclidean distance is used as the optimization goal to construct the corresponding objective loss function Q, which is expressed by the following formula:

s.t.E＝F，E≥0；G＝H，G≥0；(1)s.t.E=F, E≥0; G=H, G≥0; (1)

公式(1)中，E，F为M行D列的隐特征矩阵；线性偏差向量H，G为包含M个元素的向量；Γ表示蛋白质间相互作用数据所对应的对称稀疏矩阵W中的非缺失值集合；D表示隐特征维数；w_i,j表示蛋白质i与蛋白质j间相互作用值；f_i,d∈F，表示隐特征矩阵F中第i个蛋白质所对应隐特征的第d个元素；g_i∈G，表示线性偏差向量G的第i个元素；h_i∈H，表示线性偏差向量H的第i个元素；h_j∈H，表示线性偏差向量H的第j个元素。In formula (1), E, F are latent feature matrices with M rows and D columns; linear deviation vectors H, G are vectors containing M elements; Γ represents the non-symmetric sparse matrix W corresponding to the protein-protein interaction data. The set of missing values; D represents the latent feature dimension; w _i,j represents the interaction value between protein i and protein j; f _i,d ∈ F, represents the d-th hidden feature corresponding to the ith protein in the latent feature matrix F g _i ∈ G, represents the ith element of the linear deviation vector G; h _i ∈ H, represents the ith element of the linear deviation vector H; h _j ∈ H, represents the j th element of the linear deviation vector H .

首先构造结构大小为M行D列的乘子矩阵K以及包含M个元素的乘子向量Z，其中M表示蛋白质的个数，D表示隐特征维数，乘子矩阵K初始化为零矩阵，以及乘子向量Z初始化为零向量。然后根据交替方向乘子法的原理，可得到对应的增广拉格朗日函数ε，使用以下公式表示：First, construct a multiplier matrix K with M rows and D columns and a multiplier vector Z containing M elements, where M represents the number of proteins, D represents the latent feature dimension, and the multiplier matrix K is initialized to a zero matrix, and The multiplier vector Z is initialized to a zero vector. Then, according to the principle of the alternating direction multiplier method, the corresponding augmented Lagrangian function ε can be obtained, which is expressed by the following formula:

公式(2)中，Γ表示蛋白质间相互作用数据所对应的对称稀疏矩阵W中的非缺失值集合；M表示蛋白质的个数，D表示隐特征维数；w_i,j表示蛋白质i与蛋白质j间相互作用值；e_i,d∈E，表示隐特征矩阵E中第i个蛋白质所对应隐特征的第d个元素；f_i,d∈F，表示隐特征矩阵F中第i个蛋白质所对应隐特征的第d个元素；f_j,d∈F，表示隐特征矩阵F中第j个蛋白质所对应隐特征的第d个元素；g_i∈G，表示线性偏差向量G的第i个元素；h_i∈H，表示线性偏差向量H的第i个元素；h_j∈H，表示线性偏差向量H的第j个元素；κ_i,d∈K，表示乘子矩阵K中第i个蛋白质所对应隐特征的第d个元素；δ_i∈Z，表示乘子向量Z的第i个元素；ρ_i和u_i为惩罚参数，它们是非负整数。In formula (2), Γ represents the set of non-missing values in the symmetric sparse matrix W corresponding to the protein-protein interaction data; M represents the number of proteins, D represents the latent feature dimension; w _i,j represents protein i and protein The interaction value between j; e _i,d ∈ E, represents the d-th element of the latent feature corresponding to the ith protein in the latent feature matrix E; f _i,d ∈ F, represents the ith protein in the latent feature matrix F The d-th element of the corresponding latent feature; f _j,d ∈ F, represents the d-th element of the latent feature corresponding to the j-th protein in the latent feature matrix F; g _i ∈ G, represents the i-th element of the linear deviation vector G h _i ∈ H, represents the i-th element of the linear deviation vector H; h _j ∈ H, represents the j-th element of the linear deviation vector H; κ _{i,d ∈} K, represents the i-th element in the multiplier matrix K The d-th element of the latent feature corresponding to each protein; δ _i ∈ Z, represents the i-th element of the multiplier vector Z; ρ _i and _ui are the penalty parameters, which are non-negative integers.

S2-3：初始化用于预测的相关参数。S2-3: Initialize relevant parameters for prediction.

在本步骤中，初始化用于预测的相关参数，参数包括初始化负责迭代的隐特征矩阵F；初始化负责非负控制的隐特征矩阵E；初始化乘子矩阵K；初始化负责迭代的线性偏差向量H；初始化负责非负控制的线性偏差向量G；初始化乘子矩阵Z；初始化隐特征维数D；初始化最大训练迭代轮数T；初始化训练过程中迭代轮数控制变量t；初始化收敛终止阈值τ；初始化学习率η；初始化惩罚参数ρ_i和u_i。In this step, initialize the relevant parameters for prediction, the parameters include initializing the latent feature matrix F responsible for iteration; initializing the latent feature matrix E responsible for non-negative control; initializing the multiplier matrix K; initializing the linear deviation vector H responsible for the iteration; Initialize the linear deviation vector G responsible for non-negative control; initialize the multiplier matrix Z; initialize the latent feature dimension D; initialize the maximum number of training iterations T; Learning rate η; initialization penalty parameters ρ _i and _ui .

其中,in,

隐特征维数D决定了隐特征矩阵E和F的隐特征空间维数，初始化为正整数；The latent feature dimension D determines the latent feature space dimension of the latent feature matrices E and F, and is initialized to a positive integer;

隐特征矩阵E和F的结构大小由初始接收到的蛋白质间相互作用数据中所涉及到的蛋白质的个数M和隐特征维数D确定，即E和F为M行D列的隐特征矩阵，对于隐特征矩阵E和F及线性偏差向量G和H分别用较小的随机正数进行初始化；The structure size of the latent feature matrices E and F is determined by the number M of proteins involved in the initially received protein-protein interaction data and the latent feature dimension D, that is, E and F are latent feature matrices with M rows and D columns. , initialize the latent feature matrices E and F and the linear deviation vectors G and H with small random positive numbers respectively;

乘子矩阵K初始化为零矩阵和乘子向量Z初始化为零向量；The multiplier matrix K is initialized to a zero matrix and the multiplier vector Z is initialized to a zero vector;

最大训练迭代轮数T是控制迭代过程上限的变量，初始化为较大的正整数；The maximum number of training iterations T is a variable that controls the upper limit of the iteration process, and is initialized to a larger positive integer;

迭代轮数控制变量t初始化为0；The iteration round number control variable t is initialized to 0;

收敛终止阈值τ是用于判断迭代过程是否以收敛的参数，用极小的正数初始化；The convergence termination threshold τ is a parameter used to judge whether the iterative process is converged, initialized with a very small positive number;

学习率η是用于控制每次迭代优化过程中所走的步长，用极小的正数初始化；The learning rate η is used to control the step size taken in each iteration of the optimization process, and is initialized with a very small positive number;

惩罚参数ρ_i和u_i是用于控制拉格朗日函数中增广项效果，用正数初始化。Penalty parameters ρ _i and u _i are used to control the effect of the augmentation term in the Lagrangian function and are initialized with positive numbers.

S3：对增广拉格朗日函数进行迭代优化，得到优化后的隐特征矩阵。S3: Iteratively optimize the augmented Lagrangian function to obtain the optimized latent feature matrix.

S3-1：对增广拉格朗日函数进行分片和更新。S3-1: Fragment and update the augmented Lagrangian function.

本步骤中，对增广拉格朗日函数ε按隐特征维度D来进行分片，再根据分片依次对迭代参数、非负参数及乘子参数来进行迭代更新。In this step, the augmented Lagrangian function ε is sharded according to the hidden feature dimension D, and then the iterative parameters, non-negative parameters and multiplier parameters are iteratively updated according to the sharding.

具体地更新策略如下所示：The specific update strategy is as follows:

for d＝1 Dfor d=1 D

表示增广拉格朗日函数ε对乘子矩阵K求偏导数；

表示增广拉格朗日函数ε对乘子向量Z求偏导数。In formula (3), F _{(1～M), d} represents the d-th column vector in the latent feature matrix F, wherein the column vector contains M elements, and H _d represents the d-th element in the linear deviation vector H;

It is expressed as the linear deviation sub-vector of size (Dd) composed of the (d+1)~Dth elements in the linear deviation vector H in the t-th iteration; F ^t+1 is expressed as the t+1-th iteration Implicit feature matrix F; H ^t+1 represents the linear deviation vector H in the t+1 round of iteration; G ^t+1 represents the linear deviation vector G in the t+1 round of iteration; G ^t represents the t-th round iteration Medium linear deviation vector G; E ^t+1 represents the latent feature matrix E in the t+1 round of iteration; E ^t represents the latent feature matrix E in the t-th round of iteration; K ^t+1 represents the t+1 round of iteration Middle multiplier matrix K; K ^t is the multiplier matrix K in the t-th iteration; Z ^t+1 is the multiplier vector Z in the t+1-th iteration; Z ^t is the multiplier vector in the t-th iteration Z; η is used to control the step size in the iterative optimization process of each parameter;

Represents the partial derivative of the augmented Lagrangian function ε with respect to the multiplier vector Z.

由上述更新策略可知，按分片首先对隐特征矩阵F的相应维度的决策参数及线性偏差向量H来进行更新；然后对隐特征矩阵E的相应维度的决策参数及线性偏差向量G来进行更新；最后更新乘子矩阵K的相应维度的决策参数及乘子向量Z。It can be seen from the above update strategy that the decision parameters of the corresponding dimension of the latent feature matrix F and the linear deviation vector H are updated according to the slicing; then the decision parameters and the linear deviation vector G of the corresponding dimension of the latent feature matrix E are updated; Finally, the decision parameters of the corresponding dimension of the multiplier matrix K and the multiplier vector Z are updated.

S3-2：对增广拉格朗日目标损失函数ε进行迭代优化。S3-2: Iteratively optimize the augmented Lagrangian objective loss function ε.

根据分片更新策略，基于所得到的非缺失值集合Γ，对目标损失函数Q即增广拉格朗日目标损失函数ε进行迭代优化。According to the sharding update strategy, based on the obtained set of non-missing values Γ, the objective loss function Q, namely the augmented Lagrangian objective loss function ε, is iteratively optimized.

在本步骤中，根据分片更新策略及所得到的非缺失值集合Γ，针对分片d，迭代参数、非负参数及乘子参数的训练迭代公式分别如下所示：In this step, according to the shard update strategy and the obtained set of non-missing values Γ, for shard d, the training iteration formulas for the iteration parameters, non-negative parameters and multiplier parameters are as follows:

公式(4)中，Γ(i)表示非缺失值集合Γ中与蛋白质i相关的所有非缺失值集合；D表示隐特征维数；w_i,j表示蛋白质i与蛋白质j间相互作用值；e_i,d∈E，为隐特征矩阵E中第i个蛋白质所对应隐特征的第d个元素；f_i,d∈F，为隐特征矩阵F中第i个蛋白质所对应隐特征的第d个元素；f_j,d∈F，为隐特征矩阵F中第j个蛋白质所对应隐特征的第d个元素；g_i∈G，为线性偏差向量G的第i个元素；h_i∈H，为线性偏差向量H的第i个元素；h_j∈H，为线性偏差向量H的第j个元素；κ_i,d∈K，为乘子矩阵K中第i个蛋白质所对应隐特征的第d个元素；δ_i∈Z，为乘子向量Z的第i个元素；ρ_i和u_i为惩罚参数。In formula (4), Γ(i) represents all the non-missing value sets related to protein i in the non-missing value set Γ; D represents the latent feature dimension; w _i,j represents the interaction value between protein i and protein j; e _i,d ∈ E, is the d-th element of the latent feature corresponding to the ith protein in the latent feature matrix E; f _i,d ∈ F, is the d-th element of the latent feature corresponding to the ith protein in the latent feature matrix F d elements; f _j,d ∈ F, is the d-th element of the latent feature corresponding to the j-th protein in the latent feature matrix F; g _i ∈ G, is the i-th element of the linear deviation vector G; h _i ∈ H, is the i-th element of the linear deviation vector H; h _j ∈ H, is the j-th element of the linear deviation vector H; κ _{i,d ∈} K, is the latent feature corresponding to the i-th protein in the multiplier matrix K The d-th element of ; δ _i ∈ Z is the i-th element of the multiplier vector Z; ρ _i and _ui are the penalty parameters.

在本步骤中，根据分片及更新策略，由于隐特征维数为D，所以每一轮的迭代更新总共包含D次子更新，在每轮迭代优化完成后，我们就可以得到迭代优化所生成的隐特征矩阵E,F、线性偏差向量G,H、乘子矩阵K和乘子向量Z。In this step, according to the sharding and update strategy, since the hidden feature dimension is D, each round of iterative update contains D sub-updates in total. After each round of iterative optimization is completed, we can obtain the generated The latent feature matrix E, F, linear deviation vector G, H, multiplier matrix K and multiplier vector Z.

S3-3：判断增广拉格朗日目标损失函数ε在Γ上训练迭代过程是否达到终止条件。S3-3: Determine whether the training iteration process of the augmented Lagrangian objective loss function ε on Γ reaches the termination condition.

在本步骤中，增广拉格朗日目标损失函数ε在Γ上训练迭代过程达到终止条件有两种情况。第一是Q每迭代一轮，训练迭代轮数控制变量t的值加1，当t的值达到最大训练迭代轮数T时，Q停止训练；第二种是Q训练过程中，本轮迭代结束后计算得到的Q值与上一轮Q值的差的绝对值已经小于收敛终止阈值τ时，Q停止训练。In this step, the augmented Lagrangian objective loss function ε has two cases when the training iterative process on Γ reaches the termination condition. The first is that for each iteration of Q, the value of the control variable t for the number of training iterations is increased by 1. When the value of t reaches the maximum number of training iterations T, Q stops training; the second is that during the Q training process, the current iteration After the end, when the absolute value of the difference between the calculated Q value and the previous round Q value has become smaller than the convergence termination threshold τ, Q stops training.

在本步骤中，当目标损失函数Q即增广拉格朗日目标损失函数ε在上Γ对收敛后，取使得Q达到最小的训练得到的隐特征矩阵E和线性偏差向量G，运用它的值来计算蛋白质i和蛋白质j间的相互作用预测值

其中i,j∈M，M表示蛋白质个数,计算公式为：In this step, when the target loss function Q, that is, the augmented Lagrangian target loss function ε, converges on the upper Γ pair, take the latent feature matrix E and the linear deviation vector G obtained by training that minimize Q, and use its value to calculate the predicted value of the interaction between protein i and protein j

where i, j∈M, M represents the number of proteins, and the calculation formula is:

公式(5)中，

由上述技术方案可见，本发明实施例提供了一种采用交替方向乘子法的对称非负隐特征分析的缺失蛋白质相互作用预测方法，其专门作用于缺失蛋白质间相互作用数据，能够以较小的时间和空间复杂度，提供高精度的蛋白质相互作用数据预测，以解决针对数据非负对称性的缺失蛋白质间相互作用预测问题。It can be seen from the above technical solutions that the embodiment of the present invention provides a method for predicting missing protein interactions by symmetric non-negative latent feature analysis using the alternating direction multiplier method, which specifically acts on the interaction data between missing proteins and can be used with a small amount of data. It provides high-precision protein interaction data prediction to solve the problem of missing protein-protein interaction prediction for non-negative symmetry of the data.

本领域的普通技术人员可以理解，上述各实施方式是实现本发明的具体实施例，而在实际应用中，可以在形式上和细节上对其作各种改变，而不偏离本发明的精神和范围。Those of ordinary skill in the art can understand that the above-mentioned embodiments are specific examples for realizing the present invention, and in practical applications, various changes in form and details can be made without departing from the spirit and the spirit of the present invention. scope.

Claims

1. the inter-protein interaction prediction device based on the alternate direction multiplier method, is characterized in that, comprises data conversion module, data initialization module, alternate direction multiplier training module and prediction data generation module; Wherein,

The data conversion module is used to construct the received initial protein-protein interaction data into a corresponding symmetric sparse matrix, and store all non-missing values in the symmetric sparse matrix;

The data initialization module is used to generate the initial latent feature matrix, linear deviation vector, multiplier matrix and multiplier vector, and then constructs according to the non-missing value, latent feature matrix, linear deviation vector, multiplier matrix and multiplier vector The corresponding augmented Lagrangian function, and initialize the function;

The alternating direction multiplier training module is used to first slice the hidden feature matrix according to the hidden feature dimension, and then perform iterative optimization and updating of the iterative parameters, non-negative parameters and multiplier parameters in sequence, so that the convergent hidden feature can be obtained. feature matrix;

The predicted data generation module is used to calculate the predicted value of the interaction between missing proteins according to the converged latent feature matrix.

2. The protein-protein interaction prediction device based on the alternating direction multiplier method according to claim 1, wherein the data conversion module comprises a symmetric sparse matrix generation unit and a protein-protein interaction data storage unit; wherein,

The symmetric sparse matrix generation unit is used to construct the received initial protein-protein interaction data into a symmetric sparse matrix W;

The protein-protein interaction data storage unit is used to store all non-missing values in the constructed symmetric sparse matrix W.

3. The protein-protein interaction prediction device based on the alternating direction multiplier method according to claim 1, wherein the data initialization module comprises a linear deviation data generation unit, an augmented Lagrangian function construction unit and an initialization unit. unit; of which,

The linear deviation data generation unit is used to generate an initial linear deviation vector of protein-protein interaction;

The augmented Lagrangian function building unit is configured to construct a corresponding augmented Lagrangian function according to the initial inter-protein interaction linear deviation vector and the non-missing value;

The initialization unit is used to initialize the parameters involved in the prediction process of protein-protein interaction.

4. A protein-protein interaction prediction method based on the alternating direction multiplier method, characterized in that it specifically comprises the following steps:

S1: Input the initial protein-protein interaction data and construct a symmetric sparse matrix W;

S2: Build an augmented Lagrangian function and initialize parameters;

S3: Iteratively optimize the augmented Lagrangian function to obtain the optimized latent feature matrix;

S4: Calculate the predicted value of missing protein-protein interactions.

5. The protein-protein interaction prediction method based on the alternating direction multiplier method as claimed in claim 4, wherein the S1 comprises:

S1-1: Construct a symmetric sparse matrix W;

For the received initial protein-protein interaction data, it is stored as a triple entry, and the triple entry is represented as (pi , p _j , v _ij ), where pi represents the _ith protein, _and p _j represents the j-th protein, and v _ij represents the interaction value between the i-th protein and the j-th protein; the symmetric entry corresponding to each triple entry is generated to construct a symmetric sparse matrix W.

6. The protein-protein interaction prediction method based on the alternating direction multiplier method as claimed in claim 4, wherein the S2 comprises:

S2-1: Construct the target loss function Q;

According to the symmetric sparse matrix W, all the non-missing value sets Γ are obtained. Combined with the set Γ and the generated linear deviation vectors H and G, the Euclidean distance is used as the optimization goal to construct the corresponding objective loss function Q:

s.t.E=F, E≥0; G=H, G≥0; (1)

In formula (1), E and F are latent feature matrices with M rows and D columns; the capacity of linear deviation vectors H and G is M; Γ represents the set of non-missing values in the symmetric sparse matrix W corresponding to the protein-protein interaction data ; D represents the latent feature dimension; w _i,j represents the interaction value between protein i and protein j; e _i,d ∈ E, represents the d-th element of the latent feature corresponding to the ith protein in the latent feature matrix E; f _i,d ∈ F, represents the d-th element of the latent feature corresponding to the ith protein in the latent feature matrix F; g _i ∈ G, represents the ith element of the linear deviation vector G; h _i ∈ H, represents the linear The i-th element of the deviation vector H; h _j ∈ H, represents the j-th element of the linear deviation vector H;

S2-2: Construct an augmented Lagrangian function.

According to the principle of the alternating direction multiplier method, the corresponding augmented Lagrangian function ε can be obtained, which is expressed by the following formula:

In formula (2), Γ represents the set of non-missing values in the symmetric sparse matrix W corresponding to the protein-protein interaction data; M represents the number of proteins, D represents the latent feature dimension; w _i,j represents protein i and protein The interaction value between j; e _i,d ∈ E, represents the d-th element of the latent feature corresponding to the ith protein in the latent feature matrix E; f _i,d ∈ F, represents the ith protein in the latent feature matrix F The d-th element of the corresponding latent feature; f _j,d ∈ F, represents the d-th element of the latent feature corresponding to the j-th protein in the latent feature matrix F; g _i ∈ G, represents the i-th element of the linear deviation vector G h _i ∈ H, represents the i-th element of the linear deviation vector H; h _j ∈ H, represents the j-th element of the linear deviation vector H; κ _{i,d ∈} K, represents the i-th element in the multiplier matrix K The d-th element of the latent feature corresponding to each protein; δ _i ∈ Z, represents the ith element of the multiplier vector Z; ρ _i and _ui are the penalty parameters, which are non-negative integers;

S2-3: Initialize relevant parameters for prediction;

Initialize the relevant parameters for prediction, the parameters include the latent feature matrix F, the latent feature matrix E, the multiplier matrix K, the linear deviation vector H, the linear deviation vector G, the multiplier vector Z, the hidden feature dimension D, the maximum training Iterative round number T, iteration round number control variable t, convergence termination threshold τ, learning rate η, penalty parameters ρ _i and _ui .

7. The protein-protein interaction prediction method based on the alternating direction multiplier method as claimed in claim 4, wherein the S3 comprises:

S3-1: Iteratively update the iterative parameters, non-negative parameters and multiplier parameters in turn. The update strategy is as follows:

for d=1 D

In formula (3), F _{(1～M), d} represents the d-th column vector in the latent feature matrix F, wherein the column vector contains M elements, and H _d represents the d-th element in the linear deviation vector H;

It is expressed as the linear deviation sub-vector of size (Dd) composed of the (d+1)~Dth elements in the linear deviation vector H in the t-th iteration; F ^t+1 is expressed as the t+1-th iteration Implicit feature matrix F; H ^t+1 represents the linear deviation vector H in the t+1 round of iteration; G ^t+1 represents the linear deviation vector G in the t+1 round of iteration; G ^t represents the t-th round iteration Medium linear deviation vector G; E ^t+1 represents the latent feature matrix E in the t+1 round of iteration; E ^t represents the latent feature matrix E in the t-th round of iteration; K ^t+1 represents the t+1 round of iteration Middle multiplier matrix K; K ^t is the multiplier matrix K in the t-th iteration; Z ^t+1 is the multiplier vector Z in the t+1-th iteration; Z ^t is the multiplier vector in the t-th iteration Z; η is used to control the step size in the iterative optimization process of each parameter; ▽ _K ε(E ^t+1 , G ^t+1 , F ^t+1 , H ^t+1 , K, Z ^t ) represents the increase The generalized Lagrangian function ε finds the partial derivative of the multiplier matrix K; ▽ _Z ε(E ^t+1 ,G ^t+1 ,F ^t+1 ,H ^t+1 ,K ^t ,Z) represents the augmented Lag The Rangian function ε finds the partial derivative with respect to the multiplier vector Z;

S3-2: Iteratively optimize the augmented Lagrangian objective loss function ε;

The training iteration formulas are as follows:

κ _i,d ←κ _i,d +ηρ _i (f _i,d -e _i,d )

δ _i ←δ _i +ηu _i (h _i -g _i )

In formula (4), Γ(i) represents all the non-missing value sets related to protein i in the non-missing value set Γ; D represents the latent feature dimension; w _i,j represents the interaction value between protein i and protein j; e _i,d ∈ E, is the d-th element of the latent feature corresponding to the ith protein in the latent feature matrix E; f _i,d ∈ F, is the d-th element of the latent feature corresponding to the ith protein in the latent feature matrix F d elements; f _j,d ∈ F, is the d-th element of the latent feature corresponding to the j-th protein in the latent feature matrix F; g _i ∈ G, is the i-th element of the linear deviation vector G; h _j ∈ H, is the jth element of the linear deviation vector H; h _i ∈ H, is the ith element of the linear deviation vector H; κ _{i,d ∈} K, is the latent feature corresponding to the ith protein in the multiplier matrix K The d-th element of ; δ _i ∈ Z is the i-th element of the multiplier vector Z; ρ _i and u _i are the penalty parameters;

S3-3: Determine whether the iterative process of the augmented Lagrangian objective loss function ε is terminated:

The judgment condition is that for each iteration of the augmented Lagrangian objective loss function ε, the value of the control variable t for the number of training iterations increases by 1. When the value of t reaches the maximum number of training iterations T, ε stops training; or augmentation During the training process of the Lagrangian objective loss function ε, when the absolute value of the difference between the ε value calculated after the current iteration and the previous round ε value has become smaller than the convergence termination threshold τ, ε stops training.

8. The protein-protein interaction prediction method based on the alternating direction multiplier method as claimed in claim 4, wherein in the S4, the calculation formula of the missing protein-protein interaction prediction value is:

In formula (5),