WO2023125114A1 - 蛋白质功能模块的挖掘方法、计算机设备和存储介质 - Google Patents

蛋白质功能模块的挖掘方法、计算机设备和存储介质 Download PDF

Info

Publication number
WO2023125114A1
WO2023125114A1 PCT/CN2022/140090 CN2022140090W WO2023125114A1 WO 2023125114 A1 WO2023125114 A1 WO 2023125114A1 CN 2022140090 W CN2022140090 W CN 2022140090W WO 2023125114 A1 WO2023125114 A1 WO 2023125114A1
Authority
WO
WIPO (PCT)
Prior art keywords
protein
node
nodes
core
complex
Prior art date
Application number
PCT/CN2022/140090
Other languages
English (en)
French (fr)
Other versions
WO2023125114A9 (zh
Inventor
陈宏威
吴红艳
纪超杰
蔡云鹏
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2023125114A1 publication Critical patent/WO2023125114A1/zh
Publication of WO2023125114A9 publication Critical patent/WO2023125114A9/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention relates to the technical field of systems biology, in particular to a method for mining protein function modules, computer equipment and a computer-readable storage medium.
  • protein-protein interaction a collection of proteins that complete a specific molecular process through interaction is called a protein functional module, such as a protein complex or a protein signaling pathway.
  • the mining of protein functional modules can not only understand the functional organization structure of cells and the way to perform physiological functions, but also help people understand various biological processes, reveal the mechanism of disease occurrence, and find new drug targets. Therefore, mining the functional modules of protein interactions is of great significance.
  • protein interaction networks Based on protein interaction networks, existing technologies often predict protein functions by identifying protein complexes composed of multiple small protein molecules.
  • the protein complex is closely connected by multiple small protein molecules, interacting with each other, and presents a dense state in the network structure. Often, the relevant structural information can be well learned through the shallow graph neural network algorithm, so as to identify protein complex.
  • the protein interaction network there are still multiple small protein molecules presenting a chain shape, and the combined protein signaling pathway is in a sparse state compared to the complex, which requires high-level information in the network structure to be identified.
  • the article "Protein complexes identification based on go attributed network embedding[J].2018, Bo Xu” disclosed a method for identifying protein complexes, based on the accelerated attribute network embedding model (AANE) to learn protein node representation, through protein interaction
  • the network finds all extremely large cluster structures (more than three protein nodes), calculates the density of extremely large clusters through the similarity represented by protein nodes, iterates multiple times, and expands the extremely large cluster structures each time, according to the overall Density increase principle to obtain protein complexes.
  • this method learns the structural information of the first-order neighbors of protein nodes based on the accelerated attribute network embedding model (AANE), which is only suitable for mining protein complexes in dense subgraphs, while ignoring the protein signaling pathways in the protein interaction network. Both are equally important basis for discovering protein function.
  • AANE accelerated attribute network embedding model
  • the present invention provides a method for mining protein functional modules, computer equipment and computer-readable storage media to solve the problem of how to mine protein complexes and protein signaling pathways from protein interaction networks.
  • one aspect of the present invention provides a mining method for protein functional modules, which includes steps:
  • the step S1 includes:
  • [G k X] i represents the graph representation of node v i during the convolution operation of the kth layer, Indicates the state value of node v i during the k-1 layer convolution operation, Indicates the state value of node v i during the k-th layer convolution operation, and S() is a nonlinear transformation function;
  • W h and b h are the learnable network parameters of the node-level adaptive graph convolutional network model; ⁇ represents the activation function; Indicates the probability value of node v i during the k-th layer convolution operation;
  • the step S2 includes:
  • ⁇ tig represents the first loss coefficient
  • ⁇ sep represents the second loss coefficient
  • both ⁇ tig and ⁇ sep are constants
  • Ltig represents the similarity between nodes in a cluster
  • Lsep represents the similarity between nodes in a cluster
  • the value of the number m of the clustering clusters is set in the following manner:
  • center i is the center point of the C i cluster
  • the fitting curve is determined, and the inflection point at which the SSE value declines from fast to slow is determined in the fitting curve, and the m value corresponding to the inflection point is selected as the final number of clusters m value of .
  • the step S4 includes: based on the adjacency matrix A and the protein node vector representation, calculate the similarity of protein nodes v i and v j through the following cosine similarity calculation formula, and construct a weighted adjacency matrix W:
  • a ij is an element of the adjacency matrix A
  • w ij is an element of the weighted adjacency matrix W.
  • the step S5 includes:
  • the step S6 includes:
  • the present invention also provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein, when the processor executes the program, the above The steps of the method for mining protein functional modules.
  • the present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the protein function module as described above is realized.
  • the steps of the mining method are described above.
  • the mining method of the protein function module is based on the node-level self-adaptive graph convolutional network (NASGC), through which each protein node learns high-order and low-order neighbor information respectively through the self-adaptive mechanism, and obtains the information of the protein node through learning.
  • NASGC node-level self-adaptive graph convolutional network
  • high-level and low-level structural information are fused on the attribute characteristics of protein node gene ontology to obtain a more general protein node representation, so that protein complexes and protein signaling pathways can be mined from the protein interaction network , and the protein complex generated by mining is more in line with the real situation, improving the accuracy of protein function recognition.
  • Fig. 1 is a workflow illustration of the mining method of the protein functional module in the embodiment of the present invention
  • Fig. 2 is a process illustration of the mining method of the protein functional module in the embodiment of the present invention.
  • Fig. 3 is a structural block diagram of a computer device in an embodiment of the present invention.
  • Fig. 1 is a schematic flowchart of a method for mining protein functional modules provided by an embodiment of the present invention.
  • the method for mining protein functional modules of the present application is applied to a terminal device, wherein the terminal device may be a server, may also be a mobile device, or may be a system in which the server and the mobile device cooperate with each other.
  • various parts included in the terminal device such as various units, subunits, modules, and submodules, may all be set in the server, or all of them may be set in the mobile device, or they may be set in the server and the mobile device respectively.
  • the terminal device is, for example, a computer device.
  • the above server may be hardware or software.
  • the server When the server is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or as a single server.
  • the server When the server is software, it can be implemented as multiple software or software modules, such as software or software modules used to provide a distributed server, or as a single software or software module.
  • a method for mining protein functional modules provided by an embodiment of the present invention includes the following steps:
  • Step S1 input the protein interaction network (PPI network) into the node-level adaptive graph convolutional network (NASGC) model for learning and training, and obtain protein node vector representations.
  • PPI network protein interaction network
  • NASGC node-level adaptive graph convolutional network
  • each protein node learns high-order and low-order neighbor information, and learns to obtain a protein node vector representation.
  • high-order and low-order structural information are fused on the gene ontology attribute features of protein nodes, and a more general protein node representation is obtained.
  • Step S1 of this embodiment specifically includes the following sub-steps:
  • Step S11 obtaining the protein interaction network, constructing the corresponding adjacency matrix A and gene ontology attribute characteristic matrix X.
  • Step S12 calculating a normalized Laplacian matrix Ls based on the adjacency matrix A, and constructing a low-pass filter G.
  • the low-pass filter G is
  • I is a diagonal matrix whose eigenvalues are all 1
  • U is the eigenvector of matrix Ls;
  • Step S14 use the low-pass graph filter G and the gene ontology attribute feature matrix X to perform the k-th layer convolution operation to obtain the graph representation G k X of the current convolution layer of the protein interaction network, and the calculation formula is as follows:
  • Step S15 based on the graph representation G k X of the current convolutional layer, calculate the state value of each protein node in the protein interaction network, the calculation formula is as follows:
  • [G k X] i represents the graph representation of node v i during the convolution operation of the kth layer, Indicates the state value of node v i during the k-1 layer convolution operation, Indicates the state value of node v i during the k-th layer convolution operation, and S() is a nonlinear transformation function (RNN/GRU).
  • Step S16 calculating and evaluating the probability value of each node stopping iteration based on the state value; wherein, the calculation formula of the probability value is as follows:
  • W h and b h are the learnable network parameters of the node-level adaptive graph convolutional network model; ⁇ represents the activation function (Sigmoid); Indicates the probability value of node v i during the k-th layer convolution operation.
  • Step S18 calculate and obtain the probability value of the last layer convolution operation of each node v i , the calculation formula is as follows:
  • Step S19 for each node v i , linearly combine the graph representation of the convolution operation of the first N i layers with the probability value to obtain the protein node vector representation:
  • each protein node learns high-order and low-order neighbor information through an adaptive mechanism, and learns the vector representation information of the protein node.
  • the high-order and low-order structural information are fused on the gene ontology attribute features of protein nodes, so a more general representation of protein nodes is obtained.
  • Step S2 based on the protein node vector representation, perform clustering through the K-means clustering algorithm to obtain the soft label of the clustering result of the protein node, set a loss function according to the soft label of the clustering result and perform backpropagation, Update the network parameters of the model.
  • Step S2 of this embodiment specifically includes the following sub-steps:
  • Step S21 setting the number m of clusters, where m is a positive integer.
  • the value of the number m of the clustering clusters is set in the following manner:
  • center i is the center point of the C i cluster.
  • Step S23 according to the soft label C of the clustering result, set the loss function L as:
  • ⁇ tig represents the first loss coefficient
  • ⁇ sep represents the second loss coefficient
  • both ⁇ tig and ⁇ sep are constants
  • Ltig represents the similarity between nodes in a cluster
  • Lsep represents the similarity between nodes in a cluster .
  • a good cluster partition should have a small intra-cluster distance, so the L tig parameter is introduced to represent the similarity between nodes in the cluster.
  • a good cluster partition should also have a large inter-cluster distance, so the L sep parameter is introduced to represent the similarity between inter-cluster nodes.
  • the larger ⁇ tig drives the nodes within the cluster to be closer, while ⁇ sep drives the nodes between the clusters to be well separated.
  • ⁇ tig and ⁇ sep are adversarial parameters that control the trade-off between the two metrics of compactness and separation.
  • L For example 1:3, 1:5, 1:10, 1:15, 1:20, 1:25, 1:30, 1:35, 1:40, 1:45 or 1:50.
  • Step S3 perform iterative calculation based on the above steps S1 to S2 until the model converges or reaches the maximum number of iterations, and obtain the final protein node vector representation and clustering results of the last iterative calculation.
  • step S1 based on the preset maximum number of iterations, repeat step S1 to step S2 to perform iterative calculation until the model converges or reaches the maximum number of iterations, and the final protein node vector representation is obtained in step S1 during the last iterative calculation, and in step S2
  • Step S4 Based on the final protein node vector representation, the similarity of protein nodes is calculated by the cosine similarity calculation formula, and a weighted adjacency matrix is constructed.
  • the step S3 includes: based on the adjacency matrix A and protein node vector representation, calculate the similarity of protein nodes v i and v j by the following cosine similarity calculation formula, and construct a weighted adjacency matrix W:
  • a ij is an element of the adjacency matrix A
  • w ij is an element of the weighted adjacency matrix W.
  • Step S5 screening out the basic structure of the protein complex from the protein interaction network and expanding it based on the calculation of the weighted adjacency matrix to obtain the protein complex.
  • Step S5 of this embodiment specifically includes the following sub-steps:
  • Step S51 setting and initializing the sets Alternative_core, Complex_Seed_core, and Complex_set.
  • Step S52 using the clique mining method to screen out the extremely large cluster structure Clique q from the protein interaction network, and put the extremely large cluster structure Clique q into the set Alternative_core.
  • q is the serial number of the maximal clique structure.
  • Step S53 Based on the weighted adjacency matrix W, calculate the density scores of all maximal cliques Clique q in the set Alternative_core, and sort from large to small according to the density scores.
  • Step S54 remove the maximal cluster with the highest density score from the set Alternative_core and put it into the set Complex_Seed_core as the basic structure of the protein complex.
  • step S53 sorts Clique 1 , Clique 2 , Clique 3 , ... according to the density score from large to small; the maximum clique with the highest density score is Clique 1 , then the maximum clique Clique 1 is removed from the set Alternative_core and juxtaposed Enter the set Complex_Seed_core as the basic structure of the protein complex.
  • Step S55 traversing the remaining maximal clique structures of the set Alternative_core, if there are protein nodes in other maximal cliques that overlap with the protein nodes in the maximal clique with the highest density score:
  • Clique 1 is put into the set Complex_Seed_core as the basic structure of the protein complex, and the remaining maximal cluster structures of the set Alternative_core include Clique 2 , Clique 3 , Clique 4 , . . .
  • Realistic protein complexes will have multiple complexes with a common internal structure, which should increase the probability of generating a protein complex with a common large group.
  • maximal clusters based on the maximal cluster structure as the basic framework of protein complexes, the protein nodes shared by maximal clusters are retained as much as possible, which can reflect the protein complexes that share maximal clusters, which is more in line with The true situation.
  • Step S56 repeat the above steps S53-S55 until the set Alternative_core is an empty set, and obtain several basic structures of protein complexes in the set Complex_Seed_core.
  • Step S57 based on the maximum clique j in the set Complex_Seed_core, for any neighbor node p i of the protein node in the maximum cluster Clique j , calculate the correlation between the neighbor node p i and the maximum cluster Clique j based on the protein node similarity If the correlation score is greater than the preset threshold ⁇ 1 , then embed the protein node p i into the maximal clique Clique j ; where the calculation formula of the correlation score is as follows:
  • Step S58 traversing all the neighboring protein nodes of the maximal clique Clique j , completing the node expansion of the basic structure, thereby determining a protein complex, removing it from the set Complex_Seed_core and putting it into the set Complex_set.
  • Step S59 repeat the above steps S57-S58 until the set Complex_Seed_core is an empty set, and obtain the finally mined protein complex in the set Complex_set.
  • Step S6 screening out the basic structure of protein signaling pathways from each cluster of the clustering results and expanding based on the calculation of the weighted adjacency matrix to obtain protein signaling pathways.
  • the shortest path within the cluster is selected as the basic structure of the protein signaling pathway, and the weighted adjacency
  • the matrix calculates the correlation, and embeds the neighbor nodes meeting the predetermined conditions into the endpoints of the basic structure of the protein signaling pathway to obtain the second functional module of the protein: the protein signaling pathway.
  • Step S6 of this embodiment specifically includes the following sub-steps:
  • Step S61 setting and initializing the sets Pathway_Seed_core and Pathway_set.
  • Step S62 based on the protein interaction network, traverse the m clusters, find the shortest path between two nodes in the cluster, and the length of the shortest path does not exceed 3, and put all the filtered paths into the set Pathway_Seed_core as a protein Signaling pathway infrastructure.
  • Step S63 based on the weighted adjacency matrix W, calculate the density scores of all the shortest paths shortest_path q in the set Pathway_Seed_core, and sort from large to small according to the density scores; wherein, the calculation formula of the density scores is:
  • Step S64 Take the shortest path shortest_path j with the highest density in the set Pathway_Seed_core, and for any neighbor node p i at the end of the shortest path shortest_path j , calculate the correlation score between the neighbor protein node and the shortest path based on the protein node similarity; if the correlation If the score is greater than the preset threshold ⁇ 2 , then embed the neighbor protein node p i into the end of the shortest path shortest_path j ;
  • Step S65 traverse all the neighbor protein nodes at the end of the shortest path shortest_path j , complete the node expansion of the basic structure, thereby determine a protein communication path, remove it from the set Pathway_Seed_core and put it into the set Pathway_set.
  • Step S66 repeating the above steps S64-S65 until the set Pathway_Seed_core is an empty set, and obtain the final mined protein communication path in the set Pathway_set.
  • the mining method of protein functional modules provided in the above embodiment is based on node-level adaptive graph convolutional network (NASGC), and each protein node learns high-order and low-order neighbor information through an adaptive mechanism, and learns to obtain the vector of protein nodes
  • NASGC node-level adaptive graph convolutional network
  • high-level and low-level structural information are integrated on the attribute characteristics of protein node gene ontology to obtain a more general protein node representation, so that protein complexes and protein signaling pathways can be mined from the protein interaction network.
  • the protein complex generated by mining is more in line with the real situation, improving the accuracy of protein function recognition.
  • the embodiment of the present invention also provides a computer device, as shown in Figure 3, the computer device includes: a processor 10, a memory 20, an input device 30 and an output device 40.
  • the processor 10 is provided with a GPU, and the number of processors 10 may be one or more.
  • One processor 10 is taken as an example in FIG. 2 .
  • the processor 10, the memory 20, the input device 30 and the output device 40 in the computer device can be connected through a bus or other means.
  • the memory 20, as a computer-readable storage medium, can be used to store software programs, computer-executable programs and modules.
  • the processor 10 executes various functional applications and data processing of the device by running the software programs, instructions and modules stored in the memory 20, that is, to realize the steps of the protein functional module mining method described in the foregoing embodiments of the present invention.
  • the input device 30 is used to receive image data, input numbers or character information, and generate key signal input related to user settings and function control of the device.
  • the output device 40 may include a display device such as a display screen, for example, for displaying images.
  • the embodiment of the present invention also provides a computer-readable storage medium, on which a computer program is stored, and the present invention is realized when the computer program is executed by a processor
  • the steps of the method for mining protein functional modules in the foregoing embodiments may be any available medium or data storage device that can be accessed by a computer, including but not limited to magnetic memory, optical memory, and semiconductor memory.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Biotechnology (AREA)
  • Physiology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种蛋白质功能模块的挖掘方法、计算机设备和计算机可读存储介质,基于节点级别自适应的图卷积网络(NASGC)模型,通过自适应机制使各个蛋白质节点分别学习高阶和低阶邻居信息,学习得到蛋白质节点的向量表示信息中,在蛋白质节点基因本体属性特征上融合了高阶和低阶的结构信息,得到更加泛化的蛋白质节点表示,由此能够从蛋白质相互作用网络挖掘出蛋白质复合体以及蛋白质信号通路,并且挖掘生成的蛋白质复合体更符合真实情况,提升对于蛋白质功能识别的准确度。

Description

蛋白质功能模块的挖掘方法、计算机设备和存储介质 技术领域
本发明涉及系统生物学技术领域,具体涉及一种蛋白质功能模块的挖掘方法、计算机设备和计算机可读存储介质。
背景技术
蛋白质功能往往是通过蛋白质之间或核酸之间的相互作用而表现出来,这种相互作用存在于机体细胞的生命活动过程中,相互交叉形成蛋白质相互作用(protein-protein interaction,PPI)网络。在一个PPI网络中,通过相互作用完成某一特定分子进程的蛋白质集合称为蛋白质功能模块,例如蛋白质复合体或蛋白质信号通路。蛋白质功能模块的挖掘不仅可以了解细胞的功能组织结构和执行生理功能的方式,而且还有助于人们理解各种生物学过程、揭示疾病的发生机制以及寻找新的药物靶标。因此,挖掘蛋白质相互作用的功能模块具有重要的意义。
基于蛋白质相互作用网络,现有技术往往是通过识别多个小蛋白分子组合成的蛋白质复合体,从而预测蛋白质功能。蛋白质复合体是由多个小蛋白分子紧密连接在一起,两两相互作用,在网络结构中呈现稠密状态,往往通过浅层的图神经网络算法,即可很好学习相关结构信息,从而识别出蛋白质复合体。然而在蛋白质相互作用网络,仍存在多个小蛋白质分子呈现链状形态,组合成的蛋白质信号通路,相对于复合体,蛋白质信号通路为稀疏状态,在网络结构中需要高阶信息才能识别。
文章《Protein complexes identification based on go attributed network embedding[J].2018,Bo Xu》公开了一种识别蛋白质复合体的方法,基于加速属性网络嵌入模型(AANE)进行学习蛋白质节点表示,通过蛋白质相互作用网络找到所有极大团结构(三个蛋白质节点以上),通过蛋白质节点表示的相似度计算极大团的密度,迭代多次,每次进行扩展极大团结构,根据加入新蛋白质节点后整体的密度增加原则,获得蛋白质复合体。但该方法基于加速属性网络嵌入模型(AANE)学习了蛋白质节点一阶邻居的结构信息,仅适合挖掘处于稠密子图的蛋白质复合体,而忽略了存在在蛋白质相互作用网络中的蛋白质信号通路,两者同样是发现蛋白质功能的重要依据。
发明内容
有鉴于此,本发明提供了一种蛋白质功能模块的挖掘方法、计算机设备和计算机可读存储介质,以解决如何从蛋白质相互作用网络挖掘出蛋白质复合体以及蛋白质信号通路的问题。
为了解决上述技术问题,本发明的一方面是提供一种蛋白质功能模块的挖掘方法,其包括步骤:
S1、将蛋白质相互作用网络输入节点级别自适应图卷积网络模型中学习训练,使各个蛋白质节点学习高阶和低阶邻居信息,获得蛋白质节点向量表示;
S2、基于所述蛋白质节点向量表示,通过K-means聚类算法进行聚类,得到蛋白质节点的聚类结果软标签,根据所述聚类结果软标签设定损失函数并进行反向传播,更新模型的网络参数;
S3、基于以上步骤S1至步骤S2进行迭代计算至模型收敛或达到最大迭代次数,获得最后一次迭代计算的最终的蛋白质节点向量表示以及聚类结果;
S4、基于所述最终的蛋白质节点向量表示,通过余弦相似度计算公式计算蛋白质节点的相似度,构建加权邻接矩阵;
S5、从所述蛋白质相互作用网络筛选出蛋白质复合体基础结构并基于所述加权邻接矩阵的计算进行扩展,获得蛋白质复合体;
S6、从所述聚类结果的每个聚类簇中筛选出蛋白质信号通路基础结构并基于所述加权邻接矩阵的计算进行扩展,获得蛋白质信号通路。
优选的方案中,所述步骤S1包括:
S11、获取蛋白质相互作用网络,构建相应的邻接矩阵A和基因本体属性特征矩阵X;其中,所述蛋白质相互作用网络的节点表示为v={v 1,v 2,…,v n},所述基因本体属性特征矩阵X={x 1,x 2,...,x n} T,n为蛋白质节点总数,x i的维度为d,x和d均为正整数,i=1~n;
S12、基于所述邻接矩阵A计算归一化拉普拉斯矩阵Ls,构建低通滤波器G;所述归一化拉普拉斯矩阵Ls为Ls=I-D -1/2AD -1/2,所述低通滤波器G为
Figure PCTCN2022140090-appb-000001
其中,D为所述邻接矩阵A的度矩阵D=diag(d 1,d 2,...,d n),d i表示节点v i的边数,Λ是矩阵Ls特征值的对角矩阵,I是矩阵Ls特征值全为 1的对角矩阵,U是矩阵Ls的特征向量;
S13、设置迭代卷积层数k=t,令t从0至M循环进行以下步骤S14至S18,M表示卷积层数的最大值,取值为正整数;
S14、使用低通图滤波器G与基因本体属性特征矩阵X执行第k层卷积操作,得到蛋白质相互作用网络的当前卷积层的图表示G kX,计算公式如下:
Figure PCTCN2022140090-appb-000002
S15、基于所述当前卷积层的图表示G kX,计算蛋白质相互作用网络中各个蛋白质节点的状态值,计算公式如下:
Figure PCTCN2022140090-appb-000003
其中,[G kX] i表示第k层卷积操作时节点v i的图表示,
Figure PCTCN2022140090-appb-000004
表示节点v i在第k-1层卷积操作时的状态值,
Figure PCTCN2022140090-appb-000005
表示节点v i在第k层卷积操作时状态值,S()为一个非线性变换函数;
S16、基于所述状态值计算评估各个节点停止进行迭代卷积的概率值;其中,所述概率值的计算公式如下:
Figure PCTCN2022140090-appb-000006
其中,W h和b h是所述节点级别自适应图卷积网络模型的可学习网络参数;σ表示激活函数;
Figure PCTCN2022140090-appb-000007
表示节点v i在第k层卷积操作时的概率值;
S17、设置概率值阈值ε,对于每一个节点v i,计算其前k层卷积的累积概率值并与所述阈值ε比较:若累计概率值达到所述阈值ε以上,则对节点v i停止迭代卷积计算并记录其卷积层数为N i=k’;若卷积操作层数k迭代至M,累计概率值仍小于所述阈值ε,则节点v i的卷积层数为N i=M;所述节点v i停止迭代卷积的卷积层数N i表示如下:
Figure PCTCN2022140090-appb-000008
S18、计算获得各个节点v i的最后一层卷积操作时的概率值,计算公式如下:
Figure PCTCN2022140090-appb-000009
S19、对于每一个节点v i,将其前N i层卷积操作的图表示与概率值线性组合,得到蛋白质节点的向量表示:
Figure PCTCN2022140090-appb-000010
优选的方案中,所述步骤S2包括:
S21、设定聚类簇的数量m,m为正整数;
S22、基于所述蛋白质节点向量表示,通过K-means聚类算法进行聚类,得到蛋白质节点的聚类结果软标签;
S23、根据所述聚类结果软标签,设定损失函数L为:
Figure PCTCN2022140090-appb-000011
其中,λ tig表示第一损失系数,λ sep表示第二损失系数,λ tig和λ sep均为常数,L tig表示簇内节点之间的相似性,L sep表示簇间节点之间的相似性;
其中,L tig的计算公式如下:
Figure PCTCN2022140090-appb-000012
其中,L sep的计算公式如下:
Figure PCTCN2022140090-appb-000013
S24、根据所述损失函数进行反向传播,更新所述节点级别自适应图卷积网络模型的网络参数;
所述步骤S3包括:基于预先设定的最大迭代次数,重复步骤S1至步骤S2 进行迭代计算至模型收敛或达到最大迭代次数,在最后一次迭代计算时于步骤S1得到最终的蛋白质节点向量表示,于步骤S2得到最终的聚类结果C={C 1,C 2,...,C m},形成m个聚类簇。
优选的方案中,所述聚类簇的数量m的取值按照以下方式设定:
基于所述蛋白质相互作用网络,设置m=r,通过K-means聚类算法进行聚类得到C={C 1,C 2,...,C r};其中r从2至R取值,R为正整数;
对于每一次m的具体取值,通过手肘算法计算每个节点到簇中心距离到误差平方和SSE,计算公式如下:
Figure PCTCN2022140090-appb-000014
p为C i簇内节点,center i为C i簇的中心点;
根据m的具体取值与计算得到SSE值的对应关系拟合曲线图,在拟合曲线中确定SSE值下降幅度由快速转缓慢的拐点,选择拐点对应的m值作为最终聚类簇的数量m的取值。
优选的方案中,所述损失函数L中:
Figure PCTCN2022140090-appb-000015
优选的方案中,所述步骤S4包括:基于所述邻接矩阵A和蛋白质节点向量表示,通过以下余弦相似度计算公式计算蛋白质节点v i和v j的相似度,构建加权邻接矩阵W:
Figure PCTCN2022140090-appb-000016
其中,a ij为所述邻接矩阵A的元素,w ij为所述加权邻接矩阵W的元素。
优选的方案中,所述步骤S5包括:
S51、设置并初始化集合Alternative_core、Complex_Seed_core、Complex_set;
S52、应用团挖掘方法从所述蛋白质相互作用网络筛选出极大团结构Clique q,将所述极大团结构Clique q置入集合Alternative_core;
S53、基于所述加权邻接矩阵W,计算所述集合Alternative_core中所有极大团Clique q的密度分数,并根据密度分数进行由大到小排序;其中,所述密度分数的计算公式为:
Figure PCTCN2022140090-appb-000017
S54、将密度分数最大的极大团,从集合Alternative_core移除并置入集合Complex_Seed_core作为蛋白质复合体基础结构;
S55、遍历集合Alternative_core剩余的极大团结构,当存在其余极大团的蛋白质节点与当前密度分数最大的极大团中蛋白质节点有重合:
若重复节点个数少于2个,则将其余极大团中的重复节点删除,其余部分数量大于3则保留;若重复节点个数大于等于2,则不删除重复节点;
S56、重复进行以上步骤S53-S55,直至集合Alternative_core为空集,在集合Complex_Seed_core获得若干个蛋白质复合体基础结构;
S57、基于集合Complex_Seed_core中极大团Clique j,对于该极大团Clique j中蛋白质节点的任意一个邻居节点p i,基于蛋白质节点相似度计算邻居节点p i与该极大团Clique j的相关性成绩,若相关性成绩大于预先设定的阈值θ 1,则将邻居蛋白质节点p i嵌入该极大团Clique j;其中,相关性成绩的计算公式如下:
Figure PCTCN2022140090-appb-000018
S58、遍历完该极大团Clique j的所有邻居蛋白节点,则确定一个蛋白质复合体,从集合Complex_Seed_core移除并置入集合Complex_set;
S59、重复进行以上步骤S57-S58,直至集合Complex_Seed_core为空集,在集合Complex_set获得最终挖掘出的蛋白质复合体。
优选的方案中,所述步骤S6包括:
S61、设置并初始化集合Pathway_Seed_core、Pathway_set;
S62、基于所述蛋白质相互作用网络,遍历所述m个聚类簇,查找簇内两两节点的最短路径,并且最短路径长度不超过3,将筛选出的所有路径置入集合Pathway_Seed_core作为蛋白质信号通路基础结构;
S63、基于所述加权邻接矩阵W,计算所述集合Pathway_Seed_core中所有最短路径shortest_path q的密度分数,并根据密度分数进行由大到小排序;其中,所述密度分数的计算公式为:
Figure PCTCN2022140090-appb-000019
S64、取集合Pathway_Seed_core中密度最大的最短路径shortest_path j,对于最短路径shortest_path j末端的任意一个邻居节点p i,基于蛋白质节点相似度计算邻居蛋白质节点与该最短路径的相关性成绩;若相关性成绩大于预先设定的阈值θ 2,则将邻居蛋白质节点p i嵌入最短路径shortest_path j的末端;
其中,相关性成绩的计算公式如下:
Figure PCTCN2022140090-appb-000020
S65、遍历完所述最短路径shortest_path j的末端的所有邻居蛋白节点,则确定一个蛋白质通信路径,从集合Pathway_Seed_core移除并置入集合Pathway_set;
S66、重复进行以上步骤S64-S65,直至集合Pathway_Seed_core为空集,在集合Pathway_set获得最终挖掘出的蛋白质通信路径。
为了解决上述技术问题,本发明还提供一种计算机设备,其包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如上所述的蛋白质功能模块的挖掘方法的步骤。
为了解决上述技术问题,本发明还提供一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如上所述的蛋白质功能模块的挖掘方法的步骤。
本发明实施例提供的蛋白质功能模块的挖掘方法,基于节点级别自适应的图卷积网络(NASGC),通过自适应机制使各个蛋白质节点分别学习高阶和低阶邻居信息,学习得到蛋白质节点的向量表示信息中,在蛋白质节点基因本体属性特征上融合了高阶和低阶的结构信息,得到更加泛化的蛋白质节点表示,由此能够从蛋白质相互作用网络挖掘出蛋白质复合体以及蛋白质信号通路,并且挖掘生成的蛋白质复合体更符合真实情况,提升对于蛋白质功能识别的准确度。
附图说明
图1是本发明实施例中的蛋白质功能模块的挖掘方法的工作流程图示;
图2是本发明实施例中的蛋白质功能模块的挖掘方法的过程图示;
图3是本发明实施例中的一种计算机设备的结构框图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面结合附图对本发明的具体实施方式进行详细说明。这些优选实施方式的示例在附图中进行了例示。附图中所示和根据附图描述的本发明的实施方式仅仅是示例性的,并且本发明并不限于这些实施方式。
在此,还需要说明的是,为了避免因不必要的细节而模糊了本发明,在附图中仅仅示出了与根据本发明的方案密切相关的结构和/或处理步骤,而省略了与本发明关系不大的其他细节。
图1是本发明实施例提供的蛋白质功能模块的挖掘方法的流程示意图。
本申请的蛋白质功能模块的挖掘方法应用于一种终端设备,其中,所述的终端设备可以为服务器,也可以为移动设备,还可以为由服务器和移动设备相互配合的系统。相应地,终端设备包括的各个部分,例如各个单元、子单元、模块、子模块可以全部设置于服务器中,也可以全部设置于移动设备中,还可以分别设置于服务器和移动设备中。所述终端设备例如是计算机设备。
进一步地,上述服务器可以是硬件,也可以是软件。当服务器为硬件时,可以实现成由多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当服务器为软件时,可以实现成多个软件或软件模块,例如用来提供分布式服务器的软件或软件模块,也可以实现成单个软件或软件模块。
参阅图1和图2,本发明实施例提供的一种蛋白质功能模块的挖掘方法,其包括以下步骤:
步骤S1、将蛋白质相互作用网络(PPI网络)输入节点级别自适应图卷积网络(NASGC)模型中学习训练,获得蛋白质节点向量表示。
具体地,在所述节点级别自适应图卷积网络模型中,通过自适应机制,使各个蛋白质节点分别学习高阶和低阶邻居信息,学习得到蛋白质节点向量表示。由此,在蛋白质节点基因本体属性特征上融合了高阶和低阶的结构信息,得到 更加泛化的蛋白质节点表示。
本实施例的步骤S1具体包括以下子步骤:
步骤S11、获取蛋白质相互作用网络,构建相应的邻接矩阵A和基因本体属性特征矩阵X。
其中,所述蛋白质相互作用网络的节点表示为v={v 1,v 2,…,v n},所述基因本体属性特征矩阵X={x 1,v 2,...,x n} T,n为蛋白质节点总数,x i的维度为d,x和d均为正整数,i=1~n。
需要说明的是,邻接矩阵A中的元素a ij:当节点v i和节点v j存在相互作用时,a ij=1;当节点v i和节点v j不存在相互作用时,a ij=0。
步骤S12、基于所述邻接矩阵A计算归一化拉普拉斯矩阵Ls,构建低通滤波器G。
具体地,所述归一化拉普拉斯矩阵Ls为Ls=I-D -1/2AD -1/2,所述低通滤波器G为
Figure PCTCN2022140090-appb-000021
其中,D为所述邻接矩阵A的度矩阵D=diag(d 1,d 2,...,d n),d i表示节点v i的边数,Λ是矩阵Ls特征值的对角矩阵,I是矩阵Ls特征值全为1的对角矩阵,U是矩阵Ls的特征向量;
步骤S13、设置迭代卷积层数k=t,令t从0至M循环进行以下步骤S14至S18,M表示卷积层数的最大值,取值为正整数。
步骤S14、使用低通图滤波器G与基因本体属性特征矩阵X执行第k层卷积操作,得到蛋白质相互作用网络的当前卷积层的图表示G kX,计算公式如下:
Figure PCTCN2022140090-appb-000022
步骤S15、基于所述当前卷积层的图表示G kX,计算蛋白质相互作用网络中各个蛋白质节点的状态值,计算公式如下:
Figure PCTCN2022140090-appb-000023
其中,[G kX] i表示第k层卷积操作时节点v i的图表示,
Figure PCTCN2022140090-appb-000024
表示节点v i在第k-1层卷积操作时的状态值,
Figure PCTCN2022140090-appb-000025
表示节点v i在第k层卷积操作时状态值,S()为 一个非线性变换函数(RNN/GRU)。
步骤S16、基于所述状态值计算评估各个节点停止进行迭代的概率值;其中,所述概率值的计算公式如下:
Figure PCTCN2022140090-appb-000026
其中,W h和b h是所述节点级别自适应图卷积网络模型的可学习网络参数;σ表示激活函数(Sigmoid);
Figure PCTCN2022140090-appb-000027
表示节点v i在第k层卷积操作时的概率值。
步骤S17、设置概率值阈值ε,对于每一个节点v i,计算其前k层卷积操作的累积概率值并与所述阈值ε比较:若累计概率值达到所述阈值ε以上,则对节点v i停止迭代卷积计算并记录其卷积层数为N i=k’;若卷积操作层数k迭代至M,累计概率值仍小于所述阈值ε,则节点v i的卷积层数为N i=M;所述节点v i停止迭代卷积的卷积层数N i表示如下:
Figure PCTCN2022140090-appb-000028
步骤S18、计算获得各个节点v i的最后一层卷积操作时的概率值,计算公式如下:
Figure PCTCN2022140090-appb-000029
当所有节点都停止迭代或者达到最大卷积层数M,则停止循环计算。
步骤S19、对于每一个节点v i,将其前N i层卷积操作的图表示与概率值线性组合,得到蛋白质节点向量表示:
Figure PCTCN2022140090-appb-000030
输出以上蛋白质节点的向量表示并对蛋白质相互作用网络中的节点信息进行更新。
通过以上过程,基于节点级别自适应的图卷积网络(NASGC),通过自适应机制使各个蛋白质节点分别学习高阶和低阶邻居信息,学习得到蛋白质节点的 向量表示信息
Figure PCTCN2022140090-appb-000031
中,在蛋白质节点基因本体属性特征上融合了高阶和低阶的结构信息,因此获得更加泛化的蛋白质节点表示。
步骤S2、基于所述蛋白质节点向量表示,通过K-means聚类算法进行聚类,得到蛋白质节点的聚类结果软标签,根据所述聚类结果软标签设定损失函数并进行反向传播,更新模型的网络参数。
本实施例的步骤S2具体包括以下子步骤:
步骤S21、设定聚类簇的数量m,m为正整数。
在本实施例中,所述聚类簇的数量m的取值按照以下方式设定:
(1)基于基因本体属性特征矩阵X,对于所述蛋白质相互作用网络,设置m=r,通过K-means聚类算法进行聚类得到C={C 1,C 2,...,C r};其中r从2至R取值,R为正整数。
(2)对于每一次m的具体取值,通过手肘算法(elbow method)计算每个节点到簇中心距离到误差平方和SSE,计算公式如下:
Figure PCTCN2022140090-appb-000032
p为C i簇内节点,center i为C i簇的中心点。
(3)根据m的具体取值与计算得到SSE值的对应关系拟合曲线图(例如图2中示出的),在拟合曲线中确定SSE值下降幅度由快速转缓慢的拐点,选择拐点对应的m值作为最终聚类簇的数量m的取值。
步骤S22、基于所述蛋白质节点向量表示,通过K-means聚类算法进行聚类,得到蛋白质节点的聚类结果软标签C={C 1,C 2,...,C m}。
步骤S23、根据所述聚类结果软标签C,设定损失函数L为:
Figure PCTCN2022140090-appb-000033
其中,λ tig表示第一损失系数,λ sep表示第二损失系数,λ tig和λ sep均为常数,L tig表示簇内节点之间的相似性,L sep表示簇间节点之间的相似性。
其中,L tig的计算公式如下:
Figure PCTCN2022140090-appb-000034
其中,L sep的计算公式如下:
Figure PCTCN2022140090-appb-000035
一个好的集群分区应该有一个较小的集群内距离,因此引入了表示簇内节点之间的相似性的L tig参量。一个好的集群分区还应该具有较大的集群间距离,因此引入了表示簇间节点之间的相似性的L sep参量。
对于损失系数λ tig和λ sep来说,较大的λ tig驱动簇内的节点更紧密,而λ sep则是驱动簇间的节点很好地分离。λ tig和λ sep是对抗参数,用于控制紧密性和分离性这两个指标之间的权衡。其中,关于损失系数λ tig和λ sep的具体地取值:可以先观察L tig和(1/L sep)的比例,这可以通过执行第一次迭代后获得的值粗略地逼近;然后再对损失系数λ tig和λ sep进行具体取值,以平衡损失函数L的λ tigL tig
Figure PCTCN2022140090-appb-000036
这两项。在较为优选的实施方案中,所述损失函数L中:
Figure PCTCN2022140090-appb-000037
例如是1:3、1:5、1:10、1:15、1:20、1:25、1:30、1:35、1:40、1:45或1:50。
S24、根据所述损失函数进行反向传播,更新所述节点级别自适应图卷积网络模型的网络参数。
步骤S3、基于以上步骤S1至步骤S2进行迭代计算至模型收敛或达到最大迭代次数,获得最后一次迭代计算的最终的蛋白质节点向量表示以及聚类结果。
具体地,基于预先设定的最大迭代次数,重复步骤S1至步骤S2进行迭代计算至模型收敛或达到最大迭代次数,在最后一次迭代计算时于步骤S1得到最终的蛋白质节点向量表示,于步骤S2得到最终的聚类结果C={C 1,C 2,...,C m},形成m个聚类簇。
步骤S4、基于所述最终的蛋白质节点向量表示,通过余弦相似度计算公式计算蛋白质节点的相似度,构建加权邻接矩阵。
具体地,所述步骤S3包括:基于所述邻接矩阵A和蛋白质节点向量表示,通过以下余弦相似度计算公式计算蛋白质节点v i和v j的相似度,构建加权邻接矩阵W:
Figure PCTCN2022140090-appb-000038
其中,a ij为所述邻接矩阵A的元素,w ij为所述加权邻接矩阵W的元素。
步骤S5、从所述蛋白质相互作用网络筛选出蛋白质复合体基础结构并基于所述加权邻接矩阵的计算进行扩展,获得蛋白质复合体。
具体地,应用团挖掘方法并结合所述加权邻接矩阵,从所述蛋白质相互作用网络筛选出蛋白质复合体基础结构,将符合预定条件的邻居节点嵌入所述蛋白质复合体基础结构,获得蛋白质功能模块之一:蛋白质复合体。
本实施例的步骤S5具体包括以下子步骤:
步骤S51、设置并初始化集合Alternative_core、Complex_Seed_core、Complex_set。
步骤S52、应用团挖掘方法从所述蛋白质相互作用网络筛选出极大团结构Clique q,将所述极大团结构Clique q置入集合Alternative_core。其中,q为极大团结构的编号。
步骤S53、基于所述加权邻接矩阵W,计算所述集合Alternative_core中所有极大团Clique q的密度分数,并根据密度分数进行由大到小排序。
其中,所述密度分数的计算公式为:
Figure PCTCN2022140090-appb-000039
步骤S54、将密度分数最大的极大团,从集合Alternative_core移除并置入集合Complex_Seed_core作为蛋白质复合体的基础结构。
例如,步骤S53按照密度分数进行由大到小排序为Clique 1、Clique 2、Clique 3、…;密度分数最大的极大团为Clique 1,则将极大团Clique 1从集合Alternative_core移除并置入集合Complex_Seed_core作为蛋白质复合体的基础结构。
步骤S55、遍历集合Alternative_core剩余的极大团结构,若存在其余极大团的蛋白质节点与当前密度分数最大的极大团中蛋白质节点有重合:
若重复节点个数少于2个,则将其余极大团中的重复节点删除,其余部分数量大于3则保留;若重复节点个数大于等于2,则不删除重复节点。
例如,在第一次循环计算中,Clique 1被置入集合Complex_Seed_core作为蛋白质复合体基础结构,则集合Alternative_core剩余的极大团结构包括Clique 2、Clique 3、Clique 4、…。
以,Clique 2为例,若Clique 2的蛋白质节点与当前密度分数最大的极大团Clique 1的蛋白质节点有重合:
当重复节点个数少于2个时,将Clique 2中重复的节点删除,并且Clique 2剩余的节点大于3时,Clique 2保留在集合Alternative_core中,否则将Clique 2从集合Alternative_core中移除;当重复节点个数大于等于2时,不删除Clique 2中的重复节点。
现实的蛋白质复合体会存在多个复合体有共同的内部结构,应当增加生成共有极大团蛋白质复合体的概率。通过以上的极大团的过滤处理方式,基于极大团结构作为蛋白质复合体基本框架,尽可能保留极大团之间共有的蛋白质节点,能够体现出共有极大团的蛋白质复合体,更符合真实情况。
步骤S56、重复进行以上步骤S53-S55,直至集合Alternative_core为空集,在集合Complex_Seed_core获得若干个蛋白质复合体基础结构。
步骤S57、基于集合Complex_Seed_core中极大团Clique j,对于该极大团Clique j中蛋白质节点的任意一个邻居节点p i,基于蛋白质节点相似度计算邻居节点p i与该极大团Clique j的相关性成绩,若相关性成绩大于预先设定的阈值θ 1,则将蛋白质节点p i嵌入该极大团Clique j;其中,相关性成绩的计算公式如下:
Figure PCTCN2022140090-appb-000040
步骤S58、遍历完该极大团Clique j的所有邻居蛋白节点,完成基础结构的节点扩展,由此确定一个蛋白质复合体,从集合Complex_Seed_core移除并置入集合Complex_set。
步骤S59、重复进行以上步骤S57-S58,直至集合Complex_Seed_core为空 集,在集合Complex_set获得最终挖掘出的蛋白质复合体。
步骤S6、从所述聚类结果的每个聚类簇中筛选出蛋白质信号通路基础结构并基于所述加权邻接矩阵的计算进行扩展,获得蛋白质信号通路。
具体地,从所述聚类结果C={C 1,C 2,...,C m}的每个聚类簇中筛选出簇内最短路径作为蛋白质信号通路基础结构,通过所述加权邻接矩阵计算相关性,将符合预定条件的邻居节点嵌入所述蛋白质信号通路基础结构的端点,获得蛋白质功能模块之二:蛋白质信号通路。
本实施例的步骤S6具体包括以下子步骤:
步骤S61、设置并初始化集合Pathway_Seed_core、Pathway_set。
步骤S62、基于所述蛋白质相互作用网络,遍历所述m个聚类簇,查找簇内两两节点的最短路径,并且最短路径长度不超过3,将筛选出的所有路径置入集合Pathway_Seed_core作为蛋白质信号通路基础结构。
步骤S63、基于所述加权邻接矩阵W,计算所述集合Pathway_Seed_core中所有最短路径shortest_path q的密度分数,并根据密度分数进行由大到小排序;其中,所述密度分数的计算公式为:
Figure PCTCN2022140090-appb-000041
步骤S64、取集合Pathway_Seed_core中密度最大的最短路径shortest_path j,对于最短路径shortest_path j末端的任意一个邻居节点p i,基于蛋白质节点相似度计算邻居蛋白质节点与该最短路径的相关性成绩;若相关性成绩大于预先设定的阈值θ 2,则将邻居蛋白质节点p i嵌入最短路径shortest_path j的末端;
其中,相关性成绩的计算公式如下:
Figure PCTCN2022140090-appb-000042
步骤S65、遍历完所述最短路径shortest_path j的末端的所有邻居蛋白节点,完成基础结构的节点扩展,由此确定一个蛋白质通信路径,从集合Pathway_Seed_core移除并置入集合Pathway_set。
步骤S66、重复进行以上步骤S64-S65,直至集合Pathway_Seed_core为空集,在集合Pathway_set获得最终挖掘出的蛋白质通信路径。
如上实施例提供的蛋白质功能模块的挖掘方法,基于节点级别自适应的图卷积网络(NASGC),通过自适应机制使各个蛋白质节点分别学习高阶和低阶邻居信息,学习得到蛋白质节点的向量表示信息中,在蛋白质节点基因本体属性特征上融合了高阶和低阶的结构信息,得到更加泛化的蛋白质节点表示,由此能够从蛋白质相互作用网络挖掘出蛋白质复合体以及蛋白质信号通路,并且挖掘生成的蛋白质复合体更符合真实情况,提升对于蛋白质功能识别的准确度。
基于如上实施例提供的蛋白质功能模块的挖掘方法,本发明实施例还提供了一种计算机设备,如图3所示,所述计算机设备包括:处理器10、存储器20、输入装置30和输出装置40,处理器10中设置有GPU,处理器10的数量可以是一个或多个,图2中以一个处理器10为例。计算机设备中的处理器10、存储器20、输入装置30和输出装置40可以通过总线或其他方式连接。
其中,存储器20作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块。处理器10通过运行存储在存储器20中的软件程序、指令以及模块,从而执行设备的各种功能应用以及数据处理,即实现本发明前述实施例中所述的蛋白质功能模块的挖掘方法的步骤。输入装置30用于接收图像数据、输入的数字或字符信息,以及产生与设备的用户设置以及功能控制有关的键信号输入。输出装置40可包括显示屏等显示设备,例如是用于显示图像。
基于如上实施例提供的蛋白质功能模块的挖掘方法,本发明实施例还提供了一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现本发明前述实施例中的蛋白质功能模块的挖掘方法的步骤。所述计算机存储介质可以是计算机能够存取的任何可用介质或数据存储设备,包括但不限于磁性存储器、光学存储器、以及半导体存储器等。
需要指出的是,上述实施例仅为说明本发明的技术构思及特点,其目的在于让熟悉此项技术的人士能够了解本发明的内容并据以实施,并不能以此限制本发明的保护范围。凡根据本发明精神实质所作的等效变化或修饰,都应涵盖在本发明的保护范围之内。

Claims (10)

  1. 一种蛋白质功能模块的挖掘方法,其特征在于,包括步骤:
    S1、将蛋白质相互作用网络输入节点级别自适应图卷积网络模型中学习训练,使各个蛋白质节点学习高阶和低阶邻居信息,获得蛋白质节点向量表示;
    S2、基于所述蛋白质节点向量表示,通过K-means聚类算法进行聚类,得到蛋白质节点的聚类结果软标签,根据所述聚类结果软标签设定损失函数并进行反向传播,更新模型的网络参数;
    S3、基于以上步骤S1至步骤S2进行迭代计算至模型收敛或达到最大迭代次数,获得最后一次迭代计算的最终的蛋白质节点向量表示以及聚类结果;
    S4、基于所述最终的蛋白质节点向量表示,通过余弦相似度计算公式计算蛋白质节点的相似度,构建加权邻接矩阵;
    S5、从所述蛋白质相互作用网络筛选出蛋白质复合体基础结构并基于所述加权邻接矩阵的计算进行扩展,获得蛋白质复合体;
    S6、从所述聚类结果的每个聚类簇中筛选出蛋白质信号通路基础结构并基于所述加权邻接矩阵的计算进行扩展,获得蛋白质信号通路。
  2. 根据权利要求1所述的蛋白质功能模块的挖掘方法,其特征在于,所述步骤S1包括:
    S11、获取蛋白质相互作用网络,构建相应的邻接矩阵A和基因本体属性特征矩阵X;其中,所述蛋白质相互作用网络的节点表示为v={v 1,v 2,...,v n},所述基因本体属性特征矩阵X={x 1,x 2,...,x n} T,n为蛋白质节点总数,x i的维度为d,x和d均为正整数,i=1~n;
    S12、基于所述邻接矩阵A计算归一化拉普拉斯矩阵Ls,构建低通滤波器G;所述归一化拉普拉斯矩阵Ls为Ls=I-D -1/2AD -1/2,所述低通滤波器G为
    Figure PCTCN2022140090-appb-100001
    其中,D为所述邻接矩阵A的度矩阵D=diag(d 1,d 2,...,d n),d i表示节点v i的边数,Λ是矩阵Ls特征值的对角矩阵,I是矩阵Ls特征值全为1的对角矩阵,U是矩阵Ls的特征向量;
    S13、设置迭代卷积层数k=t,令t从0至M循环进行以下步骤S14至S18,M表示卷积层数的最大值,取值为正整数;
    S14、使用低通图滤波器G与基因本体属性特征矩阵X执行第k层卷积操作, 得到蛋白质相互作用网络的当前卷积层的图表示G kX,计算公式如下:
    Figure PCTCN2022140090-appb-100002
    S15、基于所述当前卷积层的图表示G kX,计算蛋白质相互作用网络中各个蛋白质节点的状态值,计算公式如下:
    Figure PCTCN2022140090-appb-100003
    其中,[G kX] i表示第k层卷积操作时节点v i的图表示,
    Figure PCTCN2022140090-appb-100004
    表示节点v i在第k-1层卷积操作时的状态值,
    Figure PCTCN2022140090-appb-100005
    表示节点v i在第k层卷积操作时状态值,S()为一个非线性变换函数;
    S16、基于所述状态值计算评估各个节点停止进行迭代卷积的概率值;其中,所述概率值的计算公式如下:
    Figure PCTCN2022140090-appb-100006
    其中,W h和b h是所述节点级别自适应图卷积网络模型的可学习网络参数;σ表示激活函数;
    Figure PCTCN2022140090-appb-100007
    表示节点v i在第k层卷积操作时的概率值;
    S17、设置概率值阈值ε,对于每一个节点v i,计算其前k层卷积的累计概率值并与所述阈值ε比较:
    若累计概率值达到所述阈值ε以上,则对节点v i停止迭代卷积计算并记录其卷积层数为N i=k’;
    若卷积层数k迭代至M,累计概率值仍小于所述阈值ε,则节点v i的卷积层数为N i=M;
    所述节点v i停止迭代卷积的卷积层数N i表示如下:
    Figure PCTCN2022140090-appb-100008
    S18、计算获得各个节点v i的最后一层卷积操作时的概率值,计算公式如下:
    Figure PCTCN2022140090-appb-100009
    S19、对于每一个节点v i,将其前N i层卷积操作的图表示与概率值线性组合,得到蛋白质节点向量表示:
    Figure PCTCN2022140090-appb-100010
  3. 根据权利要求1或2所述的蛋白质功能模块的挖掘方法,其特征在于,所述步骤S2包括:
    S21、设定聚类簇的数量m,m为正整数;
    S22、基于所述蛋白质节点向量表示,通过K-means聚类算法进行聚类,得到蛋白质节点的聚类结果软标签;
    S23、根据所述聚类结果软标签,设定损失函数L为:
    Figure PCTCN2022140090-appb-100011
    其中,λ tig表示第一损失系数,λ sep表示第二损失系数,λ tig和λ sep均为常数,L tig表示簇内节点之间的相似性,L sep表示簇间节点之间的相似性;
    其中,L tig的计算公式如下:
    Figure PCTCN2022140090-appb-100012
    其中,L sep的计算公式如下:
    Figure PCTCN2022140090-appb-100013
    S24、根据所述损失函数进行反向传播,更新所述节点级别自适应图卷积网络模型的网络参数;
    所述步骤S3包括:基于预先设定的最大迭代次数,重复步骤S1至步骤S2 进行迭代计算至模型收敛或达到最大迭代次数,在最后一次迭代计算时于步骤S1得到最终的蛋白质节点向量表示,于步骤S2得到最终的聚类结果C={C 1,C 2,...,C m},形成m个聚类簇。
  4. 根据权利要求3所述的蛋白质功能模块的挖掘方法,其特征在于,所述聚类簇的数量m的取值按照以下方式设定:
    基于所述蛋白质相互作用网络,设置m=r,通过K-means聚类算法进行聚类得到C={C 1,C 2,...,C r};其中r从2至R取值,R为正整数;
    对于每一次m的具体取值,通过手肘算法计算每个节点到簇中心距离到误差平方和SSE,计算公式如下:
    Figure PCTCN2022140090-appb-100014
    p为C i簇内节点,center i为C i簇的中心点;
    根据m的具体取值与计算得到SSE值的对应关系拟合曲线图,在拟合曲线中确定SSE值下降幅度由快速转缓慢的拐点,选择拐点对应的m值作为最终聚类簇的数量m的取值。
  5. 根据权利要求3所述的蛋白质功能模块的挖掘方法,其特征在于,所述损失函数L中:
    Figure PCTCN2022140090-appb-100015
  6. 根据权利要求3所述的蛋白质功能模块的挖掘方法,其特征在于,所述步骤S4包括:基于所述邻接矩阵A和蛋白质节点的向量表示,通过以下余弦相似度计算公式计算蛋白质节点v i和v j的相似度,构建加权邻接矩阵W:
    Figure PCTCN2022140090-appb-100016
    其中,a ij为所述邻接矩阵A的元素,w ij为所述加权邻接矩阵W的元素。
  7. 根据权利要求6所述的蛋白质功能模块的挖掘方法,其特征在于,所述步骤S5包括:
    S51、设置并初始化集合Alternative_core、Complex_Seed_core、Complex_set;
    S52、应用团挖掘方法从所述蛋白质相互作用网络筛选出极大团结构Clique q,将所述极大团结构Clique q置入集合Alternative_core;
    S53、基于所述加权邻接矩阵W,计算所述集合Alternative_core中所有极大团Clique q的密度分数,并根据密度分数进行由大到小排序;其中,所述密度分数的计算公式为:
    Figure PCTCN2022140090-appb-100017
    S54、将密度分数最大的极大团,从集合Alternative_core移除并置入集合Complex_Seed_core作为蛋白质复合体基础结构;
    S55、遍历集合Alternative_core剩余的极大团结构,当存在其余极大团的蛋白质节点与当前密度分数最大的极大团中蛋白质节点有重合:
    若重复节点个数少于2个,则将其余极大团中的重复节点删除,其余部分数量大于3则保留;若重复节点个数大于等于2,则不删除重复节点;
    S56、重复进行以上步骤S53-S55,直至集合Alternative_core为空集,在集合Complex_Seed_core获得若干个蛋白质复合体基础结构;
    S57、基于集合Complex_Seed_core中的极大团Clique j,对于该极大团Clique j中蛋白质节点的任意一个邻居节点p i,基于蛋白质节点相似度计算邻居节点p i与该极大团Clique j的相关性成绩,若相关性成绩大于预先设定的阈值θ 1,则将邻居蛋白质节点p i嵌入该极大团Clique j;其中,相关性成绩的计算公式如下:
    Figure PCTCN2022140090-appb-100018
    S58、遍历完该极大团Clique j的所有邻居蛋白节点,则确定一个蛋白质复合体,从集合Complex_Seed_core移除并置入集合Complex_set;
    S59、重复进行以上步骤S57-S58,直至集合Complex_Seed_core为空集,在集合Complex_set获得最终挖掘出的蛋白质复合体。
  8. 根据权利要求6所述的蛋白质功能模块的挖掘方法,其特征在于,所述步骤S6包括:
    S61、设置并初始化集合Pathway_Seed_core、Pathway_set;
    S62、基于所述蛋白质相互作用网络,遍历所述m个聚类簇,查找簇内两两节点的最短路径,并且最短路径长度不超过3,将筛选出的所有路径置入集合Pathway_Seed_core作为蛋白质信号通路基础结构;
    S63、基于所述加权邻接矩阵W,计算所述集合Pathway_Seed_core中所有最短路径shortest_path q的密度分数,并根据密度分数进行由大到小排序;其中,所述密度分数的计算公式为:
    Figure PCTCN2022140090-appb-100019
    S64、取集合Pathway_Seed_core中密度最大的最短路径shortest_path j,对于最短路径shortest_path j末端的任意一个邻居节点p i,基于蛋白质节点相似度计算邻居蛋白质节点与该最短路径的相关性成绩;若相关性成绩大于预先设定的阈值θ 2,则将邻居蛋白质节点p i嵌入最短路径shortest_path j的末端;
    其中,相关性成绩的计算公式如下:
    Figure PCTCN2022140090-appb-100020
    S65、遍历完所述最短路径shortest_path j的末端的所有邻居蛋白节点,则确定一个蛋白质通信路径,从集合Pathway_Seed_core移除并置入集合Pathway_set;
    S66、重复进行以上步骤S64-S65,直至集合Pathway_Seed_core为空集,在集合Pathway_set获得最终挖掘出的蛋白质通信路径。
  9. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现如权利要求1-8任一项所述的蛋白质功能模块的挖掘方法的步骤。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如1-8任一项所述的蛋白质功能模块的挖掘方法的步骤。
PCT/CN2022/140090 2021-12-29 2022-12-19 蛋白质功能模块的挖掘方法、计算机设备和存储介质 WO2023125114A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111649446.3A CN116417060A (zh) 2021-12-29 2021-12-29 蛋白质功能模块的挖掘方法、计算机设备和存储介质
CN202111649446.3 2021-12-29

Publications (2)

Publication Number Publication Date
WO2023125114A1 true WO2023125114A1 (zh) 2023-07-06
WO2023125114A9 WO2023125114A9 (zh) 2023-08-17

Family

ID=86997714

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/140090 WO2023125114A1 (zh) 2021-12-29 2022-12-19 蛋白质功能模块的挖掘方法、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN116417060A (zh)
WO (1) WO2023125114A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778349A (zh) * 2014-01-29 2014-05-07 思博奥科生物信息科技(北京)有限公司 一种基于功能模块的生物分子网络分析的方法
CN103923216A (zh) * 2014-04-22 2014-07-16 吉林大学 多功能模块融合蛋白及其在提高蛋白质类药物口服生物利用度中的应用
US20180357363A1 (en) * 2015-11-10 2018-12-13 Ofek - Eshkolot Research And Development Ltd Protein design method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778349A (zh) * 2014-01-29 2014-05-07 思博奥科生物信息科技(北京)有限公司 一种基于功能模块的生物分子网络分析的方法
CN103923216A (zh) * 2014-04-22 2014-07-16 吉林大学 多功能模块融合蛋白及其在提高蛋白质类药物口服生物利用度中的应用
US20180357363A1 (en) * 2015-11-10 2018-12-13 Ofek - Eshkolot Research And Development Ltd Protein design method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MA, YE; SUN, HONGMEI: "Identification of Functional Modules from Protein-protein Interaction Network", BULLETIN OF SCIENCE AND TECHNOLOGY, no. 08, 15 August 2012 (2012-08-15), XP009547440 *
ZHU RONGXIANG, JI CHAOJIE, WANG YINGYING, CAI YUNPENG, WU HONGYAN: "Heterogeneous Graph Convolutional Networks and Matrix Completion for miRNA-Disease Association Prediction", FRONTIERS IN BIOENGINEERING AND BIOTECHNOLOGY, vol. 8, XP093075866, DOI: 10.3389/fbioe.2020.00901 *

Also Published As

Publication number Publication date
WO2023125114A9 (zh) 2023-08-17
CN116417060A (zh) 2023-07-11

Similar Documents

Publication Publication Date Title
Casale et al. Probabilistic neural architecture search
Fischer et al. Training restricted Boltzmann machines: An introduction
CN110263227B (zh) 基于图神经网络的团伙发现方法和系统
US9317540B2 (en) Method, system and aggregation engine for providing structural representations of physical entities
Liu et al. A survey on computationally efficient neural architecture search
CN112087447A (zh) 面向稀有攻击的网络入侵检测方法
CN113987236B (zh) 基于图卷积网络的视觉检索模型的无监督训练方法和装置
CN112580728B (zh) 一种基于强化学习的动态链路预测模型鲁棒性增强方法
García-Pérez et al. Precision as a measure of predictability of missing links in real networks
WO2023124342A1 (zh) 一种针对图像分类的神经网络结构低成本自动搜索方法
CN115293919A (zh) 面向社交网络分布外泛化的图神经网络预测方法及系统
WO2021058096A1 (en) Node disambiguation
CN114463596A (zh) 一种超图神经网络的小样本图像识别方法、装置及设备
CN116090504A (zh) 图神经网络模型训练方法及装置、分类方法、计算设备
US10956129B1 (en) Using genetic programming to create generic building blocks
WO2023125114A1 (zh) 蛋白质功能模块的挖掘方法、计算机设备和存储介质
Luo et al. Sampling-based adaptive bounding evolutionary algorithm for continuous optimization problems
CN108898227A (zh) 学习率计算方法及装置、分类模型计算方法及装置
Qiao et al. Genetic feature fusion for object skeleton detection
Liu et al. hpGAT: High-order proximity informed graph attention network
Nagy Data-driven analysis of fractality and other characteristics of complex networks
CN112949590A (zh) 一种跨域行人重识别模型构建方法及构建系统
Ellouze Social Network Community Detection by Combining Self‐Organizing Maps and Genetic Algorithms
Stegehuis et al. Efficient inference in stochastic block models with vertex labels
Tang et al. Dynamic Network Embedding by Using Sparse Deep Autoencoder

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22914368

Country of ref document: EP

Kind code of ref document: A1