CN111370060A - Protein interaction network co-location co-expression complex recognition system and method - Google Patents

Protein interaction network co-location co-expression complex recognition system and method Download PDF

Info

Publication number
CN111370060A
CN111370060A CN202010204246.6A CN202010204246A CN111370060A CN 111370060 A CN111370060 A CN 111370060A CN 202010204246 A CN202010204246 A CN 202010204246A CN 111370060 A CN111370060 A CN 111370060A
Authority
CN
China
Prior art keywords
protein
data
complex
expression
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010204246.6A
Other languages
Chinese (zh)
Inventor
张锦雄
钟诚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi University
Original Assignee
Guangxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi University filed Critical Guangxi University
Priority to CN202010204246.6A priority Critical patent/CN111370060A/en
Publication of CN111370060A publication Critical patent/CN111370060A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Public Health (AREA)
  • Molecular Biology (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Genetics & Genomics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)

Abstract

The invention belongs to the technical field of protein complex identification, and discloses a protein interaction network co-localization co-expression complex identification system and method, which comprises the following steps: the system comprises a data extraction module, a matrix data generation module, an identification and evaluation module, a core mining module, an attachment adding module and a compound screening module. The protein complex recognition method comprises the following steps: organizing protein positioning data, gene expression data, protein interaction data and protein GO similarity data in a matrix manner; a seed expansion strategy is used to identify a co-localized co-expressed protein complex based on the core-accessory structure. The invention discovers the protein complex from the protein interaction network, is favorable for understanding the topological structure of the protein network and the biological significance contained in the complex, and has important functions for predicting the functions of unknown proteins and designing disease-targeted drugs.

Description

Protein interaction network co-location co-expression complex recognition system and method
Technical Field
The invention belongs to the technical field of protein complex identification, and particularly relates to a protein interaction network co-localization co-expression complex identification system and method.
Background
With the advent of the post-genomic era, proteome became yet another important research content for researchers. In cells, proteins rarely work alone, and they must bind to other proteins to interact with each other to perform their biological functions. Protein Interaction (PPI) is essential in all vital activities and is the basis for all metabolic activities performed by cells. Therefore, network maps for revealing and establishing interaction relationships among proteins have become hot spots in proteomics research and are also a difficult problem in the later gene era. Among various biological networks, protein interaction networks (PPINs) are the basis of cellular functions, which control a large number of life processes, and abnormal regulation caused by abnormal perturbation of protein-protein interactions is the main cause of many diseases, and thus, they are becoming major tools for revealing disease mechanisms from a molecular level.
Proteins are products of gene expression, which are executives of physiological functions of organisms and also direct manifestations of life phenomena. Proteomics is a discipline for systematic study of the properties contained in proteins and can provide detailed descriptions of the structure, function and regulation of biological systems in healthy and diseased states. Almost all biological processes are accomplished through a series of protein interactions. From the perspective of system biology, the research and analysis of biological functions by using a protein interaction network has important prospects and practical values. Protein complexes are collections of proteins that are organized in a multi-molecular mechanism by interactions at the same time and space, which is the primary form of a protein to perform its function. The recognition of protein complexes not only facilitates understanding of complex life activities, but also provides valuable theoretical references for discovering complex disease generation mechanisms and designing targeted drugs.
Currently, methods for mining protein complexes can be roughly classified into 3 types: the method is an identification method based on the traditional graph theory, for example: the method can save a certain time cost based on an RNSC algorithm clustered by a division mode, an MCODE algorithm clustered by a density mode and a GN algorithm clustered by a hierarchical mode, but the method can influence the overall efficiency of the algorithm to a certain extent due to sensitivity to a clustering center, data, parameters and the like; secondly, an identification method based on multigroup chemical data fusion is adopted, biological information data are generally integrated into the existing protein network, and the accuracy and reliability of the network are enhanced, so that the problems of false positive and false negative and the like existing in interaction data are solved, but inevitable limitations are difficult to meet the performance requirements of the algorithm; and thirdly, the identification method based on intelligent optimization shows good performance by simulating various group behaviors of the organisms in the nature and searching for an approximate optimal solution of the solved problem by utilizing interactive cooperation among individuals, such as an ant colony optimization algorithm, a particle swarm optimization algorithm and the like.
Meanwhile, a protein interaction network (PPIN) is constructed through the existing protein interaction data (PPIData), and meaningful substructures such as a protein Complex (Complex), a functional module (functional module) and a Motif (Motif) are found from the PPIN, so that the method becomes a hot spot of domestic and foreign research. In order to more easily find these substructures from the protein interaction network, it is a common practice to represent the protein interaction network in the form of a graph, regarding proteins as vertices and interactions between proteins as edges, and then to mine a biologically meaningful substructure, the protein Complex (complete), using various algorithms.
In summary, the problems of the prior art are as follows:
(1) the existing recognition method based on the traditional graph theory is sensitive to the comparison of a clustering center, data, parameters and the like, so that the overall efficiency of the algorithm is influenced to a certain extent, and the accuracy is low.
(2) The existing identification method based on multigroup data fusion has the inevitable limitation that the performance requirement of the algorithm is difficult to meet, and the accuracy rate is low.
(3) The existing identification method based on intelligent optimization is time-consuming and labor-consuming, low in convergence rate, low in search efficiency and easy to fall into local optimization.
The difficulty of solving the technical problems is as follows:
(1) the existing identification method based on the traditional graph theory is difficult to accurately identify protein compounds basically, and an algorithm needs to be redesigned according to the co-localization and co-expression attributes of the protein compounds;
(2) most of the existing identification methods based on multigroup biological data fusion only adopt 2-class biological data, and the utilization of more biological data means that multigroup biological data fusion modes are various and the optimal fusion mode needs to be selected;
(3) the NP problem cannot be solved through an exhaustion method, the problem that the NP is difficult to fall into local optimum is that the existing identification method based on intelligent optimization cannot avoid, and the search efficiency can be effectively improved by combining seed expansion and a greedy strategy.
The significance of solving the technical problems is as follows:
(1) co-localized co-expression is a fundamental attribute of protein complex assembly, and redesigning an algorithm based on the fundamental attribute is a prerequisite for accurate identification of protein complexes.
(2) Fusing more biological data into the algorithm ensures that the biological significance of identifying protein complexes is more significant.
(3) Seed expansion coupled with a greedy strategy makes efficient and accurate identification of protein complexes feasible.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a protein interaction network co-localization co-expression complex recognition system and a protein interaction network co-localization co-expression complex recognition method.
The invention is realized in such a way that a protein interaction network co-localization co-expression complex recognition method comprises the following steps:
step one, a matrix data preparation stage: extracting protein positioning data, gene expression data, protein interaction data and protein GO labeling data;
analyzing and calculating to sequentially generate an interaction matrix with reliability scores among the proteins, a protein positioning matrix, a gene expression matrix, a CC-based protein similarity matrix, an MF-based protein similarity matrix and a BP-based protein similarity matrix;
identifying a protein complex under parameter tuning setting through a core algorithm ICJointLE;
(1) protein complex core mining phase: excavating a densely and reliably connected combined co-location co-expression protein core by applying a seed expansion strategy according to the core-attachment structure;
(2) protein complex attachment addition stage: adding a strongly reliable linked joint co-localization joint co-expression protein accessory;
(3) overlapping protein complex screening phase: overlapping complexes with low reliable ligation densities were deleted.
And step four, evaluating the quality of the identified compound by taking CYC2008 as a reference.
Further, the identification method of the protein interaction network co-localization co-expression complex adopts a saccharomyces cerevisiae yeast data set.
Further, the CYC2008 as a set of known complexes comprises 408 artificially organized heteromeric protein complexes; gene expression data GSE3431 contains not only gene expression data for 3 consecutive metabolic cycles, but also 3-class GO terminology labels that contain expressed genes.
Further, the method for identifying the complex containing the protein-free localization data protein in CYC2008 by ICJointLE is as follows: some of the proteins in the CYC2008 and PPI datasets have no protein localization data, and when calculating the joint co-localization count for the proteome containing the protein lacking protein localization data proteins, the localization vectors for the protein lacking protein localization data proteins are set to all 1's.
Another object of the present invention is to provide a protein-interacting network co-localized co-expression complex recognition system for implementing the protein-interacting network co-localized co-expression complex recognition method, the protein-interacting network co-localized co-expression complex recognition system comprising:
the data extraction module is used for extracting protein positioning data, gene expression data, protein interaction data and protein GO labeling data;
the matrix data generation module is used for sequentially generating an interaction matrix with reliability scores among proteins, a protein positioning matrix, a gene expression matrix, a CC-based protein similarity matrix, an MF-based protein similarity matrix and a BP-based protein similarity matrix;
the identification and evaluation module is used for identifying the protein compound under the parameter tuning setting through a core algorithm ICJointLE, and then performing quality evaluation on the identified compound by taking CYC2008 as a reference;
the core mining module is used for mining the densely and reliably connected joint co-localization joint co-expression protein core;
an attachment addition module for adding a strongly reliably linked joint co-localized joint co-expressed protein attachment;
and the complex screening module is used for deleting the overlapped complexes with low reliable connection density.
The invention also aims to provide an information data processing terminal for realizing the protein interaction network co-localization co-expression complex identification method.
It is another object of the present invention to provide a computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the protein interaction network co-localized co-expression complex identification method.
In summary, the advantages and positive effects of the invention are: the invention realizes the co-location co-Expression protein complex identification based on a protein interaction network based on a kit ICJointLE (identification protein Complexes with the features of joint co-Localization and joint co-Expression) V1.0. The invention discovers the protein Complex (Complex) from the protein interaction network (PPIN), is beneficial to understanding not only the topological structure of the protein network, but also the biological significance contained in the Complex, and has important function for predicting the function of unknown protein and human pathogenic genes.
Drawings
FIG. 1 is a schematic diagram of a protein interaction network co-localization co-expression complex recognition system provided in an embodiment of the present invention;
in the figure: 1. a data extraction module; 2. a matrix data generation module; 3. a complex recognition module; 4. a core mining module; 5. an accessory adding module; 6. and 7, a compound screening module and a compound evaluation module.
FIGS. 2 and 3 are flow charts of methods for identifying co-localized co-expression complexes of protein interaction networks according to embodiments of the present invention.
Fig. 4 is a schematic diagram of a folder where icjoinle V1.0 is initially installed according to an embodiment of the present invention and a configuration thereof.
Fig. 5 is a schematic diagram of a preparing data set training folder according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of creating a STRING folder according to an embodiment of the present invention.
Fig. 7 is a diagram of a PPI file for preparing a STRING data set according to an embodiment of the present invention.
Fig. 8 is a schematic diagram of a matrix data file generation process of the STRING data set according to the embodiment of the present invention.
Fig. 9 is a schematic diagram of a matrix data file of a STRING data set according to an embodiment of the present invention.
FIG. 10 is a schematic diagram of a process for identifying and evaluating complexes in a STRING interaction network, according to an embodiment of the present invention.
FIG. 11 is a schematic diagram of the complexes identified in the STRING interaction network and their evaluation provided by embodiments of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Aiming at the problems in the prior art, the invention provides a protein interaction network co-localization co-expression complex recognition system and a protein interaction network co-localization co-expression complex recognition method, and the invention is described in detail below with reference to the attached drawings.
As shown in fig. 1, the system for identifying co-localized co-expression complexes of protein interaction networks provided in the embodiments of the present invention includes: the system comprises a data extraction module 1, a matrix data generation module 2, an identification evaluation module 3, a core mining module 4, an attachment adding module 5 and a compound screening module 6.
The data extraction module 1 is used for extracting protein positioning data, gene expression data, protein interaction data and protein GO labeling data;
the matrix data generation module 2 is used for sequentially generating an interaction matrix with reliability scores among proteins, a protein positioning matrix, a gene expression matrix, a CC-based protein similarity matrix, an MF-based protein similarity matrix and a BP-based protein similarity matrix;
the identification and evaluation module 3 is used for identifying the protein compound under the parameter optimization setting through a core algorithm ICJointLE, and then performing quality evaluation on the identified compound by taking CYC2008 as a reference;
the core mining module 4 is used for mining a densely and reliably connected combined co-localization combined co-expression protein core;
an attachment adding module 5 for adding a strongly reliable linked joint co-localized joint co-expressed protein attachment;
and a complex screening module 6 for deleting overlapping complexes with low reliable connection density.
The data adopted by the system provided by the embodiment of the invention is a saccharomyces cerevisiae (yeast) related data set.
CYC2008 provided by the examples herein is a collection of known complexes that includes 408 artificially organized heteromeric protein complexes. Gene expression data GSE3431 contains not only gene expression data for 3 consecutive metabolic cycles, but also 3-class GO terminology labels that contain expressed genes.
The method for identifying the compound containing the protein-free positioning data protein in CYC2008 by ICJointLE provided by the embodiment of the invention comprises the following steps:
some proteins in the CYC2008 and PPI datasets have no protein localization data. To accurately identify as many protein complexes as possible in CYC2008, the localization vectors for protein-deficient localization data proteins are set to all "1" s when calculating the joint co-localization counts for the proteomes containing the protein-deficient localization data proteins.
As shown in fig. 2, the method for identifying a co-localized co-expression complex of a protein interaction network provided in the embodiment of the present invention includes the following steps:
s101: matrix data preparation stage: extracting protein positioning data, gene expression data, protein interaction data and protein GO labeling data.
S102: and analyzing and calculating to sequentially generate an interaction matrix, a protein positioning matrix, a gene expression matrix, a CC-based protein similarity matrix, an MF-based protein similarity matrix and a BP-based protein similarity matrix with the reliability scores among the proteins.
S103: the core algorithm ICJointLE identified protein complexes under the parameter tuning settings.
S103-1: protein complex core mining phase: and (3) excavating a densely and reliably connected combined co-localization combined co-expression protein core by applying a seed expansion strategy according to the core-attachment structure.
S103-2: protein complex attachment addition stage: the addition of a strongly reliably linked co-localized co-expressed protein accessory.
S103-3: overlapping protein complex screening phase: overlapping complexes with low reliable ligation densities were deleted.
S104: evaluation of protein complexes: the quality of the identified complexes was assessed with reference to CYC 2008.
The technical solution of the present invention is further described below.
The invention points out that: a group of proteins must interact with each other at the same time and place to form a complex. In other words, the proteins in the complex are mass co-localized co-expressed and present dense junctions in the static PPI network (SPPIN). The software suite ICJointLE V1.0 excavates co-localized, co-expressed, densely and reliably connected and biologically functional similar protein clusters from a static PPI network (SPPIN) according to a core-attachment structure to generate a protein complex. Therefore, the software suite ICJointLE V1.0 realizes a group of protein co-localization criterion criteria according to the joint localization vector; then calculating a group of protein co-expression levels according to the combined gene expression pattern; in addition, similarity of characteristics of various protein Gene Ontologies (GO) is combined to establish a criterion for judging protein function similarity so as to ensure that protein complexes with consistent biological functions are identified.
1. Overview of software
1.1 principle
The software suite ICJointLE V1.0 organizes protein positioning data, gene expression data, protein interaction data and protein GO similar data in a matrix mode, and then identifies a combined co-localized and co-expressed protein complex by adopting a seed expansion strategy according to 3 steps (protein core mining, accessory protein adding and candidate protein complex screening) based on a core-accessory structure.
The software suite ICJointLE V1.0 is divided into 2 phases. The first stage is the matrix data preparation stage. This stage in turn generates an interaction matrix with reliability scores between proteins, a protein localization matrix, a gene expression matrix, a CC-based protein similarity matrix, an MF-based protein similarity matrix, and a BP-based protein similarity matrix. The second phase is the protein complex recognition phase. At this stage, according to the core attachment structure, a seed expansion strategy is applied, firstly, a densely and reliably connected combined co-localization combined co-expression protein core is excavated, then, a strongly and reliably connected combined co-localization combined co-expression protein attachment is added, and finally, an overlapping compound with low reliable connection density is deleted.
1.2 scheme
The operation flow of the present system is shown in fig. 3.
2. Operating environment
The experimental environment is shown in table 1.
TABLE 1 Experimental Environment
Figure BDA0002420421610000091
3. Instructions for use
3.1 software suite deployment
The software suite ICJointLE V1.0 is composed of a set of program modules running under a console and a plurality of related public data sets, and a user can deploy the software suite ICJointLE V1.0 into a designated folder.
3.1.1 software suite Structure and specific files
Under the folder designated by the user, the directory structure of the package of files is as follows.
Figure BDA0002420421610000092
3.1.2 software suite usage
The software suite ICJointLE V1.0 was carried out in the following two steps.
(1) Data preparation phase
preparation _ data creates a default directory "yourdata" under the current directory "
Or
preparation _ data datadir creates a directory "datadir" under the current directory "
Or
prepare _ data datadiryour _ ppis.txt generates all matrix data files within the "datadir" containing your _ ppis.txt.
The PPIs file format:
after creating the directory "yourdata" or "datadir", please copy the user's PPIs file (e.g., your _ PPIs. txt) into the directory "yourdata" or "datadir". Note that the PPIs file must conform to the following format.
your _ ppis. txt a pair of system names separated by tab per line
YKL171W YML096W
YFL017W-AYFR031C-A
...
Thus, all matrix data files can be generated using the following format.
preparing_datayourdatayour_PPIs.txt
Or
preparing_data datadiryour_PPIs.txt
After the data preparation phase is completed, the directory "yourdata" or "datadir" (assumed to be "yourdata") contains the files listed in table 2.
TABLE 2 associated data files
Figure BDA0002420421610000101
(2) Identification and evaluation phase
At this stage, the protein complexes were first identified by the core algorithm ICJointLE in the parameter tuning settings listed in table 3, and then the quality of the identified complexes was evaluated with reference to CYC 2008.
Optional parameters
TABLE 3 optional parameter description file
Figure BDA0002420421610000111
Examples of the invention
Setting all optional parameters
identify_and_analyze yourdata your_PPIs.txt -L 1 -r 999 -d 0.3 -c 0.7-f 0.75 -p 0.3 -m 0.08 -u 0.01 -e 0.9
Partial optional parameter Default
identify_and_analyze yourdata your_PPIs.txt -r 990 -c 0.6 -f 0.8 -p0.1 -m 0.4 -e 0.7
Default all optional parameters (Default all parameters according to Table 3)
identify_and_analyze STRING STRING_PPIs.txt
identify_and_analyze BioGrid BioGrid_PPIs.txt
identify_and_analyze DIP DIP_PPIs.txt
3.2 correlation data
The data adopted by the software suite ICJointLE V1.0 at present is a saccharomyces cerevisiae (yeast) related data set. Saccharomyces cerevisiae has been extensively studied as a model organism and has generated a large amount of biological data on Saccharomyces cerevisiae, which is the main reason why this study has been conducted using the Saccharomyces cerevisiae data set. In the experiments, 6 yeast PPI datasets were selected for the present invention. The first data set was from a STRING database version 10 containing 6418 protein and 939998 pairs of interactions, each with reliability score data. The second dataset consisted of 5811 proteins and 256516 interactions, which were derived from yeast PPI data version 3.4.128 of the BioGrid database. The third yeast PPI dataset was derived from the DIP database, published at 2015/07/01, comprising 5022 proteins and 22381 interactions. There were 3 additional sets of yeast binary interaction data generated by the yeast two-hybrid experiment: uetz, Ito, and Yu. The Uetz dataset contains 910 proteins and 823 interactions, the Ito dataset consists of 765 proteins and 733 interactions, and the Yu dataset consists of 1203 proteins and 1610 interactions.
CYC2008 as a set of known complexes contains 408 artificially organized heteromeric protein complexes. Gene expression data GSE3431 contains not only gene expression data for 3 consecutive metabolic cycles, but also 3-class GO terminology labels that contain expressed genes. Yeast protein localization data was derived from http:// yeastgfp. The present invention notes that some proteins in the CYC2008 and PPI datasets have no protein localization data. To accurately identify as many protein complexes as possible in CYC2008, we set the localization vector for protein-deficient localization data proteins to all "1" s when calculating the joint co-localization count for the proteome containing the protein-deficient localization data proteins. Thus, the method of the invention, ICJointLE, still recognized complexes of CYC2008 containing protein-free localization data proteins.
3.3 output results
The software suite ICJointLE V1.0 produced the results of identifying the compound and its quality assessment, and the output was stored as a file in the "complexes" subdirectory, as listed in Table 4.
Table 4 identification of complexes and quality assessment thereof
Figure BDA0002420421610000131
Example 2: example of user operation
As shown in FIG. 4, assume that the software suite ICJointLE V1.0 is installed in folder d \ ICJointLE V1.0.
1. Data preparation phase
The related data generation process is described by taking the STRING data set as an example.
Creating a data set folder
The software suite icjoinle V1.0 program module set folder is entered in the command line state and then batch command preparation data bat is executed in the following format, the operation process is shown in fig. 5.
As shown in fig. 6, a folder named STRING is created.
Preparing PPI dataset files
The PPI file (STRING _ PPIs. txt) that meets the format requirements is copied into \ STRING, as shown in fig. 7.
Generating matrix data files
The software suite icjoinle V1.0 program module set folder is entered in the command line state and then batch command preparation data bat is executed in the following format, see fig. 8 for the operational process.
preparing_data STRING STRING_PPIs.txt
After the data preparation phase is complete, a series of matrix data files are generated in the folder STRING (see FIG. 9)
2. Identification and evaluation of protein complexes
The software suite ICJointLE V1.0 program module set folder is entered in the command line state, and then the following format batch commands are executed, the operation process is shown in FIG. 10.
identify_and_analyze STRING STRING_PPIs.txt -L 1 -r 999 -d 0.3 -c 0.7-f 0.75 -p 0.3 -m 0.08 -u 0.01 -e 0.9
After the identification and evaluation phase is complete, the files listed in FIG. 11 are generated in the subfolders complexes of the folder STRING.
The technical effects of the present invention will be described in detail with reference to experiments.
To reflect the quality of the protein complexes identified by the software suite, tables 5-7 identified the evaluation index of the complexes on the STRING PPI data set from 3 points of exact match, approximate match and biological relevance versus the 9 algorithms including ICJointLE.
Table 5 compares the number of compounds that match exactly. It is easy to see that the total number of the compounds accurately identified by the software suite ICJointLE is obviously more than that of other algorithms, especially the compounds with the scale of 2-3.
TABLE 5 comparison of quantity distributions of accurately identified protein complexes on different scales
Figure BDA0002420421610000141
Table 6 compares the evaluation indexes of approximate matching. It is also easy to see that the software suite ICJointLE has better indexes than other algorithms except the Sn index.
TABLE 6 comparison of evaluation indexes for identifying protein complexes
Figure BDA0002420421610000142
Table 7 compares the significance of functional enrichment in BP terms. It can be seen that the percentage of the identified complexes, ICJointLE, in terms of BP function enrichment significance, whether overall or in different specification groups, was greater than that of other algorithms.
Table 7 identification of protein Complex BP enrichment significance comparison
Figure BDA0002420421610000151
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (7)

1. A protein interaction network co-localization co-expression complex identification method is characterized by comprising the following steps:
step one, a matrix data preparation stage: extracting protein positioning data, gene expression data, protein interaction data and protein GO labeling data;
analyzing and calculating to sequentially generate an interaction matrix with reliability scores among the proteins, a protein positioning matrix, a gene expression matrix, a CC-based protein similarity matrix, an MF-based protein similarity matrix and a BP-based protein similarity matrix;
step three, a core algorithm ICJointLE identifies a protein complex under the parameter tuning setting, and the process is divided into the following 3 sequential steps:
(1) protein complex core recognition phase: excavating a densely and reliably connected combined co-location co-expression protein core by applying a seed expansion strategy according to the core-attachment structure;
(2) protein complex attachment addition stage: adding a strongly reliable linked joint co-localization joint co-expression protein accessory;
(3) overlapping protein complex screening phase: deleting overlapping complexes of low reliable ligation density;
and step four, evaluating the quality of the identified compound by taking CYC2008 as a reference.
2. The method for identifying protein-interacting network co-localized co-expression complexes of claim 1, wherein the method for identifying protein-interacting network co-localized co-expression complexes employs a saccharomyces cerevisiae yeast dataset.
3. The method for identifying co-localized and co-expressed complexes of the protein interaction network of claim 1, wherein the CYC2008 comprises 408 artificially organized heteromeric protein complexes as a set of known complexes; gene expression data GSE3431 contains not only gene expression data for 3 consecutive metabolic cycles, but also 3-class GO terminology labels that contain expressed genes.
4. The method for identifying protein-interacting network co-localized co-expression complexes of claim 1, wherein the method for ICJointLE to identify complexes containing protein-free localization data protein in CYC2008 comprises: some of the proteins in the CYC2008 and PPI datasets have no protein localization data, and when calculating the joint co-localization count for the proteome containing the protein lacking protein localization data proteins, the localization vectors for the protein lacking protein localization data proteins are set to all 1's.
5. A protein-interacting network co-localized co-expression complex recognition system for implementing the method of any one of claims 1 to 4, wherein the protein-interacting network co-localized co-expression complex recognition system comprises:
the data extraction module is used for extracting protein positioning data, gene expression data, protein interaction data and protein GO labeling data;
the matrix data generation module is used for sequentially generating an interaction matrix with reliability scores among proteins, a protein positioning matrix, a gene expression matrix, a CC-based protein similarity matrix, an MF-based protein similarity matrix and a BP-based protein similarity matrix;
the identification and evaluation module is used for identifying the protein compound under the parameter tuning setting through a core algorithm ICJointLE, and then performing quality evaluation on the identified compound by taking CYC2008 as a reference;
the protein complex core mining module is used for mining the densely and reliably connected joint co-localization joint co-expression protein core;
a protein complex attachment addition module for adding a strongly reliably linked co-localized co-expressed protein attachment;
a protein complex screening module for deleting overlapping complexes of low reliable ligation density.
6. An information data processing terminal for implementing the method for identifying a co-localized and co-expressed complex in a protein interaction network according to any one of claims 1 to 4.
7. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method for protein-interacting network co-localized co-expression complex identification of any one of claims 1-4.
CN202010204246.6A 2020-03-21 2020-03-21 Protein interaction network co-location co-expression complex recognition system and method Pending CN111370060A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010204246.6A CN111370060A (en) 2020-03-21 2020-03-21 Protein interaction network co-location co-expression complex recognition system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010204246.6A CN111370060A (en) 2020-03-21 2020-03-21 Protein interaction network co-location co-expression complex recognition system and method

Publications (1)

Publication Number Publication Date
CN111370060A true CN111370060A (en) 2020-07-03

Family

ID=71210532

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010204246.6A Pending CN111370060A (en) 2020-03-21 2020-03-21 Protein interaction network co-location co-expression complex recognition system and method

Country Status (1)

Country Link
CN (1) CN111370060A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050260663A1 (en) * 2004-05-18 2005-11-24 Neal Solomon Functional proteomics modeling system
US20070072226A1 (en) * 2005-09-27 2007-03-29 Indiana University Research & Technology Corporation Mining protein interaction networks
CN103559426A (en) * 2013-11-06 2014-02-05 北京工业大学 Protein functional module excavating method for multi-view data fusion
CN106021988A (en) * 2016-05-26 2016-10-12 河南城建学院 Recognition method of protein complexes
US20190139621A1 (en) * 2016-04-27 2019-05-09 Zhong Wang Method for identifying key module or key node in biomolecular network
CN109887544A (en) * 2019-01-22 2019-06-14 广西大学 RNA sequence parallel sorting method based on Non-negative Matrix Factorization

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050260663A1 (en) * 2004-05-18 2005-11-24 Neal Solomon Functional proteomics modeling system
US20070072226A1 (en) * 2005-09-27 2007-03-29 Indiana University Research & Technology Corporation Mining protein interaction networks
CN103559426A (en) * 2013-11-06 2014-02-05 北京工业大学 Protein functional module excavating method for multi-view data fusion
US20190139621A1 (en) * 2016-04-27 2019-05-09 Zhong Wang Method for identifying key module or key node in biomolecular network
CN106021988A (en) * 2016-05-26 2016-10-12 河南城建学院 Recognition method of protein complexes
CN109887544A (en) * 2019-01-22 2019-06-14 广西大学 RNA sequence parallel sorting method based on Non-negative Matrix Factorization

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张锦雄等: "A method for identifying protein complexes with the features of joint co-localization and joint co-expression in static PPI networks", 《COMPUTERS IN BIOLOGY AND MEDICINE》 *

Similar Documents

Publication Publication Date Title
Qiao et al. Protein-protein interface hot spots prediction based on a hybrid feature selection strategy
Li et al. Computational approaches for detecting protein complexes from protein interaction networks: a survey
Shatkay et al. Genes, themes and microarrays
Kolodny et al. Bridging themes: short protein segments found in different architectures
Jiang et al. Mining frequent cross-graph quasi-cliques
Zhu et al. Large-scale binding ligand prediction by improved patch-based method Patch-Surfer2. 0
Zhang et al. Protein complex prediction in large ontology attributed protein-protein interaction networks
US20080162541A1 (en) Visualization Technique for Biological Information
Tanaka et al. A multi-label approach using binary relevance and decision trees applied to functional genomics
CN104281652A (en) One-by-one support point data dividing method in metric space
Cho et al. An integrated proteome database for two‐dimensional electrophoresis data analysis and laboratory information management system
Xu et al. ProtBuD: a database of biological unit structures of protein families and superfamilies
Cao et al. Pce-fr: A novel method for identifying overlapping protein complexes in weighted protein-protein interaction networks using pseudo-clique extension based on fuzzy relation
Reid et al. Comparative evolutionary analysis of protein complexes in E. coli and yeast
Ta et al. A novel method for assigning functional linkages to proteins using enhanced phylogenetic trees
CN111370060A (en) Protein interaction network co-location co-expression complex recognition system and method
Ferrari et al. A grid-aware approach to protein structure comparison
Kolchanov et al. GenExpress: A Computer System for Description, Analysis and Recognition of Regulatory Sequences in Eukaryotic Genome.
Shoop et al. MetaFam: a unified classification of protein families. II. Schema and query capabilities
CN113377765A (en) Multi-group chemical data analysis system and data conversion method thereof
Maruyama et al. Designing views in HypothesisCreator: System for assisting in discovery
Kynast et al. ATLIGATOR: editing protein interactions with an atlas-based approach
Cao et al. Detecting overlapping protein complexes in weighted protein-protein interaction networks using pseudo-clique extension based on fuzzy relation
Sidhu et al. Data integration through protein ontology
Ikeda et al. PreBINDS: an interactive web tool to create appropriate datasets for predicting compound–protein interactions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200703

RJ01 Rejection of invention patent application after publication