WO2012093869A2

WO2012093869A2 - Method for predicting location of protein within cell and molecular function of protein for each condition

Info

Publication number: WO2012093869A2
Application number: PCT/KR2012/000118
Authority: WO
Inventors: 이기영
Original assignee: 아주대학교산학협력단
Priority date: 2011-01-07
Filing date: 2012-01-05
Publication date: 2012-07-12
Also published as: WO2012093869A3

Abstract

The present invention provides a method for predicting the location of a target protein in a cell under predetermined conditions by inputting static characteristics of an individual protein or gene, protein interaction information with a neighboring protein, an expression profile of a protein or a gene, and the like. By employing the method of the present invention, the location of a target protein in a cell can be effectively predicted under specific conditions by inputting an already known target protein and predetermined conditions, and reversely, conditions such as disease stages can be ascertained through the location of the target protein within a cell. In addition, provided is a method for predicting a biological process or molecular function of a protein for each condition by inputting the biological process or molecular function of the target protein using the method of the present invention.

Description

Methods for Predicting Intracellular Location and Molecular Function of Conditional Proteins

The technical field of the present invention is bioinformatics.

Proteins have different functions depending on conditions such as various external stresses, disease developmental stages, and / or cell differentiation stages. These endogenous or exogenous conditions affect the function of the protein, leading to a regulating mechanism of genomic and / or proteomic levels. Accordingly, many efforts have been made to identify them.

One successful example of this effort is the Gene Ontology (GO) project. GO provides three distinct sets of good structural terms that are clearly defined. But the current GO is not related to any condition.

Information about the subcellular location and translocation of proteins in the cellular compartment is important information for understanding cellular functions and proteins.

However, existing experimental approaches (experimental approaches) were able to identify only a few protein positions, most methods only predicted the unconditional location, not the condition-specific protein position.

Protein location prediction may be performed by comparative analysis with other proteins whose cell location is known using basic information of the target protein. Such location prediction methods may be based on known protein sequences or structural features. However, these existing methods also lack accuracy, do not utilize a lot of information well, and more importantly, it is not implemented to predict the location information of the protein by conditions.

Accordingly, the present invention is to propose a method that can effectively predict the location information of the protein for each specific condition.

That is, by inputting known information, a dynamic protein interaction network is generated according to a specific condition required for protein location prediction, and the target protein in the condition is input by inputting information of the target protein and neighbors to know the location in the network. The purpose of this paper is to propose an efficient condition prediction method for predicting and outputting a cell's location.

In addition, the present invention is to propose a method that can effectively predict the positional information of the protein under a specific condition even in a single expression profile.

That is, even when there is only one expression profile sample, such as cell differentiation, a dynamic network is generated by inputting known information, and a target protein under such conditions by inputting information of a target protein and neighbors to be located in the network. The purpose of this paper is to propose an efficient condition prediction method for predicting and outputting a cell's location.

In addition, the present invention is to propose a method that can effectively predict the dynamic functional information including the position information of the protein for each specific time and external stimulation conditions.

That is, by inputting known information, a dynamic interaction network is generated according to a specific external stimulus condition necessary for protein location prediction, and a specific time and external stimulus are input by inputting information of a target protein and a neighbor to know the location in the network. The purpose of this study is to propose a method for predicting dynamic functional information including effective time and position information of external stimulus conditions that can predict and output the location of a target protein in a cell.

In addition, the present invention is to propose a method that can effectively predict other functional information (eg, molecular functions and biological processes) information of the protein in a similar manner as well as intracellular location information.

In order to solve the above problems, the first embodiment of the present invention comprises the steps of (a) the static characteristics of the target protein (static characteristics) is input to generate a static feature (static feature); (b) generating static networks by inputting static protein-protein interaction information; (c) applying a static property of a neighbor protein of the target protein to the static network to generate a network feature; (d) generating a location-feature model using the static feature and the network feature; (e) calculating a coherence score using an expression profile under specific conditions; (f) the adhesion score is assigned to the static network as a weight to create a dynamic network; (g) applying the static characteristics and location information of the target protein to the dynamic network to generate a protein feature under a specific condition; And (h) determining the location of the target protein under specific conditions using the protein feature and the location-feature model. to provide.

In addition, (i) repeating the steps (a) to (h) in a plurality of conditions, it is preferable to further include the step of determining a protein (translocational protein) in the position changes in accordance with the change of conditions.

In addition, the step (e), it is preferable to include the step of determining the main neighboring protein by the adhesion score.

In addition, the step (h), it is preferable that the position of the target protein further comprises the step of outputting as a degree (possibility degree) to exist at a specific position.

In addition, the adhesion score is preferably calculated by any one or more of the similarity of the expression level (expression level) and the similarity of the expression profile pattern between the target protein (or gene) and the neighboring protein (or gene).

In addition, in the step (a), the static characteristics are selected from the group consisting of sequence information, chemical information, motif information and function information of a single protein. It is desirable to include any one or more pieces of information.

In addition, in the step (b), the static protein interaction information preferably includes related interaction information at the protein level or gene level.

In addition, in step (c), the neighboring protein of the target protein is preferably determined by the static properties and the static protein interaction information.

In addition, the step (d), it is preferable that the main feature of the static feature and the network feature comprises the step of selecting for each specific location information using a Divide-and-Conquer k Nearest Neighbor method classifier (DCkNN). .

In addition, the condition is preferably a grade of the disease, and more particularly according to the cancer grade.

In order to solve the above problems, the second embodiment of the present invention comprises the steps of: (a) the static characteristics of the target protein (static characteristics) are input to generate a static feature; (b) generating a location-feature model using the static feature; (c) a coherence score is calculated using expression levels under certain conditions; (d) the adhesion score is assigned to the static feature as a weight to create a dynamic network; (e) applying the static characteristics and location information of the target protein to the dynamic network to generate a protein feature under the predetermined condition; And (f) determining the location of the target protein under the predetermined condition using the protein feature and the location-feature model, wherein the condition is a condition according to cell differentiation. It provides a method for predicting the intracellular location of the conditional protein according to differentiation.

In addition, it is preferable to further include the step (a) to (g) under a plurality of conditions to determine a protein (translocational protein) in which the position in the cell is changed according to the change of the conditions according to the differentiation of the cell. .

In addition, the step (c), it is preferable to include the step of determining the main neighboring protein by the adhesion score.

In addition, the step (f), the location of the target protein, preferably further comprises the step of outputting as a degree (possibility degree) to exist at a predetermined position.

In addition, the adhesion score is preferably calculated by the similarity of the expression pattern or expression level between the target protein (or gene) and the neighboring protein (or gene).

In addition, the step (b), it is preferable that the main feature of the static feature comprises a step of selecting each location within the cell using a DCkNN classifier (Divide-and-Conquer k Nearest Neighbor method classifier).

In addition, the above conditions preferably include any one or more of neural stem cell (HBI.F3) conditions and oligodendrocyte (F3.Olig2) conditions.

In order to solve the above problems, the third embodiment of the present invention comprises the steps of (a) the static characteristics of the target protein (static characteristics) is input to generate a static feature (static feature); (b) generating static networks by inputting static protein-protein interaction information; (c) applying a static property of a neighbor protein of the target protein to the static network to generate a network feature; (d) generating a location-feature model using the static feature and the network feature; (e) a coherence score is calculated using an hourly expression profile under specific external stimulation conditions; (f) the adhesion score is assigned to the static network as a weight to create a dynamic network; (g) generating a protein feature under a specific external stimulus condition by applying the static characteristics and location information of the target protein to the dynamic network; And (h) determining the location of the target protein at specific time and external stimulus conditions using the protein feature and the location-feature model, wherein the expression profile is a time-series of microarrays. It provides a method for predicting the intracellular location of the protein according to the time and external stimulation conditions, characterized in that the result expressed in (time-series).

In addition, (i) repeating the steps (a) to (h) in a plurality of times and external stimulation conditions to determine a protein (translocational protein) in which the position in the cell is changed according to the change of time and external stimulation conditions It is preferable to further include.

In addition, the step (h), it is preferable that the position of the target protein further comprises the step of outputting as a degree of probability (possibility degree) to exist at a predetermined position.

In addition, the step (d), it is preferable that the main feature of the static feature and the network feature includes the step of selecting for each position in the cell using a Divide-and-Conquer k Nearest Neighbor method classifier (DCkNN). .

In addition, any conditions applied externally may be used as the external stimulation conditions, and for example, any one or more of DTT (dithiothrietol) conditions and MMS (methyl methanesulfonate) conditions may be used.

In addition, by using the method according to the present invention, using the biological process (molecular function) or molecular function (molecular function) of the target protein, further comprising the step of predicting the biological or molecular function according to the condition of the protein can do.

According to the present invention, it is possible to effectively predict the position of the target protein under specific conditions by inputting the information of the target protein, the interaction information of the protein, and the expression information of the protein or gene under specific conditions. By utilizing the method according to the invention it is possible to effectively predict the position of the target protein across all proteins (proteome-wide) and under any conditions (condition-wide) under specific conditions of the protein. As shown in FIG. 4 and described below, the accuracy of the prediction is very high.

In addition, specific conditions can be effectively predicted by inputting the information and the location of the target protein. For example, when a network is set up according to a cancer grade, information about a target protein and a location may be input to determine whether it is normal, low, or high cancer.

In addition, according to the present invention, even if the number of expression profiles under a specific condition is one, it is possible to effectively predict the position of the target protein under specific conditions by inputting information of the target protein and expression information of the protein or gene under the specific conditions. By utilizing the method of the present invention, it is possible to effectively predict the position of the target protein under specific conditions of the protein according to the differentiation of cells. As shown in FIG. 7 and described below, the accuracy of the prediction is very high.

In addition, by utilizing the method according to the present invention, it is possible to effectively predict the location of the target protein across all proteins under certain external stimulus conditions of the protein and at any time and condition-wide. . 9 to 11 and as described below, the accuracy of the prediction is very high.

Prediction according to time and external stimulus conditions is possible, of course, it is possible to predict the position under normal conditions (normal conditions). Through this, it is possible to effectively predict and verify the location of the protein under normal conditions that are not known or are known in the past.

In addition, by using the biological process (molecular function) or molecular function (molecular function) of the target protein as input, it is possible to accurately predict the biological or molecular function of the various conditions of the protein.

1 and 2 are a flowchart and a reference diagram for performing a method according to the first embodiment of the present invention.

3 is a view for explaining the neighboring protein and its position according to the adhesion score.

4 shows the result of predicting and displaying the predicted location information according to the condition of a specific protein.

5 and 6 are a flowchart and a reference diagram for performing a method according to a second embodiment of the present invention.

7 shows the result of predicting and displaying the predicted location information according to the condition of the target protein.

8 is a flowchart and a reference diagram for performing a method according to a third embodiment of the present invention.

9 to 11 illustrate the results obtained by using the yeast protein and predicting the intracellular location of the protein according to time and external stimulation conditions and the results of verifying the same.

In the present invention, the information is input through a input device (not shown) to a control unit (not shown) capable of computer processing and output through an output device (not shown). The control unit may be any device capable of computing information, and the input device may be any device capable of inputting information to the control unit such as a keyboard or a mouse, and the output device may visually display the result to the user such as a monitor or a printer. It can be any device that can be shown.

Hereinafter, "position" of a protein means subcellular localization of the protein. For example, the protein is located in Actin (AT), Cell Cortex (CC), Centrosome (CT), Cytosol (CY), Endoplasmic Reticulum (ER), Golgi (Golgi) Apparatus, GL), Lysosome (LS), Mitochondrion (MT), Nucleolus (NO), Nucleus (Nucleus, NU), Peroxysome (PX), Plasma Membrane (PM), It may be any one of vacuole (VU). (See Figure 4)

Hereinafter, "neighborhood protein" refers to a protein that is expected to be closely related to each other and located at the same intracellular location as the protein of interest under specific conditions. It depends on the condition or the protein of interest. In addition, under certain conditions, one neighboring protein of a target protein may be plural and each neighboring degree or adhesion degree may be different. Thus, as will be described later, the position prediction is preferably expressed as a degree of likelihood (see FIGS. 3 and 4).

Example 1 Method for Predicting Intracellular Location of Conditional Proteins

1 and 2 will be described a method for predicting the intracellular location of the protein according to the condition according to the first embodiment of the present invention.

In accordance with the first embodiment of the present invention a large number of previously known data are used to predict the target protein position under specific conditions.

First, static characteristics of a single protein are input to a controller to generate a static feature (S110). The input static property may be sequence information, chemical information, motif information, function information, and the like. The content of the information is a prior art, detailed description thereof will be omitted.

Next, static protein-protein interaction information related to the target protein is input to the controller to generate a static network (S120).

Next, a network feature is generated by applying a static property of a neighbor protein related to the target protein to the static network generated in step S120 (S130). Neighbor proteins can be determined using known static properties, static protein-protein interactions, and the like. At the same time, when the position of the neighboring protein in the cell can be known, the position information can be input together.

Static and network features are shown at the top of A of FIG. 2.

Next, a location-feature model is generated by selecting a good feature for each intracellular location (S140). A feature can be selected for each of the 13 intracellular locations described above, where the selection of features can utilize a DCkNN classifier that automatically selects a good feature. The DCkNN classifier and the method using the same are known in the art, and thus a detailed description thereof will be omitted. The optimal feature and its combination can be selected through the DCkNN classifier. The location-feature model is shown at the bottom of A in FIG. 2. The selected feature is marked in black.

Next, a coherence score is calculated using an expression profile under specific conditions (S150), and this is assigned to each protein-protein interaction of the static network generated in S120 as a weight. A dynamic network is generated (S160). A weighted dynamic network is shown in B of FIG. 2.

The adhesion score is calculated by one or more of the similarity of expression profile patterns and similarity of expression levels between the protein of interest (or genes) and neighboring proteins (or genes).

More specifically, there are various methods of calculating the adhesion score, but in one embodiment, the following equation may be used.

Equation 1

Here, Φ (a, b) is the adhesion score of a, b, a is the target protein, b is the neighboring protein, ρ (a, b) is the Pearson correlation coefficient of the expression level of a, b, med (a) is median of the expression level of a, med (b) is the median of the expression level of b, MEDIAN is the median of the genes used for a, b, Ψ (x ) Is the p-value of x. According to the above equation, the adhesive score has a positive value, and the closer it is, the larger the value.

The adhesion score can determine the major neighboring proteins. In the example shown in Fig. 3, AP1G1 and COG2 proteins are determined to be the most important neighboring proteins related to KIF13A under normal conditions, and ATF7IP protein is most related to low and high cancer grades at low and high cancer grades. It is determined by neighboring proteins. The thicker the thickness, the larger the value of the adhesive score.

Next, by applying the static properties and location information of the target protein to the dynamic network generated in S160 to generate a protein feature (protein feature) of a specific condition (S170).

Next, the position-feature model generated in S140 and the protein feature generated in S170 may be used to predict the position of the target protein under specific conditions (S180). As shown in C of FIG. 2, the position of the target protein is preferably output as a degree of probability existing at a specific position.

In addition, by repeating the above process, it is possible to generate a conditional location map (CLM) that can compare the location information of the protein in a comparable condition, through which the location is changed as the condition changes Translocational proteins can be determined.

CLM is shown on the left side of FIG. 2C, and as shown here, the position is changed as the condition is changed when the position is changed at normal condition, low cancer level, and high cancer level (High). Determined by the protein being altered.

As described above, the method according to the first exemplary embodiment of the present invention is proteome-wide and condition-wide for all conditions.

Such conditions may be of any disease grade, and more specifically, from cancer grade to stage. In FIG. 2B, FIG. 3 and FIG. 4, normal conditions, low cancer low, high cancer high are shown.

Figure 3 describes an embodiment for confirming the location of the KIF13A protein.

In the same manner as described above, under normal conditions, AP1G1 and COG2 proteins located in the Golgi apparatus (GL) were determined as the most related neighboring proteins. At low and high cancer grades, ATF7IP protein located in the nucleus (NU) was determined to be the most relevant neighboring protein.

When the neighboring protein is determined in this way, the control unit calculates the corresponding position as a position under specific conditions of the target protein (S800).

Figure 4 describes an embodiment for confirming the location of the KIF13A protein. "A" of FIG. 4 shows the predicted conditional positional information of the KIF13A protein, which is shown as a degree of probability by identifying the position of the neighboring protein determined with reference to FIG. It is most likely to be located in the Golgi (GL) under normal conditions, and most likely in the nucleus (NU) at low and high cancer grades.

The result of verifying this is shown in "B" to "G" of FIG. Yellow in "B" and cyan (E, G) in "E" and "G" indicate that the location marker overlaps the KIF13A protein. Therefore, it was confirmed that the KIF13A protein was actually located at the corresponding position (marked with "O"), and that the KIF13A protein was not actually located at the "C", "D", and "F" (marked with "X"). , Shows that the prediction result according to the first embodiment of the present invention is correct.

Example 2 Method for Predicting Intracellular Location of Conditional Proteins According to Cell Differentiation

A method according to a second embodiment of the present invention will be described with reference to FIGS. 5 and 6.

Similar to the first embodiment, many previously known data are used to predict the target protein location under specific conditions in accordance with the second embodiment of the present invention.

First, similarly to the first embodiment, the static characteristics of a single protein are input to the controller to generate a static feature (S110). The input static property may be sequence information, chemical information, motif information, function information, and the like.

Next, a location-feature model is generated by selecting a good feature for each intracellular location (S140). A feature can be selected for each of the 13 intracellular locations described above, where the selection of features can utilize a DCkNN classifier that automatically selects a good feature. The DCkNN classifier and the method using the same are known in the art, and thus a detailed description thereof will be omitted. The optimal feature and its combination can be selected through the DCkNN classifier.

Next, a coherence score is calculated using an expression level in a specific condition (S150), and a dynamic network is generated by assigning it as a weight (S160). A weighted dynamic network is shown in A of FIG. 6.

As mentioned above, a second embodiment of the present invention is directed to a method for predicting the position of a short vertical regardless of the number into profiles under certain conditions. If it is a single expression profile, the adhesion score is calculated by the similarity of the expression levels between the target protein (or gene) and the neighboring protein (or gene).

Equation 2

Where Φ (a, b) is the adhesion score of a, b, a is the target protein, b is the neighboring protein, ρ (a, b) is the Pearson correlation coefficient of the expression level of a, b, Ψ (x) is The value of x _i for p-value and a of x and Y _i for b are the range values of the probability distribution of the degree of correlation of all interacting protein pairs from the input. ρ (a, b) has a range of -1 to 1. In addition, n is the number of samples, S _x is the covariance of _x , and γ _x means the relative expression level of x.

The adhesion score can determine the major neighboring proteins.

6A is a result of utilizing the values derived through microarrays in the cell line of the cell differentiation step, and the neural stem cells (HBI.F3) and oligodendrocytes (F3. Olig2) was used as a condition. As shown, ITGA5 was determined as the major neighboring protein of SFRP2 under F3.Olig2 conditions. The thicker the thickness, the larger the value of the adhesive score.

Next, the position-feature model generated in S140 and the protein feature generated in S170 may be used to predict the position of the target protein under specific conditions (S180). As shown in B of FIG. 6, the position of the target protein is preferably output as a degree of probability existing at a specific position.

As described above, the method according to the second embodiment of the present invention is proteome-wide and condition-wide for all conditions even when the number of expression profiles such as cell differentiation is one. .

A, B, C, and D of FIG. 7 show conditional location maps and verification results of MYC, STAT3, SOX10, and REV3L, respectively. In the two conditions (x-axis of the location map) shown in the conditional location map, the left side represents the HBI.F3 condition and the right side represents the F3.Olig2 condition.

As shown in FIG. 7A, according to the method according to the second embodiment of the present invention, it was predicted that MYC would be located in the nucleus NU with high probability under both HB1.F3 and F3.Olig2 conditions. The anti-MYC is shown in green and the nuclear marker is shown in blue, and the overlap result is shown as the third image, confirming that the prediction by the second embodiment of the present invention was correct.

As shown in FIG. 7B, according to the method according to the second embodiment of the present invention, it was predicted that STAT3 would be located in the nucleus NU under both HB1.F3 and F3.Olig2 conditions. The anti-STAT3 is shown in green and the nuclear marker is shown in blue, as shown by the overlap result, confirming that the prediction by the second embodiment of the present invention was correct.

As shown in FIG. 7C, according to the method according to the second embodiment of the present invention, SOX10 was predicted to be located in the nucleus NU under both HB1.F3 and F3.Olig2 conditions. The anti-SOX10 is shown in green, the nuclear marker is shown in blue, and the cell membrane marker is shown in red, as shown by the overlap result, confirming that the prediction by the second embodiment of the present invention was correct.

As shown in FIG. 7D, according to the method according to the second embodiment of the present invention, REV3L is located in the nucleus NU under HB1.F3 conditions but only in the endoplasmic reticulum ER under F3.Olig2 conditions. It became. In other words, it was predicted to translocation from the nucleus NU to the endoplasmic reticulum ER. The overlap results at each position are shown as shown, confirming that the prediction according to the second embodiment of the present invention was correct.

Example 3 Method for Predicting Dynamic Function Including Cellular Location Information of Proteins over Time and External Stimulation Conditions

A method according to a third embodiment of the present invention will be described with reference to FIGS. 2 and 8.

Steps S110 to S140 are applied similarly to the method according to the first embodiment.

Next, a coherence score is calculated using an expression profile under specific external stimulus conditions (S150), which is weighted to each protein-protein interaction of the static network generated at S120. By assigning, a dynamic network is generated (S160). A weighted dynamic network is shown in B of FIG. 2.

Here, microarray results are used as the input expression profile and as the expression level. The microarray result is expressed as a time-series with time as a variable, so that the weight of the adhesiveness score and hence the dynamic network has time as a variable.

More specifically, there are various methods of calculating the adhesion score, but as an example, the above-described formula of Equation 1 may be used.

The adhesion score can determine the major neighboring proteins. In the example shown in the lower part of B of FIG. 2, in normal condition, the protein (square) located in the ER as the main neighboring protein of the target protein (the central white circle) has been determined as the main neighboring protein and the low cancer. Low and high cancer levels show that the protein (green circle) located in the nucleolus (NU) has been determined to be the major neighboring protein. The thicker the thickness, the larger the value of the adhesive score.

Next, the static characteristics and location information of the target protein is applied to the dynamic network generated in S160 to generate a protein feature of a specific external stimulus condition (S170). Because dynamic networks have time as a variable, protein features also have time as a variable.

Next, the position of the target protein may be predicted at a specific time and external stimulus condition using the position-feature model generated in S140 and the protein feature generated in S170 (S180). As shown in C of FIG. 2, the position of the target protein is preferably output as a degree of probability existing at a specific position.

In addition, by repeating the above process, it is possible to generate a location map for each time and condition that can compare the location information of the protein in the comparable time and external stimulation conditions. Through this, it is possible to determine a translocational protein whose position changes as time and external stimulus conditions change.

As such, the method according to the third embodiment of the present invention is proteome-wide and time and condition-wide for all time and external stimulus conditions.

In addition, it is possible to predict the biological process (molecular function) and the like by utilizing this method.

9 is an embodiment of a position-feature model generated in step S140. As an experiment from yeast proteins having at least one known function in each type of function category, nine types of single protein static features were used, and the result was the use of 20 types of network features. As described above, the optimal feature was selected using the DCkNN classifier, and the selected feature was marked in black.

FIG. 10 shows the results of verifying the method according to the third embodiment of the present invention using the yeast protein described above in FIG. 9, and particularly shows the case where it is predicted to be the same regardless of time. In addition, these results reveal that it is possible to uncover information that includes previously unknown locations or to correct erroneously known information.

In particular, the case where it is predicted to be the same position irrespective of time is shown at the top, and the case where the position is predicted to change with time is shown at the bottom.

On the right side of the top of FIG. 11, a time and conditional location map for YBL072C / RPS8A according to the third embodiment of the present invention is shown. Dithiothrietol (DTT) and methyl methanesulfonate (MMS) conditions were used to contrast with normal conditions. As a result of the prediction according to the third embodiment of the present invention, it was predicted that YBL072C / RPS8A would be located in the cytosol under normal conditions, DTT conditions and MMS conditions.

The upper left side of FIG. 11 shows the verification result. As the DTT condition, the cells were observed after 2 hours by adding 2.5 mM DTT and the results of the hourly expression profiles were used, and the MMS conditions were the same. The result of the normal condition is shown as "before". As a result, it was confirmed that YBL072C / RPS8A is located in the cytosol under normal conditions, DTT conditions and MMS conditions as shown.

FIG. 11 shows a location map for each time and condition for the YJL146W / IDS2 according to the third embodiment of the present invention, and a verification result thereof. Under normal conditions, the probability of being located in the cytoplasm (CY) was high and the probability of being located in the nucleus (NU) was low.However, in the MMS condition, the probability of being placed in the cytoplasm (CY) gradually decreases with time, The probability is expected to increase gradually. As a result of overlapping and verifying a plurality of experimental results, it was confirmed that the prediction according to the third embodiment of the present invention was correct.

In FIG. 10, the location map for each time and condition for the YNL278W / CAF120 according to the third embodiment of the present invention and the verification result thereof are shown. In normal conditions, it is expected to be located in various positions including bud neck (BN), and the probability of being located in the cytoplasm (CY) will increase gradually over time in MMS conditions, and will be gradually decreased in other positions. As a result of this, it was confirmed that the prediction according to the third embodiment of the present invention was correct.

In addition, the positional map for each time and condition for YIL090W / ICE2 according to the third embodiment of the present invention and the verification result thereof are shown. Under normal conditions, it was highly likely to be located in the ER, but it was expected to decrease gradually with time in the DTT condition. Similarly, the verification result confirmed that the prediction according to the third embodiment of the present invention was correct.

FIG. 10 shows a location map for each time and condition for YDL060W / TSR1 according to the third embodiment of the present invention and a result of verification thereof. Under normal conditions, the probability of being located in the nucleolus (NO) and the nucleus (NU) was very high, and the probability of being located in the cytoplasm (CY) was very high. The probability of being located at CY) is expected to decrease gradually. Similarly, the verification result confirmed that the prediction according to the third embodiment of the present invention was correct.

By using the method according to the third embodiment of the present invention, it is possible not only to predict the intracellular location of proteins by time and external stimulus conditions, but also to predict and verify the position under normal / normal conditions. It was confirmed that it has a performance.

By using the method according to the third embodiment of the present invention as it is and using a known normal gene expression profile, a location map of the normal condition was generated. This was validated using 33 BPs, 22 MFs and 22 locations from GO. By using the method according to the third embodiment of the present invention, the conventional 0.90 (BPs), 0.93 (position), and 0.94 (MFs) to 0.96 (BPs), 0.98 (position) and It can be seen that the performance is increased to 0.98 (MFs). This means that the method according to the third embodiment of the present invention to which the weight is applied is more effective than the conventional method to which the weight is not applied.

As shown in the upper right of Fig. 11, by using the method according to the third embodiment of the present invention, it is possible to predict unknown steady-state (location) across all genomes. In yeast proteins, information such as the interaction with 5,776 yeast proteins is conventionally known, but for example, 1,867 yeast proteins have not been able to accurately predict position. However, by using the third embodiment of the present invention it was confirmed that the unknown position of the normal condition is also predictable.

As described above, the method according to the third embodiment of the present invention has high performance even under normal conditions, and thus is un-identified or mis-identified as shown in the lower left of FIG. 10. ) You can correct the results under normal conditions.

For example, although YLR074C / BUD20 has been reported to be located in the nucleus NU and the endoplasmic reticulum ER, the method according to the third embodiment of the present invention has a high probability of being located in the nucleus NU. The probability of being at (ER) was predicted to be close to zero. As a result of the verification, it was not found in the endoplasmic reticulum (ER) as shown below the lower left end of FIG.

In another example, YPL012W / RRP12 has been reported to be located in the nucleus (NU) and cytoplasm (CY), but according to the third embodiment of the present invention, the nucleolus rather than the nucleus (NU) and cytoplasm (CY) It was predicted that the probability of being located at (NO) was high (NO: 0.8, NU: 0.6, CY: 0.6). As a result of the verification, it appears to be located on the nucleolus as shown in the lower right of Figure 10 and confirmed that it is only weakly spread in the nucleus and cytoplasm.

Although the above has been described with reference to preferred embodiments of the present invention, those skilled in the art may variously modify and modify the present invention without departing from the spirit and scope of the present invention as set forth in the claims below. It will be appreciated that it can be changed.

Claims

(a) inputting static characteristics of a target protein to generate a static feature;

(b) generating static networks by inputting static protein-protein interaction information;

(c) applying a static property of a neighbor protein of the target protein to the static network to generate a network feature;

(d) generating a location-feature model using the static feature and the network feature;

(e) calculating a coherence score using an expression profile under certain conditions;

(f) the adhesion score is assigned to the static network as a weight to create a dynamic network;

(g) applying the static properties and location information of the target protein to the dynamic network to generate a protein feature under the predetermined condition; And

(h) determining the location of the target protein under the predetermined condition by using the protein feature and the location-feature model.

Method for predicting intracellular location of conditional proteins.
The method of claim 1,

(i) repeating the steps (a) to (h) in a plurality of conditions to determine a translocational protein whose position in the cell is changed according to the change of conditions

Characterized in that it further comprises,

Method for predicting intracellular location of conditional proteins.
The method according to claim 1 or 2,

Step (e), characterized in that it comprises the step of determining the main neighboring protein by the adhesion score,

Method for predicting intracellular location of conditional proteins.
The method according to claim 1 or 2,

Wherein (h), the position of the target protein, characterized in that further comprising the step of outputting as a degree (possibility degree) to exist at a predetermined position,

Method for predicting intracellular location of conditional proteins.
The method according to claim 1 or 2,

The adhesion score is calculated by any one or more of the similarity of the expression profile pattern and the expression level (expression level) between the target protein and the neighboring protein,

Method for predicting intracellular location of conditional proteins.
The method according to claim 1 or 2,

The adhesion score is calculated by any one or more of the similarity of the expression profile pattern and the expression level between the gene of the target protein and the gene of the neighboring protein,

Method for predicting intracellular location of conditional proteins.
The method of claim 5,

The adhesion score is calculated by the following formula,

a is the target protein,

b is the neighboring protein,

Φ (a, b) is the adhesion score of a, b,

ρ (a, b) is the Pearson correlation coefficient of the expression level of a, b,

med (a) is the median of the expression levels of a,

med (b) is the median of the expression levels of b,

MEDIAN is the median of the expression levels of the genes used for a, b,

Ψ (x) is a p-value of x,

Method for predicting intracellular location of conditional proteins.
The method according to claim 1 or 2,

In the step (a), the static property is any one selected from the group consisting of sequence information, chemistry information, motif information and function information of a single protein It is characterized by including the above information,

Method for predicting intracellular location of conditional proteins.
The method according to claim 1 or 2,

In the step (b), the static protein interaction information, characterized in that it comprises the relevant interaction information at the protein level or gene level,

Method for predicting intracellular location of conditional proteins.
The method according to claim 1 or 2,

In the step (c), the neighboring protein of the target protein is characterized in that determined by the static properties and the static protein interaction information,

Method for predicting intracellular location of conditional proteins.
The method according to claim 1 or 2,

The step (d) is characterized in that the main feature of the static feature and the network feature is selected for each position within the cell using a DCkNN classifier (Divide-and-Conquer k Nearest Neighbor method classifier),

Method for predicting intracellular location of conditional proteins.
The method according to claim 1 or 2,

The condition is characterized in that the grade of the disease,

Method for predicting intracellular location of conditional proteins.
The method of claim 12,

The condition is characterized in that according to the cancer grade (cancer grade),

A method for generating a conditional protein interaction network.
(a) inputting static characteristics of a target protein to generate a static feature;

(b) generating a location-feature model using the static feature;

(c) a coherence score is calculated using expression levels under certain conditions;

(d) the adhesion score is assigned to the static feature as a weight to create a dynamic network;

(e) applying the static characteristics and location information of the target protein to the dynamic network to generate a protein feature under the predetermined condition; And

(f) determining the location of the target protein under the predetermined condition using the protein feature and the location-feature model,

The condition is characterized in that the conditions according to the cell differentiation,

A method for predicting the intracellular location of a conditional protein according to cell differentiation.
The method of claim 14,

(g) repeating steps (a) to (f) under a plurality of conditions to determine a protein (translocational protein) in which the position in the cell is changed in accordance with the change of conditions according to the differentiation of cells

Characterized in that it further comprises,

A method for predicting the intracellular location of a conditional protein according to cell differentiation.
The method according to claim 14 or 15,

Step (b), characterized in that it comprises the step of determining the main neighboring protein by the adhesion score,

A method for predicting the intracellular location of a conditional protein according to cell differentiation.
The method according to claim 14 or 15,

The step (f) further comprises the step of outputting the position of the target protein as a degree of probability (possibility degree) that exists at a predetermined position.

A method for predicting the intracellular location of a conditional protein according to cell differentiation.
The method according to claim 14 or 15,

The adhesion score is calculated by the similarity of the expression level between the target protein and the neighboring protein,

A method for predicting the intracellular location of a conditional protein according to cell differentiation.
The method according to claim 14 or 15,

The adhesion score is calculated by the similarity of the expression level between the gene of the target protein and the gene of the neighboring protein,

A method for predicting the intracellular location of a conditional protein according to cell differentiation.
The method according to claim 14 or 15,

The adhesion score is calculated by the following formula,

a is the target protein,

b is the neighboring protein,

Φ (a, b) is the adhesion score of a, b,

ρ (a, b) is the Pearson correlation coefficient of the expression level of a, b,

Ψ (x) is the p-value of x,

The values of a for X i and the values for b for Y i are the range of the probability distribution of the degree of correlation of all interacting protein pairs from the input,

n is the number of samples,

S x is the covariance of x,

γ x is the relative expression level of x,

A method for predicting the intracellular location of a conditional protein according to cell differentiation.
The method according to claim 14 or 15,

In the step (a), the static property is any one selected from the group consisting of sequence information, chemistry information, motif information and function information of a single protein It is characterized by including the above information,

A method for predicting the intracellular location of a conditional protein according to cell differentiation.
The method according to claim 14 or 15,

Step (b) is characterized in that the main feature of the static feature is selected for each position in the cell using a DCkNN classifier (Divide-and-Conquer k Nearest Neighbor method classifier),

A method for predicting the intracellular location of a conditional protein according to cell differentiation.
The method according to claim 14 or 15,

The condition is characterized in that it comprises any one or more of neural stem cells (HBI.F3) conditions and oligodendrocytes (F3.Olig2) conditions,

A method for predicting the intracellular location of a conditional protein according to cell differentiation.
(a) inputting static characteristics of a target protein to generate a static feature;

(b) generating static networks by inputting static protein-protein interaction information;

(c) applying a static property of a neighbor protein of the target protein to the static network to generate a network feature;

(d) generating a location-feature model using the static feature and the network feature;

(e) calculating a coherence score using an expression profile over time in a predetermined stressful condition;

(f) the adhesion score is assigned to the static network as a weight to create a dynamic network;

(g) applying the static characteristics and location information of the target protein to the dynamic network to generate a protein feature under the predetermined external stimulus condition; And

(h) determining the location of the target protein at a predetermined time and at the predetermined external stimulus condition using the protein feature and the location-feature model,

The expression profile is characterized in that the result expressed in a time-series of the microarray (microarray),

Method for predicting intracellular location of proteins by time and external stimulus conditions.
The method of claim 24,

(i) repeating steps (a) to (h) in a plurality of time and external stimulation conditions to determine a protein (translocational protein) in which the position in the cell is changed according to the change of time and external stimulation conditions

Characterized in that it further comprises,

Method for predicting intracellular location of proteins by time and external stimulus conditions.
The method of claim 24 or 25,

Step (e), characterized in that it comprises the step of determining the main neighboring protein by the adhesion score,

Method for predicting intracellular location of proteins by time and external stimulus conditions.
The method of claim 24 or 25,

Wherein (h), the position of the target protein, characterized in that further comprising the step of outputting as a degree (possibility degree) to exist at a predetermined position,

Method for predicting intracellular location of proteins by time and external stimulus conditions.
The method of claim 24 or 25,

The adhesion score is calculated by any one or more of the similarity of the expression profile pattern and the expression level (expression level) between the target protein and the neighboring protein,

Method for predicting intracellular location of proteins by time and external stimulus conditions.
The method of claim 24 or 25,

The adhesion score is calculated by any one or more of the similarity of the expression profile pattern and the expression level between the gene of the target protein and the gene of the neighboring protein,

Method for predicting intracellular location of proteins by time and external stimulus conditions.
The method of claim 27,

The adhesion score is calculated by the following formula,

a is the target protein,

b is the neighboring protein,

Φ (a, b) is the adhesion score of a, b,

ρ (a, b) is the Pearson correlation coefficient of the expression level of a, b,

med (a) is the median of the expression levels of a,

med (b) is the median of the expression levels of b,

MEDIAN is the median of the expression levels of the genes used for a, b,

Ψ (x) is a p-value of x,

Method for predicting intracellular location of proteins by time and external stimulus conditions.
The method of claim 24 or 25,

In the step (a), the static property is any one selected from the group consisting of sequence information, chemistry information, motif information and function information of a single protein It is characterized by including the above information,

Method for predicting intracellular location of proteins by time and external stimulus conditions.
The method of claim 24 or 25,

In the step (b), the static protein interaction information, characterized in that it comprises the relevant interaction information at the protein level or gene level,

Method for predicting intracellular location of proteins by time and external stimulus conditions.
The method of claim 24 or 25,

In the step (c), the neighboring protein of the target protein is characterized in that determined by the static properties and the static protein interaction information,

Method for predicting intracellular location of proteins by time and external stimulus conditions.
The method of claim 24 or 25,

The step (d) is characterized in that the main feature of the static feature and the network feature is selected for each position within the cell using a DCkNN classifier (Divide-and-Conquer k Nearest Neighbor method classifier),

Method for predicting intracellular location of proteins by time and external stimulus conditions.
The method of claim 24 or 25,

The external stimulation conditions, characterized in that any one or more of DTT (dithiothrietol) conditions or MMS (methyl methanesulfonate) conditions,

Method for predicting intracellular location of proteins by time and external stimulus conditions.
Depending on the location of the target protein according to claim 24,

Characterized in predicting any one or more of the biological process (molecular function) or molecular function (molecular function) of the target protein,

A method for predicting the biological or molecular function of a protein by time and external stimulus conditions.