WO2016134659A1

WO2016134659A1 - Method for constructing protein-protein interaction network using text data

Info

Publication number: WO2016134659A1
Application number: PCT/CN2016/074496
Authority: WO
Inventors: 朱斐; 刘全; 王辉; 凌兴宏; 杨洋; 伏玉琛
Original assignee: 苏州大学张家港工业技术研究院
Priority date: 2015-02-25
Filing date: 2016-02-24
Publication date: 2016-09-01
Also published as: CN104657626A

Abstract

Disclosed is a method for constructing a protein-protein interaction network using text data, comprising: (1) establishing a protein set; (2) recording a probability value of interaction between every two of all protein in the protein set; (3) according to the size of the probability value, constructing an initial network structure; (4) repeatedly selecting protein, giving a positive or negative effect feedback value, and continuously conducting iteration on the initial network structure, so as to obtain a final network structure. By means of the present invention, in the manners of repeated selection and interaction and on the basis of positive feedback, negative feedback and prohibiting feedback, a probability diagram of an interaction network is constructed through reinforcement learning and is seamlessly combined with biological knowledge and biological data.

Description

Method for constructing protein interaction network by using text data

[0001] The present invention relates to the field of biology, and more particularly to a method of constructing a protein interaction network using text data.

Background technique

[0002] Biological systems contain many networks of different levels and different organizational forms. The most important feature of the complexity of a living system is not only the complexity of its components, but also the complexity of the relationships between its components. Therefore, in the analysis of biomolecular networks, it is necessary not only to fully understand the various molecular entities in the network, but more importantly to understand the interrelationships among the various molecular entities. Proteins are an important class of biological molecules that form a network of protein interactions through interactions with each other to participate in all aspects of life processes such as biosignal transmission, gene expression regulation, energy and material metabolism, and cell cycle regulation. The basis of implementation. The interaction between proteins plays a crucial role in the formation of almost all life systems and in the regulation of various physiological/pathological processes. Protein interactions not only provide clues for studying the biological functions of unknown proteins, but also provide the necessary information to fully understand the biological mechanisms of a cell or a biological pathway. In biomedicine, the study of protein-protein interactions has very important real-world implications. Systematic analysis of the interactions of certain disease-related proteins, understanding of how these proteins work in biological systems, understanding the mechanisms of biosignal and energy metabolism under specific physiological conditions, and understanding the functional links between disease-related proteins All have important meanings.

[0003] At present, there are a variety of methods for constructing protein interaction networks, including the establishment of interaction networks based on high-throughput experiments, the use of data mining interaction networks already available in the literature, and prediction methods through computational techniques. Establish an interaction network, etc. However, in general, there are a number of ways to construct a network of interactions between proteins.

[0004] First, the establishment of protein interaction networks based on high-throughput experiments is generally subject to cost constraints. Many high-throughput experimental methods are still limited to a small number of proteins in the study of a disease, and are not constructed and analyzed from the perspective of a broader protein profile. The main reason is the cost of biochemical experiments to analyze protein-protein interactions. High, resulting in only a small amount of protein, not all proteins Quality is used as a broad candidate protein for analytical studies. The selection of a small amount of protein for analysis is not only likely to miss the proteins associated with the disease, but also miss some biomedical facts, and the perspectives and ideas of analytical research are limited, making it harder to discover new information and new knowledge.

[0005] Second, the establishment of protein interaction networks using literature data alone is influenced by data quality and related biological analysis. There are data from different literatures that make different biological interpretations and conclusions about the same biological phenomenon; and there are different biological interpretations and conclusions in the same batch of data. This is because people's lack of comprehensive understanding of complex biological phenomena leads to different interpretations and conclusions from different perspectives. Therefore, in the research and analysis of complex biological problems, such as the construction of protein interaction networks, it is necessary to fully integrate data and related information from different sources, to discriminate various information, to deny the truth, and to deepen the comprehensive and deep level of its disease mechanism. understanding.

[0006] In addition, many computational methods for constructing interprotein interaction networks are biased towards the design and improvement of computational models, but do not integrate biological knowledge and biological facts well, so that there are some errors that contradict basic biological knowledge and facts. in conclusion.

technical problem

Problem solution

Technical solution

[0007] The object of the present invention is to provide a method for constructing a protein interaction network by using text data, which can not only integrate existing biological domain knowledge, but also fully utilize the data obtained by post-genome degeneration, and at the same time take into account complex network characteristics. New methods for constructing interprotein interaction networks.

[0008] In order to achieve the above object, the technical solution adopted by the present invention is: A method for constructing a protein interaction network by using text data, comprising:

[0009] (1) establishing a protein collection;

[0010] (2) recording a probability value of the interaction of all the proteins in the protein collection;

[0011] (3) constructing an initial network structure according to the size of the probability value;

[0012] (4) repeatedly selecting proteins, given positive or negative feedback values, iterating over the initial network structure

, to obtain the network structure of the final protein interaction network.

[0013] In the above technical solution, the “probability value of interaction between all proteins” is: arbitrarily selecting one protein as a main interaction protein in the protein collection, and interacting with other proteins. The main interaction protein interacts with each interacted protein to form an action relationship, and then replaces the main interaction protein, and interacts with other interacted proteins again to form another action relationship, such that the number of cycles reaches a predetermined value, and In the case of repeated selection, it is calculated in an iterative manner to obtain the final action relationship as the probability value corresponding to the interaction of the two proteins.

[0014] In the above technical solution, the “repetitive selection case” is a case where one protein in the protein set interacts with another protein as a main, interacted with the interacting protein, and repeatedly interacts with each other by interaction. Happening.

[0015] According to a further technical solution, the predetermined value is: each of the proteins in the set interacts with other interacting proteins as a main interacting protein, or is no longer updated in a long interval of the cycle, or reaches One or several of the rated iteration steps.

[0016] In the above technical solution, the initial network structure is constructed as follows: each protein in the protein set acts as a node, and the two interact as edges, and the larger the boundary value, the interaction between the two The greater the probability, the smaller the opposite. In the process of construction, the interaction with large boundary values is enhanced until there is no more update in a longer interval, and vice versa, until the probability value is zero, and finally obtained by The network structure constructed by nodes and edges; through the initial network structure constructed, the final network structure is further obtained.

[0017] In a further technical solution, the final network structure is: constructing a network by using an entropy weight method, calculating an entropy weight of each protein node, and calculating a network entropy weight, and the smaller the entropy weight, indicating that the network is stable, Update the initial network structure.

[0018] In the above technical solution, the establishment of the protein collection:

[0019] a. obtaining the required text through a biomedical literature database;

[0020] b from the protein interaction database to obtain the protein name and its identification number;

[0021] c according to the protein name obtained in step b, identifying the protein name in the text obtained in step a

And mark the corresponding identification number;

[0022] d. Constructing the protein set P = {pi}, where pi represents the identification number corresponding to the i-th protein.

Advantageous effects of the invention

Beneficial effect

[0023] Due to the above technical solutions, the present invention has the following advantages over the prior art: [0024] 1. The present invention employs repeated attempts to interact, increases or decreases the boundary value of the interaction between the two, and the constructed network structure appears as a result of dynamics, ensuring the scale-free characteristics of the complex biological network; [0025] 2. Using the construction method of the invention, in accordance with the characteristics of the unknown biological problem, obtaining the best behavior in an unknown random environment, constructing an unknown protein interaction network, can ensure that the network converges to an optimal stable state;

[0026] 3. Seamlessly integrate biological knowledge and biological data in the process of establishing a network, strengthen biological facts, and not randomly construct a network to ensure that the network conforms to the basic characteristics of a biological complex network.

Brief description of the drawing

DRAWINGS

1 is a flow chart showing an implementation step of constructing a protein interaction network using text data;

[0028] FIG. 2 is a reinforced learning method using text data using average bonus values to construct a protein interaction network

[0029] a schematic diagram of a node degree probability distribution;

3 is a schematic diagram of a node degree probability density distribution for constructing a protein interaction network using a reinforcement learning method using text data using an average prize value. [0030] FIG.

Embodiments of the invention

[0032] The present invention will be further described below in conjunction with the accompanying drawings and embodiments:

[0033] Embodiment 1: Referring to FIG. 1, a method for constructing a protein interaction network using text data includes:

[0034] (1) establishing a protein collection;

[0035] (2) recording a probability value of interaction of all proteins in the protein collection;

[0036] (3) constructing an initial network structure according to the size of the probability value;

[0037] (4) repeatedly selecting proteins, given positive or negative feedback values, and iterating over the initial network structure

, to obtain the network structure of the final protein interaction network.

[0038] establishment of the protein collection:

[0039] a obtaining the required text through the biomedical literature database;

[0040] b from the protein interaction database to obtain the protein name and its identification number;

[0041] c according to the protein name obtained in step b, identifying the protein name in the text obtained in step a And mark the corresponding identification number;

[0042] d. Constructing the protein set P = {pi}, where pi represents the identification number corresponding to the i-th protein.

[0043] The "probability value of interaction between all proteins" means that one protein is arbitrarily selected as a main interaction protein in the protein collection, and the other proteins are interacted proteins, and the main interaction protein is interacted with each other. Protein interaction, forming a functional relationship, then replacing the main interacting protein, interacting with other interacting proteins again, forming another functional relationship, such a loop, the number of cycles reaches a predetermined value, and in the case of repeated selection, in an iterative manner Calculate, the final action relationship is obtained as the probability value corresponding to the interaction of the two proteins.

[0044] The "repetitive selection case" is a case where one protein in a protein set interacts with another protein as a master, interacts with an interacting protein, and repeatedly interacts with each other.

[0045] The predetermined value is that each of the proteins in the set interacts with other interacted proteins as the main interacting protein, or no longer updates within a longer interval, or reaches a rated iteration step One or several of them.

[0046] The method for constructing the network structure is: each protein in the protein set acts as a node, and the two interact with each other as an edge, and the larger the boundary value, the greater the probability of interaction between the two pairs, and vice versa. The smaller, in the process of construction, the interaction with large edge values is enhanced until there is no more update in a longer interval, and vice versa, until the probability value is zero, and finally the node and edge are constructed. The network structure, which is the result of the role of the network as a result of the dynamics of learning behavior.

[0047] The final network structure is: constructing a network by using an entropy weight method, calculating an entropy weight of each protein node, and then calculating a network entropy weight, the smaller the entropy weight, indicating that the network is stable, and updating the initial network structure.

[0048] In this embodiment, a reinforcement learning method is used to construct a network structure, and a protein interaction network is established in a framework of reinforcement learning, and nodes represent proteins, which are recorded as nodes 1, ..., nodes n, and side representations A role between proteins. A node obtains an action under the decision of the reinforcement learning agent. The action may be that the protein has a cooperative relationship with other proteins, indicating that there is an interaction between the two related proteins; or it may be that the protein and other proteins are mutually exclusive. Relationship, indicating that there is no interaction between the two related proteins; it is also possible to determine whether there is an interaction between the two related proteins. The node gets a reward every time an interaction attempt is made, and the value of the reward determines which The interaction will be enhanced. Make your selections repeatedly. With the advancement of the day, the protein adjustment strategy can also make a decision-making strategy again, and introduce randomness to explore and adapt to the environment. If you get results that are unsatisfactory, you can choose to change your strategy or choose to change other proteins. In this way, both the evolution of the protein interaction network and the evolution of individual strategies are considered. The final protein interaction network emerged as a result of the dynamic behavior of the agent learning behavior.

[0049] A certain node i randomly selects to access other nodes, and the selection probability is obtained by calculating the relative selection weights assigned by each node by other nodes. Each node has a good choice of policies to access other nodes, and each time there is an enhancement. Node i has a selection weight vector <wil,...,wiN> to calculate the probability of selecting other points, which is calculated in the same way as the section 3⁄43⁄4§]

With the lingering

. Since the new node added between each turn connects the existing node i with a certain probability, the selection weight vector wi(t) of any node i at t is a random variable, and if the node i is at t-1 The choice weight of the engraving is wi(tl), then its selection weight wi(t) at t is only dependent on its selection weight at t-1. The new node can be connected at t, regardless of the history before the t-1 engraving.

[0050] Most reinforcement learning algorithms have a function that evaluates the state of a state (or a state action pair) in a given state (or state action pair), called a value function. The function Vh is the state value function of the policy h. Under strategy h, take the value of action u in state X from the state X, take action u, and then follow the expected return of strategy h, denoted as Qh(x,u). QTT is the action value function of strategy h, which is used to measure the degree of action u in state X.

[0051] The V value and the Q value need to be updated with the inter-step. A common method is to use the discount to accumulate rewards. However, there are some shortcomings in this method, such as the need to manually determine the discount factor, the setting of parameters and related to the application. In the process of network evolution, the formation of edges between nodes representing the relationship should be independent of the order in which they appear. However, in actual situations, there are many factors that cause the evolution order to be different, such as the order in which data is read. However, in the construction of the interaction network, regardless of the evolution order of the middle, the same set of data, the same method, should get the final consistent result. Therefore, it is not suitable to use the method of discount accumulating rewards. In view of this, it is necessary to evaluate the constructed network using a Q value and V value calculation method that is independent of the network evolution order. [0052] The present invention measures the stability of a network by using entropy to measure randomness or irregularity. The greater the entropy, the greater the randomness. The smaller the entropy, the smaller the randomness, which is consistent with the changing state of the biological system. If w di represents the weighted degree (wd) of the node i, the local entropy (le) of the node i is defined as shown in Equation 1.

Where wdi is the sum of the weights of the actions of all nodes associated with node i, and w ij is the weight of the edge between node i and node j.

[0054] The network entropy (ne) of a network is the sum of the entropies of all nodes, as shown in Equation 2.

[0055] After long-term iteration, the resulting protein interaction network is not a random network, and has a stable topology. Therefore, the optimal topology of the protein interaction has the smallest network entropy, so that the most stable final can be obtained. Network structure.

[0056] The specific implementation steps are:

[0057] Step (1): obtaining the required text from the biomedical literature database PubMed using the E-utility interface provided by the PubMed biomedical literature database;

[0058] Step (2): downloading the protein name and its identification number from the protein interaction relation data database DIP, IntAct and STRING;

[0059] Step (3): identifying the protein name in the text, using the identification number;

[0060] Step (4): The user gives a set of proteins in the protein interaction network to be constructed P = {pi}, where pi represents the identification number corresponding to the i-th protein;

[0061] Step (5): taking a protein set P = {pi} all two proteins, forming a candidate protein action pair set all_pairs;

[0062] Step (6): setting the available candidate protein pair set avaiable_pairs = all_pairs;

[0063] Step (7): If the candidate protein action set avaiable-pairs and unprocessed action pairs are available, take one of the action pairs (pi, pj), go to the next step, otherwise go to step (14) ; [0064] Step (8): from the available candidate protein action pair concentration removal pair (pi, pj), avaiable_pairs = avaiable_pairs-{(pi, pj) };

Step (9): Initialize the weight of the interaction between the protein pi and the protein p j weight(p i,pj) = 0.0

[0066] Step (10): searching for interactions between proteins pi, pj in the protein interaction relationship data databases DIP, IntAct and STRING, respectively;

[0067] Step (11): If there is a protein pi, the interaction between pj in the DIP database, shell ljweight(p i,pj)

= weight(pi,pj)+pre-set bonus value; otherwise, if in DIP

The database clearly indicates that there is no interaction between the proteins pi and pj, then _we ight(pi,pj) =

Weight(pi,pj)-preset penalty value; otherwise, if no information is found in the DIP database for protein pi, pj, the weight(p i,pj) value remains unchanged;

[0068] Step (12): If in IntAct

The protein database has pi, the interaction between PJ, shellfish _{lj we ight (pi, pj)} = weight (pi, pj) + pre-set value reward; otherwise, if it clear that the protein pi, PJ database in IntAct There is no interaction between them, then weight(pi,pj) = weight(pi,pj)-preset penalty value; otherwise if there is no search for protein pi in p1, pj interacts with information, then weight (pi, pj) values remain unchanged;

[0069] Step (13): If at STRING

Pi of the protein database has, the interaction between PJ, shellfish _{lj we ight (pi, pj)} = weight (pi, pj) + reward preset value; otherwise, if the pi of the protein in the clear STRING database, PJ There is no interaction between them, then weight(pi,pj) = weight(pi,pj)-pre-set penalty value; otherwise if there is no search for protein pi in pRING database, pj interacts with information, then weight (pi,pj) values remain the same

[0070] Since the protein interaction relation data databases DIP, IntAct, and STRING contain rich biological domain knowledge, by setting the initial value, the weight of the protein interaction of known information can be increased, and it is impossible to know. The effect of protein interactions is reduced on weight.

Step (14): obtaining a protein action network rich in biomedical knowledge and initializing a weight matrix, N

= (pi,pj,weight(pi,pj));

[0072] Step (15): Initializing the candidate protein set candidate-protein, adding all proteins to the initialization Selected protein set;

[0073] Step (16): selecting one protein pi from the candidate protein set candidate-protein;

Step (17): removing the protein pi, candidate_protein from the candidate protein set candidate_protein

= candidate-protein- {pi};

[0075] Step (18): initializing the paired protein set candidate_pair_protein, adding all proteins to the initializing candidate protein set;

[0076] Step (19): if the paired protein set candidate_pair_protein is not empty, then select one pair of proteins pj from the candidate paired protein set candidate-pair-protein; otherwise, go to step (17); Step (20): The formula should be used: = flat):

Calculate the current network entropy;

[0078] Step (21): using a greedy strategy to select whether there is an interaction between the proteins pi, pj;

[0079] Step (22): If Qf is less than Q', then there is no interaction between p i and pj, and weight (p i, pj) is set.

= 0.0; otherwise it is considered that there is an interaction between pi and pj, weight(p i,pj) = Qf;

[0080] Step (23): updating the protein pi, the probability of interaction between pj is

■3⁄4 difficult to "supplement, i^fet frequency

[0081] Step (24): updating the protein interaction network N with a new weight(pi, pj) value;

[0082] Step (25): After the rated iteration step is reached, it is no longer updated, and the final network structure is obtained.

[0083] The termination condition may be that the matrix weight is not updated within a longer interval or has reached a predetermined number of iterations. Matrix weight can be used for the selection of actions, that is, the choice of interaction between nodes. The selection probability is: Therefore, the final matrix weight can be regarded as the topology of the network. The update process of matrix weight can be regarded as the evolution process of building the network. .

Claims

Claim

A method for constructing a protein interaction network using text data, comprising:

(1) Establish a collection of proteins;

(2) Record the probability values of the interaction of all proteins in the protein collection;

(3) construct an initial network structure according to the size of the probability value;

(4) Repeatedly select proteins, given positive or negative feedback values, and iterate over the initial network structure to obtain the network structure of the final protein interaction network.

The method for constructing a protein interaction network according to claim 1, wherein: "the probability value of interaction between all proteins" is: arbitrarily selecting a protein as a main interaction protein in the protein collection, and other proteins In order to be interacted with the protein, the main interaction protein interacts with each interacted protein to form an action relationship, and then replaces the main interaction protein, and interacts with other interacted proteins again to form another action relationship, such that the cycle reaches The predetermined value, and in the case of repeated selection, is calculated in an iterative manner to obtain a final action relationship as a probability value corresponding to the interaction of the two proteins.

The method for constructing a protein interaction network according to claim 2, wherein: the "repetitive selection" is a case where one protein in a protein collection interacts with another protein as a main and interacted protein. , and the case of repeated interactions with each other.

The method for constructing a protein interaction network according to claim 2, wherein the predetermined value is: each of the proteins in the set interacts with other interacted proteins as a main interacting protein, or a longer cycle. There are no more updates in the interval, or one or more of the rated iteration steps.

The protein interaction network construction method according to claim 1, wherein the initial network structure is constructed as follows: each protein in the protein set acts as a node, and the two interact as edges, and the larger the boundary value thereof , the greater the probability of interaction between the two, the smaller the opposite, the stronger the boundary value is enhanced during the construction process. , until there is no more update in a longer interval, and vice versa, until the probability value is zero, and finally the network structure constructed by the node and the edge is obtained; and the final network structure is further obtained through the initial network structure constructed.

[Claim 6] The protein interaction network construction method according to claim 5, wherein the final network structure is: calculating an entropy weight of each protein node by using an entropy weight method to construct a network, and then calculating The network entropy weight, the smaller the entropy weight, indicates that the network is stable and the initial network structure is updated.

[Claim 7] The protein interaction network construction method according to claim 1, wherein the establishment of the protein collection:

a. obtaining the required text through the biomedical literature database; b. obtaining the protein name and its identification number from the protein interaction database;

c. according to the protein name obtained in step b, identifying the protein name in the text obtained in step a, and marking the corresponding identification number;

d. Constructing the protein set P = {p J, where p i represents the identification number corresponding to the i-th protein.