WO2016134659A1 - Method for constructing protein-protein interaction network using text data - Google Patents

Method for constructing protein-protein interaction network using text data Download PDF

Info

Publication number
WO2016134659A1
WO2016134659A1 PCT/CN2016/074496 CN2016074496W WO2016134659A1 WO 2016134659 A1 WO2016134659 A1 WO 2016134659A1 CN 2016074496 W CN2016074496 W CN 2016074496W WO 2016134659 A1 WO2016134659 A1 WO 2016134659A1
Authority
WO
WIPO (PCT)
Prior art keywords
protein
interaction
network
proteins
network structure
Prior art date
Application number
PCT/CN2016/074496
Other languages
French (fr)
Chinese (zh)
Inventor
朱斐
刘全
王辉
凌兴宏
杨洋
伏玉琛
Original Assignee
苏州大学张家港工业技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州大学张家港工业技术研究院 filed Critical 苏州大学张家港工业技术研究院
Publication of WO2016134659A1 publication Critical patent/WO2016134659A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • G16B5/20Probabilistic models
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • the present invention relates to the field of biology, and more particularly to a method of constructing a protein interaction network using text data.
  • Biomolecular systems contain many networks of different levels and different organizational forms. The most important feature of the complexity of a living system is not only the complexity of its components, but also the complexity of the relationships between its components. Therefore, in the analysis of biomolecular networks, it is necessary not only to fully understand the various molecular entities in the network, but more importantly to understand the interrelationships among the various molecular entities. Proteins are an important class of biological molecules that form a network of protein interactions through interactions with each other to participate in all aspects of life processes such as biosignal transmission, gene expression regulation, energy and material metabolism, and cell cycle regulation. The basis of implementation. The interaction between proteins plays a crucial role in the formation of almost all life systems and in the regulation of various physiological/pathological processes.
  • Protein interactions not only provide clues for studying the biological functions of unknown proteins, but also provide the necessary information to fully understand the biological mechanisms of a cell or a biological pathway. In biomedicine, the study of protein-protein interactions has very important real-world implications. Systematic analysis of the interactions of certain disease-related proteins, understanding of how these proteins work in biological systems, understanding the mechanisms of biosignal and energy metabolism under specific physiological conditions, and understanding the functional links between disease-related proteins All have important meanings.
  • the object of the present invention is to provide a method for constructing a protein interaction network by using text data, which can not only integrate existing biological domain knowledge, but also fully utilize the data obtained by post-genome degeneration, and at the same time take into account complex network characteristics. New methods for constructing interprotein interaction networks.
  • a method for constructing a protein interaction network by using text data comprising:
  • the “probability value of interaction between all proteins” is: arbitrarily selecting one protein as a main interaction protein in the protein collection, and interacting with other proteins.
  • the main interaction protein interacts with each interacted protein to form an action relationship, and then replaces the main interaction protein, and interacts with other interacted proteins again to form another action relationship, such that the number of cycles reaches a predetermined value, and
  • it is calculated in an iterative manner to obtain the final action relationship as the probability value corresponding to the interaction of the two proteins.
  • the “repetitive selection case” is a case where one protein in the protein set interacts with another protein as a main, interacted with the interacting protein, and repeatedly interacts with each other by interaction. happening.
  • the predetermined value is: each of the proteins in the set interacts with other interacting proteins as a main interacting protein, or is no longer updated in a long interval of the cycle, or reaches One or several of the rated iteration steps.
  • the initial network structure is constructed as follows: each protein in the protein set acts as a node, and the two interact as edges, and the larger the boundary value, the interaction between the two The greater the probability, the smaller the opposite.
  • the interaction with large boundary values is enhanced until there is no more update in a longer interval, and vice versa, until the probability value is zero, and finally obtained by The network structure constructed by nodes and edges; through the initial network structure constructed, the final network structure is further obtained.
  • the final network structure is: constructing a network by using an entropy weight method, calculating an entropy weight of each protein node, and calculating a network entropy weight, and the smaller the entropy weight, indicating that the network is stable, Update the initial network structure.
  • the present invention has the following advantages over the prior art: [0024] 1.
  • the present invention employs repeated attempts to interact, increases or decreases the boundary value of the interaction between the two, and the constructed network structure appears as a result of dynamics, ensuring the scale-free characteristics of the complex biological network;
  • Using the construction method of the invention in accordance with the characteristics of the unknown biological problem, obtaining the best behavior in an unknown random environment, constructing an unknown protein interaction network, can ensure that the network converges to an optimal stable state;
  • 1 is a flow chart showing an implementation step of constructing a protein interaction network using text data
  • FIG. 2 is a reinforced learning method using text data using average bonus values to construct a protein interaction network
  • FIG. 3 is a schematic diagram of a node degree probability density distribution for constructing a protein interaction network using a reinforcement learning method using text data using an average prize value.
  • Embodiment 1 Referring to FIG. 1, a method for constructing a protein interaction network using text data includes:
  • the "probability value of interaction between all proteins” means that one protein is arbitrarily selected as a main interaction protein in the protein collection, and the other proteins are interacted proteins, and the main interaction protein is interacted with each other. Protein interaction, forming a functional relationship, then replacing the main interacting protein, interacting with other interacting proteins again, forming another functional relationship, such a loop, the number of cycles reaches a predetermined value, and in the case of repeated selection, in an iterative manner Calculate, the final action relationship is obtained as the probability value corresponding to the interaction of the two proteins.
  • the "repetitive selection case” is a case where one protein in a protein set interacts with another protein as a master, interacts with an interacting protein, and repeatedly interacts with each other.
  • the predetermined value is that each of the proteins in the set interacts with other interacted proteins as the main interacting protein, or no longer updates within a longer interval, or reaches a rated iteration step One or several of them.
  • each protein in the protein set acts as a node, and the two interact with each other as an edge, and the larger the boundary value, the greater the probability of interaction between the two pairs, and vice versa.
  • the network structure which is the result of the role of the network as a result of the dynamics of learning behavior.
  • the final network structure is: constructing a network by using an entropy weight method, calculating an entropy weight of each protein node, and then calculating a network entropy weight, the smaller the entropy weight, indicating that the network is stable, and updating the initial network structure.
  • a reinforcement learning method is used to construct a network structure, and a protein interaction network is established in a framework of reinforcement learning, and nodes represent proteins, which are recorded as nodes 1, ..., nodes n, and side representations A role between proteins.
  • a node obtains an action under the decision of the reinforcement learning agent.
  • the action may be that the protein has a cooperative relationship with other proteins, indicating that there is an interaction between the two related proteins; or it may be that the protein and other proteins are mutually exclusive. Relationship, indicating that there is no interaction between the two related proteins; it is also possible to determine whether there is an interaction between the two related proteins.
  • the node gets a reward every time an interaction attempt is made, and the value of the reward determines which The interaction will be enhanced.
  • the protein adjustment strategy can also make a decision-making strategy again, and introduce randomness to explore and adapt to the environment. If you get results that are unsatisfactory, you can choose to change your strategy or choose to change other proteins. In this way, both the evolution of the protein interaction network and the evolution of individual strategies are considered. The final protein interaction network emerged as a result of the dynamic behavior of the agent learning behavior.
  • a certain node i randomly selects to access other nodes, and the selection probability is obtained by calculating the relative selection weights assigned by each node by other nodes.
  • Each node has a good choice of policies to access other nodes, and each time there is an enhancement.
  • Node i has a selection weight vector ⁇ wil,...,wiN> to calculate the probability of selecting other points, which is calculated in the same way as the section 3 ⁇ 43 ⁇ 4 ⁇ ]
  • the selection weight vector wi(t) of any node i at t is a random variable, and if the node i is at t-1 The choice weight of the engraving is wi(tl), then its selection weight wi(t) at t is only dependent on its selection weight at t-1.
  • the new node can be connected at t, regardless of the history before the t-1 engraving.
  • Most reinforcement learning algorithms have a function that evaluates the state of a state (or a state action pair) in a given state (or state action pair), called a value function.
  • the function Vh is the state value function of the policy h.
  • strategy h take the value of action u in state X from the state X, take action u, and then follow the expected return of strategy h, denoted as Qh(x,u).
  • QTT is the action value function of strategy h, which is used to measure the degree of action u in state X.
  • the V value and the Q value need to be updated with the inter-step.
  • a common method is to use the discount to accumulate rewards.
  • this method includes the need to manually determine the discount factor, the setting of parameters and related to the application.
  • the formation of edges between nodes representing the relationship should be independent of the order in which they appear.
  • there are many factors that cause the evolution order to be different such as the order in which data is read.
  • the same set of data, the same method should get the final consistent result. Therefore, it is not suitable to use the method of discount accumulating rewards.
  • the present invention measures the stability of a network by using entropy to measure randomness or irregularity. The greater the entropy, the greater the randomness. The smaller the entropy, the smaller the randomness, which is consistent with the changing state of the biological system. If w di represents the weighted degree (wd) of the node i, the local entropy (le) of the node i is defined as shown in Equation 1.
  • wdi is the sum of the weights of the actions of all nodes associated with node i
  • w ij is the weight of the edge between node i and node j.
  • the network entropy (ne) of a network is the sum of the entropies of all nodes, as shown in Equation 2.
  • the resulting protein interaction network is not a random network, and has a stable topology. Therefore, the optimal topology of the protein interaction has the smallest network entropy, so that the most stable final can be obtained. Network structure.
  • Step (1) obtaining the required text from the biomedical literature database PubMed using the E-utility interface provided by the PubMed biomedical literature database;
  • Step (2) downloading the protein name and its identification number from the protein interaction relation data database DIP, IntAct and STRING;
  • Step (3) identifying the protein name in the text, using the identification number
  • Step (10) searching for interactions between proteins pi, pj in the protein interaction relationship data databases DIP, IntAct and STRING, respectively;
  • Step (11) If there is a protein pi, the interaction between pj in the DIP database, shell ljweight(p i,pj)
  • Weight(pi,pj)-preset penalty value otherwise, if no information is found in the DIP database for protein pi, pj, the weight(p i,pj) value remains unchanged;
  • Step (13) If at STRING
  • Step (19): if the paired protein set candidate_pair_protein is not empty, then select one pair of proteins pj from the candidate paired protein set candidate-pair-protein; otherwise, go to step (17); Step (20): The formula should be used: flat):
  • Step (21) using a greedy strategy to select whether there is an interaction between the proteins pi, pj;
  • Step (22) If Qf is less than Q', then there is no interaction between p i and pj, and weight (p i, pj) is set.
  • Step (23) updating the protein pi, the probability of interaction between pj is
  • Step (24) updating the protein interaction network N with a new weight(pi, pj) value
  • Step (25) After the rated iteration step is reached, it is no longer updated, and the final network structure is obtained.
  • the termination condition may be that the matrix weight is not updated within a longer interval or has reached a predetermined number of iterations.
  • Matrix weight can be used for the selection of actions, that is, the choice of interaction between nodes.
  • the selection probability is: Therefore, the final matrix weight can be regarded as the topology of the network.
  • the update process of matrix weight can be regarded as the evolution process of building the network. .

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Physiology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Disclosed is a method for constructing a protein-protein interaction network using text data, comprising: (1) establishing a protein set; (2) recording a probability value of interaction between every two of all protein in the protein set; (3) according to the size of the probability value, constructing an initial network structure; (4) repeatedly selecting protein, giving a positive or negative effect feedback value, and continuously conducting iteration on the initial network structure, so as to obtain a final network structure. By means of the present invention, in the manners of repeated selection and interaction and on the basis of positive feedback, negative feedback and prohibiting feedback, a probability diagram of an interaction network is constructed through reinforcement learning and is seamlessly combined with biological knowledge and biological data.

Description

一种利用文本数据构建蛋白质相互作用网络的方法 技术领域  Method for constructing protein interaction network by using text data
[0001] 本发明涉及一种生物学领域, 尤其涉及一种利用文本数据构建蛋白质相互作用 网络的方法。  [0001] The present invention relates to the field of biology, and more particularly to a method of constructing a protein interaction network using text data.
背景技术  Background technique
[0002] 生物系统包含很多不同层面和不同组织形式的网络。 生命系统复杂性最重要的 特征不仅在于其组成成分的复杂性, 更在于各组成成分之间关系的复杂性。 故 而, 在分析生物分子网络吋, 不仅需要充分了解网络中的各个分子实体, 更重 要的是理解各分子实体之间的相互关系。 蛋白质是一类重要的生物分子, 通过 彼此之间的相互作用构成蛋白质相互作用网络来参与生物信号传递、 基因表达 调节、 能量和物质代谢及细胞周期调控等生命过程的各个环节, 是很多生物功 能实现的基础。 蛋白质之间的相互作用在形成几乎所有生命系统、 调控各种生 理 /病理进程中发挥至关重要的作用。 蛋白质相互作用不仅为研究未知蛋白质的 生物学功能提供了线索, 也为充分了解一个细胞或一个生物途径的生物学机制 , 提供了必要的信息。 在生物医学中, 研究蛋白质间相互作用有非常重要的现 实意义。 系统分析某类疾病相关的蛋白质的相互作用关系, 对于了解生物系统 中这些蛋白质的工作原理, 了解在特殊生理状态下生物信号和能量物质代谢的 反应机制, 以及了解疾病相关蛋白质之间的功能联系都有重要意义。  [0002] Biological systems contain many networks of different levels and different organizational forms. The most important feature of the complexity of a living system is not only the complexity of its components, but also the complexity of the relationships between its components. Therefore, in the analysis of biomolecular networks, it is necessary not only to fully understand the various molecular entities in the network, but more importantly to understand the interrelationships among the various molecular entities. Proteins are an important class of biological molecules that form a network of protein interactions through interactions with each other to participate in all aspects of life processes such as biosignal transmission, gene expression regulation, energy and material metabolism, and cell cycle regulation. The basis of implementation. The interaction between proteins plays a crucial role in the formation of almost all life systems and in the regulation of various physiological/pathological processes. Protein interactions not only provide clues for studying the biological functions of unknown proteins, but also provide the necessary information to fully understand the biological mechanisms of a cell or a biological pathway. In biomedicine, the study of protein-protein interactions has very important real-world implications. Systematic analysis of the interactions of certain disease-related proteins, understanding of how these proteins work in biological systems, understanding the mechanisms of biosignal and energy metabolism under specific physiological conditions, and understanding the functional links between disease-related proteins All have important meanings.
[0003] 目前, 有多种方法用于构建蛋白质相互作用网络, 主要包括在高通量实验的基 础上建立相互作用网络,利用文献中已有的数据挖掘相互作用网络, 通过计算技 术预测的方法建立相互作用网络等。 然而, 总体而言, 很多构建蛋白质间相互 作用网络的方法存在不足。  [0003] At present, there are a variety of methods for constructing protein interaction networks, including the establishment of interaction networks based on high-throughput experiments, the use of data mining interaction networks already available in the literature, and prediction methods through computational techniques. Establish an interaction network, etc. However, in general, there are a number of ways to construct a network of interactions between proteins.
[0004] 首先, 在高通量实验的基础上建立蛋白质相互作用网络一般会受到费用的制约 。 很多高通量实验的方法在研究某个疾病吋, 仍然是局限于少量的蛋白质, 没 有从更广泛的蛋白质图谱的角度去构造和分析, 其主要原因在于, 分析蛋白质 间相互作用的生化实验费用高, 导致了只能选取少量蛋白质, 无法以全部蛋白 质作为广泛的候选蛋白质进行分析研究。 而选取少量的蛋白质进行分析研究, 不仅极有可能遗漏与该疾病相关的蛋白质, 错过一些生物医学事实, 而且分析 研究的视角和思路会受到局限, 更难发现新信息和新知识。 [0004] First, the establishment of protein interaction networks based on high-throughput experiments is generally subject to cost constraints. Many high-throughput experimental methods are still limited to a small number of proteins in the study of a disease, and are not constructed and analyzed from the perspective of a broader protein profile. The main reason is the cost of biochemical experiments to analyze protein-protein interactions. High, resulting in only a small amount of protein, not all proteins Quality is used as a broad candidate protein for analytical studies. The selection of a small amount of protein for analysis is not only likely to miss the proteins associated with the disease, but also miss some biomedical facts, and the perspectives and ideas of analytical research are limited, making it harder to discover new information and new knowledge.
[0005] 其次, 单纯利用文献数据的方法建立蛋白质相互作用网络会受到数据质量及其 相关生物分析的影响。 有吋来源于不同文献的数据会对同一种生物现象做出不 同的生物解释和结论; 而有吋同一批数据又会有不同的生物解释和结论。 这是 由于人们对复杂生物现象理解不够全面导致了从不同角度去分析同一个现象会 产生不同的解释和结论。 因此在研究分析复杂生物问题, 如构建蛋白质相互作 用网络吋, 需要充分整合不同来源的数据和相关信息, 对各种信息加以甄别, 去伪存真, 从而加深对其疾病机理多层次和深层次上的全面理解。  [0005] Second, the establishment of protein interaction networks using literature data alone is influenced by data quality and related biological analysis. There are data from different literatures that make different biological interpretations and conclusions about the same biological phenomenon; and there are different biological interpretations and conclusions in the same batch of data. This is because people's lack of comprehensive understanding of complex biological phenomena leads to different interpretations and conclusions from different perspectives. Therefore, in the research and analysis of complex biological problems, such as the construction of protein interaction networks, it is necessary to fully integrate data and related information from different sources, to discriminate various information, to deny the truth, and to deepen the comprehensive and deep level of its disease mechanism. understanding.
[0006] 另外, 很多构建蛋白质间相互作用网络的计算方法偏重于计算模型的设计和改 进, 却未能很好地融合生物知识和生物事实, 以至于出现一些与生物基本知识 和事实相悖的错误结论。  [0006] In addition, many computational methods for constructing interprotein interaction networks are biased towards the design and improvement of computational models, but do not integrate biological knowledge and biological facts well, so that there are some errors that contradict basic biological knowledge and facts. in conclusion.
技术问题  technical problem
问题的解决方案  Problem solution
技术解决方案  Technical solution
[0007] 本发明目的是提供一种利用文本数据构建蛋白质相互作用网络的方法, 既能融 合现有生物领域知识, 又能充分利用后基因吋代所得到的数据, 同吋兼顾复杂 网络特性的新的蛋白质间相互作用网络构建方法。  [0007] The object of the present invention is to provide a method for constructing a protein interaction network by using text data, which can not only integrate existing biological domain knowledge, but also fully utilize the data obtained by post-genome degeneration, and at the same time take into account complex network characteristics. New methods for constructing interprotein interaction networks.
[0008] 为达到上述目的, 本发明采用的技术方案是: 一种利用文本数据构建蛋白质相 互作用网络的方法, 包括:  [0008] In order to achieve the above object, the technical solution adopted by the present invention is: A method for constructing a protein interaction network by using text data, comprising:
[0009] (1)建立蛋白质集合;  [0009] (1) establishing a protein collection;
[0010] (2)记录蛋白质集合中所有蛋白质两两发生相互作用的概率值;  [0010] (2) recording a probability value of the interaction of all the proteins in the protein collection;
[0011] (3)根据概率值的大小构建初始网络结构; [0011] (3) constructing an initial network structure according to the size of the probability value;
[0012] (4)反复选择蛋白质, 给定正作用或负作用反馈值, 在初始网络结构上不断迭代 [0012] (4) repeatedly selecting proteins, given positive or negative feedback values, iterating over the initial network structure
, 获得最终蛋白质相互作用网络的网络结构。 , to obtain the network structure of the final protein interaction network.
[0013] 上述技术方案中, 所述"所有蛋白质两两发生相互作用的概率值"为, 在蛋白质 集合中任意选择一个蛋白质作为主交互蛋白质, 与其他蛋白质为被交互蛋白质 , 所述主交互蛋白质与每一个被交互蛋白质交互, 形成一个作用关系, 而后更 换主交互蛋白质, 再次与其他被交互蛋白质进行交互, 形成另一个作用关系, 如此循环, 循环次数达到预定值, 且在重复选择的情况下, 以迭代的方式计算 , 获得最终作用关系作为对应两个蛋白质交互的概率值。 [0013] In the above technical solution, the “probability value of interaction between all proteins” is: arbitrarily selecting one protein as a main interaction protein in the protein collection, and interacting with other proteins. The main interaction protein interacts with each interacted protein to form an action relationship, and then replaces the main interaction protein, and interacts with other interacted proteins again to form another action relationship, such that the number of cycles reaches a predetermined value, and In the case of repeated selection, it is calculated in an iterative manner to obtain the final action relationship as the probability value corresponding to the interaction of the two proteins.
[0014] 上述技术方案中, 所述"重复选择的情况"为, 蛋白质集合中的某一个蛋白质与 另一个蛋白质相互作为主、 被交互蛋白质交互作用的情况, 以及重复再被相互 选择交互作用的情况。 [0014] In the above technical solution, the “repetitive selection case” is a case where one protein in the protein set interacts with another protein as a main, interacted with the interacting protein, and repeatedly interacts with each other by interaction. Happening.
[0015] 进一步的技术方案, 所述预定值为: 每一个集合内的蛋白质均作为主交互蛋白 质与其他被交互蛋白质进行过交互, 或者循环一个较长吋间段内不再有更新, 或者达到额定的迭代步数中的一种或几种。  [0015] According to a further technical solution, the predetermined value is: each of the proteins in the set interacts with other interacting proteins as a main interacting protein, or is no longer updated in a long interval of the cycle, or reaches One or several of the rated iteration steps.
[0016] 上述技术方案中, 所述构建初始网络结构为: 蛋白质集合中的每一个蛋白质作 为一个节点, 两两发生相互作用作为边, 其边值越大, 则两两之间存在相互作 用的概率越大, 反之则越小, 在构建的过程中, 边值大的交互被增强, 直至一 个较长吋间段内不再有更新, 反之则被减弱, 直至概率值为零, 最终获得由节 点与边构建的网络结构; 通过构建的初始网络结构, 再进一步获取最终网络结 构。  [0016] In the above technical solution, the initial network structure is constructed as follows: each protein in the protein set acts as a node, and the two interact as edges, and the larger the boundary value, the interaction between the two The greater the probability, the smaller the opposite. In the process of construction, the interaction with large boundary values is enhanced until there is no more update in a longer interval, and vice versa, until the probability value is zero, and finally obtained by The network structure constructed by nodes and edges; through the initial network structure constructed, the final network structure is further obtained.
[0017] 进一步的技术方案, 所述最终网络结构为:通过使用熵权法构建网络, 计算每 个蛋白质节点的熵权值, 再计算网络熵权值, 熵权值越小, 表示网络稳定, 更 新初始网络结构。  [0017] In a further technical solution, the final network structure is: constructing a network by using an entropy weight method, calculating an entropy weight of each protein node, and calculating a network entropy weight, and the smaller the entropy weight, indicating that the network is stable, Update the initial network structure.
[0018] 上述技术方案中, 所述蛋白质集合的建立: [0018] In the above technical solution, the establishment of the protein collection:
[0019] a.通过生物医学文献数据库中获得所需要的文本; [0019] a. obtaining the required text through a biomedical literature database;
[0020] b.从蛋白质相互作关系数据库中获取蛋白质名及其标识号; [0020] b from the protein interaction database to obtain the protein name and its identification number;
[0021] c根据步骤 b获得的蛋白质名, 识别出步骤 a中获得的所述文本内的蛋白质名[0021] c according to the protein name obtained in step b, identifying the protein name in the text obtained in step a
, 并标注相应的标识号; And mark the corresponding identification number;
[0022] d.构建所述蛋白质集合 P = {pi} , 其中 pi表示第 i个蛋白质所对应的标识号。 [0022] d. Constructing the protein set P = {pi}, where pi represents the identification number corresponding to the i-th protein.
发明的有益效果  Advantageous effects of the invention
有益效果  Beneficial effect
[0023] 由于上述技术方案运用, 本发明与现有技术相比具有下列优点: [0024] 1.本发明采用反复尝试交互作用, 增加或减弱两两交互作用的边值, 构建的网 络结构作为动态性的结果出现, 确保了复杂生物网络应具备的无标度特性; [0025] 2.采用本发明的构建方法, 符合生物问题未知性的特点, 在未知随机环境中获 得最佳行为, 构建具有未知性的蛋白质相互作用网络, 可以保证网络收敛到一 个最佳的稳定状态; [0023] Due to the above technical solutions, the present invention has the following advantages over the prior art: [0024] 1. The present invention employs repeated attempts to interact, increases or decreases the boundary value of the interaction between the two, and the constructed network structure appears as a result of dynamics, ensuring the scale-free characteristics of the complex biological network; [0025] 2. Using the construction method of the invention, in accordance with the characteristics of the unknown biological problem, obtaining the best behavior in an unknown random environment, constructing an unknown protein interaction network, can ensure that the network converges to an optimal stable state;
[0026] 3.在建立网络的过程中无缝地与生物知识和生物数据结合, 强化生物事实, 而 非随机构建网络, 确保网络符合生物复杂网络的基本特性。  [0026] 3. Seamlessly integrate biological knowledge and biological data in the process of establishing a network, strengthen biological facts, and not randomly construct a network to ensure that the network conforms to the basic characteristics of a biological complex network.
对附图的简要说明  Brief description of the drawing
附图说明  DRAWINGS
[0027] 图 1是利用文本数据构建蛋白质相互作用网络方法实施步骤流程图;  1 is a flow chart showing an implementation step of constructing a protein interaction network using text data;
[0028] 图 2是利用文本数据使用平均奖赏值的强化学习方法构建蛋白质相互作用网络 [0028] FIG. 2 is a reinforced learning method using text data using average bonus values to construct a protein interaction network
[0029] 的节点度概率分布示意图; [0029] a schematic diagram of a node degree probability distribution;
[0030] 图 3是利用文本数据使用平均奖赏值的强化学习方法构建蛋白质相互作用网络 [0031] 的节点度概率密度分布示意图。  3 is a schematic diagram of a node degree probability density distribution for constructing a protein interaction network using a reinforcement learning method using text data using an average prize value. [0030] FIG.
本发明的实施方式 Embodiments of the invention
[0032] 下面结合附图及实施例对本发明作进一步描述: [0032] The present invention will be further described below in conjunction with the accompanying drawings and embodiments:
[0033] 实施例一: 参见图 1所示, 一种利用文本数据构建蛋白质相互作用网络的方法 , 包括:  [0033] Embodiment 1: Referring to FIG. 1, a method for constructing a protein interaction network using text data includes:
[0034] (1)建立蛋白质集合;  [0034] (1) establishing a protein collection;
[0035] (2)记录蛋白质集合中所有蛋白质两两发生相互作用的概率值;  [0035] (2) recording a probability value of interaction of all proteins in the protein collection;
[0036] (3)根据概率值的大小构建初始网络结构; [0036] (3) constructing an initial network structure according to the size of the probability value;
[0037] (4)反复选择蛋白质, 给定正作用或负作用反馈值, 在初始网络结构上不断迭代 [0037] (4) repeatedly selecting proteins, given positive or negative feedback values, and iterating over the initial network structure
, 获得最终蛋白质相互作用网络的网络结构。 , to obtain the network structure of the final protein interaction network.
[0038] 所述蛋白质集合的建立: [0038] establishment of the protein collection:
[0039] a.通过生物医学文献数据库中获得所需要的文本;  [0039] a obtaining the required text through the biomedical literature database;
[0040] b.从蛋白质相互作关系数据库中获取蛋白质名及其标识号;  [0040] b from the protein interaction database to obtain the protein name and its identification number;
[0041] c根据步骤 b获得的蛋白质名, 识别出步骤 a中获得的所述文本内的蛋白质名 , 并标注相应的标识号; [0041] c according to the protein name obtained in step b, identifying the protein name in the text obtained in step a And mark the corresponding identification number;
[0042] d.构建所述蛋白质集合 P = {pi} , 其中 pi表示第 i个蛋白质所对应的标识号。 [0042] d. Constructing the protein set P = {pi}, where pi represents the identification number corresponding to the i-th protein.
[0043] 所述"所有蛋白质两两发生相互作用的概率值"为, 在蛋白质集合中任意选择一 个蛋白质作为主交互蛋白质, 与其他蛋白质为被交互蛋白质, 所述主交互蛋白 质与每一个被交互蛋白质交互, 形成一个作用关系, 而后更换主交互蛋白质, 再次与其他被交互蛋白质进行交互, 形成另一个作用关系, 如此循环, 循环次 数达到预定值, 且在重复选择的情况下, 以迭代的方式计算, 获得最终作用关 系作为对应两个蛋白质交互的概率值。 [0043] The "probability value of interaction between all proteins" means that one protein is arbitrarily selected as a main interaction protein in the protein collection, and the other proteins are interacted proteins, and the main interaction protein is interacted with each other. Protein interaction, forming a functional relationship, then replacing the main interacting protein, interacting with other interacting proteins again, forming another functional relationship, such a loop, the number of cycles reaches a predetermined value, and in the case of repeated selection, in an iterative manner Calculate, the final action relationship is obtained as the probability value corresponding to the interaction of the two proteins.
[0044] 所述"重复选择的情况"为, 蛋白质集合中的某一个蛋白质与另一个蛋白质相互 作为主、 被交互蛋白质交互作用的情况, 以及重复再被相互选择交互作用的情 况。 [0044] The "repetitive selection case" is a case where one protein in a protein set interacts with another protein as a master, interacts with an interacting protein, and repeatedly interacts with each other.
[0045] 所述预定值为:每一个集合内的蛋白质均作为主交互蛋白质与其他被交互蛋白 质进行过交互, 或者循环一个较长吋间段内不再有更新, 或者达到额定的迭代 步数中的一种或几种。  [0045] The predetermined value is that each of the proteins in the set interacts with other interacted proteins as the main interacting protein, or no longer updates within a longer interval, or reaches a rated iteration step One or several of them.
[0046] 所述构建网络结构方式为: 蛋白质集合中的每一个蛋白质作为一个节点, 两两 发生相互作用作为边, 其边值越大, 则两两之间存在相互作用的概率越大, 反 之则越小, 在构建的过程中, 边值大的交互被增强, 直至一个较长吋间段内不 再有更新, 反之则被减弱, 直至概率值为零, 最终获得由节点与边构建的网络 结构, 该网络结构是作用网络是作为学习行为动态性的结果出现的。  [0046] The method for constructing the network structure is: each protein in the protein set acts as a node, and the two interact with each other as an edge, and the larger the boundary value, the greater the probability of interaction between the two pairs, and vice versa. The smaller, in the process of construction, the interaction with large edge values is enhanced until there is no more update in a longer interval, and vice versa, until the probability value is zero, and finally the node and edge are constructed. The network structure, which is the result of the role of the network as a result of the dynamics of learning behavior.
[0047] 所述最终网络结构为:通过使用熵权法构建网络, 计算每个蛋白质节点的熵权 值, 再计算网络熵权值, 熵权值越小, 表示网络稳定, 更新初始网络结构。  [0047] The final network structure is: constructing a network by using an entropy weight method, calculating an entropy weight of each protein node, and then calculating a network entropy weight, the smaller the entropy weight, indicating that the network is stable, and updating the initial network structure.
[0048] 本实施例中采用的是强化学习方法构建网络结构, 在强化学习的框架中建立蛋 白质相互作用网络, 节点表示蛋白质, 记为节点 1, ......, 节点 n, 边表示蛋白 质之间的一个作用。 一个节点在强化学习 agent的决策下, 得到一个动作, 该动 作可能是该蛋白质和其他蛋白质存在合作关系, 表示相关的两个蛋白质之间有 相互作用; 也可能是该蛋白质和其他蛋白质存在互斥关系, 表示相关的两个蛋 白质之间不能有相互作用; 也有可能不能确定相关的两个蛋白质之间是否有相 互作用。 节点在每次进行交互的尝试后都会得到一个奖赏, 奖赏的值决定哪些 交互将会被增强。 反复进行选择。 随着吋间的推进, 蛋白质调整策略, 也可以 再次决策策略, 同吋引入随机性, 进行探索, 以适应环境。 得到令人不满意结 果的结果吋, 可以选择更改策略, 或选择更改其他蛋白质。 这样, 既允许了蛋 白质相互作用网络的演化, 也考虑到了个体策略的演化。 最终的蛋白质相互作 用网络是作为 agent学习行为动态性的结果出现的。 [0048] In this embodiment, a reinforcement learning method is used to construct a network structure, and a protein interaction network is established in a framework of reinforcement learning, and nodes represent proteins, which are recorded as nodes 1, ..., nodes n, and side representations A role between proteins. A node obtains an action under the decision of the reinforcement learning agent. The action may be that the protein has a cooperative relationship with other proteins, indicating that there is an interaction between the two related proteins; or it may be that the protein and other proteins are mutually exclusive. Relationship, indicating that there is no interaction between the two related proteins; it is also possible to determine whether there is an interaction between the two related proteins. The node gets a reward every time an interaction attempt is made, and the value of the reward determines which The interaction will be enhanced. Make your selections repeatedly. With the advancement of the day, the protein adjustment strategy can also make a decision-making strategy again, and introduce randomness to explore and adapt to the environment. If you get results that are unsatisfactory, you can choose to change your strategy or choose to change other proteins. In this way, both the evolution of the protein interaction network and the evolution of individual strategies are considered. The final protein interaction network emerged as a result of the dynamic behavior of the agent learning behavior.
[0049] 某个节点 i随机选择访问其他节点, 选择概率是由每个节点被其他节点所赋予 的相对选择权重计算而得到的。 每个节点都很好选择访问其他节点的策略, 并 且每次都有一个强化。 节点 i有一个选择权重向量 〈wil,... ,wiN〉 来计算选择其 他各点的概率, 其计算的方式为 节 节¾¾§] 戮童 [0049] A certain node i randomly selects to access other nodes, and the selection probability is obtained by calculating the relative selection weights assigned by each node by other nodes. Each node has a good choice of policies to access other nodes, and each time there is an enhancement. Node i has a selection weight vector <wil,...,wiN> to calculate the probability of selecting other points, which is calculated in the same way as the section 3⁄43⁄4§]
与頓纏麵顧的頓纖讓翔  With the lingering
。 由于每一吋间加入的新节点以某个概率连接已存在的节点 i, 所以任意一个 节点 i在 t吋刻的选择权重向量 wi(t)是一个随机变量, 而且若节点 i在 t-1吋刻的 选择权重为 wi(t-l), 则它在 t吋刻的选择权重 wi(t)只取决于它在 t-1吋刻的选择权 重。 在 t刻吋能连接新节点, 与 t-1吋刻之前的历史无关。  . Since the new node added between each turn connects the existing node i with a certain probability, the selection weight vector wi(t) of any node i at t is a random variable, and if the node i is at t-1 The choice weight of the engraving is wi(tl), then its selection weight wi(t) at t is only dependent on its selection weight at t-1. The new node can be connected at t, regardless of the history before the t-1 engraving.
[0050] 绝大多数强化学习算法都有一个对 agent在给定一个状态(或状态动作对)吋估 计该状态(或在状态执行给定动作)好坏程度的函数, 称为值函数。 函数 Vh为策 略 h的状态值函数。 在策略 h下, 在状态 X采取动作 u的值为从状态 X幵始, 采取 动作 u, 然后遵循策略 h的期望回报, 记为 Qh(x,u)。 QTT为策略 h的动作值函数, 用来衡量在状态 X采取动作 u的好坏程度。  [0050] Most reinforcement learning algorithms have a function that evaluates the state of a state (or a state action pair) in a given state (or state action pair), called a value function. The function Vh is the state value function of the policy h. Under strategy h, take the value of action u in state X from the state X, take action u, and then follow the expected return of strategy h, denoted as Qh(x,u). QTT is the action value function of strategy h, which is used to measure the degree of action u in state X.
[0051] V值和 Q值需要随着吋间步进行更新。 常用的方法是利用折扣累计奖赏。 但这 种方法存在的一些不足, 如需要人工确定折扣因子、 参数的设定以及与应用相 关等。 在网络演化的过程中, 表示作用关系的节点之间的边的形成应该和其出 现的先后顺序无关。 然而, 在实际情况下, 有很多因素会造成演化顺序不同, 如数据读入的先后顺序等。 但在相互作用网络的构建中, 不管中间的演化顺序 如何变化, 同一组数据, 相同的方法, 应得到最终一致的结果。 因此, 不适合 使用折扣累计奖赏的方法。 鉴于此, 需要使用一种与网络演化顺序无关的 Q值 和 V值计算方法来评估所构建的网络。 [0052] 本发明通过使用熵来衡量随机性或不规则性, 以度量网络的稳定性。 熵越大, 随机性就越大。 而熵越小, 则随机性就越小, 符合生物系统的变化状态。 如果 w di表示节点 i的加权度 (weighted degree, wd), 则节点 i的局部熵 (local entropy, le) 定义如式 1所示。
Figure imgf000009_0001
[0051] The V value and the Q value need to be updated with the inter-step. A common method is to use the discount to accumulate rewards. However, there are some shortcomings in this method, such as the need to manually determine the discount factor, the setting of parameters and related to the application. In the process of network evolution, the formation of edges between nodes representing the relationship should be independent of the order in which they appear. However, in actual situations, there are many factors that cause the evolution order to be different, such as the order in which data is read. However, in the construction of the interaction network, regardless of the evolution order of the middle, the same set of data, the same method, should get the final consistent result. Therefore, it is not suitable to use the method of discount accumulating rewards. In view of this, it is necessary to evaluate the constructed network using a Q value and V value calculation method that is independent of the network evolution order. [0052] The present invention measures the stability of a network by using entropy to measure randomness or irregularity. The greater the entropy, the greater the randomness. The smaller the entropy, the smaller the randomness, which is consistent with the changing state of the biological system. If w di represents the weighted degree (wd) of the node i, the local entropy (le) of the node i is defined as shown in Equation 1.
Figure imgf000009_0001
[0053] 其中, wdi是与节点 i相关的所有节点的发生作用的权重之和, w ij是节点 i和节 点 j之间的边的权重。  Where wdi is the sum of the weights of the actions of all nodes associated with node i, and w ij is the weight of the edge between node i and node j.
[0054] 一个网络的网络熵 (network entropy , ne)是所有节点的熵之和, 如式 2所示。
Figure imgf000009_0002
[0054] The network entropy (ne) of a network is the sum of the entropies of all nodes, as shown in Equation 2.
Figure imgf000009_0002
[0055] 经过长期迭代, 最终形成的蛋白质相互作用网络并非随机网络, 是具有一个稳 定的拓扑结构的, 因此, 最优拓扑的蛋白质相互作用具有的最小的网络熵, 从 而可获得最为稳定的最终网络结构。  [0055] After long-term iteration, the resulting protein interaction network is not a random network, and has a stable topology. Therefore, the optimal topology of the protein interaction has the smallest network entropy, so that the most stable final can be obtained. Network structure.
[0056] 具体实现步骤为:  [0056] The specific implementation steps are:
[0057] 步骤 (1): 使用生物医学文献数据库 PubMed所提供的 E-utility接口从生物医学 文献数据库 PubMed中获得所需要的文本;  [0057] Step (1): obtaining the required text from the biomedical literature database PubMed using the E-utility interface provided by the PubMed biomedical literature database;
[0058] 步骤 (2) : 从蛋白质相互作用关系数据数据库 DIP、 IntAct和 STRING中下载得 到蛋白质名及其标识号; [0058] Step (2): downloading the protein name and its identification number from the protein interaction relation data database DIP, IntAct and STRING;
[0059] 步骤 (3) : 识别出文本中的蛋白质名, 使用标识号表示; [0059] Step (3): identifying the protein name in the text, using the identification number;
[0060] 步骤 (4): 用户给出需要构建的蛋白质相互作用网络中的蛋白质集合 P = {pi} , 其中 pi表示第 i个蛋白质所对应的标识号;  [0060] Step (4): The user gives a set of proteins in the protein interaction network to be constructed P = {pi}, where pi represents the identification number corresponding to the i-th protein;
[0061] 步骤 (5): 取蛋白质集合 P = {pi}中所有任意两个蛋白质, 构成候选的蛋白质 作用对集 all_pairs; [0061] Step (5): taking a protein set P = {pi} all two proteins, forming a candidate protein action pair set all_pairs;
[0062] 步骤 (6): 设定可用候选的蛋白质作用对集 avaiable_pairs = all_pairs;  [0062] Step (6): setting the available candidate protein pair set avaiable_pairs = all_pairs;
[0063] 步骤 (7): 如果可用候选的蛋白质作用集 avaiable—pairs还有未处理的作用对, 任取其中的一个作用对 (pi,pj), 进入下一步, 否则转入步骤 (14); [0064] 步骤 (8) : 从可用候选的蛋白质作用对集中去除作用对 (pi,pj), avaiable_pairs = avaiable_pairs-{(pi,pj) }; [0063] Step (7): If the candidate protein action set avaiable-pairs and unprocessed action pairs are available, take one of the action pairs (pi, pj), go to the next step, otherwise go to step (14) ; [0064] Step (8): from the available candidate protein action pair concentration removal pair (pi, pj), avaiable_pairs = avaiable_pairs-{(pi, pj) };
[0065] 步骤 (9): 初始化蛋白质 pi和蛋白质 p j发生相互作用的权重 weight(p i,pj) = 0.0 Step (9): Initialize the weight of the interaction between the protein pi and the protein p j weight(p i,pj) = 0.0
[0066] 步骤 (10) : 分别在蛋白质相互作用关系数据数据库 DIP、 IntAct和 STRING中搜 索蛋白质 pi, pj之间的相互作用情况; [0066] Step (10): searching for interactions between proteins pi, pj in the protein interaction relationship data databases DIP, IntAct and STRING, respectively;
[0067] 步骤 (11) : 如果在 DIP数据库中有蛋白质 pi,pj之间的相互作用, 贝 ljweight(p i,pj)  [0067] Step (11): If there is a protein pi, the interaction between pj in the DIP database, shell ljweight(p i,pj)
= weight(pi,pj)+预设定的奖赏值; 否则, 如果在 DIP  = weight(pi,pj)+pre-set bonus value; otherwise, if in DIP
数据库中明确表示蛋白质 pi,pj之间的没有相互作用, 则 weight(pi,pj) = The database clearly indicates that there is no interaction between the proteins pi and pj, then we ight(pi,pj) =
weight(pi,pj)-预设定的惩罚值; 否则如果在 DIP数据库中没有搜索到蛋白质 pi,pj 发生相互作用的信息, 则 weight(p i,pj)值保持不变;  Weight(pi,pj)-preset penalty value; otherwise, if no information is found in the DIP database for protein pi, pj, the weight(p i,pj) value remains unchanged;
[0068] 步骤 (12): 如果在 IntAct  [0068] Step (12): If in IntAct
数据库中有蛋白质 pi,pj之间的相互作用, 贝 ljweight(pi,pj) = weight(pi,pj)+预设定 的奖赏值; 否则, 如果在 IntAct数据库中明确表示蛋白质 pi,pj之间的没有相互作 用, 则 weight(p i,pj) = weight(pi,pj)-预设定的惩罚值; 否则如果在 IntAct数据库 中没有搜索到蛋白质 pi,pj发生相互作用的信息, 则 weight(pi,pj)值保持不变;The protein database has pi, the interaction between PJ, shellfish lj we ight (pi, pj) = weight (pi, pj) + pre-set value reward; otherwise, if it clear that the protein pi, PJ database in IntAct There is no interaction between them, then weight(pi,pj) = weight(pi,pj)-preset penalty value; otherwise if there is no search for protein pi in p1, pj interacts with information, then weight (pi, pj) values remain unchanged;
[0069] 步骤 (13): 如果在 STRING [0069] Step (13): If at STRING
数据库中有蛋白质 pi,pj之间的相互作用, 贝 ljweight(pi,pj) = weight(pi,pj)+预设定 的奖赏值; 否则, 如果在 STRING数据库中明确表示蛋白质 pi,pj之间的没有相互 作用, 则 weight(p i,pj) = weight(pi,pj)-预设定的惩罚值; 否则如果在 STRING数 据库中没有搜索到蛋白质 pi,pj发生相互作用的信息, 则 weight(p i,pj)值保持不变 Pi of the protein database has, the interaction between PJ, shellfish lj we ight (pi, pj) = weight (pi, pj) + reward preset value; otherwise, if the pi of the protein in the clear STRING database, PJ There is no interaction between them, then weight(pi,pj) = weight(pi,pj)-pre-set penalty value; otherwise if there is no search for protein pi in pRING database, pj interacts with information, then weight (pi,pj) values remain the same
[0070] 由于蛋白质相互作用关系数据数据库 DIP、 IntAct和 STRING中包含了丰富的 生物领域知识, 通过初始值的设定, 可以将已知信息的蛋白质相互作用的权重 调高, 将已知不可能发生蛋白质相互作用的作用对权重降低。 [0070] Since the protein interaction relation data databases DIP, IntAct, and STRING contain rich biological domain knowledge, by setting the initial value, the weight of the protein interaction of known information can be increased, and it is impossible to know. The effect of protein interactions is reduced on weight.
[0071] 步骤 (14) : 得到富含生物医学知识的蛋白质作用网络以及初始化权重矩阵, N  Step (14): obtaining a protein action network rich in biomedical knowledge and initializing a weight matrix, N
= (pi,pj,weight(pi,pj))中;  = (pi,pj,weight(pi,pj));
[0072] 步骤 (15): 初始化候选蛋白质集 candidate—protein, 将所有蛋白质加入初始化候 选蛋白质集; [0072] Step (15): Initializing the candidate protein set candidate-protein, adding all proteins to the initialization Selected protein set;
[0073] 步骤 (16) : 从候选蛋白质集 candidate—protein中任选一个蛋白质 pi;  [0073] Step (16): selecting one protein pi from the candidate protein set candidate-protein;
[0074] 步骤 (17): 从候选蛋白质集 candidate_protein中去除蛋白质 pi, candidate_protein  Step (17): removing the protein pi, candidate_protein from the candidate protein set candidate_protein
= candidate—protein- {pi};  = candidate-protein- {pi};
[0075] 步骤 (18): 初始化成对蛋白质集 candidate_pair_protein, 将所有蛋白质加入初始 化候选蛋白质集; [0075] Step (18): initializing the paired protein set candidate_pair_protein, adding all proteins to the initializing candidate protein set;
[0076] 步骤 (19) : 如果成对蛋白质集 candidate_pair_protein不为空, 则从候选成对蛋 白质集 candidate—pair—protein中任选一个成对蛋白质 pj ; 否则转到步骤 (17); [0077] 步骤 (20) : 利用公式 應: = 扁):  [0076] Step (19): if the paired protein set candidate_pair_protein is not empty, then select one pair of proteins pj from the candidate paired protein set candidate-pair-protein; otherwise, go to step (17); Step (20): The formula should be used: = flat):
计算当前的网络熵;  Calculate the current network entropy;
[0078] 步骤 (21) : 使用贪心策略选择蛋白质 pi,pj之间是否有相互作用; [0078] Step (21): using a greedy strategy to select whether there is an interaction between the proteins pi, pj;
[0079] 步骤 (22) : 如果 Qf小于 Q', 则认为 p i,pj之间没有相互作用, 设置 weight(p i,pj) [0079] Step (22): If Qf is less than Q', then there is no interaction between p i and pj, and weight (p i, pj) is set.
= 0.0; 否则认为 pi,pj之间有相互作用, weight(p i,pj) = Qf;  = 0.0; otherwise it is considered that there is an interaction between pi and pj, weight(p i,pj) = Qf;
[0080] 步骤 (23) : 更新蛋白质 pi,pj之间有相互作用的概率为 [0080] Step (23): updating the protein pi, the probability of interaction between pj is
■¾難樣的 "节 扉,i^fet俱 頻 ■3⁄4 difficult to "supplement, i^fet frequency
[0081] 步骤 (24) : 使用新的 weight(pi,pj)值, 更新蛋白质相互作用网络 N;  [0081] Step (24): updating the protein interaction network N with a new weight(pi, pj) value;
[0082] 步骤 (25) : 当达到额定的迭代步数后, 不再更新, 得到最终网络结构。 [0082] Step (25): After the rated iteration step is reached, it is no longer updated, and the final network structure is obtained.
[0083] 终止条件可以是矩阵 weight在一个较长吋间段内不更新或已经达到了预定的迭 代步数。 矩阵 weight可以用于动作的选择, 即节点间相互作用的选择, 其选择 概率为: 因此最终得到的矩阵 weight可以视为网络的拓扑结构, 矩阵 weight的更 新过程可以看成是构建网络的演化过程。 [0083] The termination condition may be that the matrix weight is not updated within a longer interval or has reached a predetermined number of iterations. Matrix weight can be used for the selection of actions, that is, the choice of interaction between nodes. The selection probability is: Therefore, the final matrix weight can be regarded as the topology of the network. The update process of matrix weight can be regarded as the evolution process of building the network. .

Claims

权利要求书  Claim
一种利用文本数据构建蛋白质相互作用网络的方法, 其特征在于, 包 括: A method for constructing a protein interaction network using text data, comprising:
(1) 建立蛋白质集合;  (1) Establish a collection of proteins;
(2) 记录蛋白质集合中所有蛋白质两两发生相互作用的概率值; (2) Record the probability values of the interaction of all proteins in the protein collection;
(3) 根据概率值的大小构建初始网络结构; (3) construct an initial network structure according to the size of the probability value;
(4) 反复选择蛋白质, 给定正作用或负作用反馈值, 在初始网络结 构上不断迭代, 获得最终蛋白质相互作用网络的网络结构。  (4) Repeatedly select proteins, given positive or negative feedback values, and iterate over the initial network structure to obtain the network structure of the final protein interaction network.
根据权利要求 1所述的蛋白质相互作用网络构建方法, 其特征在于: 所述"所有蛋白质两两发生相互作用的概率值"为, 在蛋白质集合中任 意选择一个蛋白质作为主交互蛋白质, 与其他蛋白质为被交互蛋白质 , 所述主交互蛋白质与每一个被交互蛋白质交互, 形成一个作用关系 , 而后更换主交互蛋白质, 再次与其他被交互蛋白质进行交互, 形成 另一个作用关系, 如此循环, 循环次数达到预定值, 且在重复选择的 情况下, 以迭代的方式计算, 获得最终作用关系作为对应两个蛋白质 交互的概率值。 The method for constructing a protein interaction network according to claim 1, wherein: "the probability value of interaction between all proteins" is: arbitrarily selecting a protein as a main interaction protein in the protein collection, and other proteins In order to be interacted with the protein, the main interaction protein interacts with each interacted protein to form an action relationship, and then replaces the main interaction protein, and interacts with other interacted proteins again to form another action relationship, such that the cycle reaches The predetermined value, and in the case of repeated selection, is calculated in an iterative manner to obtain a final action relationship as a probability value corresponding to the interaction of the two proteins.
根据权利要求 2所述的蛋白质相互作用网络构建方法, 其特征在于: 所述"重复选择的情况"为, 蛋白质集合中的某一个蛋白质与另一个蛋 白质相互作为主、 被交互蛋白质交互作用的情况, 以及重复再被相互 选择交互作用的情况。 The method for constructing a protein interaction network according to claim 2, wherein: the "repetitive selection" is a case where one protein in a protein collection interacts with another protein as a main and interacted protein. , and the case of repeated interactions with each other.
根据权利要求 2所述的蛋白质相互作用网络构建方法, 其特征在于, 所述预定值为: 每一个集合内的蛋白质均作为主交互蛋白质与其他被 交互蛋白质进行过交互, 或者循环一个较长吋间段内不再有更新, 或 者达到额定的迭代步数中的一种或几种。 The method for constructing a protein interaction network according to claim 2, wherein the predetermined value is: each of the proteins in the set interacts with other interacted proteins as a main interacting protein, or a longer cycle. There are no more updates in the interval, or one or more of the rated iteration steps.
根据权利要求 1所述的蛋白质相互作用网络构建方法, 其特征在于, 所述构建初始网络结构为: 蛋白质集合中的每一个蛋白质作为一个节 点, 两两发生相互作用作为边, 其边值越大, 则两两之间存在相互作 用的概率越大, 反之则越小, 在构建的过程中, 边值大的交互被增强 , 直至一个较长吋间段内不再有更新, 反之则被减弱, 直至概率值为 零, 最终获得由节点与边构建的网络结构; 通过构建的初始网络结 构, 再进一步获取最终网络结构。 The protein interaction network construction method according to claim 1, wherein the initial network structure is constructed as follows: each protein in the protein set acts as a node, and the two interact as edges, and the larger the boundary value thereof , the greater the probability of interaction between the two, the smaller the opposite, the stronger the boundary value is enhanced during the construction process. , until there is no more update in a longer interval, and vice versa, until the probability value is zero, and finally the network structure constructed by the node and the edge is obtained; and the final network structure is further obtained through the initial network structure constructed.
[权利要求 6] 根据权利要求 5所述的蛋白质相互作用网络构建方法, 其特征在于, 所述最终网络结构为: 通过使用熵权法构建网络, 计算每个蛋白质节 点的熵权值, 再计算网络熵权值, 熵权值越小, 表示网络稳定, 更新 初始网络结构。  [Claim 6] The protein interaction network construction method according to claim 5, wherein the final network structure is: calculating an entropy weight of each protein node by using an entropy weight method to construct a network, and then calculating The network entropy weight, the smaller the entropy weight, indicates that the network is stable and the initial network structure is updated.
[权利要求 7] 根据权利要求 1所述的蛋白质相互作用网络构建方法, 其特征在于, 所述蛋白质集合的建立:  [Claim 7] The protein interaction network construction method according to claim 1, wherein the establishment of the protein collection:
a.通过生物医学文献数据库中获得所需要的文本; b.从蛋白质相互作关系数据库中获取蛋白质名及其标识号;  a. obtaining the required text through the biomedical literature database; b. obtaining the protein name and its identification number from the protein interaction database;
c.根据步骤 b获得的蛋白质名, 识别出步骤 a中获得的所述文本内的蛋 白质名, 并标注相应的标识号;  c. according to the protein name obtained in step b, identifying the protein name in the text obtained in step a, and marking the corresponding identification number;
d.构建所述蛋白质集合 P = {p J, 其中 p i表示第 i个蛋白质所对应的标 识号。  d. Constructing the protein set P = {p J, where p i represents the identification number corresponding to the i-th protein.
PCT/CN2016/074496 2015-02-25 2016-02-24 Method for constructing protein-protein interaction network using text data WO2016134659A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510086244.0A CN104657626A (en) 2015-02-25 2015-02-25 Method for constructing protein interaction network by using text data
CN201510086244.0 2015-02-25

Publications (1)

Publication Number Publication Date
WO2016134659A1 true WO2016134659A1 (en) 2016-09-01

Family

ID=53248740

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/074496 WO2016134659A1 (en) 2015-02-25 2016-02-24 Method for constructing protein-protein interaction network using text data

Country Status (2)

Country Link
CN (1) CN104657626A (en)
WO (1) WO2016134659A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109686402A (en) * 2018-12-26 2019-04-26 扬州大学 Based on key protein matter recognition methods in dynamic weighting interactive network
CN111667881A (en) * 2020-06-04 2020-09-15 大连民族大学 Protein function prediction method based on multi-network topological structure
CN113066524A (en) * 2021-05-19 2021-07-02 江南大学 Multi-protein interaction network comparison method based on simulated annealing
CN113470738A (en) * 2021-07-03 2021-10-01 中国科学院新疆理化技术研究所 Overlapping protein complex identification method and system based on fuzzy clustering and gene ontology semantic similarity

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657626A (en) * 2015-02-25 2015-05-27 苏州大学 Method for constructing protein interaction network by using text data
CN105138864B (en) * 2015-09-24 2017-10-13 大连理工大学 Protein interactive relation data base construction method based on Biomedical literature
CN115328117B (en) * 2022-07-15 2023-07-14 大理大学 Protein dynamic ligand channel optimal path analysis method based on reinforcement learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102012977A (en) * 2010-12-21 2011-04-13 福建师范大学 Signal peptide prediction method based on probabilistic neural network ensemble
CN103235900A (en) * 2013-03-28 2013-08-07 中山大学 Weight assembly clustering method for excavating protein complex
WO2014054526A1 (en) * 2012-10-01 2014-04-10 独立行政法人科学技術振興機構 Approval prediction device, approval prediction method, and program
CN104298651A (en) * 2014-09-09 2015-01-21 大连理工大学 Biomedicine named entity recognition and protein interactive relationship extracting on-line system based on deep learning
CN104657626A (en) * 2015-02-25 2015-05-27 苏州大学 Method for constructing protein interaction network by using text data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102012977A (en) * 2010-12-21 2011-04-13 福建师范大学 Signal peptide prediction method based on probabilistic neural network ensemble
WO2014054526A1 (en) * 2012-10-01 2014-04-10 独立行政法人科学技術振興機構 Approval prediction device, approval prediction method, and program
CN103235900A (en) * 2013-03-28 2013-08-07 中山大学 Weight assembly clustering method for excavating protein complex
CN104298651A (en) * 2014-09-09 2015-01-21 大连理工大学 Biomedicine named entity recognition and protein interactive relationship extracting on-line system based on deep learning
CN104657626A (en) * 2015-02-25 2015-05-27 苏州大学 Method for constructing protein interaction network by using text data

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109686402A (en) * 2018-12-26 2019-04-26 扬州大学 Based on key protein matter recognition methods in dynamic weighting interactive network
CN109686402B (en) * 2018-12-26 2023-11-03 扬州大学 Method for identifying key proteins in interaction network based on dynamic weighting
CN111667881A (en) * 2020-06-04 2020-09-15 大连民族大学 Protein function prediction method based on multi-network topological structure
CN111667881B (en) * 2020-06-04 2023-06-06 大连民族大学 Protein function prediction method based on multi-network topology structure
CN113066524A (en) * 2021-05-19 2021-07-02 江南大学 Multi-protein interaction network comparison method based on simulated annealing
CN113066524B (en) * 2021-05-19 2022-12-20 江南大学 Multi-protein interaction network comparison method based on simulated annealing
CN113470738A (en) * 2021-07-03 2021-10-01 中国科学院新疆理化技术研究所 Overlapping protein complex identification method and system based on fuzzy clustering and gene ontology semantic similarity
CN113470738B (en) * 2021-07-03 2023-07-14 中国科学院新疆理化技术研究所 Overlapping protein complex identification method and system based on fuzzy clustering and gene ontology semantic similarity

Also Published As

Publication number Publication date
CN104657626A (en) 2015-05-27

Similar Documents

Publication Publication Date Title
WO2016134659A1 (en) Method for constructing protein-protein interaction network using text data
Otero et al. Inducing decision trees with an ant colony optimization algorithm
CN106411896A (en) APDE-RBF neural network based network security situation prediction method
CN106126607B (en) User relationship analysis method facing social network
CN111353545A (en) Plant disease and insect pest identification method based on sparse network migration
Zhou et al. Random following ant colony optimization: Continuous and binary variants for global optimization and feature selection
García-Pérez et al. Precision as a measure of predictability of missing links in real networks
CN115798598A (en) Hypergraph-based miRNA-disease association prediction model and method
CN110264372A (en) A kind of theme Combo discovering method indicated based on node
WO2023207013A1 (en) Graph embedding-based relational graph key personnel analysis method and system
CN112436992A (en) Virtual network mapping method and device based on graph convolution network
Alpcan A framework for optimization under limited information
CN114999635A (en) circRNA-disease association relation prediction method based on graph convolution neural network and node2vec
Szwarcman et al. Quantum-inspired evolutionary algorithm applied to neural architecture search
Zhu et al. MGML: Momentum group meta-learning for few-shot image classification
CN111540405A (en) Disease gene prediction method based on rapid network embedding
Khan et al. A multi-perspective revisit to the optimization methods of Neural Architecture Search and Hyper-parameter optimization for non-federated and federated learning environments
Pradier et al. Projected BNNs: Avoiding weight-space pathologies by learning latent representations of neural network weights
JP2019079227A (en) State transition rule acquisition device, action selection learning device, action selection device, state transition rule acquisition method, action selection method, and program
Wang et al. psoResNet: An improved PSO-based residual network search algorithm
Milano et al. SL-GLAlign: Improving local alignment of biological networks through simulated annealing
CN113435112B (en) Traffic signal control method based on neighbor awareness multi-agent reinforcement learning
Zhang et al. Qos-aware reliable traffic prediction model under wireless vehicular networks
Du et al. A hierarchical evolution of neural architecture search method based on state transition algorithm
CN107480768A (en) Bayesian network structure adaptive learning method and device, storage device and terminal device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16754768

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16754768

Country of ref document: EP

Kind code of ref document: A1