CN112116951A - Proteome data management method, medium and equipment based on graph database - Google Patents

Proteome data management method, medium and equipment based on graph database Download PDF

Info

Publication number
CN112116951A
CN112116951A CN202010816554.4A CN202010816554A CN112116951A CN 112116951 A CN112116951 A CN 112116951A CN 202010816554 A CN202010816554 A CN 202010816554A CN 112116951 A CN112116951 A CN 112116951A
Authority
CN
China
Prior art keywords
index
nodes
protein
layer
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010816554.4A
Other languages
Chinese (zh)
Other versions
CN112116951B (en
Inventor
范晓宣
曹华伟
叶笑春
范东睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202010816554.4A priority Critical patent/CN112116951B/en
Publication of CN112116951A publication Critical patent/CN112116951A/en
Application granted granted Critical
Publication of CN112116951B publication Critical patent/CN112116951B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention provides a proteome data management method, medium and equipment based on a graph database, wherein the method comprises the following steps: acquiring graph data corresponding to the proteome, wherein the graph data comprises a plurality of nodes and edges, the nodes record the proteins represented by the nodes, and the edges record the relationship between two connected nodes; establishing a bottom-layer double-linked list according to the graph data corresponding to the proteome, wherein the nodes in the double-linked list are sequentially arranged according to the lexicographic order of the names of the proteins represented by the nodes; starting from a bottom-layer bidirectional linked list, extracting one node from every two nodes to a previous layer of index to establish a unidirectional index linked list in each index layer until the top index linked list has only two nodes to establish a rapid index comprising a plurality of layers of indexes; the invention establishes the rapid index on the basis of the original graph database so as to improve the efficiency of indexing large-scale proteomes.

Description

Proteome data management method, medium and equipment based on graph database
Technical Field
The invention relates to the field of databases, in particular to the technical field of database indexing, and more particularly to a method, a medium and equipment for managing proteome data based on a database.
Background
With the development of protein assay techniques (e.g., mass spectrometry), research has gradually focused on complex interactions and derived networks between protein molecules. Therefore, the data volume of experimental proteome data is exponentially increased along with the generation of various hot directions, such as prediction of protein interaction, prediction of protein function, and the like. To efficiently store, manage, analyze, and utilize these large volumes of proteome data, databases are often employed to manage the proteome data. The currently common relational database is not suitable for storage, statistics and updating of massive semi-structured data due to frequent connection operation. Graph databases represented by Neo4j and Tigergraph have the advantages of high response speed, good expansibility, high reliability and the like when processing unstructured data such as proteome, particularly under the condition of complex connection. A biological network abstracted based on a graph database data structure records proteins by adopting nodes (nodes), a Relationship (Relationship) records the interaction between the proteins, a vertex has a label attribute, and an attribute (Properties) of an edge is added to the Relationship to represent the weight of the Relationship. The system analyzes the interaction relation of a large number of proteins in a biological system, and has important significance for understanding the working principle of the proteins in the biological system, the reaction mechanism of biological signals and energy substance metabolism under special physiological states such as diseases and the like and the functional relation among the proteins.
Current graph databases typically employ a graph database stored as a native graph. Taking the Neo4j graph database as an example, in the Neo4j graph database, the nodes, the relationships, the attributes of the nodes and the relationships are stored separately and can be directly and physically located to the physical addresses of the nodes, the relationships and the attributes. Given that a relationship is the physical storage of an edge, an edge is subsequently used as an abbreviation for a relationship. The physical structure of the edge comprises an upper edge and a lower edge of an edge starting point, and an upper edge and a lower edge of an edge ending point. Physical storage is where all edges are stored only once. The search of the proteome index of the current graph database is mainly based on the traversal of a primitive graph storage structure of the graph database, and the original inverted index based on the search engine Elasticissearch is superior in small proteome network; however, in large-scale complex proteomic networks, the reverse index does not perform well. For this reason, there is a need for improvements to the inverted index based graph databases.
Disclosure of Invention
Accordingly, it is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a method, medium, and apparatus for proteome data management based on a graph database.
The purpose of the invention is realized by the following technical scheme:
according to a first aspect of the invention, a method for graph-based proteome data management, comprises: acquiring graph data corresponding to the proteome, wherein the graph data comprises a plurality of nodes and edges, the nodes record the proteins represented by the nodes, and the edges record the relationship between two connected nodes; establishing a bottom-layer double-linked list according to the graph data corresponding to the proteome, wherein the nodes in the double-linked list are sequentially arranged according to the lexicographic order of the names of the proteins represented by the nodes; starting from the bottom layer of the doubly linked list, every two nodes extract one node to the upper layer of the index to establish a unidirectional index linked list at each index layer, and the index linked list at the top part only has two nodes to establish a rapid index comprising multiple layers of indexes.
In some embodiments of the invention, the method comprises: and in response to a signal that map data corresponding to any proteome in the map database reaches a preset scale, establishing a fast index outside the original inverted index for only the proteome reaching the preset scale in the manner described above. The method comprises the following steps: the proteome which does not reach the preset scale still adopts the original inverted index.
In some embodiments of the invention, the method further comprises: when a node corresponding to a new protein is inserted into graph data corresponding to a protein group with a quick index, a random variable for determining an updating mode of the quick index of the graph data corresponding to the protein group is generated, and different updating modes for updating the quick index are set according to different numerical value ranges to which the random variable belongs.
In some embodiments of the present invention, the random variable obeys a geometric distribution with a parameter p, where p is 0.5.
In some embodiments of the invention, the range of values comprises: a first range of values which is only the value 1; a second numerical range (1, k + 1), a third numerical range (k +1, + ∞), k representing the total number of index layers in the current fast index, wherein the different updating modes for updating the fast index according to the different numerical ranges to which the random variable belongs include that when the current random variable belongs to the first numerical range, a node corresponding to the new protein is inserted into the bottom double-linked list, and the fast index is not updated, when the current random variable belongs to the second numerical range, a node corresponding to the new protein is inserted into the bottom double-linked list, and a node corresponding to the inserted new protein is added into the index layer with the number of layers below the numerical value of the current random variable, when the current random variable belongs to the third numerical range, a node corresponding to the new protein is inserted into the bottom double-linked list, and adding an index layer at the top, adding a node corresponding to the inserted new protein into each index layer, and extracting a node from every two nodes in the next index layer by the index layer at the top and constructing a single-direction linked list together with the node corresponding to the inserted new protein.
In some embodiments of the invention, the method further comprises: and in response to a request for deleting a node corresponding to a certain protein to be deleted in graph data corresponding to the proteome established with the quick index, sequentially searching and deleting the node corresponding to the protein to be deleted in each index layer from the top layer of the multi-layer index downwards, and then deleting the node corresponding to the protein to be deleted in the doubly linked list at the bottom layer.
In some embodiments of the invention, the method further comprises: establishing a hash table for recording the relationship among nodes for the nodes corresponding to each protein in the graph data corresponding to the quickly indexed proteome, calculating the storage positions of the neighbor nodes in the hash table based on the names of the proteins represented by the neighbor nodes of the nodes according to a pre-designed hash function, and respectively storing the names of the proteins represented by the neighbor nodes in the calculated storage positions.
In some embodiments of the invention, the hash function is:
Figure BDA0002632923660000031
wherein x represents the name of the protein, N represents the length of the character string of the name of the protein, and xnThe n-th character, ASCII (x), representing the name of the proteinn-a) the difference between the ASCII code of the n-th character representing the name of the protein and the ASCII code of character a, MOD represents modulo and prime represents modulo.
According to a second aspect of the present invention, there is provided an electronic apparatus comprising: one or more processors; and a memory, wherein the memory is to store one or more executable instructions; the one or more processors are configured to implement the steps of the method of the first aspect via execution of the one or more executable instructions.
Compared with the prior art, the invention has the advantages that:
the invention provides a proteome data management method based on a graph database.
Drawings
Embodiments of the invention are further described below with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart illustrating the construction of an initial fast index according to an embodiment of the present invention;
FIG. 2 is a schematic representation of a simplified proteome;
FIG. 3 is a schematic representation of a fast index constructed from the above-described simplified proteome;
FIG. 4 is a schematic diagram of a query based on the above-described fast index of simplified proteome construction;
FIG. 5 is a diagram illustrating a query when a new node is inserted into an initial fast index, according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of deleting a node corresponding to a protein according to an embodiment of the present invention;
FIG. 7 is a flowchart illustrating a process of constructing a hash table corresponding to a node according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a hash table corresponding to a node according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As mentioned in the background section, current proteome index lookup based on current graph databases is mainly based on the graph database protogram storage structure traversal, in small proteome networks, the original inverted index based on search engine performs superiorly; however, in large-scale complex proteomic networks, the reverse index does not perform well. Therefore, the invention provides a proteome data management method based on a graph database, which establishes a quick index on the basis of the original graph database so as to improve the indexing efficiency of large-scale proteomes.
Before describing embodiments of the present invention in detail, some of the terms used therein will be explained as follows:
a Graph Database (GDB) is a Database that performs semantic query using data in a Graph structure (Graph data), where the Graph data includes nodes and edges.
Proteome, refers to all proteins expressed by one Genome (Genome), one cell, or one tissue.
According to an embodiment of the present invention, there is provided a proteome data management method based on a graph database, including: a fast index is established for the proteome corresponding map data outside the original inverted index. This example establishes fast indexes for the map data corresponding to the proteome, which has the advantage that the indexing efficiency of large-scale proteome is significantly improved, but the indexing efficiency of small-scale proteome may be decreased.
According to another embodiment of the present invention, there is provided a proteome data management method based on a graph database, including: in response to a signal that map data corresponding to any proteome in the map database reaches a preset scale, establishing a fast index for the map data corresponding to the proteome outside the original inverted index in the following manner: obtaining graph data corresponding to the proteome, wherein the graph data comprises a plurality of nodes and edges, the nodes record the proteins represented by the nodes, and the edges record the relationship between two connected nodes; establishing a bottom-layer double-linked list according to the graph data corresponding to the proteome, wherein the nodes in the double-linked list are sequentially arranged according to the lexicographic order of the names of the proteins represented by the nodes; and from the bottom-layer bidirectional linked list, extracting one node from every two nodes to the upper-layer index to establish a unidirectional index linked list in each index layer until the top index linked list has only two nodes, so as to obtain the quick index comprising multiple layers of indexes. Preferably, in the underlying doubly linked list, each node is associated with a storage address of the data corresponding to the protein represented by the node. Preferably, in response to a request for querying a specific protein, if the specific protein is stored in the graph data corresponding to the proteome in which the fast index is established, the specific protein is queried and fed back based on the fast index, otherwise, the specific protein is queried and fed back based on the inverted index. In this embodiment, a fast index is established only for proteins that reach a preset scale, and an original inverted index is still used for indexing a small-scale proteome that does not reach the preset scale, so that, for different data scales, the present invention can provide query services by using an indexing mechanism suitable for the scale, and improve the data query or analysis efficiency of the proteome.
For graph databases, the present invention may be applied to graph databases based on inverted indexes, such as Neo4j graph databases, and the present invention is not limited in this respect. In the graph database, for the process of judging whether the graph data corresponding to a certain protein group reaches the preset scale, the invention can respectively maintain a daemon process for each protein group, which is used for carrying out real-time or periodic statistics on the data scale of the graph data corresponding to the protein group in the background, and when the statistics result shows that the data scale reaches the preset scale, a signal that the graph data corresponding to the protein group reaches the preset scale is sent. The preset scale may be set according to the data volume, for example, the data volume reaches 10 GB; or according to the number of nodes, for example, the number of nodes reaches 10 ten thousand; or according to the number of edges, for example, the number of edges reaches 50 ten thousand; there are of course other possible arrangements and the invention is not limited in this regard.
For the fast index, when establishing, the nodes are firstly required to be sorted according to the names of the proteins represented by the nodes, and the nodes can be sorted in an ascending order or a descending order. The present invention will be described with the example of ascending sort. The process of building an initial fast index is shown in fig. 1, and the step of building a fast index may include: s110, obtaining graph data corresponding to the proteome, wherein the graph data comprises a plurality of nodes and edges, the nodes record the proteins represented by the nodes, and the edges record the relationship between two connected nodes; s120, establishing a bottom-layer double-linked list according to the graph data corresponding to the proteome, wherein the nodes in the double-linked list are arranged in ascending order according to the lexicographic order of the names of the proteins represented by the nodes; s130, taking a node from every two nodes in the newly constructed layer (i-1 layer) to construct the ith index layer, where the initial i is 1; s140, analyzing whether the number of the nodes of the current ith index layer is less than or equal to 2, if so, ending, otherwise, turning to the step S150; s150, i ═ i +1, the process proceeds to step S130. Taking a Neo4j graph database as an example, the file size of each node on the Neo4j graph database is fixed, but in order to reduce the time complexity generated by moving data, the bottom layer of the invention adopts a data structure of a double-linked list to maintain the order of the nodes, so that the previous node and the next node of the node to be deleted can be quickly obtained when the node is deleted, and the efficiency of deleting the node at the bottom layer is improved. For a data structure of a fast index, a plurality of index layers need to be designed for a bottom bidirectional linked list, and each index layer is an ordered unidirectional linked list. Assuming bi-directionalThe linked list has n nodes, the index layer of the bidirectional linked list closest to the storage node is called layer 1 index, the index node of the layer has n/2 nodes, and so on, the number of the nodes in the layer i index is 2 times of the number of the nodes in the layer i +1 index. The height h of the index layer and the length n of the linked list meet the following conditions: 2h+1N. For the convenience of understanding, a subgraph of a certain Protein-Protein interaction network (PPI) shown in fig. 2 is used for illustration, the subgraph comprises proteins FAA1, FAA4, FAT1, GEM1, HGM1, PCG1, PAY2 and PAY3, the subgraph is assumed to be a simplified Protein group for illustration, each circle represents a node corresponding to one Protein, a connecting line between nodes is an edge, and a relation between nodes, namely, interactions (Experiments) between proteins, means that two or more proteins can form a Protein complex through non-covalent bonds as proved by Experiments. Referring to fig. 3, a bottom-layer doubly linked list is established according to the graph data corresponding to the proteome, the nodes in the doubly linked list are arranged in ascending order according to the lexicographic order of the names of the proteins represented by the nodes, so as to obtain bottom-layer doubly linked lists FAA1, FAA4, FAT1, GEM1, HMG1, PAY2, PAY3, and PGC1, and then, one node is extracted from every two nodes from the FAA1 of the bottom-layer doubly linked list to serve as a node of the index layer 1; the index layer 1 comprises 4 nodes FAA1, FAT1, HGM1 and PAY3, and the 4 nodes respectively store a pointer pointing to the vertex bottom linked list; starting from FAA1 at the index layer 1 layer, extracting one node from every two nodes as a node at the index layer 2 layer, wherein the index layer 2 layer comprises 2 nodes of FAA1 and HGM1, and the 2 nodes respectively store a pointer pointing to the index layer 1 layer; since the index layer 2 has only 2 nodes, the point index construction is finished. Thus, a multi-layer index is obtained, wherein the layer 1 of the index layer is a single-direction linked list composed of FAA1, FAT1, HMG1 and PAY3, and the layer 2 of the index layer is a single-direction linked list composed of FAA1 and HMG 1.
The node in the proteome with the fast index is searched in the following way: suppose the protein to be queried is abbreviated as pro and the protein at the middle of the index ordering is abbreviated as mid (i.e., if the index has n points, mid is numbered as "pro" in the lexicographic order)
Figure BDA0002632923660000061
The node corresponding to the protein at (a). Letters have a higher priority than numbers if the same digit needs to be compared to numbers). Comparing the sizes of the protein pro and the protein mid to be inquired in the linked list with the length of n, if pro is just the same as mid, finishing the inquiry, and returning a result; if the lexicographic order of pro precedes mid, it turns out that all proteins in the index that are ordered after mid are unlikely to satisfy the condition; if the lexicographic order of pro follows mid, it is unlikely that all proteins in the proof-graph database that are ordered before mid will satisfy the condition. Assume the top layer is the kth layer: then n/2 nodes can be excluded by comparing the two elements of the k-th layer, with the same principle that k-1 layer excludes n/4 nodes, and so on, until a result is returned. Since each layer only needs to compare the values of 2 nodes, the time complexity T of the query operation is mainly related to the number of layers n, that is: t (n) ═ o (log n). Also based on the fast index established in fig. 2 as an example, suppose that the protein PAY2 is to be queried, see fig. 4, the query process is: at the level of index level 2, PAY2 is compared with FAA1, HMG1 in lexicographic order due to PAY2>HMG1, thus going directly down to index layer 1 along the HMG1 point; at the level of index level 1, comparing the dictionary order of PAY2 and HMG1, PAY 3; due to HMG1<PAY2<PAY3, therefore, descends to the bottom doubly linked list along HMG1, finds HMG1, and then finds the node corresponding to protein PAY2 to the right.
According to an embodiment of the present invention, when a node corresponding to a new protein is inserted into graph data corresponding to a proteome in which a fast index is established, a random variable for determining an update method of the fast index of the graph data corresponding to the proteome is generated, and different update methods for updating the fast index are set according to different value ranges to which the random variable belongs. Preferably, the random variable obeys a geometric distribution with a parameter p, p being 0.5. Preferably, the numerical ranges include: a first range of values which is only the value 1; a second numerical range (1, k + 1), a third numerical range (k +1, + ∞), k representing the total number of index layers in the current fast index, wherein, setting different updating modes for updating the fast index according to the different numerical ranges to which the random variable belongs comprises inserting a node corresponding to the new protein into a bottom double-linked list when the current generated random variable belongs to the first numerical range, not updating the fast index, inserting a node corresponding to the new protein into the bottom double-linked list when the current generated random variable belongs to the second numerical range, adding a node corresponding to the inserted new protein into an index layer with the number of layers below the numerical value of the current generated random variable, inserting a node corresponding to the new protein into the bottom double-linked list when the current generated random variable belongs to the third numerical range, and adding an index layer at the top, adding a node corresponding to the inserted new protein into each index layer, and extracting a node from every two nodes in the next index layer by the index layer at the top and constructing a single-direction linked list together with the node corresponding to the inserted new protein. When inserting the protein, if the protein is continuously added between two adjacent index nodes of the bottom doubly linked list, the protein will be greatly increased in the extreme case, so that the time complexity of the query operation is degraded to O (n). To maintain the efficiency of the indexing layer, the fast index needs to be updated when proteins are inserted. However, if the fast index is updated in a manner of strictly extracting one node from every two nodes to the last index layer each time, a great calculation overhead is caused in the updating process of the fast index in a large-scale proteome. Therefore, the present invention defines a random variable K, which satisfies the geometric distribution of p-1/2. If K is 1, inserting the node corresponding to the protein into the bottom-layer doubly linked list; if K-1 is more than 1 and less than or equal to K, inserting nodes corresponding to the protein into the 1 st to K-1 st index layers; if K-1> K, an index layer is added at the top, nodes corresponding to the protein are inserted into all index layers, and one node is extracted from every two nodes of the next layer by the newly added index layer at the top. Therefore, the efficiency of the operation of inserting the protein is improved, and as K satisfies the geometric distribution of p-1/2, the updating mode of the fast index approximately extracts a new node to the index layer above every two new nodes, thereby approximately ensuring the balance and the index efficiency of the index structure of the fast index, and the updating process of the fast index can not occupy excessive computing resources when the new protein is inserted, so that the time complexity of the operation of inserting the protein is mainly related to the position of inserting the protein. Referring to fig. 5, for the case that a node corresponding to the newly added protein is inserted into a certain layer of the fast index, the steps of searching for the protein node are as follows: searching from the top layer, assuming that the top layer only has a node a and a node b, if the lexicographic order of the protein pro to be queried is the same as that of the node a or the node b, the result can be directly found, and if the lexicographic order of the pro is between a and b, the result is descended to the next index layer along the pointer of the node a; if the lexicographic order of pro is behind the b node, the pointer of the b node is descended to the index layer, if the current layer is descended from the a node to the layer, the next of the a node is an x node and the next of the x is an inserted y node in the ordered linked list structure of the layer (assuming that the random variable generated when the y node is inserted is 2, so the y node is inserted only in the layer 1 of the index layer), and the next of the y node is the b node; at this time, pro needs to be compared with the lexicographic order of x and y, and if the lexicographic order of pro is the same as a, x, y or b, the result can be found directly from the bottom doubly linked list, that is, the storage address of the data corresponding to the protein represented by pro. If the lexicographic order of pro is between a-x, descending along a to the next level; if the distance is between x and y, the position is descended to the next layer along x; if between y and b, descending to the next layer along y; if b is greater, then b is followed down to the next level and the search continues in this manner until a result is returned. Assuming that the node pro to be searched for this time is actually the node y, the search is performed according to the path shown by the dotted arrow in fig. 5.
According to an embodiment of the present invention, in response to a request for deleting a node corresponding to a certain protein to be deleted in graph data corresponding to a proteome in which a fast index is established, after nodes corresponding to the protein to be deleted in each index layer are sequentially searched downward from the top layer of the multi-layer index and deleted, nodes corresponding to the protein to be deleted in the doubly linked list of the bottom layer are deleted. When deleting a certain protein, in order to maintain the correctness of the quick index, except for deleting the node corresponding to the protein in the bottom doubly linked list, the node corresponding to the protein in the multi-layer index still needs to be deleted. The time complexity of the deletion operation is mainly related to finding the deletion position of the protein. When deleting a node, the index needs to do the following operations: according to the searching process, searching from the top layer k, if the ith index layer has a node to be deleted, deleting the node to be deleted of the ith layer, and repeating the above process until the 1 st index layer; and finally deleting the node to be deleted corresponding to the protein from the bottom-layer doubly linked list. Referring to fig. 6, assuming that the protein PAY3 is deleted secondarily, a node corresponding to the protein PAY3 to be deleted is searched from the top layer according to a search path shown by a dashed arrow, because the top layer does not have a node corresponding to the protein PAY3, a next index layer is searched continuously, and a node corresponding to the protein PAY3 is found in the next index layer, a node corresponding to the protein PAY3 in the index layer is deleted, and finally, a node corresponding to the protein PAY3 is deleted from a doubly linked list in the bottom layer, so that the fast index after the node corresponding to the protein PAY3 is deleted shown in fig. 6 is obtained.
According to an embodiment of the invention, the method further comprises: establishing a hash table for recording the relationship among nodes for the nodes corresponding to each protein in the graph data corresponding to the quickly indexed proteome, calculating the storage positions of the neighbor nodes in the hash table based on the names of the proteins represented by the neighbor nodes of the nodes according to a pre-designed hash function, and respectively storing the names of the proteins represented by the neighbor nodes in the calculated storage positions. Referring to fig. 7, the process of creating the hash table includes: s210, calculating a hash value of a neighbor node of a node according to a pre-designed hash function; s220, judging whether the hash value conflicts in the hash table of the node, if so, turning to the step S240, and if not, turning to the step S230; s230, directly storing the neighbor node at a position corresponding to the hash value; and S240, inserting the neighbor node into the tail part of the linked list of the hash value. That is, if there is hash value conflict, the linked list method is adopted to solve the conflict problem, which is equivalent to that a plurality of neighbor nodes which are equal to the same hash value after hash function calculation form a one-way linked list in sequence. Preferably, the pre-designed hash function is:
Figure BDA0002632923660000091
wherein x represents the name of the protein, N represents the length of the character string of the name of the protein, and xnThe n-th character, ASCII (x), representing the name of the proteinn-a) the difference between the ASCII code of the n-th character representing the name of the protein and the ASCII code of character a, MOD represents modulo and prime represents modulo. The value of prime may be set by the user as desired. Preferably, prime is set to take a certain prime number more than 2 times the average Degree according to the average Degree (regression) of the entire protein map database. The design of the index of the interaction relationship between the proteins is to establish a Hash Table (Hash Table) for each protein node to store the information of the adjacent protein nodes.
According to an embodiment of the invention, the method further comprises: responding to a request for inquiring whether a specific node is a neighbor node of a certain node, calculating a hash value of the name of the protein represented by the specific node according to a hash function, and inquiring whether the name of the protein represented by the specific node exists in a hash table of the node according to a storage position corresponding to the hash value; if yes, feeding back that the specific node is a neighbor node of the node; if not, the specific node is fed back to be not the neighbor node of the node. Preferably, the name of the protein stored in the hash table is associated with a storage address of information (relationship) of an edge between a node corresponding to the protein and a node corresponding to the hash table. Therefore, after the neighbor node of a certain node can be found through the hash table, the storage address of the information of the edge between the two nodes can be further obtained through the associated information, and the index efficiency is improved. By the method given in this example, the invention can convert the question of whether there is an intergroup relationship between two proteins into: and for the a node, searching whether a certain node is adjacent to the a node through a hash table. If the structure is not adopted, the search of the opposite side needs to traverse from the first side of the node a, and the search is carried out one by one until the result is returned. By adopting the method, if whether a node (assumed as a node b) is adjacent to the node a is inquired, the hash value of the node b can be directly calculated according to a pre-designed hash function, and then the storage position of the hash table of the node a corresponding to the hash value is removed to search whether the node b exists; if yes, the a node and the b node are proved to be adjacent, and if not, the a node and the b node are not proved to be adjacent, so that the graph structure does not need to be traversed, and the efficiency of edge indexing is improved in a large-scale protein group. Ideally, the time complexity of finding a certain edge in the hash table of a certain node is O (1), which is equivalent to reducing the time complexity from O (n) to O (1).
According to an example of the present invention, the process of constructing the hash table is still described by taking the node corresponding to the protein FAA4 in the subgraph shown in fig. 2 as an example, and the process of constructing the hash table of other nodes is similar to FAA 4. Assuming that the modulo prime is set to 101 by the user, the pre-designed hash function is expressed as
Figure BDA0002632923660000101
The ASCII code of character a is 65, N is 4, and corresponds to h (x) ((x)4-65)*8+(x3-65)*4+(x2-65)*2+xl-65) MOD 101. The process of constructing the hash table corresponding to FAA4 includes:
directly storing FAA1 at the 24 position by calculating H (FAA1) ═(70-65) × 8+ (65-65) × 4+ (65-65) × 2+49-65) MOD101 ═ 24, at which time the hash table is empty at the 24 position;
directly storing FAA1 at 62 by calculating H (FAT1) ═(70-65) × 8+ (65-65) × 4+ (84-65) × 2+49-65) MOD101 ═ 62, when the hash table is empty at 62;
directly storing GEM1 at the position 72 by calculating H (GEM1) ═(71-65) × 8+ (69-65) × 4+ (77-65) × 2+49-65) MOD101 ═ 72, when the hash table is empty at the position 72;
HMG1 was stored directly at 100 by calculating H (HMG1) ═ 100MOD101 ((72-65) × 8+ (77-65) × 4+ (71-65) × 2+49-65) MOD101, when the hash table was empty at 100;
directly storing the PAY2 at the position 52 by calculating H (PAY2) ((80-65) × 8+ (65-65) × 4+ (89-65) × 2+50-65) & gt MOD101 ═ 153MOD101 ═ 52, when the hash table is empty at the position 52;
directly storing PAY3 at position 53 by calculating H (PAY3) ((80-65) × 8+ (65-65) × 4+ (89-65) × 2+51-65) MOD101 ═ 154MOD101 ═ 53, when the hash table is empty at position 53;
by calculating H (PGC1) ((80-65) × 8+ (71-65) × 4+ (67-65) × 2+49-65) MOD101 ═ 132MOD101 ═ 31, the hash table is empty at the position 31, and PGC1 is directly stored at this position 31, thereby obtaining a hash table corresponding to FAA4 shown in fig. 8.
It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. A method for proteomic data management based on a graph database, comprising:
acquiring graph data corresponding to the proteome, wherein the graph data comprises a plurality of nodes and edges, the nodes record the proteins represented by the nodes, and the edges record the relationship between two connected nodes;
establishing a bottom-layer double-linked list according to the graph data corresponding to the proteome, wherein the nodes in the double-linked list are sequentially arranged according to the lexicographic order of the names of the proteins represented by the nodes;
starting from the bottom layer of the doubly linked list, every two nodes extract one node to the upper layer of the index to establish a unidirectional index linked list at each index layer, and the index linked list at the top part only has two nodes to establish a rapid index comprising multiple layers of indexes.
2. The method for proteomic data management based on a graph database of claim 1, wherein the method comprises: and responding to a signal that map data corresponding to any proteome in the map database reaches a preset scale, and establishing a quick index outside the original inverted index for the proteome reaching the preset scale according to the mode.
3. The method for proteomic data management based on a graph database of claim 2, wherein the method further comprises:
when a node corresponding to a new protein is inserted into graph data corresponding to a protein group with a quick index, a random variable for determining an updating mode of the quick index of the graph data corresponding to the protein group is generated, and different updating modes for updating the quick index are set according to different numerical value ranges to which the random variable belongs.
4. The method for proteomic data management based on a graph database according to claim 3, wherein said random variables obey a geometric distribution with a parameter p, wherein p is 0.5.
5. The method for proteomic data management based on a graph database of claim 4, wherein the range of values comprises:
a first range of values which is only the value 1;
a second range of values (1, k +1 ];
a third range of values (k +1, + ∞), k representing the total number of index layers in the current fast index;
wherein, the setting of different updating modes for updating the fast index according to different value ranges to which the random variables belong includes:
when the currently generated random variable belongs to a first numerical range, inserting a node corresponding to the new protein into a bottom-layer doubly linked list, and not updating the quick index;
when the currently generated random variable belongs to a second numerical value range, inserting a node corresponding to the new protein into a bottom-layer doubly linked list, and adding the node corresponding to the inserted new protein into an index layer with the number of layers below the numerical value of the currently generated random variable;
when the currently generated random variable belongs to a third numerical range, inserting a node corresponding to the new protein into the bottom-layer doubly linked list, adding an index layer on the top, adding a node corresponding to the inserted new protein into each index layer, and extracting a node from every two nodes in the next index layer by the index layer on the top and constructing a doubly linked list together with the node corresponding to the inserted new protein.
6. The method for proteomic data management based on a graph database of claim 1, wherein the method further comprises:
and in response to a request for deleting a node corresponding to a certain protein to be deleted in graph data corresponding to the proteome established with the quick index, sequentially searching and deleting the node corresponding to the protein to be deleted in each index layer from the top layer of the multi-layer index downwards, and then deleting the node corresponding to the protein to be deleted in the doubly linked list at the bottom layer.
7. The method for proteome data management based on a graph database according to any one of claims 1 to 6, said method further comprising:
establishing a hash table for recording the relationship among nodes for the nodes corresponding to each protein in the graph data corresponding to the quickly indexed proteome, calculating the storage positions of the neighbor nodes in the hash table based on the names of the proteins represented by the neighbor nodes of the nodes according to a pre-designed hash function, and respectively storing the names of the proteins represented by the neighbor nodes in the calculated storage positions.
8. The method for proteomic management based on a graph database of claim 7, wherein the hash function is:
Figure FDA0002632923650000021
wherein x represents the name of the protein, N represents the length of the character string of the name of the protein, and xnThe n-th character, ASCII (x), representing the name of the proteinn-a) the difference between the ASCII code of the n-th character representing the name of the protein and the ASCII code of character a, MOD represents modulo and prime represents modulo.
9. A computer-readable storage medium, having embodied thereon a computer program, the computer program being executable by a processor to perform the steps of the method of any one of claims 1 to 8.
10. An electronic device, comprising:
one or more processors; and
a memory, wherein the memory is to store one or more executable instructions;
the one or more processors are configured to implement the steps of the method of any one of claims 1-8 via execution of the one or more executable instructions.
CN202010816554.4A 2020-08-14 2020-08-14 Proteome data management method, medium and equipment based on graph database Active CN112116951B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010816554.4A CN112116951B (en) 2020-08-14 2020-08-14 Proteome data management method, medium and equipment based on graph database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010816554.4A CN112116951B (en) 2020-08-14 2020-08-14 Proteome data management method, medium and equipment based on graph database

Publications (2)

Publication Number Publication Date
CN112116951A true CN112116951A (en) 2020-12-22
CN112116951B CN112116951B (en) 2023-04-07

Family

ID=73804050

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010816554.4A Active CN112116951B (en) 2020-08-14 2020-08-14 Proteome data management method, medium and equipment based on graph database

Country Status (1)

Country Link
CN (1) CN112116951B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060004792A1 (en) * 2004-06-21 2006-01-05 Lyle Robert W Hierarchical storage architecture using node ID ranges
CN109726305A (en) * 2018-12-30 2019-05-07 中国电子科技集团公司信息科学研究院 A kind of complex_relation data storage and search method based on graph structure
CN110347685A (en) * 2019-06-28 2019-10-18 华中科技大学 Index structure, data query optimization method, main memory management device based on dictionary tree
CN110706743A (en) * 2019-10-14 2020-01-17 福建师范大学 Protein interaction network motif detection method for balanced sampling and graph retrieval
CN110929103A (en) * 2019-11-20 2020-03-27 车智互联(北京)科技有限公司 Method for constructing index for data set, data query method and computing equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060004792A1 (en) * 2004-06-21 2006-01-05 Lyle Robert W Hierarchical storage architecture using node ID ranges
CN109726305A (en) * 2018-12-30 2019-05-07 中国电子科技集团公司信息科学研究院 A kind of complex_relation data storage and search method based on graph structure
CN110347685A (en) * 2019-06-28 2019-10-18 华中科技大学 Index structure, data query optimization method, main memory management device based on dictionary tree
CN110706743A (en) * 2019-10-14 2020-01-17 福建师范大学 Protein interaction network motif detection method for balanced sampling and graph retrieval
CN110929103A (en) * 2019-11-20 2020-03-27 车智互联(北京)科技有限公司 Method for constructing index for data set, data query method and computing equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曹皓伟等: "《基于Neo4j生物医药知识图谱的构建》", 《计算机时代》 *

Also Published As

Publication number Publication date
CN112116951B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
JP5604507B2 (en) How to find objects in a database
CN106484875B (en) MOLAP-based data processing method and device
US10055509B2 (en) Constructing an in-memory representation of a graph
CN110851722A (en) Search processing method, device and equipment based on dictionary tree and storage medium
CN110291518A (en) Merge tree garbage index
US20160275196A1 (en) Semantic search apparatus and method using mobile terminal
CN110383261A (en) Stream for multithread storage device selects
US20100106713A1 (en) Method for performing efficient similarity search
CN106503223B (en) online house source searching method and device combining position and keyword information
CN109033314B (en) Real-time query method and system for large-scale knowledge graph under condition of limited memory
CN108681603B (en) Method for rapidly searching tree structure data in database and storage medium
CN109635037B (en) Fragmentation storage method and device for relational distributed database
US8015195B2 (en) Modifying entry names in directory server
CN108549696B (en) Time series data similarity query method based on memory calculation
CN110888880A (en) Proximity analysis method, device, equipment and medium based on spatial index
CN111858607A (en) Data processing method and device, electronic equipment and computer readable medium
KR100419575B1 (en) Method for bulkloading of high-dementional index structure
CN111813744A (en) File searching method, device, equipment and storage medium
CN115563409A (en) Address administrative division identification method, device, equipment and medium
CN113722600B (en) Data query method, device, equipment and product applied to big data
CN109992593A (en) A kind of large-scale data parallel query method based on subgraph match
CN112116951B (en) Proteome data management method, medium and equipment based on graph database
CN116361287A (en) Path analysis method, device and system
CN113821550B (en) Road network topological graph dividing method, device, equipment and computer program product
CN107291875B (en) Metadata organization management method and system based on metadata graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant