WO2024041376A1 - 分布式图数据处理系统、方法、装置、设备及存储介质 - Google Patents

分布式图数据处理系统、方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2024041376A1
WO2024041376A1 PCT/CN2023/111773 CN2023111773W WO2024041376A1 WO 2024041376 A1 WO2024041376 A1 WO 2024041376A1 CN 2023111773 W CN2023111773 W CN 2023111773W WO 2024041376 A1 WO2024041376 A1 WO 2024041376A1
Authority
WO
WIPO (PCT)
Prior art keywords
graph data
data processing
graph
data block
distributed
Prior art date
Application number
PCT/CN2023/111773
Other languages
English (en)
French (fr)
Inventor
陈宏智
林鹏
马林
吴凯
石海洋
刘啸
许伟
宫大伟
张帅
Original Assignee
抖音视界有限公司
脸萌有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 抖音视界有限公司, 脸萌有限公司 filed Critical 抖音视界有限公司
Publication of WO2024041376A1 publication Critical patent/WO2024041376A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Definitions

  • the present disclosure relates to the field of data processing, and in particular, to a distributed graph data processing system, method, device, equipment and storage medium.
  • graph data needs to be stored and managed in the process of performing graph processing tasks such as graph computing and graph mining based on distributed graph data processing systems.
  • the present disclosure provides a distributed graph data processing system.
  • the method includes: adopting a computing node with a distributed architecture.
  • the computing node includes a computing unit and a persistent storage unit with corresponding relationships.
  • the computing node A storage manager is deployed in the unit; the computing unit is used to receive the assigned graph data processing task, obtain the graph data fragment corresponding to the graph data processing task, and store the graph data fragment in the corresponding Graph data adjacency list; the storage manager is used to divide the graph data adjacency list corresponding to the graph data processing task into graph data blocks, determine the key value corresponding to each graph data block, and record and store the key value of the graph data block.
  • the present disclosure provides a distributed graph data processing method.
  • the method includes: receiving graph data processing tasks through computing nodes in a distributed architecture, and obtaining graph data shards corresponding to the graph data processing tasks. , store the graph data slices in the corresponding graph data adjacency list; control the computing unit in the computing node to call the storage manager, and divide the graph data adjacency list corresponding to the graph data processing task into graph data blocks;
  • the graph data block includes all Describe the node data and edge data in the graph data adjacency table, determine the key value corresponding to each graph data block, record and store the correspondence between the key value of the graph data block and the storage location information of the graph data block; wherein , the storage location information of the graph data block is used to obtain the graph data block; store the graph data block in a persistent storage unit based on the storage location information; control the computing unit in the computing node, based on According to the corresponding relationship between the key value of the graph data block and the storage location information of the graph data block, the
  • the present disclosure provides a distributed graph data processing device.
  • the device includes: a first acquisition module configured to receive graph data processing tasks through computing nodes in a distributed architecture and acquire the graph data processing tasks.
  • the graph data fragment corresponding to the task is stored in the corresponding graph data adjacency list;
  • the first storage module is used to control the computing unit in the computing node to call the storage manager to store the graph data
  • the graph data adjacency list corresponding to the processing task is divided into graph data blocks; the graph data blocks include node data and edge data in the graph data adjacency list, the key value corresponding to each graph data block is determined, and the graph data is recorded and stored
  • the storage location information stores the graph data block in a persistent storage unit;
  • the graph data processing module is used to control the
  • the present disclosure provides a computer-readable storage medium in which instructions are stored.
  • the terminal device implements the above method.
  • the present disclosure provides a data processing device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the computer program , implement the above method.
  • the present disclosure provides a computer program product.
  • the computer program product includes a computer program/instructions. When the computer program/instructions are executed by a processor, the above method is implemented.
  • Figure 1 is a schematic structural diagram of a distributed graph data processing system provided by an embodiment of the present disclosure.
  • Figure 2 is a schematic diagram of a node adjacency list and an edge adjacency list provided by an embodiment of the present disclosure.
  • Figure 3 is a schematic diagram of a search tree based on graph data mining triangles provided by an embodiment of the present disclosure.
  • Figure 4 is a partial architecture diagram of a distributed graph data processing system provided by an embodiment of the present disclosure.
  • Figure 5 is a flow chart of a distributed graph data processing method provided by an embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of a distributed graph data processing device provided by an embodiment of the present disclosure.
  • FIG. 7 is a schematic structural diagram of a distributed graph data processing device provided by an embodiment of the present disclosure.
  • FIG. 1 is a schematic structural diagram of a distributed graph data processing system provided by an embodiment of the present disclosure.
  • the distributed graph data processing system 10 includes a distributed graph data processing system using The computing nodes of the distributed architecture, such as the computing node 101, the computing node 102 and the computing node 103 shown in Figure 1, respectively belong to the computing nodes adopting the distributed architecture.
  • Each computing node includes a computing unit and a persistent storage unit with corresponding relationships, and a storage manager is deployed in each computing unit. As shown in Figure 1, taking the computing node 101 as an example, it includes a computing unit 1011 and a persistent storage unit 1012. A storage manager 1013 is deployed in the computing unit 1011.
  • the computing unit is configured to receive assigned graph data processing tasks, obtain graph data fragments corresponding to the graph data processing tasks, and store the graph data fragments in corresponding graph data adjacency lists.
  • an elastic engine 100 is deployed in the distributed graph data processing system 10, which is used to receive graph data processing tasks issued by users, and allocate computing resources and storage resources based on the load of the graph data processing tasks.
  • the graph data processing task is assigned a computing node having a computing unit and a persistent storage unit, the graph data processing task is executed based on the assigned computing node.
  • Remote file system 110 Used to store the original graph data corresponding to graph data processing tasks in file format.
  • the computing unit after the computing unit obtains the graph data fragment corresponding to the graph data processing task from the remote file system, it stores the graph data fragment in the corresponding graph data adjacency list.
  • the graph data processing tasks in the embodiments of the present disclosure may be graph computing tasks, graph mining tasks, etc.
  • Graph computing tasks can include iterative graph computing tasks based on the Think Like A Vertex (TLAV) programming paradigm, such as PageRank algorithm, community detection algorithm and other related graph computing tasks.
  • TLAV Think Like A Vertex
  • Graph mining tasks refer to the process of discovering information of interest to users from graph data such as social networks and deep webs.
  • Graph mining results (such as triangles, frequent subgraphs, graph patterns, etc.) can be used directly for graph data analysis, or indirectly It is used for other tasks related to graph data. Therefore, the application scenarios of graph mining in science and business are very wide and practical.
  • Both graph computing tasks and graph mining tasks are executed based on the graph data stored in the graph data adjacency list.
  • the graph data in the graph data adjacency list includes node data and edge data.
  • the graph data adjacency table includes the node adjacency table Vertex table and the edge adjacency table Edge table.
  • the node data is stored based on the node adjacency table Vertex table, while the edge data uses the edge adjacency table Edge table to store the edges of each node in a continuous manner. , retaining the order of each node in the node adjacency list.
  • the disclosed embodiment adopts a simple shared-nothing design.
  • the node adjacency table is partitioned in a distributed cluster, and each partition is associated with its corresponding edge.
  • embodiments of the present disclosure allow system users to rewrite the graph partitioner and partition the graph data blocks according to the nodes or edges of the graph data.
  • Figure 2 it is a schematic diagram of the node adjacency list and the edge adjacency list.
  • the storage manager is used to divide the graph data adjacency list corresponding to the graph data processing task into graph data blocks, determine the key value corresponding to each graph data block, record and store the key value of the graph data block and the The corresponding relationship between the storage location information of the graph data block; wherein the graph data block includes node data and edge data in the graph data adjacency list, and the storage location information of the graph data block is used to obtain the graph data block.
  • the storage manager calculates the key corresponding to each graph data block based on preset rules, and the computing unit stores each graph data block in the persistent storage unit to implement key-based graph data storage.
  • the preset rules may be key value generation rules based on hash functions, etc.
  • the storage manager After calculating the corresponding key for each graph data block, the storage manager records and stores the correspondence between the key of the graph data block and the storage location information of the graph data block, so that it can be subsequently read from the persistent storage unit based on the key.
  • the persistent storage unit is configured to persistently store the graph data block based on the storage location information to support the computing unit corresponding to the persistence unit to perform the graph data processing task.
  • the persistent storage unit can adopt a hierarchical structure of memory and persistent storage media (such as persistent memory Optane Persistent Memory, PMEM), or can also adopt a hierarchical structure of memory, persistent storage media and hard disks.
  • the persistent storage unit is specifically used to store graph data blocks to the corresponding storage layer in descending order of priority of the three-level storage layers of memory, persistent storage media, and hard disk.
  • the hard disk can be a solid state drive (SSD) or the like.
  • Embodiments of the present disclosure can use different storage media to store graph data of different sizes. Specifically, when the magnitude of graph data is small, memory can be used to store it first to achieve better efficiency in data reading; for medium-sized graph data, most of the data can be stored in persistent memory PMEM. , PMEM has the advantage of large capacity and performance close to that of memory, so it can meet the storage requirements of slightly larger amounts of graph data; for larger amounts of graph data, a larger capacity hard disk (such as SSD) can be used for storage. Less frequent data overflow enables larger amounts of graph data to be stored on the hard disk.
  • PMEM has the advantage of large capacity and performance close to that of memory, so it can meet the storage requirements of slightly larger amounts of graph data; for larger amounts of graph data, a larger capacity hard disk (such as SSD) can be used for storage. Less frequent data overflow enables larger amounts of graph data to be stored on the hard disk.
  • the persistent storage unit may be designed based on hybrid storage of memory (which may also be called main memory, such as DRAM), persistent storage media (such as PMEM), and hard disks (such as SSD).
  • main memory such as DRAM
  • PMEM persistent storage media
  • hard disks such as SSD
  • main memory since the main memory has the advantage of fast access speed, the persistent storage unit provides the main memory first.
  • the access speed of the persistent storage medium PMEM is faster than that of the hard disk. Therefore, the persistent storage unit cannot be used when the main memory is occupied. Under such circumstances, priority is given to providing persistent storage media such as PMEM.
  • PMEM When PMEM is occupied, hard disks (such as SSD) are provided to achieve a persistent storage unit of tiered storage. This can not only ensure the access speed of storage space, but also ensure The storage space is large enough.
  • the embodiments of the present disclosure adopt a hierarchical persistent storage design, the amount of data that each computing node can process increases, thereby reducing the overall number of required computing devices and reducing system costs.
  • DRAM is used to store metadata, such as storage indexes of graph data blocks stored separately in the PMEM layer and SSD layer. If not specified, the graph data block can be written to the DRAM layer first and then asynchronously to the PMEM layer. If the PMEM layer has insufficient free resources, the graph data block can be flushed from the PMEM layer to the SSD layer and the SSD layer is used to store it. .
  • the persistent storage unit 102 can be designed based on a hybrid storage of memory and persistent storage media. Since persistent storage media such as PMEM can not only ensure access speed, but also provide a storage space that is much larger than memory, therefore , the above implementation method can also ensure the access speed of the storage space and the storage space is large enough at the same time.
  • persistent storage media such as PMEM can not only ensure access speed, but also provide a storage space that is much larger than memory, therefore , the above implementation method can also ensure the access speed of the storage space and the storage space is large enough at the same time.
  • the computing unit is also configured to read the data from the persistent storage unit based on the correspondence between the key value of the graph data block stored in the storage manager and the storage location information of the graph data block.
  • the graph data block corresponding to the described graph data processing task performs graph data processing.
  • the computing unit reads the required graph data from the persistent storage unit through the storage manager for the execution of the graph data processing tasks. Specifically, the storage manager generates a key based on the graph data read request of the computing unit, determines the storage location information corresponding to the key, and then reads the corresponding graph data block from the persistent storage unit based on the computing unit to obtain the required graph data.
  • the computing unit may include a first computing unit and/or a second computing unit.
  • the first computing unit is used to perform graph computing tasks
  • the second computing unit is used to perform graph mining tasks.
  • the distributed graph data processing system unifies the two programming paradigms of graph computing and graph mining into the same graph data processing system. That is to say, the distributed graph data processing system 10 can process graph computing tasks and Graph mining task.
  • graph computing and graph mining engines differ in underlying computing models
  • the distributed graph data processing system of the embodiment of the present disclosure provides a unified interface to make them consistent externally, such as making them compatible with high-level SDKs for developers. Be consistent with low-level components such as tiered storage and graph data indexing.
  • the distributed graph data processing system includes computing nodes adopting a distributed architecture.
  • Each computing node includes a computing unit and a persistent storage unit with corresponding relationships.
  • a storage manager is deployed in each computing unit. .
  • the computing unit receives the graph data processing task and obtains the graph data fragment corresponding to the graph data processing task, it stores the graph data fragment in the corresponding graph data adjacency list.
  • the storage manager deployed in the computing unit divides the graph data adjacency list into graph data blocks. After determining the key value corresponding to each graph data block, it records and stores the correspondence between the key value of the graph data block and the storage location information. .
  • the computing unit can be based on the correspondence between the key value of the graph data block stored in the storage manager and the storage location information, Read graph data blocks from the persistent storage unit to perform graph data processing tasks.
  • the distributed graph data processing system provided by the embodiment of the present disclosure uses a storage manager to store graph data in a persistent storage unit. Since the persistent storage unit serves as an external storage device, it can meet the requirements of graph data with a large amount of data. Storage requirements have little impact on the efficiency of graph data processing. Therefore, the distributed graph data processing system provided by the embodiments of the present disclosure can meet the storage resource requirements of graph data processing.
  • the computing unit will generate intermediate state data during the execution of the graph data processing task.
  • the intermediate state data corresponding to the graph data processing task can also be Perform persistent storage.
  • the storage manager is also used to receive the intermediate data generated during the execution of the graph data processing task.
  • Intermediate state data divide the intermediate state data into intermediate state data blocks, determine the key value corresponding to each intermediate state data block, and perform the corresponding relationship between the key value and the storage location information of the intermediate state data block.
  • Storage wherein the storage location information of the intermediate state data block is used to obtain the intermediate state data block.
  • the persistent storage unit is also used to store the intermediate state data block based on the storage location information, and is used to support the fault tolerance function of the distributed graph data processing system.
  • the intermediate state data generated during the execution of the graph data processing task can be stored in the persistent storage unit based on the hierarchical storage method.
  • the persistent storage unit based on the hierarchical storage method.
  • the intermediate state data corresponding to the graph data processing task includes message data and intermediate state snapshot data
  • the intermediate state snapshot data includes node snapshot data and message snapshot data.
  • embodiments of the present disclosure can obtain and store the intermediate state data corresponding to the graph data processing task at a preset frequency, for example, every 100 rounds of graph computing tasks are executed. After obtaining the intermediate state data once, when an abnormality is detected in the distributed graph data processing system, the rollback stop of the graph data processing task can be implemented based on the latest round of intermediate state data. In some embodiments, the embodiments of the present disclosure do not limit the implementation of fault tolerance based on intermediate state data.
  • the intermediate state data corresponding to the graph data processing task includes the current growth level in the search tree corresponding to the graph data processing task and the current growth position in the current growth level.
  • FIG. 3 is a schematic diagram of a search tree based on graph data mining triangles provided by an embodiment of the present disclosure.
  • first determine the nodes at the first level such as the nodes corresponding to Level-1 as shown in Figure 3, and then perform node growth based on the nodes at the first level to obtain the second level.
  • Nodes, such as the nodes corresponding to Level-2 shown in Figure 3 are then added to the nodes at the second level to obtain third-level nodes, such as the nodes corresponding to Level-3 shown in Figure 3.
  • the result of the graph mining task is determined.
  • the intermediate state data includes the current growth level in the search tree corresponding to the graph data processing task and the current growth position in the current growth level.
  • the intermediate state data is recorded when the search tree grows to the growth position 301 of the second level.
  • the current growth level in the search tree included in the intermediate state data at this time refers to the second level, and the current growth position in the current growth level refers to Growth position 301. Since the growth position 301 is at the third growth position of the second level, the intermediate state data can be recorded as (Level-2; 3).
  • the distributed graph data processing system provided by the embodiments of the present disclosure can not only use the storage manager to store graph data in a persistent storage unit, but also improve the storage resources required for graph data processing from the two dimensions of storage space and processing efficiency.
  • the storage manager can also be used to persistently store the intermediate state data generated during the execution of graph data processing tasks to support the fault-tolerant function of distributed graph data processing systems. Therefore, the distributed graph data processing system provided by the embodiments of the present disclosure can meet the storage resource requirements of graph data processing and support rollback stop loss of graph data processing tasks.
  • the embodiment of the disclosure provides a partial architecture of the distributed graph data processing system Schematic diagram, refer to Figure 4.
  • the distributed graph data processing system may include both a computing unit 401 for performing graph computing tasks and a computing unit 402 for performing graph mining tasks.
  • the embodiment of the present disclosure sets a storage manager for each computing unit in the distributed graph data processing system to achieve decoupling between the computing unit and the persistent storage unit, and can flexibly expand the application layer (i.e., computing unit) and storage Available storage media between tiers (i.e., persistent storage units).
  • the computing unit 401 is deployed with a storage manager 403
  • the computing unit 402 is deployed with a storage manager 404.
  • the rule-based index manager is used to define the index type, and then combine the index type as a prefix and the index value to obtain the entire key value, which is used to Uniquely identifies a graph data block or an intermediate state data block.
  • the index table can be implemented using a concurrent hash table, which is used to store the correspondence between the key value and the storage address of the graph data block/intermediate state data block.
  • the graph data block and the intermediate state data block can be stored in the persistent storage unit.
  • the computing unit 401 can store the graph data block and the intermediate state data block in the persistent storage unit.
  • the computing unit 402 may store the graph data block and the intermediate state data block in the persistent storage unit 406.
  • Each persistent storage unit can store graph data blocks and intermediate state data blocks based on a hierarchical structure.
  • the persistent storage unit with a hierarchical structure can be understood with reference to the above embodiments, and will not be described again here.
  • embodiments of the present disclosure also provide a distributed graph data processing method.
  • Figure 5 is a flow chart of a distributed graph data processing method provided by an embodiment of the present disclosure.
  • the distributed graph data processing method provided by the embodiment of the present disclosure can be applied to the computing unit in the distributed graph data processing system.
  • the computing unit is mounted with a persistent storage unit.
  • the method includes: S501: By The computing node in the distributed architecture receives the graph data processing task, obtains the graph data fragment corresponding to the graph data processing task, and stores the graph data fragment in the corresponding graph data adjacency list.
  • the graph data adjacency list includes graph data fragments corresponding to the graph data processing task, and the graph data blocks include node data and edge data in the graph data adjacency list.
  • the computing unit after receiving the assigned graph data processing task, the computing unit obtains the graph data fragment corresponding to the graph data processing task from the remote file system, and then uses the graph data adjacency table to store the graph data fragment , and divide the graph data adjacency list storing the graph data fragments into graph data blocks.
  • the specific dividing method is not limited in this disclosed embodiment.
  • the computing unit in the embodiment of the present disclosure is deployed with a storage manager, and the storage manager divides the graph data adjacency list corresponding to the graph data processing task to obtain graph data blocks.
  • S502 Control the computing unit in the computing node to call the storage manager to divide the graph data adjacency list corresponding to the graph data processing task into graph data blocks; the graph data blocks include nodes in the graph data adjacency list data and edge data, determine the key value corresponding to each graph data block, record and store the correspondence between the key value of the graph data block and the storage location information of the graph data block.
  • the storage location information of the graph data block is used to obtain the graph data block.
  • the key corresponding to each graph data block is determined.
  • the key corresponding to each graph data block is composed of a prefix and the graph data block ID.
  • the prefix can be an index type
  • the index type can be determined based on the data type stored in the graph data block.
  • the index type corresponding to node data is kVertex.
  • the index type corresponding to edge data can be kEdge.
  • Add the graph block ID after the prefix to identify different graph blocks of the same index type.
  • operations such as reading graph data blocks of the same index type can be performed to improve the efficiency of graph data reading.
  • the correspondence between the key value of the graph data block and the storage location information can be persistently stored. Based on the corresponding relationship, the storage location information corresponding to the graph data block can be determined through the key value of the graph data block, and then the graph data in the graph data block can be read based on the storage location information.
  • a key value can be generated for each graph data block based on a hash function, which is used as a storage index of the graph data block.
  • a hash function which is used as a storage index of the graph data block.
  • the embodiment of the present disclosure does not limit other ways of generating key values.
  • S503 Store the graph data block in a persistent storage unit based on the storage location information.
  • S504 Control the computing unit in the computing node, and read the graph data from the persistent storage unit based on the correspondence between the key value of the graph data block and the storage location information of the graph data block.
  • the graph data block corresponding to the task performs graph data processing.
  • the storage unit used to persistently store graph data blocks can be implemented based on a hierarchical structure.
  • the hierarchical structure can You can use memory, persistent storage media such as PMEM and hard disk, or you can also use memory and persistent storage media such as PMEM.
  • the storage priority of memory is higher than that of persistent storage media
  • the storage priority of persistent storage media is higher than that of persistent storage media. harddisk.
  • the embodiments of the present disclosure preferentially use the memory to store the graph data block based on the storage location information. ; If it is determined that the free resources in the memory are insufficient, use a persistent storage medium to store the graph data block.
  • the embodiment of the present disclosure preferentially uses the memory to store the graph data block based on the storage location information. ; If it is determined that the free resources in the memory are insufficient, then use the persistent storage medium to store the graph data blocks; if it is determined that the free resources in the persistent storage medium are insufficient, then use the hard disk to store the graph data blocks. Data blocks are stored.
  • the distributed graph data processing method provided by the embodiments of the present disclosure stores graph data in a persistent storage unit that can meet the storage requirements of graph data with a large amount of data, and has little impact on the efficiency of graph data processing. Therefore, this method
  • the distributed graph data processing system provided by the disclosed embodiments can meet the storage resource requirements of graph data processing.
  • the embodiments of the present disclosure can also persistently store the intermediate state data generated during the execution of the graph data processing task for subsequent use in distributed Graph data processing methods are being fault-tolerant.
  • the computing unit in the control computing node calls the storage manager to divide the intermediate state data generated during the execution of the graph data processing task into intermediate state data blocks; then, determine the key value corresponding to each intermediate state data block, and store the corresponding relationship between the key value and the storage location information of the intermediate state data block; wherein the storage location information of the intermediate state data block is used to obtain the intermediate state data block; and then based on the The storage location information persistently stores the intermediate state data block.
  • intermediate state data blocks can be stored in key-value form on persistent memory PMEM.
  • embodiments of the present disclosure can perform fault-tolerant processing on the distributed graph data processing system based on the intermediate state data blocks in the persistent storage unit.
  • the persistent storage unit can be implemented using a hierarchical structure.
  • the hierarchical structure can include memory, persistent storage media such as PMEM, and hard disks, or include memory and persistent storage media such as PMEM, and the storage is optimized.
  • the memory priority is the highest and the hard disk is the lowest.
  • the intermediate state data blocks are preferentially stored in PMEM.
  • the intermediate state data blocks are stored in PMEM.
  • Intermediate state data blocks are stored in hard disks (such as SSD) to achieve tiered storage.
  • the graph data processing task may be a graph computing task or a graph mining task.
  • the intermediate state data corresponding to the graph data processing task includes message data and intermediate state snapshot data.
  • the intermediate state snapshot data includes node snapshot data and message snapshot data.
  • Message data refers to the messages transmitted between the current computing node and other computing nodes during the execution of the graph computing task, and is used for the execution of the graph computing task of the current computing node.
  • the intermediate state snapshot data refers to the snapshot data obtained for the current round of node data and message data every time N rounds of graph calculation tasks are executed, and is used for rollback stop loss of the fault tolerance function. Among them, N can be set based on requirements.
  • the intermediate state data corresponding to the graph data processing task includes the current growth level in the search tree and the current growth position in the current growth level.
  • Executing a graph mining task is a process of growing the search tree.
  • the computing unit obtains the current growth level of the search tree and the current growth level at a preset frequency during the execution of the graph mining task. Current growth position for rollback stop of fault tolerance function.
  • the present disclosure also provides a distributed graph data processing device.
  • the device includes: The first acquisition module 601 is used to receive graph data processing tasks through computing nodes in a distributed architecture, obtain graph data fragments corresponding to the graph data processing tasks, and store the graph data fragments in corresponding graph data Adjacency list; the first storage module 602 is used to control the computing unit in the computing node to call the storage manager to divide the graph data adjacency list corresponding to the graph data processing task into graph data blocks; the graph data blocks include The node data and edge data in the graph data adjacency table determine the key value corresponding to each graph data block, and record and store the corresponding relationship between the key value of the graph data block and the storage location information of the graph data block; Wherein, the storage location information of the graph data block is used to obtain the graph data block; the second storage module 603 is used to store the graph data block into
  • the device further includes: a dividing module, configured to control the computing unit in the computing node to call the storage manager to divide the intermediate state data generated during the execution of the graph data processing task into intermediate Status data block; a third storage module, used to determine the key value corresponding to each intermediate status data block, record and store the correspondence between the key value of the intermediate status data block and the storage location information of the intermediate status data block The storage location information of the intermediate state data block is used to obtain the intermediate state data block; the fourth storage module is used to store the intermediate state data block to the intermediate state data block based on the storage location information. in the persistent storage unit.
  • the graph data processing task is a graph computing task
  • the intermediate state data corresponding to the graph data processing task includes message data and intermediate state snapshot data
  • the intermediate state snapshot data includes node snapshot data and message snapshot data.
  • the message data is calculation result data transmitted between computing nodes during the execution of the graph data processing task
  • the intermediate state snapshot data is used to resume execution of the graph data processing task.
  • the graph data processing task is a graph mining task
  • the intermediate state data corresponding to the graph data processing task includes the current growth level in the search tree corresponding to the graph data processing task and the current growth level in the search tree. current growth position.
  • the second storage module includes: a first storage sub-module configured to, after receiving the graph data block, preferentially use memory to store the graph data block based on the storage location information; The second storage submodule is used to reuse the persistent storage medium to store the graph data block if it is determined that the free resources in the memory are insufficient; the third storage submodule is used to store the graph data block if it is determined that the persistent storage If the free resources in the medium are insufficient, the hard disk will be used to store the image data blocks.
  • a first storage sub-module configured to, after receiving the graph data block, preferentially use memory to store the graph data block based on the storage location information
  • the second storage submodule is used to reuse the persistent storage medium to store the graph data block if it is determined that the free resources in the memory are insufficient
  • the third storage submodule is used to store the graph data block if it is determined that the persistent storage If the free resources in the medium are insufficient, the hard disk will be used to store the image data blocks.
  • the second storage module includes: a fourth storage sub-module, configured to preferentially use memory to store the graph data block based on the storage location information after receiving the graph data block;
  • the fifth storage submodule is configured to use a persistent storage medium to store the graph data block if it is determined that the free resources in the memory are insufficient.
  • the persistent storage medium includes persistent memory.
  • the distributed graph data processing device provided by the embodiments of the present disclosure can not only use the storage manager to store graph data in a persistent storage unit, but also ensure the storage resources required for graph data processing from the two dimensions of storage space and processing efficiency. , you can also use the storage manager to persistently store the intermediate state data generated during the execution of graph data processing tasks, supporting the fault-tolerant function of the distributed graph data processing system. Therefore, the distributed graph data processing device provided by the embodiments of the present disclosure can meet the storage resource requirements of graph data processing and support rollback stop loss of graph data processing tasks.
  • embodiments of the present disclosure also provide a computer-readable storage medium. Instructions are stored in the computer-readable storage medium. When the instructions are run on a terminal device, the terminal device enables the terminal device to implement the present invention.
  • the distributed graph data processing method described in the disclosed embodiments is disclosed.
  • An embodiment of the present disclosure also provides a computer program product.
  • the computer program product includes a computer program/instruction.
  • the computer program/instruction is executed by a processor, the distributed graph data processing method described in the embodiment of the present disclosure is implemented.
  • the embodiment of the present disclosure also provides a distributed graph data processing device, as shown in FIG. 7 , which may include: a processor 701, a memory 702, an input device 703, and an output device 704.
  • the number of processors 701 in the distributed graph data processing device may be one or more. In FIG. 7 , one processor is taken as an example.
  • the processor 701, the memory 702, the input device 703, and the output device 704 may be connected through a bus or other means, wherein the connection through the bus is taken as an example in FIG. 7 .
  • the memory 702 can be used to store software programs and modules.
  • the processor 701 executes various functional applications and data processing of the distributed graph data processing device by running the software programs and modules stored in the memory 702 .
  • the memory 702 may mainly include a program storage area and a data storage area, where the program storage area may store an operating system, at least one application program required for a function, and the like.
  • memory 702 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • the input device 703 may be used to receive input numeric or character information and to generate signal input related to user settings and functional control of the profile data processing device.
  • the processor 701 will load the executable files corresponding to the processes of one or more application programs into the memory 702 according to the following instructions, and the processor 701 will run the executable files stored in the memory 702. application, thereby realizing various functions of the above-mentioned distributed graph data processing device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开提供了一种分布式图数据处理系统、方法、装置、设备及存储介质,所述系统包括:采用分布式架构的计算节点,计算节点中包括具有对应关系的计算单元和持久化存储单元,每个计算单元中部署有存储管理器。该存储管理器将图数据邻接表划分为图数据块,确定每个图数据块对应的键值之后,记录并存储图数据块的键值与存储位置信息之间的对应关系。持久化存储单元基于存储位置信息对图数据块进行持久化存储之后,该计算单元可以基于存储管理器中存储的图数据块的键值与存储位置信息之间的对应关系,从持久化存储单元中读取图数据块执行图数据处理任务。

Description

分布式图数据处理系统、方法、装置、设备及存储介质
相关申请的交叉引用
本公开是以中国申请号为202211010841.1,申请日为2022年08月23日的申请为基础,并主张其优先权,该中国申请的公开内容在此作为整体引入本公开中。
技术领域
本公开涉及数据处理领域,尤其涉及一种分布式图数据处理系统、方法、装置、设备及存储介质。
背景技术
大数据时代下,社交网络、物联网、电子商务等应用产生大量的数据,这些数据组织成图的格式,形成图数据。而为了高效的处理规模较大的图数据,分布式图数据处理系统被提出。
由于图数据的数据量较大,对存储资源要求较高,因此,在基于分布式图数据处理系统执行图计算、图挖掘等图处理任务的过程中,需要对图数据进行存储和管理。
发明内容
第一方面,本公开提供了一种分布式图数据处理系统,所述方法包括:采用分布式架构的计算节点,所述计算节点包括具有对应关系的计算单元和持久化存储单元,所述计算单元中部署有存储管理器;所述计算单元,用于接收被分配的图数据处理任务,并获取所述图数据处理任务对应的图数据分片,将所述图数据分片存储于对应的图数据邻接表;所述存储管理器,用于将所述图数据处理任务对应的图数据邻接表划分为图数据块,确定每个图数据块对应的键值,记录并存储图数据块的键值与所述图数据块的存储位置信息之间的对应关系;其中,所述图数据块包括所述图数据邻接表中的节点数据和边数据,所述图数据块的存储位置信息用于获取所述图数据块;所述持久化存储单元,用于基于所述存储位置信息对所述图数据块进行持久化存储,以支持与所述持久化存储单元对应的计算单元执行所述图数据处理任务。
第二方面,本公开提供了一种分布式图数据处理方法,所述方法包括:通过分布式架构中的计算节点接收图数据处理任务,并获取所述图数据处理任务对应的图数据分片,将所述图数据分片存储于对应的图数据邻接表;控制所述计算节点中的计算单元调用存储管理器,将所述图数据处理任务对应的图数据邻接表划分为图数据块;所述图数据块包括所 述图数据邻接表中的节点数据和边数据,确定每个图数据块对应的键值,记录并存储图数据块的键值与所述图数据块的存储位置信息之间的对应关系;其中,所述图数据块的存储位置信息用于获取所述图数据块;基于所述存储位置信息将所述图数据块存储至持久化存储单元中;控制所述计算节点中的计算单元,基于所述图数据块的键值与图数据块的存储位置信息之间的对应关系,从所述持久化存储单元中读取所述图数据处理任务对应的图数据块进行图数据处理。
第三方面,本公开提供了一种分布式图数据处理装置,所述装置包括:第一获取模块,用于通过分布式架构中的计算节点接收图数据处理任务,并获取所述图数据处理任务对应的图数据分片,将所述图数据分片存储于对应的图数据邻接表;第一存储模块,用于控制所述计算节点中的计算单元调用存储管理器,将所述图数据处理任务对应的图数据邻接表划分为图数据块;所述图数据块包括所述图数据邻接表中的节点数据和边数据,确定每个图数据块对应的键值,记录并存储图数据块的键值与所述图数据块的存储位置信息之间的对应关系;其中,所述图数据块的存储位置信息用于获取所述图数据块;第二存储模块,用于基于所述存储位置信息将所述图数据块存储至持久化存储单元中;图数据处理模块,用于控制所述计算节点中的计算单元,基于所述图数据块的键值与图数据块的存储位置信息之间的对应关系,从所述持久化存储单元中读取所述图数据处理任务对应的图数据块进行图数据处理。
第四方面,本公开提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在终端设备上运行时,使得所述终端设备实现上述的方法。
第五方面,本公开提供了一种数据处理设备,包括:存储器,处理器,及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时,实现上述的方法。
第六方面,本公开提供了一种计算机程序产品,所述计算机程序产品包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现上述的方法。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。
为了更清楚地说明本公开实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言, 在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例提供的一种分布式图数据处理系统的结构示意图。
图2为本公开实施例提供的一种节点邻接表和边邻接表的示意图。
图3为本公开实施例提供的一种基于图数据挖掘三角形的搜索树的示意图。
图4为本公开实施例提供的一种分布式图数据处理系统的局部架构图。
图5为本公开实施例提供的一种分布式图数据处理方法流程图。
图6为本公开实施例提供的一种分布式图数据处理装置的结构示意图。
图7为本公开实施例提供的一种分布式图数据处理设备的结构示意图。
具体实施方式
为了能够更清楚地理解本公开的上述目的、特征和优点,下面将对本公开的方案进行进一步描述。需要说明的是,在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合。
在下面的描述中阐述了很多具体细节以便于充分理解本公开,但本公开还可以采用其他不同于在此描述的方式来实施;显然,说明书中的实施例只是本公开的一部分实施例,而不是全部的实施例。
本公开实施例提供了一种分布式图数据处理系统,参考图1,为本公开实施例提供的一种分布式图数据处理系统的结构示意图,其中,该分布式图数据处理系统10包括采用分布式架构的计算节点,如图1所示的计算节点101、计算节点102和计算节点103,分别属于采用分布式架构的计算节点。每个计算节点包括具有对应关系的计算单元和持久化存储单元,每个计算单元中部署有存储管理器。如图1所示,以计算节点101为例,包括计算单元1011和持久化存储单元1012,计算单元1011中部署有存储管理器1013。
所述计算单元,用于接收被分配的图数据处理任务,并获取所述图数据处理任务对应的图数据分片,将所述图数据分片存储于对应的图数据邻接表。
如图1所示,分布式图数据处理系统10中部署有弹性引擎100,用于接收用户下发的图数据处理任务,并基于图数据处理任务的负载,分配计算资源和存储资源等。在为图数据处理任务分配具有计算单元和持久化存储单元的计算节点之后,基于被分配的计算节点执行该图数据处理任务。
针对每个计算节点中的计算单元,在接收到被分配的图数据处理任务之后,首先从远程文件系统中获取该图数据处理任务对应的图数据分片。如图1所示的远程文件系统110, 用于以文件格式存储图数据处理任务对应的原始图数据。
本公开实施例中,计算单元从远程文件系统中获取到图数据处理任务对应的图数据分片之后,将图数据分片存储于对应的图数据邻接表。
本公开实施例中的图数据处理任务可以为图计算任务、图挖掘任务等。
图计算任务可以包括基于像顶点一样思考(Think Like A Vertex,TLAV)编程范式的迭代图计算任务,如PageRank算法、社区检测算法等相关图计算任务。
图挖掘任务是指从社交网络、深网等图数据中发现用户感兴趣的信息的过程,图挖掘结果(如三角形、频繁子图、图模式等)可以直接用于图数据分析,也可以间接用于其他与图数据相关的任务,因此,图挖掘在科学和商业中的应用场景非常广泛和实用。
图计算任务和图挖掘任务均是基于图数据邻接表中存储的图数据执行的,图数据邻接表中的图数据包括节点数据和边数据。其中,图数据邻接表包括节点邻接表Vertex table和边邻接表Edge table,节点数据是基于节点邻接表Vertex table存储,而边数据是利用边邻接表Edge table以连续的方式存储每个节点的边,保留节点邻接表中各节点的顺序。本公开实施例采用简单的无共享设计,在分布式环境中,对节点邻接表在分布式集群中进行分区,每个分区都与其对应的边相关联。为了实现负载均衡的最大并发,本公开实施例允许系统用户重写图分区器,根据图数据的节点或边进行分区实现图数据块的划分。如图2所示,为节点邻接表和边邻接表的示意图。
所述存储管理器,用于将所述图数据处理任务对应的图数据邻接表划分为图数据块,确定每个图数据块对应的键值,记录并存储图数据块的键值与所述图数据块的存储位置信息之间的对应关系;其中,所述图数据块包括所述图数据邻接表中的节点数据和边数据,所述图数据块的存储位置信息用于获取所述图数据块。
存储管理器基于预设规则计算各图数据块对应的键值key,并由计算单元将各图数据块存入持久化存储单元中,实现基于key的图数据存储。其中,预设规则可以是基于哈希hash函数的键值生成规则等。
存储管理器在为各图数据块计算对应的key之后,记录并存储图数据块的key与图数据块的存储位置信息之间的对应关系,以便后续可以基于key从持久化存储单元中读取图数据块。
另外,所述持久化存储单元,用于基于所述存储位置信息对所述图数据块进行持久化存储,以支持与所述持久化单元对应的所述计算单元执行所述图数据处理任务。
其中,持久化存储单元可以采用内存和持久化存储介质(如持久内存Optane Persistent Memory,PMEM)的分层结构,或者也可以采用内存、持久化存储介质和硬盘的分层结构实现,相应的,持久化存储单元具体用于按照内存、持久化存储介质和硬盘三级存储层的优先级降序,将图数据块存储至对应的存储层。其中,硬盘可以为固态硬盘(Solid State Drive,SSD)等。
本公开实施例针对不同量级大小的图数据,可以利用不同的存储介质进行存储。具体的,当图数据的量级较小时,可以优先利用内存对其进行存储,以实现数据读取的较佳效率;对于中等量级的图数据,可以将大部分数据存储于持久内存PMEM中,PMEM具有容量大且性能接近内存的优势,因此能够保证稍大量级的图数据的存储要求;对于较大量级的图数据,可以利用容量较大的硬盘(如SSD)进行存储,通过持续但不太频繁的数据溢出,实现将较大量级的图数据存储至硬盘中。
一些实施例中,持久化存储单元可以基于内存(也可以称为主存储器,例如DRAM)、持久化存储介质(例如PMEM)和硬盘(例如SSD)的混合存储进行设计。具体的,由于主存储器具有访问速度快的优势,因此持久化存储单元优先提供主存储器,其次,持久化存储介质PMEM的访问速度快于硬盘,因此,持久化存储单元在主存储器被占用的情况下,优先提供持久化存储介质如PMEM,在PMEM被占用的情况下,提供硬盘(例如SSD),从而实现分层存储的持久化存储单元,这样既能够保证存储空间的访问速度,又能够保证存储空间足够大。另外,由于本公开实施例采用了分层持久存储的设计,使得每个计算节点能够处理的数据量增加,从而整体上降低了所需计算设备的数量,降低了系统成本。
实际应用中,为了性能优势,DRAM用于存储元数据,例如PMEM层和SSD层中分别存储的图数据块的存储索引。如果未指定,图数据块可以首先写入DRAM层,然后异步写入PMEM层,如果PMEM层的空闲资源不足,则可以将图数据块从PMEM层刷新到SSD层,利用SSD层对其进行存储。
另一些实施例中,持久化存储单元102可以基于内存和持久化存储介质的混合存储进行设计,由于持久化存储介质如PMEM既能够保证访问速度,也能够提供远远大于内存的存储空间,因此,上述实施方式也能够同时保证存储空间的访问速度以及存储空间足够大。
另外,所述计算单元,还用于基于所述存储管理器中存储的图数据块的键值与图数据块的存储位置信息之间的对应关系,从所述持久化存储单元中读取所述图数据处理任务对应的图数据块进行图数据处理。
实际应用中,计算单元在执行图数据处理任务的过程中,通过存储管理器从持久化存储单元中读取所需的图数据,用于图数据处理任务的执行。具体的,存储管理器基于计算单元的图数据读取请求生成key,确定key对应的存储位置信息,然后基于该计算单元从持久化存储单元中读取对应的图数据块,从而获取所需的图数据。
其中,计算单元可以包括第一计算单元和/或第二计算单元,第一计算单元用于执行图计算任务,第二计算单元用于执行图挖掘任务。
本公开实施例提供的分布式图数据处理系统,将图计算和图挖掘两种编程范式统一到同一个图数据处理系统,也就是说,基于分布式图数据处理系统10能够处理图计算任务和图挖掘任务。尽管图计算和图挖掘引擎在底层计算模式上有所不同,但本公开实施例的分布式图数据处理系统提供一个统一的接口,使它们对外保持一致,例如使他们与面向开发人员的高级SDK和低级组件(如分层存储和图数据索引)保持一致。
本公开实施例提供的分布式图数据处理系统,包括采用分布式架构的计算节点,每个计算节点中包括具有对应关系的计算单元和持久化存储单元,每个计算单元中部署有存储管理器。当计算单元接收图数据处理任务,并获取到该图数据处理任务对应的图数据分片之后,将图数据分片存储于对应的图数据邻接表。该计算单元中部署的存储管理器将图数据邻接表划分为图数据块,确定每个图数据块对应的键值之后,记录并存储图数据块的键值与存储位置信息之间的对应关系。该计算单元对应的持久化存储单元基于存储位置信息对图数据块进行持久化存储之后,该计算单元可以基于存储管理器中存储的图数据块的键值与存储位置信息之间的对应关系,从持久化存储单元中读取图数据块执行图数据处理任务。
可见,本公开实施例提供的分布式图数据处理系统,利用存储管理器将图数据存储于持久化存储单元中,由于持久化存储单元作为外存设备,能够满足数据量较大的图数据的存储需求,且对于图数据处理的效率影响较小,因此,本公开实施例提供的分布式图数据处理系统能够满足图数据处理对存储资源的需求。
在上述实施例的基础上,计算单元在执行图数据处理任务的过程中会产生中间状态数据,为了支持分布式图数据处理系统的容错功能等,还可以将图数据处理任务对应的中间状态数据进行持久化存储。
在一些实施例中,存储管理器,还用于接收所述图数据处理任务执行过程中产生的中 间状态数据,将中间状态数据划分为中间状态数据块,并确定各中间状态数据块对应的键值,以及对所述键值与所述中间状态数据块的存储位置信息之间的对应关系进行存储;其中,所述中间状态数据块的存储位置信息用于获取所述中间状态数据块。
持久化存储单元,还用于基于所述存储位置信息对所述中间状态数据块进行存储,用于支持分布式图数据处理系统的容错功能。
本公开实施例中,可以基于分层存储方式,将图数据处理任务执行过程中产生的中间状态数据存入持久化存储单元,具体可参照上述实施例。
假设图数据处理任务为图计算任务,则图数据处理任务对应的中间状态数据包括消息数据和中间状态快照数据,中间状态快照数据包括节点快照数据和消息快照数据。
实际应用中,由于图计算任务通常为迭代图计算任务,为了支持容错功能,本公开实施例可以按照预设频率获取并存储图数据处理任务对应的中间状态数据,例如每执行100轮图计算任务获取一次中间状态数据,后续在检测到分布式图数据处理系统发生异常时,可以基于最新一轮的中间状态数据实现图数据处理任务的回滚止损。在一些实施例中,对于基于中间状态数据进行容错的实现方式,本公开实施例不做限制。
假设图数据处理任务为图挖掘任务,则图数据处理任务对应的中间状态数据包括图数据处理任务对应的搜索树中的当前增长层级以及所述当前增长层级中的当前增长位置。
实际应用中,图挖掘任务通常是一个搜索树增长的过程,以图3为例,图3为本公开实施例提供的一种基于图数据挖掘三角形的搜索树的示意图。在基于图数据挖掘三角形的过程中,首先确定第一层级的节点,如图3所示的Level-1对应的节点,然后在第一层级的节点的基础上进行节点增长,得到第二层级的节点,如图3所示的Level-2对应的节点,进而在第二层级的节点的基础上进行节点增长,得到第三层级的节点,如图3所示的Level-3对应的节点。通过判断第三层级的节点是否满足三角形条件,确定图挖掘任务的结果。
在上述图挖掘任务执行的过程中,中间状态数据包括图数据处理任务对应的搜索树中的当前增长层级以及当前增长层级中的当前增长位置。假设在搜索树增长到第二层级的增长位置301时记录中间状态数据,此时的中间状态数据包括的搜索树中的当前增长层级是指第二层级,当前增长层级中的当前增长位置是指增长位置301,由于增长位置301处于第二层级的第三个增长位置,因此可以将中间状态数据记录为(Level-2;3)。
本公开实施例提供的分布式图数据处理系统,不仅能够利用存储管理器将图数据存储于持久化存储单元中,从存储空间和处理效率两个维度,对图数据处理所需的存储资源进 行保障,还可以利用存储管理器将图数据处理任务执行过程中产生的中间状态数据进行持久化存储,支持分布式图数据处理系统的容错功能。因此,本公开实施例提供的分布式图数据处理系统能够满足图数据处理对存储资源的需求,以及支持图数据处理任务的回滚止损。
为了便于对本公开实施例提供的分布式图数据处理系统中的计算单元、持久化存储单元以及存储管理器进行进一步的理解,本公开实施例提供的了一种分布式图数据处理系统的局部架构示意图,参考图4所示。
其中,分布式图数据处理系统中可以同时包括用于执行图计算任务的计算单元401,以及用于执行图挖掘任务的计算单元402。本公开实施例针对分布式图数据处理系统中的每个计算单元分别设置一个存储管理器,实现计算单元与持久化存储单元之间的解耦,能够灵活扩展应用层(即计算单元)和存储层(即持久化存储单元)之间的可用的存储介质。如图4所示,计算单元401部署有存储管理器403,计算单元402部署有存储管理器404。
存储管理器中有两个重要部分,即基于规则的索引管理器和索引表,基于规则的索引管理器用于定义索引类型,然后将索引类型作为前缀和索引值进行组合得到整个键值,用于唯一标识图数据块或者中间状态数据块。索引表可以是采用并发哈希表实现,用于存储键值和图数据块/中间状态数据块的存储地址之间的对应关系。
基于索引管理器和索引表,可以将图数据块和中间状态数据块存储在持久化存储单元中,如图4所示,计算单元401可以将图数据块和中间状态数据块存储于持久化存储单元405,计算单元402可以将图数据块和中间状态数据块存储于持久化存储单元406。各个持久化存储单元可以基于分层结构对图数据块和中间状态数据块进行存储。在一些实施例中,具有分层结构的持久化存储单元可参照上述实施例进行理解,在此不再赘述。
基于对上述分布式图数据处理系统的理解,本公开实施例还提供了一种分布式图数据处理方法,参考图5,为本公开实施例提供的一种分布式图数据处理方法流程图。
在一些实施例中,本公开实施例提供的分布式图数据处理方法可以应用于分布式图数据处理系统中的计算单元,该计算单元挂载有持久化存储单元,该方法包括:S501:通过分布式架构中的计算节点接收图数据处理任务,并获取所述图数据处理任务对应的图数据分片,将所述图数据分片存储于对应的图数据邻接表。
其中,所述图数据邻接表中包括所述图数据处理任务对应的图数据分片,所述图数据块包括所述图数据邻接表中的节点数据和边数据。
本公开实施例中,计算单元在接收到被分配的图数据处理任务之后,从远程文件系统获取图数据处理任务对应的图数据分片,然后利用图数据邻接表对该图数据分片进行存储,并将存储有图数据分片的图数据邻接表划分为图数据块,具体的划分方式本公开实施例不做限制。
本公开实施例中的计算单元部署有存储管理器,由存储管理器对图数据处理任务对应的图数据邻接表进行划分,得到图数据块。
S502:控制所述计算节点中的计算单元调用存储管理器,将所述图数据处理任务对应的图数据邻接表划分为图数据块;所述图数据块包括所述图数据邻接表中的节点数据和边数据,确定每个图数据块对应的键值,记录并存储图数据块的键值与所述图数据块的存储位置信息之间的对应关系。
其中,所述图数据块的存储位置信息用于获取所述图数据块。
本公开实施例中,在划分得到各个图数据块之后,确定各个图数据块分别对应的键值key。其中,各图数据块对应的key由前缀和图数据块ID组合而成,前缀可以为索引类型,索引类型可以基于图数据块中存储的数据类型确定,例如,节点数据对应的索引类型为kVertex,边数据对应的索引类型可以为kEdge。在前缀后增加图数据块ID,以确定同一索引类型的不同图数据块。另外,基于索引类型的前缀,能够对同一索引类型的图数据块进行读取等操作,提升图数据读取的效率。
另外,为了便于后续对图数据块的读取,可以将图数据块的键值与存储位置信息之间的对应关系进行持久化存储。基于该对应关系,可以通过图数据块的键值确定该图数据块对应的存储位置信息,进而基于该存储位置信息读取到该图数据块中的图数据。
一些实施例中,可以基于哈希函数为各图数据块生成键值,用于作为图数据块的存储索引。另外,对于其他生成键值的方式本公开实施例不做限定。
S503:基于所述存储位置信息将所述图数据块存储至持久化存储单元中。
S504:控制所述计算节点中的计算单元,基于所述图数据块的键值与图数据块的存储位置信息之间的对应关系,从所述持久化存储单元中读取所述图数据处理任务对应的图数据块进行图数据处理。
其中,用于持久化存储图数据块的存储单元可以基于分层结构实现,例如分层结构可 以采用内存、持久化存储介质如PMEM和硬盘,或者也可以采用内存和持久化存储介质如PMEM,其中,内存的存储优先级高于持久化存储介质,持久化存储介质的存储优先级高于硬盘。
一些实施例中,针对分层结构采用内存和持久化存储介质PMEM实现的情况,本公开实施例在接收到图数据块之后,优先利用内存基于所述存储位置信息对所述图数据块进行存储;如果确定所述内存中的空闲资源不足,则利用持久化存储介质对所述图数据块进行存储。
另外,针对分层结构采用内存、持久化存储介质如PMEM和硬盘实现的情况,本公开实施例在接收到图数据块之后,优先利用内存基于所述存储位置信息对所述图数据块进行存储;如果确定所述内存中的空闲资源不足,则再利用持久化存储介质对所述图数据块进行存储;如果确定所述持久化存储介质中的空闲资源不足,则再利用硬盘对所述图数据块进行存储。
本公开实施例提供的分布式图数据处理方法,将图数据存储于能够满足数据量较大的图数据的存储需求的持久化存储单元,且对于图数据处理的效率影响较小,因此,本公开实施例提供的分布式图数据处理系统能够满足图数据处理对存储资源的需求。
在上述实施例的基础上,为了支持分布式图数据处理方法的容错功能,本公开实施例还可以对图数据处理任务执行过程中产生的中间状态数据进行持久化存储,以便后续用于分布式图数据处理方法的容错处理中。
实际应用中,控制计算节点中的计算单元调用存储管理器,将图数据处理任务执行过程中产生的中间状态数据划分为中间状态数据块;然后,确定每个中间状态数据块对应的键值,并对所述键值与所述中间状态数据块的存储位置信息之间的对应关系进行存储;其中,所述中间状态数据块的存储位置信息用于获取所述中间状态数据块;进而基于所述存储位置信息对所述中间状态数据块进行持久化存储。例如,中间状态数据块可以以键值形式存储在持久内存PMEM上。
另外,本公开实施例可以基于持久化存储单元中的中间状态数据块,对分布式图数据处理系统进行容错处理。
在一些实施例中,持久化存储单元可以采用分层结构实现,分层结构可以包括内存、持久化存储介质如PMEM和硬盘,或者包括内存和持久化存储介质如PMEM,且存储优 先级为内存最高且硬盘最低。
一些实施例中,由于PMEM的访问速度快于硬盘,因此,在PMEM中存在空闲存储资源的情况下,优先将中间状态数据块存储于PMEM中,在PMEM中无空闲存储资源的情况下,将中间状态数据块存储于硬盘(如SSD)中,实现分层存储。
一些实施例中,图数据处理任务可以为图计算任务,也可以为图挖掘任务。
假设图数据处理任务为图计算任务,图数据处理任务对应的中间状态数据包括消息数据和中间状态快照数据,所述中间状态快照数据包括节点快照数据和消息快照数据。消息数据是指图计算任务执行过程中,当前计算节点与其他计算节点之间传输的消息,用于当前计算节点的图计算任务的执行。中间状态快照数据是指每执行N轮图计算任务时,针对当前轮的节点数据和消息数据获取的快照数据,用于容错功能的回滚止损。其中,N可以基于需求进行设置。
假设图数据处理任务为图挖掘任务,图数据处理任务对应的中间状态数据包括搜索树中的当前增长层级以及所述当前增长层级中的当前增长位置。执行图挖掘任务是一个搜索树增长的过程,为了执行图挖掘任务执行过程中的容错,计算单元在执行图挖掘任务的过程中,按照预设频率获取搜索树的当前增长层级以及当前增长层级中的当前增长位置,用于容错功能的回滚止损。
基于上述系统和方法实施例,本公开还提供了一种分布式图数据处理装置,参考图6,为本公开实施例提供的一种分布式图数据处理装置的结构示意图,所述装置包括:第一获取模块601,用于通过分布式架构中的计算节点接收图数据处理任务,并获取所述图数据处理任务对应的图数据分片,将所述图数据分片存储于对应的图数据邻接表;第一存储模块602,用于控制所述计算节点中的计算单元调用存储管理器,将所述图数据处理任务对应的图数据邻接表划分为图数据块;所述图数据块包括所述图数据邻接表中的节点数据和边数据,确定每个图数据块对应的键值,记录并存储图数据块的键值与所述图数据块的存储位置信息之间的对应关系;其中,所述图数据块的存储位置信息用于获取所述图数据块;第二存储模块603,用于基于所述存储位置信息将所述图数据块存储至持久化存储单元中;图数据处理模块604,用于控制所述计算节点中的计算单元,基于所述图数据块的键值与图数据块的存储位置信息之间的对应关系,从所述持久化存储单元中读取所述图数据处理任务对应的图数据块进行图数据处理。
一些实施例中,所述装置还包括:划分模块,用于控制所述计算节点中的计算单元调用所述存储管理器,将所述图数据处理任务执行过程中产生的中间状态数据划分为中间状态数据块;第三存储模块,用于确定每个中间状态数据块对应的键值,记录并存储所述中间状态数据块的键值与所述中间状态数据块的存储位置信息之间的对应关系进行存储;其中,所述中间状态数据块的存储位置信息用于获取所述中间状态数据块;第四存储模块,用于基于所述存储位置信息将所述中间状态数据块存储至所述持久化存储单元中。
一些实施例中,所述图数据处理任务为图计算任务,所述图数据处理任务对应的中间状态数据包括消息数据和中间状态快照数据,所述中间状态快照数据包括节点快照数据和消息快照数据,所述消息数据为所述图数据处理任务执行过程中各计算节点之间传输的计算结果数据,所述中间状态快照数据用于所述图数据处理任务的恢复执行。
一些实施例中,所述图数据处理任务为图挖掘任务,所述图数据处理任务对应的中间状态数据包括所述图数据处理任务对应的搜索树中的当前增长层级以及所述当前增长层级中的当前增长位置。
一些实施例中,所述第二存储模块,包括:第一存储子模块,用于在接收到所述图数据块之后,优先利用内存基于所述存储位置信息对所述图数据块进行存储;第二存储子模块,用于如果确定所述内存中的空闲资源不足,则再利用持久化存储介质对所述图数据块进行存储;第三存储子模块,用于如果确定所述持久化存储介质中的空闲资源不足,则再利用硬盘对所述图数据块进行存储。
一些实施例中,所述第二存储模块,包括:第四存储子模块,用于在接收到所述图数据块之后,优先利用内存基于所述存储位置信息对所述图数据块进行存储;第五存储子模块,用于如果确定所述内存中的空闲资源不足,则利用持久化存储介质对所述图数据块进行存储。
一些实施例中,所述持久化存储介质包括持久内存。
本公开实施例提供的分布式图数据处理装置,不仅能够利用存储管理器将图数据存储于持久化存储单元中,从存储空间和处理效率两个维度对图数据处理所需的存储资源进行保障,还可以利用存储管理器将图数据处理任务执行过程中产生的中间状态数据进行持久化存储,支持分布式图数据处理系统的容错功能。因此,本公开实施例提供的分布式图数据处理装置能够满足图数据处理对存储资源的需求以及支持图数据处理任务的回滚止损。
除了上述方法和装置以外,本公开实施例还提供了一种计算机可读存储介质,计算机可读存储介质中存储有指令,当所述指令在终端设备上运行时,使得所述终端设备实现本公开实施例所述的分布式图数据处理方法。
本公开实施例还提供了一种计算机程序产品,所述计算机程序产品包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现本公开实施例所述的分布式图数据处理方法。
另外,本公开实施例还提供了一种分布式图数据处理设备,参见图7所示,可以包括:处理器701、存储器702、输入装置703和输出装置704。分布式图数据处理设备中的处理器701的数量可以一个或多个,图7中以一个处理器为例。在本公开的一些实施例中,处理器701、存储器702、输入装置703和输出装置704可通过总线或其它方式连接,其中,图7中以通过总线连接为例。
存储器702可用于存储软件程序以及模块,处理器701通过运行存储在存储器702的软件程序以及模块,从而执行分布式图数据处理设备的各种功能应用以及数据处理。存储器702可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等。此外,存储器702可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。输入装置703可用于接收输入的数字或字符信息,以及产生与分布式图数据处理设备的用户设置以及功能控制有关的信号输入。
具体在本实施例中,处理器701会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器702中,并由处理器701来运行存储在存储器702中的应用程序,从而实现上述分布式图数据处理设备的各种功能。
需要说明的是,在本文中,诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上所述仅是本公开的具体实施方式,使本领域技术人员能够理解或实现本公开。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本公开的精神或范围的情况下,在其它实施例中实现。因此,本公开将不会被限制于本文所述的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (20)

  1. 一种分布式图数据处理系统,包括:
    采用分布式架构的计算节点,所述计算节点包括具有对应关系的计算单元和持久化存储单元,所述计算单元中部署有存储管理器;
    所述计算单元,用于接收被分配的图数据处理任务,并获取所述图数据处理任务对应的图数据分片,将所述图数据分片存储于对应的图数据邻接表;
    所述存储管理器,用于将所述图数据处理任务对应的图数据邻接表划分为图数据块,确定每个图数据块对应的键值,记录并存储图数据块的键值与所述图数据块的存储位置信息之间的对应关系,其中,所述图数据块包括所述图数据邻接表中的节点数据和边数据,所述图数据块的存储位置信息用于获取所述图数据块;
    所述持久化存储单元,用于基于所述存储位置信息对所述图数据块进行持久化存储,以支持与所述持久化存储单元对应的计算单元执行所述图数据处理任务。
  2. 根据权利要求1所述的分布式图数据处理系统,其中:
    所述存储管理器,还用于接收所述图数据处理任务执行过程中产生的中间状态数据,将所述中间状态数据划分为中间状态数据块,并确定所述中间状态数据块对应的键值,以及对所述键值与所述中间状态数据块的存储位置信息之间的对应关系进行存储,其中,所述中间状态数据块的存储位置信息用于获取所述中间状态数据块;
    所述持久化存储单元,还用于基于所述存储位置信息对所述中间状态数据块进行存储。
  3. 根据权利要求2所述的分布式图数据处理系统,其中,所述图数据处理任务为图计算任务,所述图数据处理任务对应的中间状态数据包括消息数据和中间状态快照数据,所述中间状态快照数据包括节点快照数据和消息快照数据,所述消息数据为所述图数据处理任务执行过程中各计算节点之间传输的计算结果数据,所述中间状态快照数据用于所述图数据处理任务的恢复执行。
  4. 根据权利要求2或3所述的分布式图数据处理系统,其中,所述图数据处理任务为图挖掘任务,所述图数据处理任务对应的中间状态数据包括所述图数据处理任务对应的搜索树中的当前增长层级以及所述当前增长层级中的当前增长位置。
  5. 根据权利要求1-4任一项所述的分布式图数据处理系统,其中,所述持久化存储单元采用内存、持久化存储介质和硬盘的分层结构;
    所述持久化存储单元,用于按照内存、持久化存储介质和硬盘三级存储层的优先级降序,将所述图数据块存储至对应的存储层。
  6. 根据权利要求1-5任一项所述的分布式图数据处理系统,其中,所述持久化存储单元采用内存和持久化存储介质的分层结构;
    所述持久化存储单元,用于在接收到所述图数据块之后,优先利用所述内存进行存储,并在所述内存中的空闲资源不足时,再利用所述持久化存储介质对所述图数据块进行存储。
  7. 根据权利要求5或6所述的分布式图数据处理系统,其中,所述持久化存储介质包括持久内存。
  8. 根据权利要求1-7任一项所述的分布式图数据处理系统,其中,所述计算单元包括第一计算单元和第二计算单元的至少一种,所述图数据处理任务包括图计算任务和图挖掘任务的至少一种;
    所述第一计算单元,用于执行所述图计算任务;
    所述第二计算单元,用于执行所述图挖掘任务。
  9. 一种分布式图数据处理方法,包括:
    通过分布式架构中的计算节点接收图数据处理任务,并获取所述图数据处理任务对应的图数据分片,将所述图数据分片存储于对应的图数据邻接表;
    控制所述计算节点中的计算单元调用存储管理器,将所述图数据处理任务对应的图数据邻接表划分为图数据块,其中,所述图数据块包括所述图数据邻接表中的节点数据和边数据,确定每个图数据块对应的键值,记录并存储图数据块的键值与所述图数据块的存储位置信息之间的对应关系,其中,所述图数据块的存储位置信息用于获取所述图数据块;
    基于所述存储位置信息,将所述图数据块存储至持久化存储单元中;
    控制所述计算节点中的计算单元,基于所述图数据块的键值与图数据块的存储位置信息之间的对应关系,从所述持久化存储单元中读取所述图数据处理任务对应的图数据块进行图数据处理。
  10. 根据权利要求9所述的分布式图数据处理方法,还包括:
    控制所述计算节点中的计算单元调用所述存储管理器,将所述图数据处理任务执行过程中产生的中间状态数据划分为中间状态数据块;
    确定每个中间状态数据块对应的键值,记录并存储所述中间状态数据块的键值与所述中间状态数据块的存储位置信息之间的对应关系进行存储,其中,所述中间状态数据块的 存储位置信息用于获取所述中间状态数据块;
    基于所述存储位置信息,将所述中间状态数据块存储至所述持久化存储单元中。
  11. 根据权利要求10所述的分布式图数据处理方法,其中,所述图数据处理任务为图计算任务,所述图数据处理任务对应的中间状态数据包括消息数据和中间状态快照数据,所述中间状态快照数据包括节点快照数据和消息快照数据,所述消息数据为所述图数据处理任务执行过程中各计算节点之间传输的计算结果数据,所述中间状态快照数据用于所述图数据处理任务的恢复执行。
  12. 根据权利要求10或11所述的分布式图数据处理方法,其中,所述图数据处理任务为图挖掘任务,所述图数据处理任务对应的中间状态数据包括所述图数据处理任务对应的搜索树中的当前增长层级以及所述当前增长层级中的当前增长位置。
  13. 根据权利要求9-12中的任一项所述的分布式图数据处理方法,其中,所述基于所述存储位置信息,将所述图数据块存储至持久化存储单元中,包括:
    在接收到所述图数据块之后,优先利用内存基于所述存储位置信息对所述图数据块进行存储;
    在确定所述内存中的空闲资源不足的情况下,再利用持久化存储介质对所述图数据块进行存储;
    在确定所述持久化存储介质中的空闲资源不足的情况下,再利用硬盘对所述图数据块进行存储。
  14. 根据权利要求9-13中的任一项所述的分布式图数据处理方法,其中,所述基于所述存储位置信息,将所述图数据块存储至持久化存储单元中,包括:
    在接收到所述图数据块之后,优先利用内存基于所述存储位置信息对所述图数据块进行存储;
    在确定所述内存中的空闲资源不足的情况下,利用持久化存储介质对所述图数据块进行存储。
  15. 根据权利要求13或14所述的分布式图数据处理方法,其中,所述持久化存储介质包括持久内存。
  16. 一种分布式图数据处理装置,包括:
    第一获取模块,用于通过分布式架构中的计算节点接收图数据处理任务,并获取所述图数据处理任务对应的图数据分片,将所述图数据分片存储于对应的图数据邻接表;
    第一存储模块,用于控制所述计算节点中的计算单元调用存储管理器,将所述图数据处理任务对应的图数据邻接表划分为图数据块,其中,所述图数据块包括所述图数据邻接表中的节点数据和边数据,确定每个图数据块对应的键值,记录并存储图数据块的键值与所述图数据块的存储位置信息之间的对应关系,其中,所述图数据块的存储位置信息用于获取所述图数据块;
    第二存储模块,用于基于所述存储位置信息将所述图数据块存储至持久化存储单元中;
    图数据处理模块,用于控制所述计算节点中的计算单元,基于所述图数据块的键值与图数据块的存储位置信息之间的对应关系,从所述持久化存储单元中读取所述图数据处理任务对应的图数据块进行图数据处理。
  17. 一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在终端设备上运行时,使得所述终端设备实现如权利要求9-15任一项所述的分布式图数据处理方法。
  18. 一种分布式图数据处理设备,包括:
    存储器;以及
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器的指令,执行如权利要求9-15任一项所述的分布式图数据处理方法。
  19. 一种计算机程序产品,所述计算机程序产品包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现如权利要求9-15任一项所述的分布式图数据处理方法。
  20. 一种计算机程序,包括:
    指令,所述指令当由处理器执行时使所述处理器执行根据权利要求9-15任一项所述的分布式图数据处理方法。
PCT/CN2023/111773 2022-08-23 2023-08-08 分布式图数据处理系统、方法、装置、设备及存储介质 WO2024041376A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211010841.1A CN115391341A (zh) 2022-08-23 2022-08-23 分布式图数据处理系统、方法、装置、设备及存储介质
CN202211010841.1 2022-08-23

Publications (1)

Publication Number Publication Date
WO2024041376A1 true WO2024041376A1 (zh) 2024-02-29

Family

ID=84119769

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/111773 WO2024041376A1 (zh) 2022-08-23 2023-08-08 分布式图数据处理系统、方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN115391341A (zh)
WO (1) WO2024041376A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115391341A (zh) * 2022-08-23 2022-11-25 抖音视界有限公司 分布式图数据处理系统、方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522428A (zh) * 2018-09-17 2019-03-26 华中科技大学 一种基于索引定位的图计算系统的外存访问方法
CN110737659A (zh) * 2019-09-06 2020-01-31 平安科技(深圳)有限公司 图数据存储和查询方法、装置及计算机可读存储介质
CN111881326A (zh) * 2020-07-24 2020-11-03 北京浪潮数据技术有限公司 一种图数据存储方法、装置、设备及可读存储介质
CN114329096A (zh) * 2022-03-03 2022-04-12 深圳市一号互联科技有限公司 一种原生图数据库处理方法及系统
CN115391341A (zh) * 2022-08-23 2022-11-25 抖音视界有限公司 分布式图数据处理系统、方法、装置、设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336808B (zh) * 2013-06-25 2017-12-15 中国科学院信息工程研究所 一种基于bsp模型的实时图数据处理系统及方法
CN103885856B (zh) * 2014-03-10 2017-01-25 北京大学 一种基于消息再生机制的图计算容错方法及系统
CN105426375B (zh) * 2014-09-22 2019-01-18 阿里巴巴集团控股有限公司 一种关系网络的计算方法及装置
CN106777351B (zh) * 2017-01-17 2019-08-09 中国人民解放军国防科学技术大学 基于art树分布式系统图存储计算系统及其方法
CN110751275B (zh) * 2019-08-03 2022-09-02 北京达佳互联信息技术有限公司 图训练系统、数据访问方法及装置、电子设备、存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522428A (zh) * 2018-09-17 2019-03-26 华中科技大学 一种基于索引定位的图计算系统的外存访问方法
CN110737659A (zh) * 2019-09-06 2020-01-31 平安科技(深圳)有限公司 图数据存储和查询方法、装置及计算机可读存储介质
CN111881326A (zh) * 2020-07-24 2020-11-03 北京浪潮数据技术有限公司 一种图数据存储方法、装置、设备及可读存储介质
CN114329096A (zh) * 2022-03-03 2022-04-12 深圳市一号互联科技有限公司 一种原生图数据库处理方法及系统
CN115391341A (zh) * 2022-08-23 2022-11-25 抖音视界有限公司 分布式图数据处理系统、方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN115391341A (zh) 2022-11-25

Similar Documents

Publication Publication Date Title
US20200257450A1 (en) Data hierarchical storage and hierarchical query method and apparatus
US11693789B2 (en) System and method for mapping objects to regions
US10356150B1 (en) Automated repartitioning of streaming data
CN110134335B (zh) 一种基于键值对的rdf数据管理方法、装置及存储介质
JP2014500544A (ja) オブジェクトモデルとkey−valueモデル間のマッピング
JP2013541083A (ja) 重複排除に基づくストレージシステムにおけるスケーラブル参照管理のためのシステム及び方法
US11809382B2 (en) System and method for supporting versioned objects
WO2024041376A1 (zh) 分布式图数据处理系统、方法、装置、设备及存储介质
US11294931B1 (en) Creating replicas from across storage groups of a time series database
US10747739B1 (en) Implicit checkpoint for generating a secondary index of a table
US11520759B2 (en) Processing time series metrics data
US20240061712A1 (en) Method, apparatus, and system for creating training task on ai training platform, and medium
US11429311B1 (en) Method and system for managing requests in a distributed system
US9395930B2 (en) Information processing system, control method of information processing system, and recording medium
CN111930684A (zh) 基于hdfs的小文件处理方法、装置、设备及存储介质
CN108121807B (zh) Hadoop环境下多维索引结构OBF-Index的实现方法
CN112307272B (zh) 确定对象之间关系信息的方法、装置、计算设备及存储介质
US11860835B1 (en) Efficient drop column requests in a non-relational data store
US11340964B2 (en) Systems and methods for efficient management of advanced functions in software defined storage systems
CN113946542A (zh) 数据处理方法以及装置
US11249952B1 (en) Distributed storage of data identifiers
CN113687935A (zh) 一种基于超融合设计的云原生存储调度方式
US11586608B1 (en) Handling requests to access separately stored items in a non-relational database
US11550760B1 (en) Time-based partitioning to avoid in-place updates for data set copies
US11816088B2 (en) Method and system for managing cross data source data access requests

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23856466

Country of ref document: EP

Kind code of ref document: A1