CN117112692A - Mixed distributed graph data storage and calculation method - Google Patents

Mixed distributed graph data storage and calculation method Download PDF

Info

Publication number
CN117112692A
CN117112692A CN202310889520.1A CN202310889520A CN117112692A CN 117112692 A CN117112692 A CN 117112692A CN 202310889520 A CN202310889520 A CN 202310889520A CN 117112692 A CN117112692 A CN 117112692A
Authority
CN
China
Prior art keywords
graph
data
transaction
distributed
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310889520.1A
Other languages
Chinese (zh)
Inventor
岳丽军
管东林
李勇
孙煜飞
贾丽
唐琳
文叙菠
张宁燕
陈超
谢德晓
张平
产世兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unit 91977 Of Pla
Original Assignee
Unit 91977 Of Pla
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unit 91977 Of Pla filed Critical Unit 91977 Of Pla
Priority to CN202310889520.1A priority Critical patent/CN117112692A/en
Publication of CN117112692A publication Critical patent/CN117112692A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/466Transaction processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a mixed distributed graph data storage and calculation method, which comprises the following steps: the container arrangement and the scheduling are carried out, and the influence of the software and hardware infrastructure on the application operation is shielded; fusing the transaction lock and the high-performance graph storage structure; resource conflict when transaction analysis tasks run simultaneously is solved through a scheduling strategy, so that the transaction analysis hybrid processing function and performance of the graph database are met; the main flow graph algorithm is accelerated by selecting an optimized and improved parallel processing model through coding optimization, graph calculation fault tolerance and a self-adaptive model; through openCypher development and decomposition, a plurality of parallel tasks are performed, and plan optimization, distributed parallel execution optimization and deep query optimization are performed; performing high-performance cross-graph model query and cross-cluster query; and (5) establishing a graph database prototype system integrating mixed transaction analysis and completing test verification. The data storage capacity of the trillion-node trillion-edge scale graph is realized, and the bottom storage support is provided for high-performance graph query, transaction processing and high-throughput graph data loading.

Description

Mixed distributed graph data storage and calculation method
Technical Field
The invention relates to the technical field of data storage, in particular to a hybrid distributed graph data storage and calculation method.
Background
Aiming at the increasingly growing requirements of complex and high-efficiency multi-source heterogeneous information knowledge data and information analysis business scenes, the problems and requirements of low support and expansion degree of a unified standard graph development language, non-unification of mass graph data analysis timeliness, non-unification of a mixed graph transaction analysis computing system, cross-domain cross-cluster graph query and the like are mainly solved based on the national autonomous controllable strategic background. The system breaks through the research contents such as architecture of a distributed graph database of a heterogeneous computing platform, storage management of the distributed graph database, computing architecture of the distributed graph database, transaction mixed support, federal query of the heterogeneous graph and the like, and breaks through the key technologies such as container-based heterogeneous resource arrangement and scheduling technology, high-performance graph data distributed storage technology, random-based copy consistency, distributed graph query optimization technology, graph algorithm acceleration technology and the like, develops a set of high-performance distributed transaction analysis mixed graph data prototype system, achieves high-reliability storage of trillion node multi-edge graph data, graph transaction analysis high-performance mixed processing, standard openCypher V9 language and expansion support, achieves technical indexes such as LDBC-SNB reference test (SF-300) more than 3000ops, typical open source figure relationship six-degree relation query second-level feedback and the like through LDBC reference test, obtains distributed mixed graph data prototype system deployment under the environment of a user operating system (UFO, application program) and graph data, provides a high-oriented graph analysis performance for a main stream graph system, improves the performance of the map data storage system by using a test of LDBC reference test, graph database functional test, open source information analysis typical scene application verification and the like, achieves the high-performance mixed graph processing of the graph data, achieves the standard openwork graph V9 language and expansion support, achieves the high-SNB reference test (SF-300) 3000ops, and the technical indexes such as the high-SNB is more than the UF-300, and the system is provided by the system is better than the system.
The research on the storage management of the distributed graph database is focused on breaking through the high-performance graph data distributed storage technology, solving the problems of power law distribution, data inclination, attribute isomerism and the like under large-scale graph storage, and providing high-efficiency bottom storage technology support for random reading, transaction processing and query load inclination of graph calculation.
Disclosure of Invention
The invention aims to provide a mixed distributed graph data storage and calculation method which is used for solving the problems in the prior art.
In order to achieve the above object, the hybrid distributed graph data storage and calculation method of the present invention includes:
step 1, a distributed graph architecture and graph database main stream evaluation benchmark for a domestic heterogeneous computing platform, a graph database main stream architecture, cloud-oriented hybrid load scheduling and benchmark test are surrounded, container arrangement and scheduling are carried out based on cloud native technologies of Docker and Kubernetes, and influences of software and hardware infrastructures on application operation are shielded;
step 2, carrying out large-scale graph data storage management on the basis of the main stream architecture of the graph database, and carrying out transaction lock and high-performance graph storage structure fusion on the basis of multiple copies of the graph storage structure, copy consistency and online graph data disaster tolerance;
Step 3, transaction analysis integration based on a high-performance graph storage structure, copy consistency and read-only copy is carried out, graph snapshot generation without copying is carried out, and graph data of transaction specifications are provided for graph analysis; based on transaction analysis resource control, resource conflict during simultaneous running of transaction analysis tasks is solved through a scheduling strategy, and the transaction analysis mixed processing function and performance of a graph database are met;
step 4, carrying out distributed graph calculation aiming at a distributed graph structure, carrying out graph analysis parallel processing based on a GAS graph programming model on the basis of main flow graph algorithm capability analysis and algorithm call support, and accelerating the main flow graph algorithm by selecting an optimization parallel processing model through coding optimization, graph calculation fault tolerance and self-adaptive model;
step 5, optimizing the distributed graph query based on the openCypher, enabling the openCypher compiling, the openCypher language expanding and the storage process support to meet all queries, developing and decomposing the queries into a plurality of parallel tasks through the openCypher, and executing the tasks in a distributed manner in a cluster; performing plan optimization, distributed parallel execution optimization and deep query optimization, and implementing various optimization strategies in the processes of logic execution plan generation and physical execution to provide technical support for high-performance query of a graph database;
Step 6, based on the heterogeneous graph query federation technology, performing high-performance heterogeneous graph model query and cluster-crossing query around heterogeneous graph unified metadata, heterogeneous graph model query and cluster-crossing query distribution, aiming at unified views of heterogeneous graph queries with different types of vertexes and/or edges and different attribute numbers and types;
and 7, developing graph database prototype software design and test verification based on the graph database architecture, wherein the graph database prototype system integrating hybrid transaction analysis is established and the test verification is completed, wherein the graph database prototype software design and test verification comprises software architecture design, core function design, software function development, management and operation and maintenance function integration, test environment construction, reference test and typical scene test verification.
Furthermore, in the step 1, the cloud-oriented hybrid load scheduling includes a policy requirement oriented to autonomous control of domestic, and a containerization scheme is adopted for a graph storage service, a graph query service and a graph calculation service according to different operating systems and different hardware conditions of domestic heterogeneous clusters, including comprehensive perception of heterogeneous resources, reasonable and elastic scheduling of hybrid loads, and fine-grained resource management and control, so as to implement a scheduling policy oriented to heterogeneous resource characteristics.
Further, the graph data storage management in the step 2 includes distributed storage of graph data, multiple copies of data, data disaster tolerance, distributed transaction, and data authority management;
the distributed storage comprises a memory structure design of point and side files and a storage file structure design, and is used for carrying out point inquiry, associated side inquiry and attribute filtering, and the storage structure can keep the load balance of storage in a cluster;
the data multi-copy comprises copy creation and consistency management, and service switching and data recovery when storage service faults are supported;
the data disaster recovery comprises data backup and data recovery, wherein backup data is stored in the HDFS and can be recovered by the online data of another cluster;
the distributed transaction comprises transaction grade support, transaction performance and transaction conflict processing, the transaction of a serizable grade is supported, the transaction performance is improved through a multi-version control mode, the transaction conflict is processed through an optimistic lock, and for the transaction request crossing a plurality of physical nodes, the transaction request is provided by a two-stage submitting mode and a global transaction ID manager mode;
the data authority comprises management and control of the data authority of multiple users, defines users, user groups and roles by adopting an RBAC model, and provides read authority, write authority and management authority for a graph storage system, wherein the read and write authority covers a graph level, a label level and an attribute level.
Furthermore, the map data storage management adopts an LSM-Tree-based high-efficiency KV storage technology to realize mass map data storage, and the inserted data, the updated data and the deleted data are written into an Active Memable buffer zone of the memory and are distinguished by one identification bit respectively for subsequent data combination; the data in the memory Active MemTable buffer is converted into Immutable MemTable after the data quantity reaches a threshold value through sequencing and integration, so that the writing thread asynchronously persists into segments to a disk.
Further, the step 2 further includes a distributed multi-copy storage, and after the distributed storage is adopted, the graph data is split into sub-graphs, and the sub-graphs are stored in each node in the following splitting manner: hashed by start ID, hashed by end ID, hashed by side ID.
Further, the distributed graph calculation in the step 4 includes graph calculation model, graph data segmentation, graph data encoding, algorithm execution fault tolerance.
Further, the distributed graph query in the step 5 includes an openCypher grammar compiler, a stored procedure compiler, a query optimizer, a distributed executor, and resource pool allocation.
Further, in the step 6, the heterogeneous graph federation query includes meta information identification, query distribution and query encryption of a plurality of local or remote heterogeneous graphs, and the federation query supports multi-graph aggregation query and multi-graph association query by distinguishing graph data through a namespace for the plurality of heterogeneous graphs.
Furthermore, the integration of the hybrid transaction analysis in the step 7 avoids copying data between the graph database and the graph computation, adopts a data copy technology to integrate analysis algorithm and transaction query, and adds a data read-only copy in multiple copies of the data, the data read-only copy synchronizes the updating of the main copy in real time, and the graph computation snapshot is based on the data read-only copy, so that the access pressure to the graph database is reduced while the latest data is acquired.
Further, the mixed transaction analysis integrated distributed transaction based on MVCC and optimistic lock in step 7 is controlled concurrently by adopting a two-phase commit method, wherein the two-phase commit blocks conflicting transactions based on a lock mechanism to avoid any non-serializable execution, and the two-phase commit is divided into two phases: in the first stage, all objects accessed by the transaction are locked, according to different access types, the locks can be generally divided into a read lock and a write lock, wherein the read lock is a shared lock, the write lock is a mutual exclusion lock, after all accessed data are successfully locked, the modification of the transaction to the database is submitted in the second stage, and then the lock is released; the MVCC is controlled by storing a snapshot of the data at a certain time point, allows the same data record to have a plurality of different versions, and acquires the data of the corresponding version wanted by the user by adding the corresponding constraint condition when inquiring.
The method of the invention has the following advantages:
according to the technical routes of high-performance KV storage structures, distributed transactions based on MVCC and optimistic locks and snapshot-based online disaster recovery technology of LSM-Tree, the invention realizes breakthrough in the technical fields of high-performance point edge accurate query, edge out query, edge in query, attribute filtering query, high-performance transaction processing with a transaction level of serizable, lightweight data disaster recovery and the like, completes the KV-based high-performance graph data distributed storage technology, realizes the storage capacity of trillion node trillion edge scale graph data, and provides bottom storage support for high-performance graph query, transaction processing and high-throughput graph data loading.
Drawings
FIG. 1 illustrates a key technology implementation overall architecture diagram;
FIG. 2 shows a technical roadmap;
FIG. 3 shows a primary graph storage structure based on LSM-Tree;
FIG. 4 illustrates a graph data multiple copy mechanism;
FIG. 5 illustrates a distributed transaction technology implementation based on MVCC and optimistic locks;
FIG. 6 is a logical schematic diagram of a dot edge data structure;
FIG. 7 illustrates a two-phase transaction lock implementation mechanism.
Detailed Description
The technical solution of the present invention will be clearly and completely described in conjunction with the specific embodiments, but it should be understood by those skilled in the art that the embodiments described below are only for illustrating the present invention and should not be construed as limiting the scope of the present invention. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments of the present invention, are within the scope of the present invention.
The invention adopts a distributed graph storage and calculation architecture, expands the technical research of a distributed transaction analysis mixed graph database according to ideas such as architecture research, storage management technology, graph calculation architecture, transaction analysis mixed support, mainstream language query optimization, heterogeneous federal query, prototype system design, test verification and the like, is used for quickly searching the association relation among data, and provides strong algorithm analysis capability. The method solves the problem of mass associated graph data storage, and realizes low-delay multi-layer relation query which cannot be provided by the traditional database through the custom graph storage format and clustered storage, thereby providing strong platform support for military service scenes.
Key technical implementation the overall architecture diagram is shown in fig. 1.
The invention supports distributed computing and storage, supports horizontal expansion, and multiple computing nodes execute tasks in parallel in an equalizing way, provides the computing capability of a super-large scale map, and simultaneously has the capabilities of automatic data segmentation and fragmentation storage.
The present invention uses an autonomously developed distributed computing engine, spark. In order to improve the calculation performance of the graph database, a calculation engine and a storage engine are combined with each other, so that the calculation engine can directly utilize the data localization characteristics to perform high-performance calculation.
The invention designs a special graph storage structure, and improves the utilization rate of the memory and the disk through an efficient compression algorithm. The efficient distributed storage algorithm is designed, the graph data can be automatically and uniformly stored on each node of the cluster, the capacity of automatic data segmentation and fragmentation storage is achieved, and the linear expandability of the cluster is ensured.
The invention also creates a plurality of copies for the data through the star ring distributed storage engine, and ensures the fault tolerance, consistency and high availability of the data. The method supports horizontal expansion, multiple computing nodes execute tasks in parallel in a balanced manner, the computing capability of the ultra-large-scale graph is provided, and the ultra-large-scale graph with the trillion-point and trillion-side scale can be easily stored, calculated and analyzed.
The invention supports the partition strategy of the Hash and interval isopgram; the ACID nature of the transaction can be provided at the time of distributed deployment.
The invention adopts a memory calculation separation architecture, supports the capacity expansion capacity of the storage node and the computing node, and supports the dynamic migration of data so as to meet the sudden service request.
The invention adopts a distributed graph storage and calculation architecture, and develops the technical research of a distributed transaction analysis mixed graph database according to the ideas of architecture research, storage management technology, graph calculation architecture, transaction analysis mixed support, mainstream language query optimization, heterogeneous federal query, prototype system design, test verification and the like, and the overall technical route is shown in figure 2.
Firstly, researching a distributed architecture facing to a domestic heterogeneous computing platform and a main stream evaluation standard of a graph database, wherein the main stream architecture of the graph database, a hybrid load scheduling technology facing to cloud protogenesis and a standard test research are mainly surrounded. The characteristics and advantages of the mainstream database architecture are analyzed from the aspects of storage, calculation, inquiry, service and the like, and an infrastructure support reference is provided for the invention. Based on cloud native technology represented by Docker and Kubernetes, container arrangement and scheduling technology is researched, influence of software and hardware infrastructure on application operation is shielded, resource utilization rate and operation and maintenance efficiency are improved, requirements of a graph database on a heterogeneous resource scheduling layer are met, and a technical foundation is provided for hybrid deployment and operation of the graph database on a domestic heterogeneous computing platform. The main stream benchmark test research of the graph database is carried out, the benchmark test data set structure and scale, benchmark test graph query characteristics and military graph query scene coverage capability are analyzed, the test benchmark with stronger pertinence and wider military scene coverage is developed, and a theoretical basis is provided for the prototype system design of the graph database and the typical military scene test verification.
And secondly, developing large-scale graph data storage management technology research on the basis of graph database architecture research. Firstly, a high-performance graph storage structure technical research is carried out to research a graph storage structure meeting the requirements of mass graph storage and high-performance calculation. Based on the graph storage structure, the multi-copy technology, the copy consistency technology and the online graph data disaster recovery technology are researched, and technical support is provided for stable and reliable storage of massive graph data. And meanwhile, a fusion technology of the transaction lock and the high-performance graph storage structure is researched, and storage layer support is provided for graph database transaction analysis. And (3) researching a high-throughput graph loading technology based on a graph storage structure, and realizing rapid graph import of a graph database.
And thirdly, researching a transaction analysis integrated technology based on read-only copy based on a high-performance graph storage structure and a copy consistency technology, realizing graph snapshot generation without copying, and providing graph data of transaction specifications for graph analysis. And meanwhile, a transaction analysis resource control technology is researched, resource conflict during the simultaneous running of transaction analysis tasks is solved through a scheduling strategy, and the transaction analysis mixed processing function and performance of the graph database are met.
And fourthly, developing a distributed graph computing architecture technology research, researching a graph analysis parallel processing technology based on a GAS graph programming model on the basis of main stream algorithm capability analysis and algorithm call support, improving the parallel processing model through technologies such as coding technology optimization, graph computing fault tolerance, adaptive model selection optimization and the like, realizing main stream graph algorithm acceleration, solving the problem that a single programming model cannot be suitable for all graph algorithms, and providing a technical basis for developing a high-performance graph computing engine.
Fifthly, researching an openCypher-based distributed query optimization technology based on the characteristics of easy use and comprehensive grammar expression of the openCypher. The research on the openCypher compiling technology, the openCypher language expanding technology and the storage process supporting technology meets the requirement that all queries can be developed and decomposed into a plurality of parallel tasks through the openCypher and are executed in a distributed mode in a cluster. And meanwhile, technologies such as study execution plan optimization, distributed parallel execution optimization, deep query optimization and the like are researched, and various optimization strategies are researched in the processes of logic execution plan generation, physical execution and the like, so that technical support is provided for high-performance query of a graph database.
And sixthly, researching a heterogram query federation technology, mainly developing research around a heterogeneous graph unified metadata technology, a heterogeneous graph model query and a cross-cluster query distribution technology, solving the problem of unified view of heterogeneous (different types of vertexes/edges, different numbers of attributes and different types) graph queries, and providing technical support for high-performance cross-graph model query and cross-cluster query.
Finally, developing the design and test verification of the prototype software of the graph database based on the graph database architecture and the technical research result, wherein the design and test verification mainly comprise the working contents of software architecture design, core function design, software function development, management and operation and maintenance function integration, test environment construction, reference test, typical scene test verification and the like. A set of graph database prototype system meeting the index requirements of the invention is developed, the transaction analysis is mixed, the test verification is completed, the research content and the key technology related results are verified, and the project acceptance is passed.
The main subtechnology connotation includes the following aspects.
(1) Mixed load scheduling technology oriented to cloud protogenesis
The method is aimed at the requirements of autonomous and controllable strategies of domestic, and aims at the technical researches of different operating systems and different hardware conditions of domestic heterogeneous clusters, such as comprehensive perception of heterogeneous resources, reasonable and elastic scheduling of mixed load, fine-grained resource management and control, scheduling strategies for heterogeneous resource characteristics and the like, by adopting a containerization scheme for graph storage service, graph query service and graph calculation service.
(2) Graph data storage technology
The graph data storage technology researches comprise technologies such as distributed storage of graph data, multiple copies of the data, data disaster tolerance, distributed transactions, data authority and the like.
The distributed storage technology comprises a memory structure design of point and side files and a storage file structure design, and the storage supports high-concurrency continuous data writing, so that point inquiry, associated side inquiry and attribute filtering are satisfied. The storage structure has good expandability and can keep the load balance of storage in the cluster.
The data multi-copy technology comprises copy creation and consistency management technology, supports service switching and data recovery when storage service fails, and improves availability of storage service.
The data disaster recovery technology comprises data backup and data recovery, and reduces backup cost in a mode of combining full backup and incremental backup. The backup data is to be stored in the HDFS and may be restored by another cluster as online data.
The distributed transaction technology comprises transaction level support, transaction performance and transaction conflict processing, is intended to support transactions of a serizable level, is intended to improve the transaction performance in a multi-version control mode, and is intended to process transaction conflicts through optimistic locks. For transaction requests that span multiple physical nodes, a two-phase commit approach and a global transaction ID manager approach are to be utilized.
The data authority technology comprises data authority management and control of multiple users, defines users, user groups and roles by a to-be-supported RBAC model, and provides read authority, write authority and management authority for a graph storage system, wherein the read authority is to cover a graph level, a label level and an attribute level.
(3) Distributed graph query technique
The graph data query technology comprises the technologies of an openCypher grammar compiler, a storage process compiler, a query optimizer, a distributed executor, resource pool allocation and the like.
The openCypher grammar compiling technology supports the openCypher V9 standard, and can meet the requirement of most of map query. The invention supports, inter alia, graph modeling grammar, data import grammar, data export grammar, and extensible user-defined function plug-ins.
The stored procedure compiler is intended to support the stored procedure of the graph, including common types, flow control, cursors, and exception handling, and to support transactional operations of graph data augmentation, pruning, and investigation.
Query optimizers include Rule-based optimizers, cost-based optimizers, CFG optimizers, parallel Optimizer, and DAG optimizers, among others. The Rule-based Optimizer adjusts the execution sequence through built-in rules, pushing down the filter operator. The Cost-based Optimizer adjusts the execution order by graph statistics. The CFG Optimizer is mainly used for optimizing codes in a storage process, completing cyclic expansion, eliminating redundant codes, and mainly optimizing function inlining and the like. Parallel Optimizer is used for parallelizing some original serial logic, and the computing power of the cluster is utilized to improve the overall execution speed, so that the performance improvement of some key functions such as cursors is very obvious. The DAG Optimizer performs secondary optimization according to the generated DAG, generates a more reasonable physical execution plan, and mainly reduces task overheads such as Shuffle and the like.
The distributed executor has higher concurrent processing performance based on the MPP model. To better accommodate various data scenarios, the execution engine of the multi-storage engine database contains two execution modes: a low latency mode and a high throughput mode. The low-latency mode is mainly applied to a scene with smaller data volume, the execution engine can generate a physical execution plan with low execution latency, and shorter execution time is ensured by reducing or avoiding some high-latency tasks (such as IO, network and the like). The high throughput mode is mainly applied to a complex query scene, and the performance of complex graph query on a large data volume is improved through reasonable distributed execution. The distributed executor also comprises a multi-task execution and multi-level scheduling computing framework, so that the optimization of the whole resource utilization rate can be realized, and the whole job throughput of the database can be improved.
The resource pool allocation comprises modeling and unified management of cluster computing resources, and smooth execution of the mixed load task is realized through allocation of computing task resources.
(4) Distributed graph computing technique
The distributed graph computing technology comprises a graph computing model, graph data segmentation, graph data coding, algorithm execution fault tolerance and other technologies.
The graph calculation model relates to the execution process of the graph algorithm, and aims to acquire better concurrency performance by adopting the GAS calculation model, realize synchronous execution and asynchronous execution and expand the algorithm supporting capability.
The graph data segmentation is used for balancing the calculation load of the graph algorithm, and the calculation load of the data edges is balanced by adopting a point cutting mode, so that the communication cost between the nodes is reduced.
The graph data coding relates to a memory representation structure of graph algorithm data, and reduces memory overhead.
The algorithm execution fault tolerance comprises fault tolerance of intermediate results in the multi-round execution process, recalculation cost after calculation failure is reduced, and intermediate calculation results are reserved in a checkpoint mode.
(5) Hybrid transaction analysis integration technique
In order to reduce the complexity of the system and avoid copying data between the graph database and the graph computation, a data copy technology is adopted to fuse an analysis algorithm and transaction inquiry, a read-only copy is added into multiple copies of the data, the copy synchronizes the updating of a main copy in real time, the graph computation snapshot is acquired based on the copy of the data, and the access pressure to the graph database is reduced while the latest data is acquired.
In order to enable the graph algorithm to acquire consistent graph data in a distributed environment, a read transaction technology is adopted to acquire data conforming to transaction levels from a graph storage system so as to form a graph snapshot.
(6) Graph federation query technique
The graph federation query technology study includes meta-information identification, query distribution, query encryption for a plurality of local or remote heterograms. Federal queries involve multiple iso-graphs, differentiating graph data by namespace, intended to support multi-graph aggregate queries, and multi-graph association queries. For data query across clusters, network unsafety is considered, and the security of query content and results is ensured through a data encryption means.
Examples:
1) LSM-Tree-based efficient KV storage
The invention adopts the high-efficiency KV storage technology based on LSM-Tree to realize the mass graph data storage, and the storage structure is shown in figure 3.
(1)Active MemTable
The frequent updating of the graph data can cause larger pressure on the disk, and the query of the graph data can also bring a large number of random read requests, so that the disk pressure is further increased. The invention is intended to write the insertion data, the update data and the deletion data into the Active MemTable buffer of the memory. The inserted data, the updated data and the deleted data are respectively distinguished by an identification bit and used for subsequent data merging.
(2)Immutable MemTable
The data in the memory buffer is converted into Immutable MemTable after the data quantity reaches a threshold value through sequencing and integration, so that the writing thread asynchronously persists into segments to a disk. For one graph partition in the system, it is allowed to have 1 Active Memtable and a plurality of Immutable Memtable. Immutable Memtable will persist until the asynchronous write thread has not written Immutable Memtable to disk. To enable unified management Immutable Memtable to meet the requirements of concurrent queries and asynchronous writes at the same time, it is managed using reference counting. For each query, a snapshot is applied to the system, the reference value of Immutable Memtable is incremented by one, and released after the query is completed. The asynchronous write thread does not directly release the corresponding memory after completing the write of Immutable Memtable, but does not have a snapshot in the system to hold Immutable Memtable.
Segment
Each Segment is an integrally ordered KV file, and the file is composed of a plurality of compressed data blocks DataBlock, a plurality of index blocks index block, and a file tail block recording meta information such as version numbers. The data blocks DataBlock are each about 64KB, are compressed by adopting an LZF compression algorithm, and reduce the storage space of data in a prefix coding mode. In addition, in order to accelerate searching for the specified point edge data in the DataBlock, a more perfect index is constructed through a small amount of extra space in a hash index, a Bloom filter and other modes. The index block records coarse-granularity index information of each DataBlock, such as a minimum Key, a maximum Key, the number of record bars, and the like. Since more datablocks will produce more scattered index records, a secondary index will also be used in the index block, merging multiple index records into one coarser granularity index record.
For reading Segment files, three modes of accurate searching, prefix searching and data traversing are supported.
The ID information of a given point and a given edge is accurately searched, the file offset of the DataBlock is firstly determined through indexes, and a Bloom filter attached to the DataBlock is preferentially used for determining whether the current DataBlock contains data or not. If the Bloom filter result shows that the data possibly exists, reading the data of the DataBlock, completing decompression, and quickly positioning to the appointed record through the hash index.
Prefix lookup obtains the corresponding outgoing or incoming edge through the ID of a given point. The method is similar to the accurate search, the range of the datablocks is determined through the index block, the K datablocks meeting the conditions are sequentially read, and the point IDs are used for further filtering.
The data traversal is simple, the assistance of an index block is not needed any more, and all the datablocks are sequentially read from front to back.
SkipList
The graph data needs to maintain an ordered state in the memtab for quick query and Segment file generation, so a simple and quick data structure is needed, and the characteristics of ordered and quick reading of the data are supported. For the MemTable buffer, the skip list data structure is to be used to store data.
Skip list is a simple data structure, and based on a linked list structure, data searching with O (log N) complexity, data writing with O (1) complexity and data deleting with O (1) complexity can be realized. Skip list is composed of ordered linked list structure and multi-level index, and its main key point is the creation and maintenance of multi-level index. The multi-level indexes point to different positions of the linked list respectively, and the more sparse the indexes of the upper layer are, the fast jump to the positions in the linked list can be realized. The data query of the skip list jumps to the designated position of the linked list through index comparison from top to bottom, and then the query result can be returned through Key comparison for a plurality of times. During data writing and deleting, the designated position is found through the index to be inserted and deleted.
In order to improve the insertion and update performance of the dot edge data, the data are identified by an identifier to distinguish the insertion data, the update data and the deletion data. On the premise of conforming to the consistency of the transaction, the user is allowed to update part of the attributes instead of reading the complete record in advance. With multiple updates of the same point edge data, there may be multiple pieces of partially updated data in the skip list, which needs to be data integrated before Segment is generated.
Hierarchical data merge policy
On the basis of LSM-Tree, key repetition may exist among generated Segment data, and for query, multiple Segment data will be accessed, so that additional IO load is caused, and query performance is reduced. Therefore, a reasonable strategy is needed to be adopted, the generated files are combined as data, and duplicate data and deleted data are eliminated.
The hierarchical merge strategy is one of the excellent strategies. The Segment files are divided into logical K layers, with no Key duplication between files at each layer except layer 0. After Segment is generated, the life cycle of the Segment is gradually transferred from the 0 th layer to the K-1 layer, so that files with fewer layers have updated data compared with files with deeper layers. The files in layer 0 are directly written by Memtable, so there is a Key range overlapping each other. When the i-layer file is transferred to the i+1 layer, the i-layer file and all Key range overlapping files of the i+1 layer are needed to be combined together, and a new i+1 layer file is generated.
When reading data, firstly, according to the Key of the data to be read, all Segment files possibly containing the Key are carded out, and the data is read from top to bottom according to the organization mode of 0-k-1 layers. Since the files of layers 1-k-1 are not overlapped, the whole read file should be k-1+M, where M is the number of Segment files related to layer 0.
Reference count policy
The current system has Active Memable, active Table and a series of Segment files, and for the concurrent condition of read query and write operation, the Memable is occupied by the query request and can not be released temporarily. In addition, merge operations between Segment files are handled asynchronously by background threads, and thus a mechanism is needed to handle read and write requests concurrently without lock and blocking. The reference counting method can solve the problem, and maintains a reference count for all Active Memtable, immutable MemTable and Segment files. Increasing a reference value when the resource is acquired; upon release of the object, the reference value is decremented. When the reference value returns to zero, the corresponding resource may enter a reclamation state. For Active Memtable and Immutable Memtable, memory may be freed; for Segment files, the file may be deleted.
Aiming at the efficient writing of batch data, the invention adopts batch sequencing, sequential writing and batch mounting modes to improve the throughput of the system. And (3) sequencing the imported data in batches by using a distributed computing engine, and generating a disk Segment file in a sequential writing mode after sequencing is completed. After all the files are generated, the storage files are shared to the storage engine, and the storage engine is integrated into the current data set in a mounting mode.
2) Distributed multi-copy storage
After the distributed storage is adopted, the graph data needs to be split into sub-graphs and stored in each node, as shown in fig. 4. Common segmentation modes are: hashed by start ID, hashed by end ID, hashed by side ID. The starting point ID hash is beneficial to the inquiry mainly of the outgoing side, the ending point ID hash is beneficial to the inquiry mainly of the incoming side, and the side ID hash is beneficial to the uniform distribution and the load balance of the side data. All three splitting modes solve a part of problems, but the attribute map query can have the simultaneous out-side query, in-side query and in-and-out-side query, so that a more reasonable mode is needed, and the side data query can be accelerated.
The invention aims to acquire the performance acceleration of the edge query by adopting a mode of creating the edge data reverse index and adding a certain memory amount. For point data, determining a partition number of the ID through a Hash value; for edge data, an index record is stored in the partition where the start point is located and an index record is stored in the partition where the end point is located.
When data is written, a data record and an index record are required to be generated at a client, the data record and the index record are respectively submitted to two partitions, and the atomicity of writing is ensured by distributed transactions. When the point data is inquired, the point data can be positioned to a designated partition through the node ID; when the edge is queried, data query can be sent to the partition where the starting point is located; when inquiring the incoming side, sending data inquiry to the partition where the terminal point is located; and when inquiring the incoming and outgoing edges, sending inquiry to the partition where the starting point and the end point are located. The partition mode can effectively avoid sending query broadcast to the clusters, and can bring about considerable performance improvement when the cluster scale is large.
The present create data multiple copy mechanism provides reliability of stored data and maintains consistency of the data through the Raft protocol. Each partition of the graph data creates a Raft Group, and the nodes where the Raft Leader is located are responsible for data writing and querying.
Raft is an easily understood distributed consensus protocol that enhances serialization between log entries, simplifying the design of the protocol. Each node in Raft maintains an incremental variable called tenure (term). Tenninal is essentially a logical clock that nodes commonly maintain, through which a node can discover outdated messages. Specifically, when a message is sent between nodes, the current period of the message needs to be carried. If the period value carried by the message received by the node is smaller than the current period value of the node, rejecting the message; otherwise, updating the own tenure value. When a new log item is added to the log, the node also saves the current period value of the node in the log item, and the current period value becomes the period of the log item.
The nodes in Raft have 3 roles: a master node (Leader), a slave node (follow), and a Candidate node (Candidate). In the initial state, all nodes are slave nodes. The Raft protocol mainly comprises two parts: the master node is elected and the master node synchronizes the log to the slave node. Normally, the master node maintains its own master node identity by sending heartbeat signals to other nodes at regular intervals. When no heartbeat signal is received from the node for a period of time, the node is converted into a candidate node. The candidate node first increases its own tenure and then sends the tenure to all nodes to initiate election. After receiving the election request, the node compares the own period with the period carried by the election request. If the period of the self-service is larger, or the self-service has cast tickets for other candidate nodes in the period of the self-service, rejecting the election request. The candidate node that receives the majority vote will become the new master node.
The multiple-dispatch voting mechanism of Raft ensures the security of the Election (electric Safety): in each period, there is at most one master node. To ensure the completeness of the new master node (Leader Completeness), i.e. its log should contain all committed log entries, raft introduced the following rule: if the candidate node's log is older than its own log, the node will reject the election request. The nodes can judge whether the log is old or new by comparing the serial number of the log item with the serial number of the largest log item in the log with the tenninal period (called LastIndex and LastTerm respectively).
The master node synchronizes the log entries to the slave nodes according to the log numbering sequence, and the slave nodes need to accept the log entries of the master node according to the numbering sequence. The master node and the slave node maintain the next acceptable number via an Acknowledgement (ACK) mechanism. Raft establishes a full order relationship between all log entries by numbering. Through such limitation, the log of the nodes in the Raft can not generate holes, and meanwhile, the consistency of the log among the nodes is ensured: if the logs on two nodes have the same log entry at the same location, then the log entries at all locations must also be the same. According to the consistency of the log, if a certain log entry has been committed, then all the smaller numbered log entries in the log have also been committed.
The data writing process comprises the following steps: 1) Submitting write-in data to a node where the Leader is located; 2) The Leader receives data, writes the data into a local WAL log and sends the data to the Follower; 3) The Follower receives the data, writes the data into a local WAL log, and returns a response to the Leader; 4) After the Leader receives more than half of the Follower responses, the data is written into the Active Memable of the storage system, and the writing responses are returned.
In the high concurrency writing of graph data, generating a complete Raft synchronization flow for each write request will bring about a large system load, and most of the overhead is wasted in network communication and system call. In addition, there may be multiple graphs and multiple graph partitions in a storage process, so multiple Raft groups may be generated, and if one communication port is maintained for each Raft Group, the port number of the system will be exhausted quickly. Therefore, with Multi-Raft optimization techniques, only K communication ports are maintained for one storage process, and all network communications of the Raft Group share the K ports. When writing is performed in high concurrency, a plurality of writing requests are sent to other nodes together in a mode of splicing data packets, so that the writing performance can be effectively improved.
The Follower standard flow is to receive the data transmitted by the Leader and write the data into the Raft log. After the Leader informs the log number X of Apply through heartbeat information, the asynchronous IO thread reads and writes all data smaller than or equal to X into the Active Memable. The flow of the Follower brings extra IO operations, and when there are a lot of writing operations, the system switching Leader brings more delay. Therefore, the life cycle of the memory for transmitting data by the Leader is prolonged at the Follower terminal, and under the condition of high concurrency writing, the Follower can acquire the latest Apply log number of the Leader faster, so that the memory data can be directly written, and disk reading is avoided.
The Raft log is continuously increased along with data writing, and after the data is confirmed to be written into the storage system, the data should be cleaned in time, so that disk occupation is avoided. The dot edge data should contain a long field to record the version number of the Raft log where the current record is located. After the data passes through the asynchronous thread to generate segments, the latest data version number that has been currently persisted should be maintained and updated in the system. The Raft process should periodically check the version number of the data that has been persisted, and log data less than that version number can be purged in time.
3) MVCC and optimistic lock based distributed transactions
As shown in fig. 5, in concurrency control of distributed transactions, two-phase commit is a very important approach and is used as the primary concurrency control protocol in industrial systems. Two-phase commit blocks conflicting transactions based on a lock mechanism to avoid any non-serializable execution. Two-phase commit is generally divided into two phases, the first phase, where all objects accessed by a transaction are locked, and the locks can be generally divided into read locks (shared locks) and write locks (exclusive locks) according to the type of access, and after it is determined that all accessed data is successfully locked, the transaction is committed to modify the database in the second phase, and then the locks are released. In distributed transactions, some means such as "Wait-to-death" or "injury-to-Wait" are typically taken in addition, because deadlock detection of a single machine is not possible like when a single machine transaction is being processed. In a distributed environment, these strategies typically require two additional stages to complete.
In the processing process of concurrent transactions, simultaneous modification conditions may exist in multi-user point-to-point edge data, and conflicts are easy to generate. Typical conflicts can be categorized into dirty reads, phantom reads, nonrepeatable reads, and update lost. To resolve data conflicts, lock technology is generally used, and there are pessimistic locks and optimistic locks common methods.
Pessimistic locks are locks with strong exclusivity and exclusivity, and use pessimistic attitudes to solve the problem of concurrency of transactions. Pessimistic locks are based on the assumption that concurrency conflicts must occur and the overhead is significant after a system concurrency update failure is deemed. The data is thus first locked, whether or not a collision is actually occurring. After each transaction reads a piece of data, the piece of data is locked to block other transactions from reading and updating. The pessimistic lock avoids the loss of modification by dying the resources occupied by the transaction and ensures the consistency of the data. Only after a transaction commits or fails and rollback is completed will the lock be released and the resource released.
Optimistic locks solve the problem of pessimistic locks to some extent, solve the problem of transactional concurrency in an optimistic attitude, and have a more relaxed locking mechanism. The design of the optimistic lock recognizes that concurrent operations on the same data in the system do not frequently occur, so that collisions do not need to be attended to in advance, and retries after the collisions occur. The design of the optimistic lock requires adding a version field to each data to determine if the data is modified by other transactions during the transaction according to the version of the data. If it has been modified by other transactions, the transaction needs to terminate commit and rollback.
MVCC (Multi-Version Concurrency Control) is controlled by saving a snapshot of the data at a point in time. MVCC allows the same data record to have multiple different versions, and the corresponding version of data desired by the user can be obtained by adding corresponding constraint conditions during query.
The present invention contemplates a distributed transaction technique based on MVCC and optimistic locks, as shown in FIG. 5. The distributed transaction is realized by adopting a two-stage commit technology, a multi-version concurrency control technology, an optimistic lock technology and a global transaction ID technology, and the transaction level is intended to support a Serializable.
The global transaction ID manager provides an incremental ID service for the cluster that will act as a credential for the distributed transaction. The global transaction ID manager meets the persistence requirement, and can still ensure the incremental property of the ID after the system is down or restarted. After displaying the use begin transaction statement, the querying client should automatically apply for an ID.
To provide a highly available transaction ID manager, to avoid service unavailability after a single node process failure, the transaction ID manager process is intended to be created at multiple nodes and managed by the Raft protocol. The transaction ID manager of the Leader node pre-allocates an ID, e.g., [ K1, K2], writes the record into the Raft log, and copies it to other nodes. IDs within the range of [ K1, K2] will be allocated directly from memory until a new transaction ID segment [ K2, K3] is written after allocation is completed. Upon failure of the current transaction ID manager process node, the service may migrate directly to the Follower node and begin allocation from the new ID segment. Although the IDs may not be continuous with each other, global monotonic increases are guaranteed and higher performance is achieved.
FIG. 6 is a schematic diagram of a point-edge data structure, wherein the point-edge data has a version number field, and the reading rule includes: 1) data that is not committed within the present transaction is allowed to be read, 2) data that is less than the present transaction ID is allowed to be read, 3) data that is greater than or equal to the present transaction ID is not allowed to be read, 4) data that is not committed by other transactions is not allowed to be read.
For a write operation in a transaction, the rules should include: 1) create a transaction buffer for recording uncommitted data, 2) control the buffer size to avoid long transactions from submitting large amounts of data, 3) post the transaction to Raft after commit and release the buffer, 4) empty and release the buffer when the transaction conflicts or rolls back.
Optimistic locks are used to resolve transaction conflict situations and reduce lock overhead, the rules of which should include: 1) Before the writing transaction starts, storing and applying for a transaction ID, generating a data snapshot, 2) before the writing transaction is submitted, checking whether the latest version number of the current system is consistent with the version number in the snapshot for each piece of data modified, if not, indicating that the transaction conflicts, and rolling back.
Since the graph data of transaction modifications may be stored on different machines, a two-phase transaction lock implementation mechanism is shown in FIG. 7, intended to use two-phase commit for resolving distributed transactions across nodes.
(1) The client analyzes the data label related to the point-edge data in the transaction, and requests the write lock of the corresponding label from the storage system.
(2) The client sends a prepare request to the storage server involved in the writing.
(3) After receiving the request, the storage server checks whether the current transaction has conflict or not, and returns whether the current transaction is successful or not.
(4) After receiving the preparation response of each server, the client sends an abort request and releases the write lock of the corresponding tag if any server cannot submit. If all servers can commit, a commit request is sent.
(5) If the memory server receives the commit request, then commit operations are performed. If an abort request is received, a transaction rollback is performed.
(6) The client releases the write lock of the corresponding tag after all storage servers respond to the commit request.
(7) After the timeout time expires, the storage system actively closes the unfinished transaction and releases the tag write lock.
4) High performance graph loading
The data loading technology is a basic function of a database, a traditional data loading mode is generally used for processing small-batch data import, and large-batch data import cannot be realized basically because a data processing model is based on single data, and when large-batch data are processed, a large amount of extra cost overheads such as network communication, disk random access and the like can be brought, and characteristics of a modern CPU such as CPU CACHE, branch prediction and instruction pipeline cannot be utilized efficiently. Different databases provide the high-throughput graph loading function according to the storage model, the calculation engine model and the architecture characteristics.
The high throughput graph data loading technique studied by the present invention provides the ability to import data from a database table into a graph library. The main working principle is that the distributed computing is combined with batch processing capability, and the overall computing efficiency is improved by means of distributed vectorization computing; the IO performance of the disk is ensured by directly generating the storage file data; and optimizing the storage engine to realize efficient mounting of the direct mounted file into the storage engine.
(1) Data computation
When data is imported from the database table to the gallery, localized calculation is realized as much as possible according to the distribution condition of the table data. Because the calculation flow of the loading data is relatively fixed, vectorization calculation is easier to realize, and therefore, the vectorization calculation mode is used in the calculation mode, and the performance of the CPU is better utilized. Because the distributed computing process involves a shuffle stage, there are data serialization and deserialization processes, and in order to reduce the costs of serialization and deserialization as much as possible, the computing is advanced as much as possible, after which binary data is directly used in the computing flow.
(2) Data writing
When writing data, the data file and the index file are generated at the same time, and if the storage node and the computing node are on the same physical machine, the data file is directly written into the storage catalog. Because the storage engine adopts a multi-copy strategy, when data is required to be sent to the copies of the storage node through the network, the data is sent by taking the file as a unit, and a pipeline sending mode is adopted among the multiple copies, the mode can reduce the bandwidth consumption of a sending end, reduce the network communication overhead, realize the parallel writing of the data among the copies, and improve the overall writing throughput capacity.
Data mount
After the file generation and replication of each copy is completed, the file needs to be dynamically suspended in the system. When no data exists in the system, the externally mounted file is directly placed on the last layer, so that additional invalid Segment merging is avoided; when data exists in the system, the newly mounted data is data update for the existing data, so that the mounting is required at the layer 0, and the data merging is performed by a background thread.
For data mounting of a plurality of partitions, atomicity needs to be met, and the problem that partial data is mounted and partial data is not mounted is avoided. The two-stage commit approach is to be adopted:
(1) checking the states of all the partitions, and stopping mounting if abnormal partitions exist;
(2) submitting a preparation mounting request to each partition, wherein the preparation mounting request is written into a Raft log and is transferred to a copy of each partition;
(3) if the partition fails, the current mount is abandoned, a roller back request is submitted to each partition, the roller back request is also written into a Raft log and is transferred to a copy of each partition;
if all partitions have completed the prepare request, then a commit request is submitted to all partitions, which is written to the Raft log and passed to a copy of each partition.
5) On-line graph data disaster recovery
The invention solves the problem that the graph database storage system supports an online data backup and recovery technology, and can execute the data backup or recovery of a certain graph while the database is normally served. The main working principle is that a snapshot is created for the state of the graph storage at a certain moment, and snapshot data is copied into/loaded from the HDFS into the graph storage:
(1) Graph data snapshot
The invention solves the problems that the graph data maintained by the graph database storage system mainly comprises four parts: map base meta information, map data files, map index files, and Raft log files. When creating the graph data snapshot, the snapshot will contain the current up-to-date state of the four portions of data at the same time.
The primitive is a small amount of descriptive information for graph modeling, storage encoding, etc. It occupies little storage space and is not constantly changing and can be contained directly in the snapshot.
The graph data file is coded graph data, and the graph index file is data for accelerating index field searching during query. Both files are stored and managed by the LSM-Tree model, and have the properties of "files are not modified after writing", "files with a certain file name are created only once". When the graph snapshot is created, the file names of all the graph data files and the graph index files in the current graph storage system are added into the graph snapshot to be backed up, and the files are set to be in a protection state, so that the files cannot be deleted by other operations before the backup is finished.
The Raft log file is a synchronized write operation log that the present invention addresses for use by a graph database storage system in maintaining multi-copy data consistency. The Raft log file is different from the graph data file, and is continuously and dynamically added with content in the database operation. When the graph snapshot is created, the continuously written Raft log files are intercepted at a proper position, the outdated log files are discarded, and the latest log files are copied singly for backup; the original Raft log file can continue to serve the database storage in operation.
(2) Online graph data backup
The invention solves the problem that the storage system of the graph database provides two backup modes of full data backup and incremental data backup. Full data backup uses a graph data snapshot, copies all of the data files corresponding to the recorded file names/file paths to the HDFS one by one, and records the contents of the snapshot (i.e., the data file names) in the graph database storage. Incremental data backups require that a full data backup be performed on the graph before. The system will compare the contents of the current snapshot with the previously recorded full data backup snapshot contents, find out the newly added graph data file, the graph index file, and copy only these newly added data files. The incremental data backup function can obviously reduce the file copy cost of the graph data backup and reduce the backup data storage occupation when the incremental data is far less than the bottom data.
At backup, the graph database store will create a snapshot of the graph data for all copies. When partial data node downtime occurs in the backup process, backup operation can be completed by switching backup data source copies, so that high availability of functions is realized.
(3) Online graph data recovery
The invention solves the problem that the storage system of the graph database provides an online graph data recovery function, and the data recovery of a certain graph does not influence the usability of the whole graph database and the usability of other graphs.
When the graph data is restored, the storage system firstly creates the graph of the same model according to the basic meta information in the backup data, then copies the files in the backup data from the HDFS to the data catalog of the newly created graph one by one, and adds the files into the graph data file list managed in the memory. When the recovery is completed, the graph database storage system can directly provide service by using the data file list recorded in the memory.
The database storage system can also recover the graph data from 'one full backup+a plurality of subsequent incremental backups'. The graph recovery will recover the basic meta information and the Raft log file from the latest incremental backup; and the recovery of the data file/index file will be searched from each backup in reverse order of time. This achieves the goal of restoring to the data state at the last incremental backup.
6) RBAC-based multi-granularity rights control
Data security is an important problem in databases, and in military applications, the security of the graph data needs to be controlled, so that sensitive information is prevented from leaking. Access control policies meeting the requirements of military applications should satisfy three aspects of control:
confidentiality control: ensuring that the resources are not illegally read;
integrity control: the resources are ensured not to be illegally added, deleted and rewritten;
and (3) effectiveness control: the resources are not used and destroyed by the illegal access main body;
RBAC (Role-BasedAccess Control) determines its access rights by Role authenticating the principal. The system only verifies the colors regardless of the specific identity of the subject. In RBAC, a role is a group of users that have the same range of behavior. Between the visitor and the roles is a many-to-many mapping, one visitor may have multiple roles, and one role may contain multiple visitors. The problems of system flexibility and management difficulty in the conventional autonomous access control method DAC (Discretionary Access Control) and the mandatory access control method MAC (MandatoryAccess Control) can be effectively solved. RBAC is a flexible security policy that facilitates authorization management, gives minimal privileges, is hierarchical according to work, is independent of responsibility, etc.
The invention is intended to control the user's data rights by means of an RBAC-based model. The read right, the write right, the update right, the delete right and the management right are to be supported.
The read rights may be subdivided into graph level read rights, tag level read rights, and attribute level graph rights.
The write permission may be subdivided into graph level point side write permission and label level point side write permission.
The update rights may be subdivided into graph level update rights, label level update rights, and attribute level update rights.
The deletion authority may be subdivided into graph-level data deletion authority and label-level data deletion authority.
The management rights include meta-information modification, graph deletion, graph backup restoration, and graph renaming.
In addition to rights (Permission), the rights management system of the present invention will include: users (User), roles (Role), and Session (Session). The invention provides a user management system for user registration, user management and authority setting.
While the invention has been described in detail in the foregoing general description and specific examples, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims (10)

1. A hybrid distributed graph data storage and computation method, comprising:
step 1, a distributed graph architecture and graph database main stream evaluation benchmark for a domestic heterogeneous computing platform, a graph database main stream architecture, cloud-oriented hybrid load scheduling and benchmark test are surrounded, container arrangement and scheduling are carried out based on cloud native technologies of Docker and Kubernetes, and influences of software and hardware infrastructures on application operation are shielded;
step 2, carrying out large-scale graph data storage management on the basis of the main stream architecture of the graph database, and carrying out transaction lock and high-performance graph storage structure fusion on the basis of multiple copies of the graph storage structure, copy consistency and online graph data disaster tolerance;
step 3, based on the transaction analysis integration of the high-performance graph storage structure, the copy consistency and the data read-only copy, generating graph snapshots without copying, and providing graph data of transaction specifications for graph analysis; based on transaction analysis resource control, resource conflict during simultaneous running of transaction analysis tasks is solved through a scheduling strategy, and the transaction analysis mixed processing function and performance of a graph database are met;
step 4, carrying out distributed graph calculation aiming at a distributed graph structure, carrying out graph analysis parallel processing based on a GAS graph programming model on the basis of main flow graph algorithm capability analysis and algorithm call support, and accelerating the main flow graph algorithm by selecting an optimization parallel processing model through coding optimization, graph calculation fault tolerance and self-adaptive model;
Step 5, optimizing the distributed graph query based on the openCypher, enabling the openCypher compiling, the openCypher language expanding and the storage process support to meet all queries, developing and decomposing the queries into a plurality of parallel tasks through the openCypher, and executing the tasks in a distributed manner in a cluster; performing plan optimization, distributed parallel execution optimization and deep query optimization, and implementing various optimization strategies in the processes of logic execution plan generation and physical execution to provide technical support for high-performance query of a graph database;
step 6, based on the heterogeneous graph query federation technology, performing high-performance heterogeneous graph model query and cluster-crossing query around heterogeneous graph unified metadata, heterogeneous graph model query and cluster-crossing query distribution, aiming at unified views of heterogeneous graph queries with different types of vertexes and/or edges and different attribute numbers and types;
and 7, developing graph database prototype software design and test verification based on the graph database architecture, wherein the graph database prototype system integrating hybrid transaction analysis is established and the test verification is completed, wherein the graph database prototype software design and test verification comprises software architecture design, core function design, software function development, management and operation and maintenance function integration, test environment construction, reference test and typical scene test verification.
2. The method for storing and calculating the data of the hybrid distributed graph according to claim 1, wherein the cloud-oriented hybrid load scheduling in step 1 includes a strategy requirement oriented to autonomous control of domestic resources, and a scheduling strategy oriented to characteristics of the heterogeneous resources is implemented by adopting a containerization scheme for graph storage service, graph query service and graph calculation service according to different operating systems and different hardware conditions of domestic heterogeneous clusters, including comprehensive perception of heterogeneous resources, reasonable and flexible scheduling of hybrid loads and resource management and control of fine granularity.
3. The method for storing and computing hybrid distributed graph data according to claim 1, wherein the graph data storing management in step 2 includes distributed storing of graph data, multiple copies of data, data disaster tolerance, distributed transaction, and data rights management;
the distributed storage comprises a memory structure design of point and side files and a storage file structure design, and is used for carrying out point inquiry, associated side inquiry and attribute filtering, and the storage structure can keep the load balance of storage in a cluster;
the data multi-copy comprises copy creation and consistency management, and service switching and data recovery when storage service faults are supported;
The data disaster recovery comprises data backup and data recovery, wherein backup data is stored in the HDFS and can be recovered by the online data of another cluster;
the distributed transaction comprises transaction grade support, transaction performance and transaction conflict processing, the transaction of a Serializable grade is supported, the transaction performance is improved through a multi-version control mode, the transaction conflict is processed through an optimistic lock, and for a transaction request crossing a plurality of physical nodes, the transaction request is provided by a two-stage submitting mode and a global transaction ID manager mode;
the data authority comprises management and control of the data authority of multiple users, defines users, user groups and roles by adopting an RBAC model, and provides read authority, write authority and management authority for the graph storage system, wherein the read authority, the write authority cover graph level, label level and attribute level.
4. The method for storing and calculating hybrid distributed graph data according to claim 3, wherein the graph data storage management adopts an efficient KV storage technology based on LSM-Tree to realize mass graph data storage, and inserts data, updates data and deletes data into an active memtable buffer of a memory, wherein the inserts data, the updates data and the deletes data are respectively distinguished by an identification bit for subsequent data merging; the data in the memory ActiveMemable buffer is converted into Immutable MemTable after the data quantity reaches a threshold value through sequencing and integration, so that the writing thread asynchronously persists into segments to a disk.
5. The method of claim 4, wherein the step 2 further comprises a distributed multi-copy storage, and the graph data is split into sub-graphs and stored in each node after the distributed storage is adopted, and the splitting manner is as follows: hash by start ID, hash by end ID, hash by side ID.
6. The method according to claim 1, wherein the distributed graph computation in step 4 includes graph computation model, graph data partitioning, graph data encoding, algorithm execution fault tolerance.
7. The method according to claim 1, wherein the step 5 of querying the distributed graph includes an opencytoer grammar compiler, a storage process compiler, a query optimizer, a distributed executor, and resource pool allocation.
8. The method according to claim 1, wherein the heterogeneous map federation query in step 6 includes meta information identification, query distribution, and query encryption for a plurality of local or remote heterogeneous maps, and the federation query supports multi-map aggregation queries and multi-map association queries by distinguishing map data through namespaces for the plurality of heterogeneous maps.
9. The method for storing and computing data of a hybrid distributed graph according to claim 1, wherein the integration of the hybrid transaction analysis in step 7 avoids copying data between the graph database and the graph computation, and adopts a data copy technology to fuse the analysis algorithm and the transaction query, and the snapshot of the graph computation is based on the read-only copy of the data by adding a read-only copy of the data to the multiple copies of the data, and the access pressure to the graph database is reduced while the latest data is acquired.
10. The hybrid distributed graph data storage and computation method of claim 9, wherein the hybrid transaction analysis integration MVCC and optimistic lock based distributed transactions in step 7 are controlled concurrently using a two-phase commit method that blocks conflicting transactions based on a lock mechanism to avoid any non-serializable execution, the two-phase commit being divided into two phases: in the first stage, all objects accessed by the transaction are locked, the locks are divided into a read lock and a write lock according to different access types, the read lock is a shared lock, the write lock is a mutual exclusive lock, after all accessed data are successfully locked, the modification of the transaction to the database is submitted in the second stage, and then the lock is released; the MVCC is controlled by storing a snapshot of the data at a certain time point, and allows the same data record to have a plurality of different versions, and the corresponding version data desired by the user is obtained by adding corresponding constraint conditions when inquiring.
CN202310889520.1A 2023-07-19 2023-07-19 Mixed distributed graph data storage and calculation method Pending CN117112692A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310889520.1A CN117112692A (en) 2023-07-19 2023-07-19 Mixed distributed graph data storage and calculation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310889520.1A CN117112692A (en) 2023-07-19 2023-07-19 Mixed distributed graph data storage and calculation method

Publications (1)

Publication Number Publication Date
CN117112692A true CN117112692A (en) 2023-11-24

Family

ID=88806371

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310889520.1A Pending CN117112692A (en) 2023-07-19 2023-07-19 Mixed distributed graph data storage and calculation method

Country Status (1)

Country Link
CN (1) CN117112692A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117808180A (en) * 2023-12-27 2024-04-02 北京科技大学 Path planning method, application and device based on knowledge and data combination
CN117972154A (en) * 2024-03-27 2024-05-03 支付宝(杭州)信息技术有限公司 Graph data processing method and graph calculation engine
CN118069076A (en) * 2024-04-22 2024-05-24 江苏华存电子科技有限公司 Pre-disc filling method and system for NAND product
CN117808180B (en) * 2023-12-27 2024-07-05 北京科技大学 Path planning method, application and device based on knowledge and data combination

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117808180A (en) * 2023-12-27 2024-04-02 北京科技大学 Path planning method, application and device based on knowledge and data combination
CN117808180B (en) * 2023-12-27 2024-07-05 北京科技大学 Path planning method, application and device based on knowledge and data combination
CN117972154A (en) * 2024-03-27 2024-05-03 支付宝(杭州)信息技术有限公司 Graph data processing method and graph calculation engine
CN118069076A (en) * 2024-04-22 2024-05-24 江苏华存电子科技有限公司 Pre-disc filling method and system for NAND product

Similar Documents

Publication Publication Date Title
USRE47106E1 (en) High-performance log-based processing
US9922075B2 (en) Scalable distributed transaction processing system
EP1649374B1 (en) Parallel recovery by non-failed nodes
CA2913036C (en) Index update pipeline
US8768977B2 (en) Data management using writeable snapshots in multi-versioned distributed B-trees
Binnig et al. Distributed snapshot isolation: global transactions pay globally, local transactions pay locally
Wang et al. Replication-based fault-tolerance for large-scale graph processing
CN117112692A (en) Mixed distributed graph data storage and calculation method
Yan et al. Carousel: Low-latency transaction processing for globally-distributed data
CN111736964B (en) Transaction processing method and device, computer equipment and storage medium
CN111444027B (en) Transaction processing method and device, computer equipment and storage medium
Moniz et al. Blotter: Low latency transactions for geo-replicated storage
CA2310578A1 (en) Method and apparatus for synchronizing not-logged application temporary tables in a multi-node relational database management system
Waqas et al. Transaction management techniques and practices in current cloud computing environments: A survey
Chen et al. A structural classification of integrated replica control mechanisms
Yang From Google file system to omega: a decade of advancement in big data management at Google
Zou et al. Elastic database replication in the cloud
CN118092885B (en) Code frame method based on front-end and back-end plug-in architecture
Riedl Adaptable distributed transaction systems
Gao A Paradigm for Scalable, Transactional, and Efficient Spatial Indexes
Koloniari et al. Transaction management for cloud-based graph databases
Jain Dgraph: synchronously replicated, transactional and distributed graph database
Guo et al. Efficient Snapshot Isolation in Paxos-Replicated Database Systems
CN118092885A (en) Code frame method based on front-end and back-end plug-in architecture
Son Using replication to improve reliability in distributed information systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination