CN111639082B - Object storage management method and system of billion-level node scale knowledge graph based on Ceph - Google Patents

Object storage management method and system of billion-level node scale knowledge graph based on Ceph Download PDF

Info

Publication number
CN111639082B
CN111639082B CN202010514803.4A CN202010514803A CN111639082B CN 111639082 B CN111639082 B CN 111639082B CN 202010514803 A CN202010514803 A CN 202010514803A CN 111639082 B CN111639082 B CN 111639082B
Authority
CN
China
Prior art keywords
data
graph
vertex
index
ceph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010514803.4A
Other languages
Chinese (zh)
Other versions
CN111639082A (en
Inventor
曹亮
刘魁
李超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN202010514803.4A priority Critical patent/CN111639082B/en
Publication of CN111639082A publication Critical patent/CN111639082A/en
Application granted granted Critical
Publication of CN111639082B publication Critical patent/CN111639082B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2219Large Object storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

The invention discloses an object storage management method and system of a billion-level node scale knowledge graph based on Ceph, wherein the method comprises the following steps: the method comprises the steps of constructing and designing a graph storage framework, obtaining entity data of a plurality of entities corresponding to a target service, generating a knowledge graph corresponding to the target service according to the entity data, storing the knowledge graph, taking Ceph as a distributed resource storage, adding an external index background mechanism, decomposing a large task into a plurality of subtasks by using a distributed computing engine, distributing the subtasks to different machines for execution, and summarizing the subtasks after the execution is finished so as to provide large-scale data processing capacity to support OLAP requirements for a user to perform data analysis based on the knowledge graph. The invention also provides an object storage management system based on the Ceph billion-level node scale knowledge graph. The scheme introduces the distributed resource manager, has the characteristics of expandability and high availability, can store and express massive knowledge, supports the data volume of billions of nodes, and has the characteristics of reliability, easy use and high efficiency.

Description

Object storage management method and system of billion-level node scale knowledge graph based on Ceph
Technical Field
The invention relates to the technical field of information processing, in particular to a object storage management method and system of a billion-level node scale knowledge graph based on Ceph.
Background
The knowledge map is a method for describing knowledge resources and carriers thereof by using a visualization technology, mining, analyzing, constructing, drawing and displaying knowledge and the mutual relation among the knowledge resources and the carriers. The knowledge graph can extract hidden knowledge in large-scale data to construct a data model based on the graph. The final purpose of the technology is to essentially arrange data collection into structured, reusable and reasonable storage for further use scenes, and the storage format of the knowledge graph can nearly perfectly match the requirements. The knowledge graph aims at describing various entities or concepts existing in the real world and the incidence relation among the entities, and each entity of the knowledge graph is identified by a globally and uniquely determined ID, namely, each person has an identity card number; the second is to use attribute-value pairs to characterize the intrinsic properties of entities, and to use relationships to connect two entities and characterize the relationship between them.
The biggest defects of the existing graph storage system are that the graph storage system is not truly distributed, more and more data can be obtained in a big data era, the capacity of a single machine is limited, the data volume is difficult to process after exceeding the bearing capacity of the single machine, the bottom layer storage is far less than the block storage and object storage mode, the efficiency of graph query and graph analysis is low, the system has poor disaster tolerance and instantaneity, and the problems that the dynamic capacity expansion is difficult to realize in hundreds of millions of nodes, the node association query efficiency is low and the like are faced.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a method and a system for object storage management of a billion-level node scale knowledge graph based on Ceph, can process the data of the billion-level node scale knowledge graph, supports large-scale graph data storage and elastic and linear expansion, has high availability and fault tolerance rate, has OLTP and CRUD characteristics, and simultaneously supports OLAP data analysis and external indexing.
The purpose of the invention is realized by the following technical scheme:
the object storage management method of the billion-level node scale knowledge graph based on the Ceph comprises the following steps:
s1: constructing a graph storage architecture, acquiring entity data of a plurality of entities corresponding to a target service, generating a knowledge graph corresponding to the target service according to the entity data, storing the knowledge graph, taking Ceph as a distributed resource storage, constructing a Ceph cluster by using a small cluster consisting of a plurality of monitors by using a Client/Server architecture, and simultaneously storing graph data by using a plurality of OSD (on screen displays) under a single Monitor small cluster;
s2, constructing an external index background, namely mapping the knowledge map data into a fixed index data structure, using an elastic search/Solr retrieval engine as an external index plug-in to realize non-equivalent query, and simultaneously combining an efficient indexing mechanism to construct the external index background;
s3, constructing an integrated distributed computing engine framework, constructing a distributed computing engine by using a Spark computing engine framework, converting the graph relation into a Spark operator by using a graph X library, storing the graph data on the nodes of the Ceph cluster in a distributed manner by using RDD (resource description language) by using the graph X library, and respectively and correspondingly storing a vertex set and an edge set by using the vertex RDD and the edge RDD;
and S4, managing a graph storage architecture, and providing three layers of expanded line query, data write-in, data read, cluster expansion, metadata backup, metadata snapshot, online object analysis and online analysis processing operations to realize the management of graph data of the knowledge graph on the basis of the graph storage architecture, the external index background and the distributed computation engine.
Specifically, the efficient indexing mechanism in step S2 includes a graph index and a vertex center index, where the graph index is a global index structure of the entire knowledge graph; the vertex center index is a local index structure built for each vertex.
Specifically, the step S3 further includes a partitioning operation, and specifically includes the following sub-steps:
s101, carrying out Hash partitioning on the vertex RDD according to the ID of the vertex, and distributing vertex data on a cluster in a multi-partition mode;
s102, partitioning the RDD according to a specified partition strategy, and distributing the edge data on the cluster in a multi-partition mode;
s103, storing a routing table for recording the relation between the vertex and all the side RDD partitions in the partitions of the vertex RDD, and when the side RDD needs vertex data, the vertex data is sent to the side RDD partitions by the vertex RDD according to the routing table.
Specifically, the data writing step in step S4 includes the following substeps:
s201, connecting a client to a Monitor, acquiring the Map information of the cluster, and requesting a corresponding main OSD data node;
s202, the main OSD data node writes the data of the other two replica nodes simultaneously, and waits for the main node and the other two replica nodes to finish the data writing state, and after the main node and the replica nodes are successful in writing state, a finishing signal is returned to the Client, and the data writing is finished.
Specifically, the cluster capacity expansion step in step S4 includes the following substeps:
s301, the Client connects the Monitor to obtain Map information of the cluster, the OSD1 of the new main node uploads a request to the Monitor, and the OSD2 node takes over the OSD1 node to become a temporary main node;
s302, the temporary main node OSD2 synchronizes the total data to the new main node OSD1, and the ClientIO read-write is directly connected with the temporary main node OSD2 for data read-write;
s303, the temporary main node OSD2 receives the read-write IO and writes the data in the other two copy nodes at the same time, and after the data in the temporary main node OSD2 and the data in the other two copy nodes are written successfully, a signal is returned to the Client, and the read-write of the Client IO is finished;
s304, if the OSD1 data of the nodes are synchronized, the temporary main node OSD2 uploads a request to the Monitor to give out the role of the main node, the OSD1 nodes become the main node again, and the OSD2 nodes become copy nodes;
and S305, at the same time, on the graph data level, after the node capacity expansion is realized, the graph data is cut according to a graph data cutting mode and is respectively stored on a plurality of machines.
Specifically, the graph data cutting mode comprises two cutting modes of point cutting and edge cutting; performing data cutting on the vertexes of the graph according to a point cutting mode, wherein a cutting line passes through the vertexes of the graph, each edge is only stored once and only appears on one machine, and vertexes with more neighbor vertexes can be distributed to a plurality of different machines for storage; and performing data cutting by using the graph edges according to an edge cutting mode, wherein a cutting line only passes through the edges connecting the vertexes, each vertex is only stored once, and the cut edges are distributed to a plurality of different machines for storage.
Specifically, the metadata snapshot step in step S4 includes: according to the metadata information, the previous data state can be effectively recovered, and the program can also be recovered to the system operation history state; saving system data of a specific time point, and generating a report of the corresponding time point of the system; and exporting the snapshot data for offline work.
Specifically, the three-layer wire expansion query step in step S4 includes the following substeps:
s401, setting a user-given vertex set Vset as basic data of a first-layer expanded line query, setting a query filtering condition of a first layer as a filtering condition ConditionA of a vertex Label/vertex attribute, and performing the vertex expanded line query of the first layer;
s402, taking a vertex set meeting the first layer query filtering condition as basic data of second layer expansion query, setting the query filtering condition of the second layer as a filtering condition ConditionB of edge Label/edge attribute, and performing second layer edge expansion query;
and S403, setting attribute query conditions by taking the edge sets meeting the second-layer query filtering conditions as basic data of third-layer expanded line query, performing the third-layer attribute expanded line query, and outputting query results subjected to the third-layer expanded line query.
The object storage management system based on the Ceph billion-level node scale knowledge graph comprises a graph data storage module, a distributed computing module, an index module and a metadata management module. The graph data storage module is used for storing object data of the large-scale knowledge graph in a distributed mode and providing object storage, block device storage and file system service;
the distributed computing module is used for decomposing a large task into a plurality of subtasks by utilizing spark RDD memory computing, respectively deploying the subtasks to different machines for execution, and summarizing the subtasks after the subtasks are completed so as to provide efficient large-scale data processing capacity to support OLAP requirements and provide data analysis for users based on knowledge graphs;
the index module is used for mapping the knowledge data into a fixed index data structure and providing a graphic index, a vertex center index and an external index function for a user;
the metadata management module is used for backup of metadata, snapshot of the metadata, program recovery, generation of a time point report and offline work of the system.
The invention has the beneficial effects that: the scheme adds a large data distributed architecture and refers to a distributed resource manager, has the main performance characteristics of expandability, high availability and the like, and is mainly embodied in the aspects of distributed clusters, external indexes, data reliability and the distributed resource manager. Meanwhile, the system can store and express massive knowledge, supports the data volume of billions of nodes, and has the characteristics of reliability, easy use and high efficiency.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is an overall distribution architecture diagram of the present invention.
FIG. 3 is a diagram of the distributed resource management architecture of the present invention.
FIG. 4 is a diagram of the integrated distributed computing engine architecture of the present invention.
FIG. 5 is a diagram of the external index plug architecture of the present invention.
Fig. 6 is a data writing flow chart of the present invention.
FIG. 7 is a flow chart of cluster capacity expansion of the present invention.
Fig. 8 is a functional block diagram of the system of the present invention.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.
In this embodiment, as shown in fig. 1-2, the object storage management method based on the Ceph knowledge graph with billion node scales includes the following steps:
step 1: and constructing a graph storage framework, acquiring entity data of a plurality of entities corresponding to the target service, and generating and storing a knowledge graph corresponding to the target service according to the entity data. As shown in fig. 3, for the overall distributed architecture, ceph is used as a distributed resource storage, a Client/Server architecture is adopted, a Ceph cluster is constructed by a small cluster composed of multiple monitors, and multiple OSDs are used for storing graph data under a single Monitor small cluster.
Step 2, firstly mapping the knowledge graph data into a fixed index data structure, in order to have the capability of processing billions of nodes of knowledge data, as shown in fig. 5, adding an external index background mechanism, using an elastic search/Solr search engine as an external index plug-in, realizing that the index can be used when non-equivalent query is carried out, and simultaneously combining with an efficient index mechanism to construct an external index background. And the external index background and the index engine exchange data in an API mode.
And 3, constructing an integrated distributed computing engine framework, constructing a distributed computing engine by using a Spark computing engine framework, converting the graph relation into a Spark operator by using a graph X library, storing the graph data on the nodes of the Ceph cluster in a distributed manner by using RDD (resource description language) by using the graph X library, and correspondingly storing a vertex set and an edge set by using the vertex RDD and the edge RDD respectively.
And 4, managing a graph storage architecture, namely providing three layers of expanded line query, data write-in, data read, cluster expansion, metadata backup, metadata snapshot, online object analysis and online analysis processing operations to realize the management of graph data of the knowledge graph on the basis of the graph storage architecture, the external index background and the construction of a distributed computing engine.
In this embodiment, the efficient indexing mechanism includes a graph index and a vertex center index, the graph index is a global index structure of the entire knowledge graph, better selectivity is obtained by indexing attributes of entities or edges, so that the graph traversal speed is increased, and equivalent retrieval is performed through a fixed attribute combination composed of one or a group of attributes. The vertex center index is a local index structure established for each vertex, but when each vertex has thousands or more edges in a large graph, traversal of the vertices has filtering of corresponding edges, and traversal efficiency is low, so that the vertex center index only supports leftmost matching.
For the index-based three-layer wire expanding query, firstly, a user-given vertex set Vset is set as basic data of a first-layer wire expanding query, a query filtering condition of a first layer is set as a filtering condition ConditionA of a vertex Label/vertex attribute, and the vertex wire expanding query of the first layer is carried out. And then, taking the vertex set of the side meeting the query filtering condition of the first layer as basic data for the line expansion query of the second layer, setting the query filtering condition of the second layer as a filtering condition ConditionB of a side Label/side attribute, and performing the line expansion query of the second layer. The last line expansion query only queries edges which meet ConditionB, but vertexes related to the edges only have information of vertex IDs, and do not contain any attribute information, and whether the ConditionA is met or not is uncertain. The edge set meeting the second layer query filtering condition is used as basic data of the third layer expanded line query, the attribute query condition is set, the third layer attribute expanded line query is carried out, the query result subjected to the third layer expanded line query is output, and the efficient index is implemented to play a role.
In this embodiment, as shown in fig. 4, in order to support the OLA requirement P, a set of high performance computing framework API is further extended, spark is supported, a graph x library is used to convert the graph relationship into a Spark operator, graph x stores graph data on the nodes of the cluster in a distributed manner by RDD, and vertex RDD (vertex RDD) and edge RDD (EdgeRDD) are used to store a vertex set and an edge set. Vertex RDD distributes vertex data across the cluster in a multi-partition fashion by hashing partitions by their IDs. And partitioning the edge RDD according to a specified partition strategy (partitionStrategy), and distributing the edge data on the cluster in a multi-partition mode. In addition, the vertex RDD also has a routing table, which is the routing information of the vertex-to-edge RDD partition. The routing table exists in a partition of the vertex RDD and records the relationship of the vertex to all the edge RDD partitions in the partition. When the edge RDD needs the vertex data, the vertex data can be sent to the edge RDD subarea by the vertex RDD according to the routing table. Up to this point, the graph data is stored as the RDD of Spark.
At the Spark bottom layer, an operator executes and starts Spark context, the Spark context registers and applies for running an Executor resource to a resource manager, the resource manager allocates the Executor resource and starts StandaloneExecutionBackend (task scheduling), the running condition of the Executor is sent to the resource manager along with heartbeat, the Spark context constructs a DAG graph, the DAG graph is decomposed into Stage, and the Taskset is sent to a TaskScheduler. The Executor applies for Task from the SparkContext, the TaskSchedule issues the Task to the Executor for operation, the SparkContext issues the application program code to the Executor, the Task operates on the Executor, and all resources are released after the operation is finished. Therefore, efficient operations such as mapridges, mapVertics and aggregateMessages are achieved, and data analysis requirements are responded quickly.
In this embodiment, as shown in fig. 6, for data writing, a Client (Client) is connected to a Monitor, obtains Map information of a cluster, requests a corresponding master OSD data node, and the master OSD data node writes data of two other replica nodes at the same time, waits for a data writing status of the master node and the two other replica nodes, and returns the data to the Client after the data writing status of the master node and the data writing status of the replica nodes are successful, so that the data writing is completed. The data reading mode and the data writing mode are the same.
In this example, for cluster capacity expansion, the Client connects the Monitor to obtain the Map information of the cluster. Meanwhile, the new main node OSD1 reports the Monitor actively due to no PG (Placement groups) data, the OSD2 node is informed to take over as the main node temporarily, the temporary main node OSD2 synchronizes the whole data to the new main node OSD1, the ClientIO read-write is directly connected with the temporary main node OSD2 for reading and writing, the OSD2 node receives the read-write IO and writes into the other two copy nodes simultaneously, the OSD2 node and the other two copy nodes are waited to write successfully, and after the three data of the OSD2 node are written successfully, a signal is returned to the Client. At the moment, the ClientIO reading and writing are finished, if the OSD1 node data synchronization is finished, the temporary main node OSD2 uploads a request to the Monitor, the temporary main node OSD2 can give out a main role, the OSD1 becomes a main node, and the OSD2 becomes a copy node. And meanwhile, on the data level of the graph, after capacity expansion is realized, the graph is cut, namely the data needs to be cut and stored on a plurality of machines, the first type of cutting is point-by-point cutting, and a cutting line passes through a Vertex (Vertex) of the graph instead of an Edge (Edge). Each edge is only stored once, each edge only appears on one machine, and multiple adjacent vertexes can be distributed to different machines; and the second type of cutting according to edges, wherein the cutting line only passes through the Edge (Edge) connected with the vertex, each vertex is only stored once, and the cut edges can be stored on a plurality of machines until the cluster expansion is completed.
In this embodiment, an object storage management system of a Ceph-based giga-level node scale knowledge graph is also provided, and the system includes a graph data storage module, a distributed computation module, an index module, and a metadata management module.
The graph data storage module is used for storing object data of the large-scale knowledge graph in a distributed mode and providing object storage, block device storage and file system services.
The distributed computing module is used for decomposing a large task into a plurality of subtasks by utilizing spark RDD memory computing, respectively deploying the subtasks to different machines for execution, and summarizing the subtasks after the subtasks are completed so as to provide efficient large-scale data processing capacity to support OLAP requirements and provide data analysis for users based on knowledge graphs.
The index module is used for mapping the knowledge data into a fixed index data structure and providing functions of graphic index, vertex center index and external index for a user.
The metadata management module is used for backup of metadata, snapshot of the metadata, program recovery, generation of a time point report and offline work of the system.
In this embodiment, the entire embodiment can be applied to the anti-fraud detection scenario. The method comprises the steps of constructing a heterogeneous network by user information, equipment information and social relations, and applying the heterogeneous network graph to a user association analysis and anti-fraud detection scene. After data is imported, the number of nodes reaches the order of 11 hundred million, and the relation data reaches the order of 500 hundred million, so that a complex heterogeneous network comprising 11 types of nodes and 13 types of edges is formed. Screening suspicious users through a specific rule, and checking users having specific association with the suspicious users; checking the network characteristics and the user characteristics of all user forming subnets which are specifically associated with the suspicious users; analyzing what association a particular user can associate together through; data of 6-layer incidence relation can be analyzed at most to complete a series of data analysis tasks, and in the map of the patent with 11 hundred million-level nodes, the map traversal and query response time of the scheme is 4-100 times faster than that of the existing map storage system. The technical scheme is compared with the existing map storage solution as follows:
TABLE 1 data Loading
The technical scheme NEO4J-OFFLINE NEO4J-CYPHER
45375 second Not completed in 24 hours Not completed in 24 hours
TABLE 2 data storage size
The technical scheme NEO4J-OFFLINE NEO4J-CYPHER
609375MB 275950MB 1276175MB
TABLE 3 query Performance
The technical scheme NEO4J-OFFLINE NEO4J-CYPHER
7.5 milliseconds 55.0 msec 34.1 ms
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (7)

1. The object storage management method of the knowledge graph with billion-level node scale based on Ceph is characterized by comprising the following steps:
s1: the method comprises the steps of constructing a graph storage architecture, obtaining entity data of a plurality of entities corresponding to a target service, generating a knowledge graph corresponding to the target service according to the entity data, storing the knowledge graph, taking Ceph as a distributed resource storage, constructing a Ceph cluster by using a small cluster composed of a plurality of monitors by using a Client and Server architecture, and simultaneously storing graph data by using a plurality of OSD under a single Monitor small cluster;
s2, constructing an external index background, namely mapping the knowledge map data into a fixed index data structure, using an elastic search engine and a Solr retrieval engine as external index plug-ins to realize non-equivalent query, and meanwhile, combining a high-efficiency index mechanism to construct the external index background; the efficient indexing mechanism comprises a graph index and a vertex center index, wherein the graph index is a global index structure of the whole knowledge graph; the vertex center index is a local index structure established for each vertex;
s3, constructing an integrated distributed computing engine framework, constructing a distributed computing engine by using a Spark computing engine framework, converting the graph relation into a Spark operator by using a graph X library, storing the graph data on the nodes of the Ceph cluster in a distributed manner by using RDD (resource description language) by using the graph X library, and respectively and correspondingly storing a vertex set and an edge set by using the vertex RDD and the edge RDD;
s4, managing a graph storage architecture, and providing three layers of expanded line query, data write-in, data reading, cluster expansion, metadata backup, metadata snapshot, online object analysis and online analysis processing operations to realize management of graph data of the knowledge graph on the basis of the graph storage architecture, an external index background and a distributed computation engine; the three-layer wire expansion inquiring step comprises the following substeps:
s401, setting a user given vertex set Vset as basic data of a first layer of expanded line query, setting a query filtering condition of the first layer as a filtering condition ConditionA of a vertex label and a vertex attribute, and performing the vertex expanded line query of the first layer;
s402, taking a vertex set meeting the first layer query filtering condition as basic data of second layer expansion query, setting the second layer query filtering condition as a filtering condition ConditionB of an edge label and an edge attribute, and performing second layer edge expansion query;
and S403, setting an attribute query condition by taking the edge set meeting the second-layer query filtering condition as basic data of the third-layer expanded line query, performing the third-layer attribute expanded line query, and outputting a query result subjected to the third-layer expanded line query.
2. The object storage management method of a Ceph-based knowledge-graph of billion node sizes as claimed in claim 1, wherein the step S3 further comprises a partitioning operation, and specifically comprises the following sub-steps:
s101, carrying out Hash partitioning on the vertex RDD according to the ID of the vertex, and distributing vertex data on a cluster in a multi-partition mode;
s102, partitioning the RDD according to a specified partition strategy, and distributing the edge data on the cluster in a multi-partition mode;
s103, storing a routing table for recording the relation between the vertex and all the side RDD partitions in the partitions of the vertex RDD, and when the side RDD needs vertex data, the vertex data is sent to the side RDD partitions by the vertex RDD according to the routing table.
3. The method for object storage management based on the Ceph-scale knowledgegraph of billions of nodes of claim 1, wherein the data writing step of step S4 comprises the following substeps:
s201, connecting a client to a Monitor, acquiring the Map information of the cluster, and requesting a corresponding main OSD data node;
s202, the main OSD data node writes the data of the other two replica nodes simultaneously, and waits for the main node and the other two replica nodes to finish the data writing state, and after the main node and the replica nodes are successful in writing state, a finishing signal is returned to the Client, and the data writing is finished.
4. The method for object storage management based on the Ceph-based billion-level node scale knowledge graph of claim 1 wherein the step of expanding the cluster in step S4 comprises the sub-steps of:
s301, the Client is connected with the Monitor to acquire the Map information of the cluster, the OSD1 of the new main node uploads a request to the Monitor, and the OSD2 node replaces the OSD1 node to become a temporary main node;
s302, the temporary main node OSD2 synchronizes the total data to the new main node OSD1, and the ClientIO read-write is directly connected with the temporary main node OSD2 for data read-write;
s303, the temporary main node OSD2 receives the read-write IO and writes the data in the other two copy nodes at the same time, and after the data in the temporary main node OSD2 and the data in the other two copy nodes are written successfully, a signal is returned to the Client, and the read-write of the Client IO is finished;
s304, if the OSD1 data of the nodes are synchronized, the temporary main node OSD2 uploads a request to the Monitor to give out the role of the main node, the OSD1 nodes become the main node again, and the OSD2 nodes become copy nodes;
and S305, simultaneously, on the graph data level, after the node capacity expansion is realized, the graph data are cut according to a graph data cutting mode and are respectively stored on a plurality of machines.
5. The object storage management method of the Ceph-based billion-scale node-scale knowledge-graph according to claim 4, wherein the graph data cutting modes comprise two cutting modes of point-by-point cutting and edge-by-edge cutting; performing data cutting on the vertexes of the graph according to a point cutting mode, wherein a cutting line passes through the vertexes of the graph, each edge is only stored once and only appears on one machine, and vertexes with more neighbor vertexes are distributed to a plurality of different machines for storage; and performing data cutting by using the graph edges according to an edge cutting mode, wherein a cutting line only passes through the edges connecting the vertexes, each vertex is only stored once, and the cut edges are distributed to a plurality of different machines for storage.
6. The method for object storage management based on a Ceph-scale knowledge-graph, according to claim 1, wherein the step of snapshotting metadata in step S4 comprises: according to the metadata information, the previous data state can be effectively recovered, and the program can also be recovered to the system operation history state; saving system data of a specific time point, and generating a report of the corresponding time point of the system; and exporting the snapshot data for offline work.
7. A system for object storage management based on a Ceph-scale knowledge-graph, implemented in a method for object storage management based on a Ceph-scale knowledge-graph according to any one of claims 1 to 6, comprising:
the graph data storage module is used for storing object data of the large-scale knowledge graph in a distributed mode and providing object storage, block device storage and file system services;
the distributed computing module is used for decomposing a large task into a plurality of subtasks by utilizing spark rdd memory computing, respectively deploying the subtasks to different machines for execution, and summarizing the subtasks after the subtasks are completed so as to provide efficient large-scale data processing capacity to support OLAP (online analytical processing) requirements and provide data analysis for users based on knowledge graphs;
the index module is used for mapping the knowledge data into a fixed index data structure and providing a graph index, a vertex center index and an external index function for a user;
and the metadata management module is used for backup of metadata, snapshot of the metadata, program recovery, generation of a time point report and offline work of the system.
CN202010514803.4A 2020-06-08 2020-06-08 Object storage management method and system of billion-level node scale knowledge graph based on Ceph Active CN111639082B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010514803.4A CN111639082B (en) 2020-06-08 2020-06-08 Object storage management method and system of billion-level node scale knowledge graph based on Ceph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010514803.4A CN111639082B (en) 2020-06-08 2020-06-08 Object storage management method and system of billion-level node scale knowledge graph based on Ceph

Publications (2)

Publication Number Publication Date
CN111639082A CN111639082A (en) 2020-09-08
CN111639082B true CN111639082B (en) 2022-12-23

Family

ID=72329872

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010514803.4A Active CN111639082B (en) 2020-06-08 2020-06-08 Object storage management method and system of billion-level node scale knowledge graph based on Ceph

Country Status (1)

Country Link
CN (1) CN111639082B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112363979B (en) * 2020-09-18 2023-08-04 杭州欧若数网科技有限公司 Distributed index method and system based on graph database
CN112199048B (en) * 2020-10-20 2021-07-27 重庆紫光华山智安科技有限公司 Data reading method, system, device and medium
CN112632293B (en) * 2020-12-24 2024-03-26 北京百度网讯科技有限公司 Industry map construction method and device, electronic equipment and storage medium
CN112637067A (en) * 2020-12-28 2021-04-09 北京明略软件系统有限公司 Graph parallel computing system and method based on analog network broadcast
CN113778990A (en) * 2021-09-01 2021-12-10 百融至信(北京)征信有限公司 Method and system for constructing distributed graph database
CN115309947B (en) * 2022-08-15 2023-03-21 北京欧拉认知智能科技有限公司 Method and system for realizing online analysis engine based on graph

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9244976B1 (en) * 2010-12-16 2016-01-26 The George Washington University and Board of Regents Just-in-time analytics on large file systems and hidden databases
CN106372127A (en) * 2016-08-24 2017-02-01 云南大学 Spark-based diversity graph sorting method for large-scale graph data
CN107092639A (en) * 2017-02-23 2017-08-25 武汉智寻天下科技有限公司 A kind of search engine system
CN107247738A (en) * 2017-05-10 2017-10-13 浙江大学 A kind of extensive knowledge mapping semantic query method based on spark
CN107330125A (en) * 2017-07-20 2017-11-07 云南电网有限责任公司电力科学研究院 The unstructured distribution data integrated approach of magnanimity of knowledge based graphical spectrum technology
CN107329982A (en) * 2017-06-01 2017-11-07 华南理工大学 A kind of big data parallel calculating method stored based on distributed column and system
CN109582660A (en) * 2018-12-06 2019-04-05 深圳前海微众银行股份有限公司 Data consanguinity analysis method, apparatus, equipment, system and readable storage medium storing program for executing
CN109657072A (en) * 2018-12-13 2019-04-19 北京百分点信息科技有限公司 A kind of intelligent search WEB system and method applied to government's aid decision
CN110263225A (en) * 2019-05-07 2019-09-20 南京智慧图谱信息技术有限公司 Data load, the management, searching system of a kind of hundred billion grades of knowledge picture libraries
CN110377757A (en) * 2019-07-16 2019-10-25 北京海致星图科技有限公司 A kind of real time knowledge map construction system
CN110659292A (en) * 2019-09-21 2020-01-07 北京海致星图科技有限公司 Spark and Ignite-based distributed real-time graph construction and query method and system
CN110795417A (en) * 2019-10-30 2020-02-14 北京明略软件系统有限公司 System and method for storing knowledge graph
CN110888888A (en) * 2019-12-11 2020-03-17 北京明略软件系统有限公司 Personnel relationship analysis method and device, electronic equipment and storage medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9244976B1 (en) * 2010-12-16 2016-01-26 The George Washington University and Board of Regents Just-in-time analytics on large file systems and hidden databases
CN106372127A (en) * 2016-08-24 2017-02-01 云南大学 Spark-based diversity graph sorting method for large-scale graph data
CN107092639A (en) * 2017-02-23 2017-08-25 武汉智寻天下科技有限公司 A kind of search engine system
CN107247738A (en) * 2017-05-10 2017-10-13 浙江大学 A kind of extensive knowledge mapping semantic query method based on spark
CN107329982A (en) * 2017-06-01 2017-11-07 华南理工大学 A kind of big data parallel calculating method stored based on distributed column and system
CN107330125A (en) * 2017-07-20 2017-11-07 云南电网有限责任公司电力科学研究院 The unstructured distribution data integrated approach of magnanimity of knowledge based graphical spectrum technology
CN109582660A (en) * 2018-12-06 2019-04-05 深圳前海微众银行股份有限公司 Data consanguinity analysis method, apparatus, equipment, system and readable storage medium storing program for executing
CN109657072A (en) * 2018-12-13 2019-04-19 北京百分点信息科技有限公司 A kind of intelligent search WEB system and method applied to government's aid decision
CN110263225A (en) * 2019-05-07 2019-09-20 南京智慧图谱信息技术有限公司 Data load, the management, searching system of a kind of hundred billion grades of knowledge picture libraries
CN110377757A (en) * 2019-07-16 2019-10-25 北京海致星图科技有限公司 A kind of real time knowledge map construction system
CN110659292A (en) * 2019-09-21 2020-01-07 北京海致星图科技有限公司 Spark and Ignite-based distributed real-time graph construction and query method and system
CN110795417A (en) * 2019-10-30 2020-02-14 北京明略软件系统有限公司 System and method for storing knowledge graph
CN110888888A (en) * 2019-12-11 2020-03-17 北京明略软件系统有限公司 Personnel relationship analysis method and device, electronic equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Big data and stream processing platforms for Industry 4.0 requirements mapping for a predictive maintenance use case;RadhyaSahal 等;《Journal of Manufacturing Systems》;20200131;第54卷;138-151 *
基于Spark的聚类算法优化与实现;赵玉明 等;《现代电子技术》;20200415;第43卷(第08期);52-55+59 *
知识图谱数据管理研究综述;王鑫 等;《软件学报》;20190419;第30卷(第07期);2139-2174 *

Also Published As

Publication number Publication date
CN111639082A (en) 2020-09-08

Similar Documents

Publication Publication Date Title
CN111639082B (en) Object storage management method and system of billion-level node scale knowledge graph based on Ceph
CN103116596B (en) System and method of performing snapshot isolation in distributed databases
JP4586019B2 (en) Parallel recovery with non-failing nodes
Ju et al. iGraph: an incremental data processing system for dynamic graph
CN107180113B (en) Big data retrieval platform
US20150032758A1 (en) High Performance Index Creation
CN111639114A (en) Distributed data fusion management system based on Internet of things platform
CN103793442A (en) Spatial data processing method and system
CN116662441A (en) Distributed data blood margin construction and display method
CN111930716A (en) Database capacity expansion method, device and system
CN117677943A (en) Data consistency mechanism for hybrid data processing
CN108228725A (en) GIS application systems based on distributed data base
CN111708894A (en) Knowledge graph creating method
CN114329096A (en) Method and system for processing native map database
Lwin et al. Non-redundant dynamic fragment allocation with horizontal partition in Distributed Database System
CN111177244A (en) Data association analysis method for multiple heterogeneous databases
CN111708895B (en) Knowledge graph system construction method and device
Yang From Google file system to omega: a decade of advancement in big data management at Google
CN114925075B (en) Real-time dynamic fusion method for multi-source time-space monitoring information
WO2010150750A1 (en) Database management device using key-value store with attributes, and key-value-store structure caching-device therefor
CN114443798A (en) Distributed management system and method for geographic information data
Saxena et al. Concepts of HBase archetypes in big data engineering
Vilaça et al. On the expressiveness and trade-offs of large scale tuple stores
CN112416944A (en) Method and equipment for synchronizing service data
CN113391916A (en) Organization architecture data processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant