CN106354729B - Graph data processing method, device and system - Google Patents

Graph data processing method, device and system Download PDF

Info

Publication number
CN106354729B
CN106354729B CN201510419390.0A CN201510419390A CN106354729B CN 106354729 B CN106354729 B CN 106354729B CN 201510419390 A CN201510419390 A CN 201510419390A CN 106354729 B CN106354729 B CN 106354729B
Authority
CN
China
Prior art keywords
task
graph
data
analysis
updating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510419390.0A
Other languages
Chinese (zh)
Other versions
CN106354729A (en
Inventor
葛朋旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201510419390.0A priority Critical patent/CN106354729B/en
Publication of CN106354729A publication Critical patent/CN106354729A/en
Application granted granted Critical
Publication of CN106354729B publication Critical patent/CN106354729B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • G06F16/434Query formulation using image data, e.g. images, photos, pictures taken by a user

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a graph data processing method, a device and a system, wherein the method comprises the following steps: writing the request to be processed into a graph updating task queue or a graph analyzing task queue according to the type of the request to be processed, wherein the type of the request to be processed comprises a graph updating request and a graph analyzing request; determining the running sequence of each task according to the first characteristics of each task in the graph updating task queue and the graph analyzing task queue; and running each task according to the running sequence. By adopting the scheme of the application, the graph updating request and the graph analyzing request can be respectively received and put into the graph updating task queue and the graph analyzing task queue, and the tasks are managed to determine the running sequence, so that the graph updating and the graph analyzing processing can be compatible by using a set of system, and the situation that the application scenes of the current graph updating and the graph analyzing processing are separated is solved.

Description

Graph data processing method, device and system
Technical Field
The present application relates to the field of graph processing technologies, and in particular, to a graph data processing method, device, and system.
Background
At present, there are many products and solutions for graph computation in the industry, but most of them are staying in analyzing static graph data or updating and processing single graph data; a solution for real-time updating and real-time analysis of the complete graph is lacking.
In the traditional database field: OLTP (On-Line Transaction Processing) and OLAP (On-Line analytical Processing) are usually separated, and since data generation is slow and analysis On the data requires more resources, all Processing time delay aspects of the data can only wait for more than one day to output an analysis result. And the relation model of the data model can concern the dependency type of the data only in the analysis stage through the design idea of the relation database.
In the field of graph data, the dependency model is used for processing data, namely, the data naturally has strong relation; g ═ V, E), a Graph contains two basic models, Vertex and Edge, which are physically linked together by edges as relationships. Aiming at real-time updating and real-time analysis of the graph data, under a service scene, the data updating is required to immediately influence the dependency relationship, and the generated influence can immediately trigger corresponding service analysis operation; thereby extending the service requirement of real-time updating and real-time analysis of the graph.
For this purpose: the real-time updating of the graph requires that when the relation occurs, the entity (vertex) with the relation in the natural graph field can complete the updating operation of the relation in time; and then, according to the service scene, the analysis processing task of the graph can be quickly completed.
Most of graph database systems, graph storage and graph calculation frameworks in the industry at present use a computing framework of MapReduce or BSP for reference to construct a set of non-real-time graph analysis system on distributed file systems like GFS or HDFS; the supported application scene is single, the data depends on the daily full data, a plurality of tasks (Job) are started to perform parallel analysis, and the analysis content is time delay which is often more than an hour level.
In addition, some graph database systems only solve the design concept of the conventional relational database and only support graph characteristics, and although the graph updating can be supported relatively quickly, the graph updating is similar to the characteristics of the OLTP, but the corresponding characteristics are not provided for the characteristics of the OLAP, and it is more difficult to provide a solution for fusing the two types of application characteristics.
The prior art has the following disadvantages:
since relational databases have evolved over the years, it is basically well established for many application scenarios; application scenarios that result in OLTP and OLAP tend to be separate, resulting in many technical frameworks that cannot directly combine the two for application. In recent years, new graph databases attempt to break through the situation due to the rise of the NOSQL (non-relational database) model, but because the new domain belongs to the field, no mature technical framework is perfectly compatible with the two application scenarios.
Disclosure of Invention
The embodiment of the application provides a method, a device and a system for processing graph data, and solves the technical problem that a set of system is compatible with graph updating and graph analyzing and processing, and the graph updating and the graph analyzing and processing cannot be compatible.
In one aspect, an embodiment of the present application provides a graph data processing method, including:
writing the request to be processed into a graph updating task queue or a graph analyzing task queue according to the type of the request to be processed, wherein the type of the request to be processed comprises a graph updating request and a graph analyzing request;
determining the running sequence of each task according to the first characteristics of each task in the graph updating task queue and the graph analyzing task queue;
and running the tasks according to the running sequence.
In another aspect, an embodiment of the present application provides a graph data processing apparatus, including:
the updating task queue is used for writing a graph updating task;
a graph analysis task queue for writing a graph analysis task;
and the scheduler is used for determining the running sequence of each task according to the first characteristics of each task in the graph updating task queue and the graph analyzing task queue and distributing the current task to be run to the corresponding computing resource for running.
In another aspect, an embodiment of the present application provides a graph data processing system, including:
the service interface layer comprises an updating interface and an analysis interface, wherein the updating interface is used for receiving a data updating task and writing the data updating task into an updating task queue; the analysis interface is used for receiving data analysis tasks and writing the data analysis tasks into an analysis task queue;
the task scheduling layer comprises a graph updating task queue, a graph analyzing task queue, a scheduler, a graph computing engine and a graph storing engine, wherein:
the graph updating task queue is used for writing a graph updating task;
a graph analysis task queue for writing a graph analysis task;
the scheduler is used for determining the running sequence of each task according to the first characteristics of each task in the graph updating task queue and the graph analyzing task queue and distributing the current task to be run to the corresponding computing resource for running;
the graph calculation engine is used for carrying out a graph updating operation and/or a graph analysis operation of the task;
and the graph storage engine is used for storing the graph.
The beneficial effects are as follows:
the embodiment of the application provides a graph data processing method, a device and a system, which can respectively receive a graph updating request and a graph analysis request, place the graph updating request and the graph analysis request into a graph updating task queue and a graph analysis task queue, manage each task, and determine an operation sequence for the tasks, so that a set of system compatible graph updating and graph analysis processing can be utilized, the problem that application scenes of the existing graph updating and graph analysis processing are separated is solved, the graph analysis can not have data dependence on daily full data any more, and the analysis content is often in the condition of time delay above an hour level.
Drawings
Specific embodiments of the present application will be described below with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart diagram illustrating a data processing method in an embodiment of the present application;
FIG. 2 is a flow chart illustrating a graph data processing method according to a first embodiment;
FIG. 3 is a flowchart showing a graph data processing method according to the second embodiment;
FIG. 4 is an exploded diagram of the internal abstract implementation of two interfaces in the second embodiment;
FIG. 5 is a flowchart showing the storing of the map in the third embodiment;
FIG. 6 is a schematic structural diagram of the data processing apparatus in the embodiment of the present application;
FIG. 7 is a schematic structural diagram of a graph data processing apparatus according to an example in the embodiment of the present application;
FIG. 8 is a schematic structural diagram of a graph data processing apparatus according to an example in the embodiment of the present application;
FIG. 9 is a block diagram of the data processing system of the embodiment of the present application;
FIG. 10 is a block diagram of a graph data processing system according to an example in an embodiment of the present application.
Detailed Description
In order to make the technical solutions and advantages of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and not an exhaustive list of all embodiments. And the embodiments and features of the embodiments in the present description may be combined with each other without conflict.
The inventor finds that the data storage and calculation requirements of the image business are large at present, such as the transaction, shopping, transfer and the like of an online shopping platform, and each second is more than ten thousand levels; the number of days exceeds billions of records. The data is written and updated very frequently in real time, and after the data is written, the data needs to be updated into the graph data model quickly. Graph-based business scenarios: the transaction risk identification and accurate recommendation service requires that full analysis calculation can be quickly carried out on incremental graph data, calculation results are output, graph updating and writing are quickly updated into a graph storage engine, graph analysis calculation can cover the latest data as much as possible, and second-level delay is tolerated between a data snapshot on which analysis depends and the latest data updating. Based on consideration of these practical requirements, the embodiments of the present application provide a method, an apparatus, and a system for processing graph data, which are described below.
The graph updating means that an external service application sends an instruction to update the vertex attributes in the graph, add a new vertex, or establish a new edge from the vertex a to the vertex B, modify the attributes of the edge, and the like.
The graph analysis refers to analyzing and calculating specific subgraphs and full graphs under the analysis instruction of the service, and the analysis process is through query read-only operation of traversing, counting and filtering the graphs and the attributes of specific vertexes and edges.
Fig. 1 illustrates a graph data processing method in an embodiment of the present application, and as shown in the figure, includes:
step 101, writing the request to be processed into a graph updating task queue or a graph analyzing task queue according to the type of the request to be processed, wherein the type of the request to be processed comprises a graph updating request and a graph analyzing request;
102, determining the running sequence of each task according to the first characteristics of each task in the graph updating task queue and the graph analyzing task queue;
and 103, running each task according to the running sequence.
Has the advantages that: the graph updating and graph analyzing processing method in the embodiment of the application can respectively receive graph updating requests and graph analyzing requests, place the graph updating requests and the graph analyzing requests into a graph updating task queue and a graph analyzing task queue, manage all tasks, and determine the running sequence of the tasks, so that a set of system compatible graph updating and graph analyzing processing can be used, the problem that application scenes of the existing graph updating and graph analyzing processing are separated is solved, the situation that data depend on daily full data does not exist in graph analysis any more, and the analysis content is often delayed by more than an hour level is solved.
Further, in order to improve the processing efficiency, the following embodiment may be performed.
In implementation, after the running sequence of each task is determined, whether the first sequential task is the current task to be run is determined according to the state of the read-write lock;
the state of the read-write lock is modified to be occupied when the task runs and is modified to be unoccupied when the task runs or is suspended.
The method for determining whether the first cis-position task is the current task to be run according to the state of the read-write lock can comprise any one or a combination of the following steps:
when the read-write lock is in an unoccupied state, determining that the first priority task is a current task to be operated;
when the read-write lock is in an occupied state, if the current running task is a pure read graph analysis task, determining that the first priority task is the current task to be run;
and when the read-write lock is occupied, if the current running task is a non-pure read image analysis task or an image updating task, suspending the first sequential task, and judging the state of the read-write lock again after the next period.
Has the advantages that:
in the implementation, a read-write lock is added, and whether the first sequential task is the current task to be run is determined according to the state of the read-write lock. Therefore, when the read-write lock is occupied, if the current running task is a pure read graph analysis task, the first priority task is still determined to be the current task to be run, so that the tasks can be parallel, and the processing efficiency is improved.
In addition, in the implementation, after the operation sequence of each task is determined, whether the first order task is a graph analysis task with time and/or resource consumption larger than a set threshold value is judged, if so, the first order task is divided into a plurality of tasks, the plurality of tasks are operated at intervals, and after the operation of the plurality of tasks is finished, graph analysis results are combined to complete the first order task.
Or monitoring whether the task runs overtime after the task runs, if so, suspending the task, and restarting the task after the next period.
Has the advantages that: by the two modes, the graph analysis task is split, whether the task runs overtime or not is monitored, if overtime is waiting, one task can be prevented from occupying too long time and/or too many resources, so that the task can be more reasonably carried out, and particularly, when the graph analysis task consumes more time and/or resources, the graph updating task can be more effectively carried out.
Further, after the graph updating task is executed, the memory mapping object can be stored in the cache region and the disk at the same time;
when a graph analysis task is executed, data are obtained from a cache region;
and if the data related to the graph analysis task is not in the cache region, acquiring the data from the disk.
Because the disk data is cold data, the data acquisition from the cache region is more efficient, and the data in the cache region is the data which is updated recently, the latest map updating situation can be reflected, and the efficiency can be greatly improved when some application scenes use the data in the cache region.
Further, the graph storage may include:
determining the graph to be a sparse graph or a dense graph according to the data characteristics of the graph;
determining whether the graph is dominant at the vertex or dominant at the edge according to the calculation characteristics of the graph;
and determining a graph segmentation algorithm according to the data characteristics of the graph and the calculation characteristics, and segmenting and storing the data of the graph.
Because the segmentation algorithm of the graph is determined according to the data characteristics and the calculation characteristics of the graph, the adopted segmentation algorithm of the graph can be more reasonable, the reasonability of data storage is enhanced, and the whole scheme is more efficient.
To facilitate the practice of the present application, the following description is given by way of example.
The first embodiment is as follows:
as shown in fig. 2, the graph data processing method in the first embodiment includes:
step 201, monitoring whether a graph updating request or a graph analyzing request is received, if so, performing step 202; otherwise, returning to the step 201;
generally, when the system is started and the system initialization is completed, monitoring can be started to determine whether a graph update request or a graph analysis request is received, and the specific monitoring starting time is not limited in this step.
In the method of the embodiment, only the received map updating request or the received map analyzing request is subjected to subsequent processing.
Step 202, writing the request to be processed into a graph updating task queue or a graph analyzing task queue according to the type of the request to be processed;
that is, the graph update request is written to the graph update task queue, and the graph analysis request is written to the graph analysis task queue.
Step 203, determining the running sequence of each task according to the first characteristics of each task in the graph updating task queue and the graph analyzing task queue;
the first characteristic comprises any one or a combination of: timestamp, timeliness, priority, data dependency characteristics. For example, the running order of each task can be determined according to the time stamp of each task in two task queues independently, that is, the task which enters the two queues first is processed first; the running sequence of each task can be determined by integrating the timeliness, the priority and the data dependence characteristics of each task, and the specific first characteristic can be determined according to actual needs.
After the running sequence of each task is determined in this step, it may be further determined whether the first order task is a graph analysis task whose time and/or resource consumption is greater than a set threshold, if so, the first order task is split into a plurality of tasks, the plurality of tasks are run at intervals, and after the plurality of tasks are run, graph analysis results are merged to complete the first order task. This process may avoid that one task takes too long and/or too many resources, thereby making the task more reasonable, and may ensure that the graph update task is performed more efficiently, especially when the graph analysis task is time and/or resource consuming.
Step 204, determining whether the first sequential task is a current task to be operated or not according to the state of the read-write lock, if so, performing step 205, otherwise, suspending the first sequential task, waiting for the next cycle, and returning to the step 204;
the read-write lock is adopted according to the situation that the graph updating task and the graph analyzing task are processed in the same system, the read-write lock is introduced for ensuring the final consistency of the affairs of the updating task and the analyzing task, after the read-write lock is introduced, tasks which are not affected mutually, such as a pure read analyzing task and the updating task, can be completed in parallel, so that the efficiency is improved, during specific implementation, the read-write lock can be omitted, under the situation, the parallel task is not allowed for ensuring the final consistency of the affairs of the updating task and the analyzing task, and the next task is performed only after the task is determined to be completed. The state of the read-write lock is modified to be occupied when the task runs and is modified to be unoccupied when the task runs or is suspended. This step may be understood as determining whether the first cis task is allowed to run according to the status of the read/write lock.
The specific operation of this step may include any one or a combination of the following:
when the read-write lock is in an unoccupied state, determining that the first priority task is a current task to be operated;
when the read-write lock is in an occupied state, if the current running task is a pure read graph analysis task, determining that the first priority task is the current task to be run;
and when the read-write lock is occupied, if the current running task is a non-pure read image analysis task or an image updating task, suspending the first sequential task, and judging the state of the read-write lock again after the next period.
When the read-write lock is in an occupied state, if the current running task is a pure read graph analysis task, the first priority task is determined to be the current task to be run, so that other tasks can be paralleled when the current running task is the pure read graph analysis task, and the processing efficiency is improved.
In practical implementation, such processing may not be selected, when the state of the read/write lock is occupied, the first sequential task is suspended, and the state of the read/write lock is determined again after the next cycle, that is, by using this scheme, the parallel task is not allowed.
And step 205, distributing the current task to be run to the corresponding computing resource to run according to the distributed partition information of the data related to the current task to be run.
The storage and calculation of the industry internal graph are divided into a single-point mode and a distributed mode; the single-point graph is that all graphs are stored on a single computer, and the computation of the graphs is also centralized on a single computing node; the distributed graph model aims at the situation that the graph is stored in a distributed mode on a plurality of machines, the graph is large in quantity, the graph cannot be stored in a single machine physically, and meanwhile the graph is calculated in a distributed mode and executed in parallel on the plurality of machines. In this embodiment, a distributed type is taken as an example for description, so that the current task to be executed needs to be allocated to the corresponding computing resource to be executed according to the distributed partition information of the data related to the current task to be executed.
After the task is operated, whether the task is operated overtime or not can be monitored, if yes, the task is suspended, and the task is restarted after the next period. This process may avoid that one task takes too long and/or too many resources, thereby making the task more reasonable, and may ensure that the graph update task is performed more efficiently, especially when the graph analysis task is time and/or resource consuming.
Example two:
the graph data processing method in the second embodiment, as shown in fig. 3, includes:
step 301, monitoring whether a graph updating request or a graph analyzing request is received, and if so, performing step 302; otherwise, returning to the step 301;
step 302, writing the request to be processed into a graph updating task queue or a graph analyzing task queue according to the type of the request to be processed;
step 303, determining the running sequence of each task according to the first characteristics of each task in the graph updating task queue and the graph analyzing task queue;
step 304, determining whether the first sequential task is a current task to be operated or not according to the state of the read-write lock, if so, performing step 305, otherwise, suspending the first sequential task, waiting for the next cycle, and returning to the step 304;
step 305, judging whether the first cis-position task is a task for updating, if so, performing step 306, otherwise, performing step 307;
since this flow processes only the map update request and the map analysis request, the case where it is determined in this step that it is not a map update indicates that the task is a map analysis task.
Step 306, running a graph updating task, and storing the memory mapping object to a cache region and a disk at the same time;
when the method is implemented, when the memory mapping object is stored in the cache region, the Delta (increment) updating object can be directly stored, and the memory mapping object can also be processed and divided into the Delta updating object and the hot object. Specific hot objects can be generated according to existing rules, for example, Delta update objects with operation times exceeding 100 times within one hour are considered as hot objects, specific generation of the hot objects is not specifically limited in the application, and after the hot objects are generated, graph analysis tasks can be performed on the hot objects.
Step 307, acquiring a data operation diagram analysis task from the buffer area;
and step 308, if the data related to the graph analysis task is not in the cache region, acquiring the data from the disk.
In particular, the graph computation engine in a graph database system includes two types of feature computation operations: updating operation and analyzing operation, wherein the two types of operation use a BSP parallel task characteristic asynchronous shared memory model as technical reference, and an abstract common computing interface is combined with the characteristic of graph structure data operation as follows:
graph update interface definition: UpdateResult updateGraph (GraphData)
Graph analysis interface definition: StatsResult statsGraph (Statsparam)
The internal abstraction implementation decomposition for the two interfaces is shown in fig. 4:
a) the internal steps of updateGraph are as follows:
a1) query the vertex gatherReadyUpdateVertex ()
a2) Updating the vertex information applyUpdateGraph ()
a3) Communicating vertex update information to each neighboring vertex, scatter UpdateVertexs ()
a4) And summarizing the state of the updated vertex into an update success queue of the buffer area, wherein the step is processed by an asynchronous message mechanism, and the step does not influence the processing time consumption of the previous step.
b) The internal steps of statgraph are as follows:
b1) collection of Source vertex information to analyze gatherReadyStatsSourceVertex ()
This step may collect source vertex information to be analyzed from the cache update success queue
b2) Perform analysis task applyStatsGraph ()
b3) Merging analysis statistical task results summary StatsSourceVertexs ()
The graph computation framework in this embodiment is divided into several stages:
collection, implementation, divergence [ Collection ] Gather, Apply, Scatter [ Summary ]
In this embodiment, a summarization (summary) step is added, and is used for collecting results of the update operation in the update task and returning the results to the update queue context; for the analysis task, the method is used for summarizing the result of the analysis task; the aggregate information is written as standard rule data into the Context (Context) of the computing framework for use by the internal computing framework.
One feature point of the present embodiment: aiming at the collection and update result operation summaryUpdateResult in the updated graph interface updatagraph, the data of the graph top point which is updated currently in real time can be automatically written into the updated subgraph cache queue after the task is completed according to the rule, and the data can be automatically obtained from the cache queue in the corresponding analysis task, namely, the data is automatically completed in the gatherReadyStatsSourceVertex of the statGraph interface.
EXAMPLE III
The graph storage in this application adopts distributed graph storage engine, and distributed graph storage engine is responsible for solving two main kinds of problems as supporting the bottom layer support engine that high-efficient picture was updated, the graph analysis:
firstly, the method comprises the following steps: the dense graph and the sparse graph are effectively stored in a distributed mode, and the core of the distributed graph is the graph partitioning (partitions);
for real world graph structures, there are roughly two categories; the first type: vertices of the graph (Vertex) have a small number of adjoining edges (Edge), i.e. sparse graph; the second type: a small number of vertices (Vertex) have a large number of contiguous edges, i.e. local dense graphs (referred to as dense graphs in this application).
II, secondly: and a simple and uniform access API (Application Programming Interface) is provided for the upper layer graph calculation engine to call.
Regarding the first category of problems, there are 3 categories of basic referential segmentation algorithms in the field of graph segmentation design:
A1) cutting at the balanced side: performing Hash (Hash) calculation according to the ID of the vertex, uniquely dividing the vertex to different machines according to the number of the machines, and then redundantly storing the vertex to different machines according to edges; in order to maintain the high efficiency of graph calculation, the algorithm needs to redundantly pair adjacent vertex and side information on different machines; any side, vertex update involves a relatively large number of network transport interactions.
A2) Balanced vertex cutting: performing Hash calculation according to the ID of the edge, uniquely dividing the edge to different machines, and performing redundancy on different machines aiming at the vertex connected with the edge; due to the uniqueness of the edges, only updates to the vertices involve more network transport interactions.
A3) Greedy vertex cutting: based on the a2 algorithm, for two vertices v (a) and v (b) connected to any edge E, considering a set of machines pre-stored with the corresponding vertex, for example, a vertex is assigned a set of machines m (a), and b vertex is assigned a set of machines m (b), further evaluating the following conditions, and then determining an edge assignment rule:
if M (a) has an intersection with M (b), E is assigned to the intersected machine.
If M (a) and M (b) do not intersect, but have contents, and the union is not empty, E is allocated to the machine with the smallest allocation edge in the union on M (a) and M (b).
If M (a) has been allocated, but M (b) has not, then E is allocated on M (a), and vice versa.
If M (a) and M (b) are not allocated, E is allocated to the machine with the least load.
For algorithm A3, the approach of storing the edges is relatively pursued in design, and because the algorithm of the graph is relatively heavy, the storage part of the graph can relatively consume part of the performance; but the corresponding performance is significantly improved for the subsequent computation part of the graph.
In this embodiment, for the optimization during graph storage in the present application, the centralized algorithm is synthesized according to the data features and the calculation feature features of the graph, and specifically as shown in fig. 5, the method includes the following steps:
step 501, determining the graph to be a sparse graph or a dense graph according to the data characteristics of the graph;
step 502, determining whether the graph is dominant at the vertex or dominant at the edge according to the calculation characteristics of the graph;
and step 503, determining a graph segmentation algorithm according to the data characteristics and the calculation characteristics of the graph, and segmenting and storing the data of the graph.
The specific algorithm may be edge segmentation, point segmentation, optimized point segmentation, etc.
Because the segmentation algorithm of the graph is determined according to the data characteristics and the calculation characteristics of the graph, the adopted segmentation algorithm of the graph can be more reasonable, the reasonability of data storage is enhanced, and the whole scheme is more efficient.
Regarding the second kind of problem, regarding the API of graph storage, according to the business scenario of graph update and graph analysis, the corresponding interface APIs are uniformly packaged as follows:
create Vertex createVertex (Key)
Creating Edge createEdge (Key, sourceVertex, targetVertex)
Updating the Vertex Result updateVertex (Vertex, Property)
Update edge Result updateEdge (Vertex, Property)
Finding vertex FindVertex (Key)
Find edge FindEdgeOfVertex (Key) of vertex
FindEdgeByLabel (Label) for inquiring the edge of the specified label
Bulk scene vertex BulkCreateVertexs (List (Key))
Bulk creation of edge bulkcreateeedges (Key, List < sourceVertex >, List < targetVertex >)
Finding neighboring vertices FindAdjacentVertxs (Vertex)
Find adjacent edge FindAdjacentedges (edge)
Delete Vertex Boolean drop Vertex (Vertex)
Boolean drop Edge (Edge)
Therefore, a simple and uniform access API is provided for the upper-layer graph calculation engine to call conveniently.
Based on the same inventive concept, the embodiment of the present application further provides a graph data processing apparatus, and as the principle of solving the problem of these devices is similar to that of a graph data processing method, the implementation of these devices may refer to the implementation of the method, and repeated details are not repeated.
As shown in fig. 6, the apparatus may include:
a graph update task queue 601 for writing a graph update task;
a graph analysis task queue 602 for writing graph analysis tasks;
the scheduler 603 is configured to determine a running order of each task according to the first characteristic of each task in the graph update task queue and the graph analysis task queue, and allocate the current task to be run to the corresponding computing resource for running.
In a specific implementation, the graph update task queue 601 may be responsible for maintaining timeliness and power law control of update tasks; the graph analysis task queue 602 may maintain the priority, timeliness, and failure retry characteristics of the analysis tasks.
Further, when applied to a distributed system, as shown in fig. 7, a partition identifier 701 is included to provide the scheduler 603 with distributed partition information related to data of the task to be currently executed. When not a distributed system, then there is no need to include a partition identifier.
Further, as shown in fig. 8, the apparatus may further include a read-write lock module 801, configured to store a state of a read-write lock, where the state of the read-write lock is modified to be occupied when the task runs, and is modified to be unoccupied when the task runs over or is suspended;
after determining the running sequence of each task, the scheduler 603 determines whether the first sequential task is a current task to be run according to the state of the read-write lock;
further, the scheduler 603 determines whether the first in-order task is the current task to be executed according to the status of the read-write lock, including any one or a combination of the following:
when the read-write lock is in an unoccupied state, determining that the first priority task is a current task to be operated;
when the read-write lock is in an occupied state, if the current running task is a pure read graph analysis task, determining that the first priority task is the current task to be run;
and when the read-write lock is occupied, if the current running task is a non-pure read image analysis task or an image updating task, suspending the first sequential task, and judging the state of the read-write lock again after the next period.
Further, after determining the running order of each task, the scheduler 603 may further determine whether the first order task is a graph analysis task whose time and/or resource consumption is greater than a set threshold, if so, split the first order task into a plurality of tasks, run the plurality of tasks at intervals, and after the plurality of tasks are finished running, merge graph analysis results to complete the first order task.
Further, after running the task, the scheduler 603 may also monitor whether the task runs overtime, and if yes, suspend the task and wait for the next period to restart the task.
The read/write lock module 801 and the partition identifier 701 may each be separately combined with the modules of fig. 6.
An embodiment of the present application further provides a graph data processing system, as shown in fig. 9, including:
the service interface layer comprises a graph updating interface and a graph analysis interface, wherein the graph updating interface is used for receiving a graph updating task and writing the graph updating task into a graph updating task queue; the graph analysis interface is used for receiving graph analysis tasks and writing the graph analysis tasks into a graph analysis task queue;
a task scheduling layer including the graph data processing apparatus;
the graph calculation engine is used for carrying out a graph updating operation and/or a graph analysis operation of the task;
and the graph storage engine is used for storing the graph.
In a specific implementation, an update interface in a service interface layer belongs to an interface for a service hierarchy, and includes corresponding service semantics, and a basic design rule is as follows: and converting the non-diagram data model of the business into a standard diagram data model, and standardizing all interface rules through Vertex, Edge, Relationship, Property.
Analysis interface in service interface layer: receiving a service-driven analysis task, or a timing task, or an analysis task depending on an updated data object; this type of interface generally accepts two types of rules: analysis source, analysis rule index.
Further, the graph storage engine comprises a cache region and a disk;
and after the graph calculation engine runs the graph updating task, storing the memory mapping object into a cache region and a disk simultaneously, acquiring data from the cache region when the task is a graph analysis task, and acquiring the data from the disk if the data related to the graph analysis task is not in the cache region.
Further, when the graph storage engine stores the memory mapping object into the cache region, the memory mapping object is processed and divided into a Delta update object and a hot object.
Further, the graph storage engine includes:
the graph data characteristic analyzer is used for determining the graph to be a sparse graph or a dense graph according to the data characteristics of the graph;
the graph calculation characteristic analyzer is used for determining whether the graph is dominant at the vertex or dominant at the edge according to the calculation characteristics of the graph;
and the graph storage and segmentation manager is used for determining a segmentation algorithm of the graph according to the data characteristics and the calculation characteristics of the graph and carrying out segmentation storage on the data of the graph.
Further, as shown in fig. 10, the system may further include a monitoring core, configured to collect resource load conditions of the graph computation engine and the graph storage engine in real time, convert the monitoring information into a measurable graph computation scheduling evaluation factor in real time, and provide the measurable graph computation scheduling evaluation factor to a scheduler 603 of the task scheduling layer;
the scheduler 603 also evaluates the scheduling task assignment according to the graph computed scheduling evaluation factor.
In particular implementations, the graph computing the schedule evaluation factor may include:
the number of newly added and updated tasks and the number of analysis tasks (within one minute) of the graph
Updating task in operation chart and number of chart analysis tasks
Updating task and analyzing task number queued in task queue
Number of partitions in graph, physical cutting status
Number of nodes and edges of whole graph
Current system read and write lock conditions
Newly added edge and vertex number cached in storage engine
Storing the number of edges and vertexes to be merged in the engine
Deleted, modified edge, top point number in storage engine
Number of sub-blocks to be split in storage engine
Compute Memory and IO overhead for an engine
Memory Size, Cache Size, Disk file Size of storage engine
The scheduler 603 can more efficiently distribute the computation tasks according to the graph computation scheduling evaluation factors, maximizing the concurrency between parallel and multi-graph services within a single machine.
In addition, the monitoring core can also provide the calculation scheduling evaluation factor to a graph monitoring display system to display the running condition of the system.
For convenience of description, each part of the above-described apparatus is separately described as being functionally divided into various modules or units. Of course, the functionality of the various modules or units may be implemented in the same one or more pieces of software or hardware when implementing the present application.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

Claims (17)

1. A method for processing data from a graph database, comprising:
writing the request to be processed into a graph updating task queue or a graph analyzing task queue according to the type of the request to be processed, wherein the type of the request to be processed comprises a graph updating request and a graph analyzing request;
determining the running sequence of each task according to the first characteristics of each task in the graph updating task queue and the graph analyzing task queue; after the running sequence of each task is determined, determining whether the first sequential task is the current task to be run or not according to the state of the read-write lock; the determining whether the first sequential task is the current task to be run according to the state of the read-write lock includes any one or a combination of the following steps:
when the read-write lock is in an unoccupied state, determining that the first priority task is a current task to be operated;
when the read-write lock is in an occupied state, if the current running task is a pure read graph analysis task, determining that the first priority task is the current task to be run;
when the read-write lock state is occupied, if the current running task is a non-pure read graph analysis task or a graph updating task, suspending the first sequential task, and judging the read-write lock state again after the next period;
the state of the read-write lock is modified to be occupied when the task runs and is modified to be unoccupied when the task runs over or is suspended;
and running the tasks according to the running sequence.
2. The method of claim 1, wherein said executing the tasks according to the execution order comprises:
and distributing the current task to be operated to the corresponding computing resource to operate according to the distributed partition information of the data related to the current task to be operated.
3. The method of claim 1, wherein the first characteristic comprises any one or a combination of: timestamp, timeliness, priority, data dependency characteristics.
4. The method according to claim 1, wherein after the running order of each task is determined, it is determined whether a first in-order task is a graph analysis task whose time and/or resource consumption is greater than a set threshold, if so, the first in-order task is split into a plurality of tasks, the plurality of tasks are run at intervals, and after the plurality of tasks are run, graph analysis results are merged to complete the first in-order task.
5. The method of claim 1, wherein after running a task, monitoring whether the task runs overtime, if yes, suspending the task, and waiting for a next period, restarting the task.
6. The method of claim 1, wherein after executing the graph update task, storing the memory mapped object to both the cache and the disk;
when a graph analysis task is executed, data are obtained from the cache region;
and if the data related to the graph analysis task is not in the cache region, acquiring the data from the disk.
7. The method of claim 6, wherein storing the memory-mapped object to the cache memory area processes the memory-mapped object into an incremental update object and a hot object.
8. The method of claim 1, when performing graph storage, comprising:
determining the graph to be a sparse graph or a dense graph according to the data characteristics of the graph;
determining whether the graph is dominant at the vertex or dominant at the edge according to the calculation characteristics of the graph;
and determining a segmentation algorithm of the graph according to the data characteristics and the calculation characteristics of the graph, and performing segmentation storage on the data of the graph.
9. A graph database data processing apparatus, comprising:
the graph updating task queue is used for writing a graph updating task;
a graph analysis task queue for writing a graph analysis task;
the scheduler is used for determining the running sequence of each task according to the first characteristics of each task in the graph updating task queue and the graph analyzing task queue and distributing the current task to be run to the corresponding computing resource for running;
the read-write lock module is used for storing the state of the read-write lock, and the state of the read-write lock is modified to be occupied when the task runs and is modified to be unoccupied when the task runs and is finished or suspended;
after determining the running sequence of each task, the scheduler determines whether the first sequential task is a current task to be run or not according to the state of the read-write lock; the scheduler determines whether the first sequential task is the current task to be run according to the state of the read-write lock, wherein the determination comprises any one or a combination of the following steps:
when the state of the read-write lock is unoccupied, determining the first cis-position task as the current task to be operated;
when the read-write lock is in an occupied state, if the current running task is a pure read graph analysis task, determining that the first priority task is the current task to be run;
and when the read-write lock state is occupied, if the current running task is a non-pure read image analysis task or an image updating task, suspending the first sequential task, and judging the read-write lock state again after the next period.
10. The apparatus of claim 9, comprising a partition identifier to provide distributed partition information to the scheduler regarding data for a task currently to be run.
11. The apparatus according to claim 9, wherein the scheduler determines, after determining the running order of the tasks, whether a first in-order task is a graph analysis task whose time and/or resource consumption is greater than a set threshold, if so, splits the first in-order task into a plurality of tasks, runs the plurality of tasks at intervals, and merges graph analysis results after the plurality of tasks are finished running, thereby completing the first in-order task.
12. The apparatus of claim 9, wherein the scheduler monitors whether the task runs overtime after running the task, and if so, suspends the task and restarts the task for a next period.
13. A system for processing data from a graph database, comprising:
the service interface layer comprises an updating interface and an analysis interface, wherein the updating interface is used for receiving a data updating task and writing the data updating task into an updating task queue; the analysis interface is used for receiving data analysis tasks and writing the data analysis tasks into an analysis task queue;
a task scheduling layer comprising the apparatus of claims 9 to 12;
the graph calculation engine is used for carrying out a graph updating operation and/or a graph analysis operation of the task;
and the graph storage engine is used for storing the graph.
14. The system of claim 13, wherein the graph storage engine comprises a cache and a disk;
the graph calculation engine stores the memory mapping object into a cache region and a disk simultaneously after running a graph updating task, acquires data from the cache region when the task is a graph analysis task, and acquires the data from the disk if the data related to the graph analysis task is not in the cache region.
15. The system of claim 14, wherein the graph store engine processes the memory-mapped object as it is stored in the cache memory into an incremental update object and a hotspot object.
16. The system of claim 13, wherein the graph storage engine comprises:
the graph data characteristic analyzer is used for determining that the graph is a sparse graph or a dense graph according to the data characteristics of the graph;
the graph calculation characteristic analyzer is used for determining whether the graph is dominant at the vertex or dominant at the edge according to the calculation characteristics of the graph;
and the graph storage and segmentation manager is used for determining a segmentation algorithm of the graph according to the data characteristics and the calculation characteristics of the graph and carrying out segmentation storage on the data of the graph.
17. The system of claim 13, further comprising a monitoring core for collecting resource load conditions of the graph computation engine and the graph storage engine in real time, converting monitoring information into a measurable graph computation scheduling evaluation factor in real time, and providing the measurable graph computation scheduling evaluation factor to a scheduler of the task scheduling layer;
and the scheduler also calculates a scheduling evaluation factor according to the graph to evaluate the scheduling task distribution.
CN201510419390.0A 2015-07-16 2015-07-16 Graph data processing method, device and system Active CN106354729B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510419390.0A CN106354729B (en) 2015-07-16 2015-07-16 Graph data processing method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510419390.0A CN106354729B (en) 2015-07-16 2015-07-16 Graph data processing method, device and system

Publications (2)

Publication Number Publication Date
CN106354729A CN106354729A (en) 2017-01-25
CN106354729B true CN106354729B (en) 2020-01-07

Family

ID=57842658

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510419390.0A Active CN106354729B (en) 2015-07-16 2015-07-16 Graph data processing method, device and system

Country Status (1)

Country Link
CN (1) CN106354729B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108132838B (en) 2016-11-30 2021-12-14 华为技术有限公司 Method, device and system for processing graph data
CN108595251B (en) * 2018-05-10 2022-11-22 腾讯科技(深圳)有限公司 Dynamic graph updating method, device, storage engine interface and program medium
CN108984281A (en) * 2018-05-30 2018-12-11 深圳市买买提信息科技有限公司 A kind of task processing method and server
CN109670089A (en) * 2018-12-29 2019-04-23 颖投信息科技(上海)有限公司 Knowledge mapping system and its figure server
CN113032112A (en) * 2019-12-25 2021-06-25 上海商汤智能科技有限公司 Resource scheduling method and device, electronic equipment and storage medium
CN111309750A (en) * 2020-03-31 2020-06-19 中国邮政储蓄银行股份有限公司 Data updating method and device for graph database
CN111291870B (en) * 2020-05-09 2020-08-21 支付宝(杭州)信息技术有限公司 Method and system for processing high-dimensional sparse features in deep learning of images
CN115470377B (en) * 2021-06-11 2024-07-16 清华大学 Stream graph data processing method and system
CN113239243A (en) * 2021-07-08 2021-08-10 湖南星汉数智科技有限公司 Graph data analysis method and device based on multiple computing platforms and computer equipment
CN113672636B (en) * 2021-10-21 2022-03-22 支付宝(杭州)信息技术有限公司 Graph data writing method and device
CN116821250B (en) * 2023-08-25 2023-12-08 支付宝(杭州)信息技术有限公司 Distributed graph data processing method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102906743A (en) * 2010-05-17 2013-01-30 慕尼黑技术大学 Hybrid OLTP and OLAP high performance database system
CN104504003A (en) * 2014-12-09 2015-04-08 北京航空航天大学 Graph data searching method and device
CN104615677A (en) * 2015-01-20 2015-05-13 同济大学 Graph data access method and system
CN104679764A (en) * 2013-11-28 2015-06-03 方正信息产业控股有限公司 Method and device for searching graph data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102906743A (en) * 2010-05-17 2013-01-30 慕尼黑技术大学 Hybrid OLTP and OLAP high performance database system
CN104679764A (en) * 2013-11-28 2015-06-03 方正信息产业控股有限公司 Method and device for searching graph data
CN104504003A (en) * 2014-12-09 2015-04-08 北京航空航天大学 Graph data searching method and device
CN104615677A (en) * 2015-01-20 2015-05-13 同济大学 Graph data access method and system

Also Published As

Publication number Publication date
CN106354729A (en) 2017-01-25

Similar Documents

Publication Publication Date Title
CN106354729B (en) Graph data processing method, device and system
US20200210412A1 (en) Using databases for both transactions and analysis
CN109726191B (en) Cross-cluster data processing method and system and storage medium
CN110245023B (en) Distributed scheduling method and device, electronic equipment and computer storage medium
US10031935B1 (en) Customer-requested partitioning of journal-based storage systems
CN106933669B (en) Apparatus and method for data processing
Ju et al. iGraph: an incremental data processing system for dynamic graph
CN108920153A (en) A kind of Docker container dynamic dispatching method based on load estimation
CN110971939B (en) Illegal picture identification method and related device
US10198346B1 (en) Test framework for applications using journal-based databases
CN110874271B (en) Method and system for rapidly calculating mass building pattern spot characteristics
CN109213752A (en) A kind of data cleansing conversion method based on CIM
US8938599B2 (en) Distributed graph storage system
CN112579692B (en) Data synchronization method, device, system, equipment and storage medium
CN115587118A (en) Task data dimension table association processing method and device and electronic equipment
US10235407B1 (en) Distributed storage system journal forking
CN110825526B (en) Distributed scheduling method and device based on ER relationship, equipment and storage medium
CN108563787A (en) A kind of data interaction management system and method for data center&#39;s total management system
CN116414801A (en) Data migration method, device, computer equipment and storage medium
CN113886111A (en) Workflow-based data analysis model calculation engine system and operation method
JP2009037369A (en) Resource assignment method to database server
KR20170033303A (en) Dynamic n-dimensional cubes for hosted analytics
CN115269519A (en) Log detection method and device and electronic equipment
CN110971928B (en) Picture identification method and related device
WO2021096346A1 (en) A computer-implemented system for management of container logs and its method thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Patentee before: Alibaba Group Holding Ltd.

TR01 Transfer of patent right