Graph data processing method, device and system
Technical Field
The present application relates to the field of graph processing technologies, and in particular, to a graph data processing method, device, and system.
Background
At present, there are many products and solutions for graph computation in the industry, but most of them are staying in analyzing static graph data or updating and processing single graph data; a solution for real-time updating and real-time analysis of the complete graph is lacking.
In the traditional database field: OLTP (On-Line Transaction Processing) and OLAP (On-Line analytical Processing) are usually separated, and since data generation is slow and analysis On the data requires more resources, all Processing time delay aspects of the data can only wait for more than one day to output an analysis result. And the relation model of the data model can concern the dependency type of the data only in the analysis stage through the design idea of the relation database.
In the field of graph data, the dependency model is used for processing data, namely, the data naturally has strong relation; g ═ V, E), a Graph contains two basic models, Vertex and Edge, which are physically linked together by edges as relationships. Aiming at real-time updating and real-time analysis of the graph data, under a service scene, the data updating is required to immediately influence the dependency relationship, and the generated influence can immediately trigger corresponding service analysis operation; thereby extending the service requirement of real-time updating and real-time analysis of the graph.
For this purpose: the real-time updating of the graph requires that when the relation occurs, the entity (vertex) with the relation in the natural graph field can complete the updating operation of the relation in time; and then, according to the service scene, the analysis processing task of the graph can be quickly completed.
Most of graph database systems, graph storage and graph calculation frameworks in the industry at present use a computing framework of MapReduce or BSP for reference to construct a set of non-real-time graph analysis system on distributed file systems like GFS or HDFS; the supported application scene is single, the data depends on the daily full data, a plurality of tasks (Job) are started to perform parallel analysis, and the analysis content is time delay which is often more than an hour level.
In addition, some graph database systems only solve the design concept of the conventional relational database and only support graph characteristics, and although the graph updating can be supported relatively quickly, the graph updating is similar to the characteristics of the OLTP, but the corresponding characteristics are not provided for the characteristics of the OLAP, and it is more difficult to provide a solution for fusing the two types of application characteristics.
The prior art has the following disadvantages:
since relational databases have evolved over the years, it is basically well established for many application scenarios; application scenarios that result in OLTP and OLAP tend to be separate, resulting in many technical frameworks that cannot directly combine the two for application. In recent years, new graph databases attempt to break through the situation due to the rise of the NOSQL (non-relational database) model, but because the new domain belongs to the field, no mature technical framework is perfectly compatible with the two application scenarios.
Disclosure of Invention
The embodiment of the application provides a method, a device and a system for processing graph data, and solves the technical problem that a set of system is compatible with graph updating and graph analyzing and processing, and the graph updating and the graph analyzing and processing cannot be compatible.
In one aspect, an embodiment of the present application provides a graph data processing method, including:
writing the request to be processed into a graph updating task queue or a graph analyzing task queue according to the type of the request to be processed, wherein the type of the request to be processed comprises a graph updating request and a graph analyzing request;
determining the running sequence of each task according to the first characteristics of each task in the graph updating task queue and the graph analyzing task queue;
and running the tasks according to the running sequence.
In another aspect, an embodiment of the present application provides a graph data processing apparatus, including:
the updating task queue is used for writing a graph updating task;
a graph analysis task queue for writing a graph analysis task;
and the scheduler is used for determining the running sequence of each task according to the first characteristics of each task in the graph updating task queue and the graph analyzing task queue and distributing the current task to be run to the corresponding computing resource for running.
In another aspect, an embodiment of the present application provides a graph data processing system, including:
the service interface layer comprises an updating interface and an analysis interface, wherein the updating interface is used for receiving a data updating task and writing the data updating task into an updating task queue; the analysis interface is used for receiving data analysis tasks and writing the data analysis tasks into an analysis task queue;
the task scheduling layer comprises a graph updating task queue, a graph analyzing task queue, a scheduler, a graph computing engine and a graph storing engine, wherein:
the graph updating task queue is used for writing a graph updating task;
a graph analysis task queue for writing a graph analysis task;
the scheduler is used for determining the running sequence of each task according to the first characteristics of each task in the graph updating task queue and the graph analyzing task queue and distributing the current task to be run to the corresponding computing resource for running;
the graph calculation engine is used for carrying out a graph updating operation and/or a graph analysis operation of the task;
and the graph storage engine is used for storing the graph.
The beneficial effects are as follows:
the embodiment of the application provides a graph data processing method, a device and a system, which can respectively receive a graph updating request and a graph analysis request, place the graph updating request and the graph analysis request into a graph updating task queue and a graph analysis task queue, manage each task, and determine an operation sequence for the tasks, so that a set of system compatible graph updating and graph analysis processing can be utilized, the problem that application scenes of the existing graph updating and graph analysis processing are separated is solved, the graph analysis can not have data dependence on daily full data any more, and the analysis content is often in the condition of time delay above an hour level.
Drawings
Specific embodiments of the present application will be described below with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart diagram illustrating a data processing method in an embodiment of the present application;
FIG. 2 is a flow chart illustrating a graph data processing method according to a first embodiment;
FIG. 3 is a flowchart showing a graph data processing method according to the second embodiment;
FIG. 4 is an exploded diagram of the internal abstract implementation of two interfaces in the second embodiment;
FIG. 5 is a flowchart showing the storing of the map in the third embodiment;
FIG. 6 is a schematic structural diagram of the data processing apparatus in the embodiment of the present application;
FIG. 7 is a schematic structural diagram of a graph data processing apparatus according to an example in the embodiment of the present application;
FIG. 8 is a schematic structural diagram of a graph data processing apparatus according to an example in the embodiment of the present application;
FIG. 9 is a block diagram of the data processing system of the embodiment of the present application;
FIG. 10 is a block diagram of a graph data processing system according to an example in an embodiment of the present application.
Detailed Description
In order to make the technical solutions and advantages of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and not an exhaustive list of all embodiments. And the embodiments and features of the embodiments in the present description may be combined with each other without conflict.
The inventor finds that the data storage and calculation requirements of the image business are large at present, such as the transaction, shopping, transfer and the like of an online shopping platform, and each second is more than ten thousand levels; the number of days exceeds billions of records. The data is written and updated very frequently in real time, and after the data is written, the data needs to be updated into the graph data model quickly. Graph-based business scenarios: the transaction risk identification and accurate recommendation service requires that full analysis calculation can be quickly carried out on incremental graph data, calculation results are output, graph updating and writing are quickly updated into a graph storage engine, graph analysis calculation can cover the latest data as much as possible, and second-level delay is tolerated between a data snapshot on which analysis depends and the latest data updating. Based on consideration of these practical requirements, the embodiments of the present application provide a method, an apparatus, and a system for processing graph data, which are described below.
The graph updating means that an external service application sends an instruction to update the vertex attributes in the graph, add a new vertex, or establish a new edge from the vertex a to the vertex B, modify the attributes of the edge, and the like.
The graph analysis refers to analyzing and calculating specific subgraphs and full graphs under the analysis instruction of the service, and the analysis process is through query read-only operation of traversing, counting and filtering the graphs and the attributes of specific vertexes and edges.
Fig. 1 illustrates a graph data processing method in an embodiment of the present application, and as shown in the figure, includes:
step 101, writing the request to be processed into a graph updating task queue or a graph analyzing task queue according to the type of the request to be processed, wherein the type of the request to be processed comprises a graph updating request and a graph analyzing request;
102, determining the running sequence of each task according to the first characteristics of each task in the graph updating task queue and the graph analyzing task queue;
and 103, running each task according to the running sequence.
Has the advantages that: the graph updating and graph analyzing processing method in the embodiment of the application can respectively receive graph updating requests and graph analyzing requests, place the graph updating requests and the graph analyzing requests into a graph updating task queue and a graph analyzing task queue, manage all tasks, and determine the running sequence of the tasks, so that a set of system compatible graph updating and graph analyzing processing can be used, the problem that application scenes of the existing graph updating and graph analyzing processing are separated is solved, the situation that data depend on daily full data does not exist in graph analysis any more, and the analysis content is often delayed by more than an hour level is solved.
Further, in order to improve the processing efficiency, the following embodiment may be performed.
In implementation, after the running sequence of each task is determined, whether the first sequential task is the current task to be run is determined according to the state of the read-write lock;
the state of the read-write lock is modified to be occupied when the task runs and is modified to be unoccupied when the task runs or is suspended.
The method for determining whether the first cis-position task is the current task to be run according to the state of the read-write lock can comprise any one or a combination of the following steps:
when the read-write lock is in an unoccupied state, determining that the first priority task is a current task to be operated;
when the read-write lock is in an occupied state, if the current running task is a pure read graph analysis task, determining that the first priority task is the current task to be run;
and when the read-write lock is occupied, if the current running task is a non-pure read image analysis task or an image updating task, suspending the first sequential task, and judging the state of the read-write lock again after the next period.
Has the advantages that:
in the implementation, a read-write lock is added, and whether the first sequential task is the current task to be run is determined according to the state of the read-write lock. Therefore, when the read-write lock is occupied, if the current running task is a pure read graph analysis task, the first priority task is still determined to be the current task to be run, so that the tasks can be parallel, and the processing efficiency is improved.
In addition, in the implementation, after the operation sequence of each task is determined, whether the first order task is a graph analysis task with time and/or resource consumption larger than a set threshold value is judged, if so, the first order task is divided into a plurality of tasks, the plurality of tasks are operated at intervals, and after the operation of the plurality of tasks is finished, graph analysis results are combined to complete the first order task.
Or monitoring whether the task runs overtime after the task runs, if so, suspending the task, and restarting the task after the next period.
Has the advantages that: by the two modes, the graph analysis task is split, whether the task runs overtime or not is monitored, if overtime is waiting, one task can be prevented from occupying too long time and/or too many resources, so that the task can be more reasonably carried out, and particularly, when the graph analysis task consumes more time and/or resources, the graph updating task can be more effectively carried out.
Further, after the graph updating task is executed, the memory mapping object can be stored in the cache region and the disk at the same time;
when a graph analysis task is executed, data are obtained from a cache region;
and if the data related to the graph analysis task is not in the cache region, acquiring the data from the disk.
Because the disk data is cold data, the data acquisition from the cache region is more efficient, and the data in the cache region is the data which is updated recently, the latest map updating situation can be reflected, and the efficiency can be greatly improved when some application scenes use the data in the cache region.
Further, the graph storage may include:
determining the graph to be a sparse graph or a dense graph according to the data characteristics of the graph;
determining whether the graph is dominant at the vertex or dominant at the edge according to the calculation characteristics of the graph;
and determining a graph segmentation algorithm according to the data characteristics of the graph and the calculation characteristics, and segmenting and storing the data of the graph.
Because the segmentation algorithm of the graph is determined according to the data characteristics and the calculation characteristics of the graph, the adopted segmentation algorithm of the graph can be more reasonable, the reasonability of data storage is enhanced, and the whole scheme is more efficient.
To facilitate the practice of the present application, the following description is given by way of example.
The first embodiment is as follows:
as shown in fig. 2, the graph data processing method in the first embodiment includes:
step 201, monitoring whether a graph updating request or a graph analyzing request is received, if so, performing step 202; otherwise, returning to the step 201;
generally, when the system is started and the system initialization is completed, monitoring can be started to determine whether a graph update request or a graph analysis request is received, and the specific monitoring starting time is not limited in this step.
In the method of the embodiment, only the received map updating request or the received map analyzing request is subjected to subsequent processing.
Step 202, writing the request to be processed into a graph updating task queue or a graph analyzing task queue according to the type of the request to be processed;
that is, the graph update request is written to the graph update task queue, and the graph analysis request is written to the graph analysis task queue.
Step 203, determining the running sequence of each task according to the first characteristics of each task in the graph updating task queue and the graph analyzing task queue;
the first characteristic comprises any one or a combination of: timestamp, timeliness, priority, data dependency characteristics. For example, the running order of each task can be determined according to the time stamp of each task in two task queues independently, that is, the task which enters the two queues first is processed first; the running sequence of each task can be determined by integrating the timeliness, the priority and the data dependence characteristics of each task, and the specific first characteristic can be determined according to actual needs.
After the running sequence of each task is determined in this step, it may be further determined whether the first order task is a graph analysis task whose time and/or resource consumption is greater than a set threshold, if so, the first order task is split into a plurality of tasks, the plurality of tasks are run at intervals, and after the plurality of tasks are run, graph analysis results are merged to complete the first order task. This process may avoid that one task takes too long and/or too many resources, thereby making the task more reasonable, and may ensure that the graph update task is performed more efficiently, especially when the graph analysis task is time and/or resource consuming.
Step 204, determining whether the first sequential task is a current task to be operated or not according to the state of the read-write lock, if so, performing step 205, otherwise, suspending the first sequential task, waiting for the next cycle, and returning to the step 204;
the read-write lock is adopted according to the situation that the graph updating task and the graph analyzing task are processed in the same system, the read-write lock is introduced for ensuring the final consistency of the affairs of the updating task and the analyzing task, after the read-write lock is introduced, tasks which are not affected mutually, such as a pure read analyzing task and the updating task, can be completed in parallel, so that the efficiency is improved, during specific implementation, the read-write lock can be omitted, under the situation, the parallel task is not allowed for ensuring the final consistency of the affairs of the updating task and the analyzing task, and the next task is performed only after the task is determined to be completed. The state of the read-write lock is modified to be occupied when the task runs and is modified to be unoccupied when the task runs or is suspended. This step may be understood as determining whether the first cis task is allowed to run according to the status of the read/write lock.
The specific operation of this step may include any one or a combination of the following:
when the read-write lock is in an unoccupied state, determining that the first priority task is a current task to be operated;
when the read-write lock is in an occupied state, if the current running task is a pure read graph analysis task, determining that the first priority task is the current task to be run;
and when the read-write lock is occupied, if the current running task is a non-pure read image analysis task or an image updating task, suspending the first sequential task, and judging the state of the read-write lock again after the next period.
When the read-write lock is in an occupied state, if the current running task is a pure read graph analysis task, the first priority task is determined to be the current task to be run, so that other tasks can be paralleled when the current running task is the pure read graph analysis task, and the processing efficiency is improved.
In practical implementation, such processing may not be selected, when the state of the read/write lock is occupied, the first sequential task is suspended, and the state of the read/write lock is determined again after the next cycle, that is, by using this scheme, the parallel task is not allowed.
And step 205, distributing the current task to be run to the corresponding computing resource to run according to the distributed partition information of the data related to the current task to be run.
The storage and calculation of the industry internal graph are divided into a single-point mode and a distributed mode; the single-point graph is that all graphs are stored on a single computer, and the computation of the graphs is also centralized on a single computing node; the distributed graph model aims at the situation that the graph is stored in a distributed mode on a plurality of machines, the graph is large in quantity, the graph cannot be stored in a single machine physically, and meanwhile the graph is calculated in a distributed mode and executed in parallel on the plurality of machines. In this embodiment, a distributed type is taken as an example for description, so that the current task to be executed needs to be allocated to the corresponding computing resource to be executed according to the distributed partition information of the data related to the current task to be executed.
After the task is operated, whether the task is operated overtime or not can be monitored, if yes, the task is suspended, and the task is restarted after the next period. This process may avoid that one task takes too long and/or too many resources, thereby making the task more reasonable, and may ensure that the graph update task is performed more efficiently, especially when the graph analysis task is time and/or resource consuming.
Example two:
the graph data processing method in the second embodiment, as shown in fig. 3, includes:
step 301, monitoring whether a graph updating request or a graph analyzing request is received, and if so, performing step 302; otherwise, returning to the step 301;
step 302, writing the request to be processed into a graph updating task queue or a graph analyzing task queue according to the type of the request to be processed;
step 303, determining the running sequence of each task according to the first characteristics of each task in the graph updating task queue and the graph analyzing task queue;
step 304, determining whether the first sequential task is a current task to be operated or not according to the state of the read-write lock, if so, performing step 305, otherwise, suspending the first sequential task, waiting for the next cycle, and returning to the step 304;
step 305, judging whether the first cis-position task is a task for updating, if so, performing step 306, otherwise, performing step 307;
since this flow processes only the map update request and the map analysis request, the case where it is determined in this step that it is not a map update indicates that the task is a map analysis task.
Step 306, running a graph updating task, and storing the memory mapping object to a cache region and a disk at the same time;
when the method is implemented, when the memory mapping object is stored in the cache region, the Delta (increment) updating object can be directly stored, and the memory mapping object can also be processed and divided into the Delta updating object and the hot object. Specific hot objects can be generated according to existing rules, for example, Delta update objects with operation times exceeding 100 times within one hour are considered as hot objects, specific generation of the hot objects is not specifically limited in the application, and after the hot objects are generated, graph analysis tasks can be performed on the hot objects.
Step 307, acquiring a data operation diagram analysis task from the buffer area;
and step 308, if the data related to the graph analysis task is not in the cache region, acquiring the data from the disk.
In particular, the graph computation engine in a graph database system includes two types of feature computation operations: updating operation and analyzing operation, wherein the two types of operation use a BSP parallel task characteristic asynchronous shared memory model as technical reference, and an abstract common computing interface is combined with the characteristic of graph structure data operation as follows:
graph update interface definition: UpdateResult updateGraph (GraphData)
Graph analysis interface definition: StatsResult statsGraph (Statsparam)
The internal abstraction implementation decomposition for the two interfaces is shown in fig. 4:
a) the internal steps of updateGraph are as follows:
a1) query the vertex gatherReadyUpdateVertex ()
a2) Updating the vertex information applyUpdateGraph ()
a3) Communicating vertex update information to each neighboring vertex, scatter UpdateVertexs ()
a4) And summarizing the state of the updated vertex into an update success queue of the buffer area, wherein the step is processed by an asynchronous message mechanism, and the step does not influence the processing time consumption of the previous step.
b) The internal steps of statgraph are as follows:
b1) collection of Source vertex information to analyze gatherReadyStatsSourceVertex ()
This step may collect source vertex information to be analyzed from the cache update success queue
b2) Perform analysis task applyStatsGraph ()
b3) Merging analysis statistical task results summary StatsSourceVertexs ()
The graph computation framework in this embodiment is divided into several stages:
collection, implementation, divergence [ Collection ] Gather, Apply, Scatter [ Summary ]
In this embodiment, a summarization (summary) step is added, and is used for collecting results of the update operation in the update task and returning the results to the update queue context; for the analysis task, the method is used for summarizing the result of the analysis task; the aggregate information is written as standard rule data into the Context (Context) of the computing framework for use by the internal computing framework.
One feature point of the present embodiment: aiming at the collection and update result operation summaryUpdateResult in the updated graph interface updatagraph, the data of the graph top point which is updated currently in real time can be automatically written into the updated subgraph cache queue after the task is completed according to the rule, and the data can be automatically obtained from the cache queue in the corresponding analysis task, namely, the data is automatically completed in the gatherReadyStatsSourceVertex of the statGraph interface.
EXAMPLE III
The graph storage in this application adopts distributed graph storage engine, and distributed graph storage engine is responsible for solving two main kinds of problems as supporting the bottom layer support engine that high-efficient picture was updated, the graph analysis:
firstly, the method comprises the following steps: the dense graph and the sparse graph are effectively stored in a distributed mode, and the core of the distributed graph is the graph partitioning (partitions);
for real world graph structures, there are roughly two categories; the first type: vertices of the graph (Vertex) have a small number of adjoining edges (Edge), i.e. sparse graph; the second type: a small number of vertices (Vertex) have a large number of contiguous edges, i.e. local dense graphs (referred to as dense graphs in this application).
II, secondly: and a simple and uniform access API (Application Programming Interface) is provided for the upper layer graph calculation engine to call.
Regarding the first category of problems, there are 3 categories of basic referential segmentation algorithms in the field of graph segmentation design:
A1) cutting at the balanced side: performing Hash (Hash) calculation according to the ID of the vertex, uniquely dividing the vertex to different machines according to the number of the machines, and then redundantly storing the vertex to different machines according to edges; in order to maintain the high efficiency of graph calculation, the algorithm needs to redundantly pair adjacent vertex and side information on different machines; any side, vertex update involves a relatively large number of network transport interactions.
A2) Balanced vertex cutting: performing Hash calculation according to the ID of the edge, uniquely dividing the edge to different machines, and performing redundancy on different machines aiming at the vertex connected with the edge; due to the uniqueness of the edges, only updates to the vertices involve more network transport interactions.
A3) Greedy vertex cutting: based on the a2 algorithm, for two vertices v (a) and v (b) connected to any edge E, considering a set of machines pre-stored with the corresponding vertex, for example, a vertex is assigned a set of machines m (a), and b vertex is assigned a set of machines m (b), further evaluating the following conditions, and then determining an edge assignment rule:
if M (a) has an intersection with M (b), E is assigned to the intersected machine.
If M (a) and M (b) do not intersect, but have contents, and the union is not empty, E is allocated to the machine with the smallest allocation edge in the union on M (a) and M (b).
If M (a) has been allocated, but M (b) has not, then E is allocated on M (a), and vice versa.
If M (a) and M (b) are not allocated, E is allocated to the machine with the least load.
For algorithm A3, the approach of storing the edges is relatively pursued in design, and because the algorithm of the graph is relatively heavy, the storage part of the graph can relatively consume part of the performance; but the corresponding performance is significantly improved for the subsequent computation part of the graph.
In this embodiment, for the optimization during graph storage in the present application, the centralized algorithm is synthesized according to the data features and the calculation feature features of the graph, and specifically as shown in fig. 5, the method includes the following steps:
step 501, determining the graph to be a sparse graph or a dense graph according to the data characteristics of the graph;
step 502, determining whether the graph is dominant at the vertex or dominant at the edge according to the calculation characteristics of the graph;
and step 503, determining a graph segmentation algorithm according to the data characteristics and the calculation characteristics of the graph, and segmenting and storing the data of the graph.
The specific algorithm may be edge segmentation, point segmentation, optimized point segmentation, etc.
Because the segmentation algorithm of the graph is determined according to the data characteristics and the calculation characteristics of the graph, the adopted segmentation algorithm of the graph can be more reasonable, the reasonability of data storage is enhanced, and the whole scheme is more efficient.
Regarding the second kind of problem, regarding the API of graph storage, according to the business scenario of graph update and graph analysis, the corresponding interface APIs are uniformly packaged as follows:
create Vertex createVertex (Key)
Creating Edge createEdge (Key, sourceVertex, targetVertex)
Updating the Vertex Result updateVertex (Vertex, Property)
Update edge Result updateEdge (Vertex, Property)
Finding vertex FindVertex (Key)
Find edge FindEdgeOfVertex (Key) of vertex
FindEdgeByLabel (Label) for inquiring the edge of the specified label
Bulk scene vertex BulkCreateVertexs (List (Key))
Bulk creation of edge bulkcreateeedges (Key, List < sourceVertex >, List < targetVertex >)
Finding neighboring vertices FindAdjacentVertxs (Vertex)
Find adjacent edge FindAdjacentedges (edge)
Delete Vertex Boolean drop Vertex (Vertex)
Boolean drop Edge (Edge)
Therefore, a simple and uniform access API is provided for the upper-layer graph calculation engine to call conveniently.
Based on the same inventive concept, the embodiment of the present application further provides a graph data processing apparatus, and as the principle of solving the problem of these devices is similar to that of a graph data processing method, the implementation of these devices may refer to the implementation of the method, and repeated details are not repeated.
As shown in fig. 6, the apparatus may include:
a graph update task queue 601 for writing a graph update task;
a graph analysis task queue 602 for writing graph analysis tasks;
the scheduler 603 is configured to determine a running order of each task according to the first characteristic of each task in the graph update task queue and the graph analysis task queue, and allocate the current task to be run to the corresponding computing resource for running.
In a specific implementation, the graph update task queue 601 may be responsible for maintaining timeliness and power law control of update tasks; the graph analysis task queue 602 may maintain the priority, timeliness, and failure retry characteristics of the analysis tasks.
Further, when applied to a distributed system, as shown in fig. 7, a partition identifier 701 is included to provide the scheduler 603 with distributed partition information related to data of the task to be currently executed. When not a distributed system, then there is no need to include a partition identifier.
Further, as shown in fig. 8, the apparatus may further include a read-write lock module 801, configured to store a state of a read-write lock, where the state of the read-write lock is modified to be occupied when the task runs, and is modified to be unoccupied when the task runs over or is suspended;
after determining the running sequence of each task, the scheduler 603 determines whether the first sequential task is a current task to be run according to the state of the read-write lock;
further, the scheduler 603 determines whether the first in-order task is the current task to be executed according to the status of the read-write lock, including any one or a combination of the following:
when the read-write lock is in an unoccupied state, determining that the first priority task is a current task to be operated;
when the read-write lock is in an occupied state, if the current running task is a pure read graph analysis task, determining that the first priority task is the current task to be run;
and when the read-write lock is occupied, if the current running task is a non-pure read image analysis task or an image updating task, suspending the first sequential task, and judging the state of the read-write lock again after the next period.
Further, after determining the running order of each task, the scheduler 603 may further determine whether the first order task is a graph analysis task whose time and/or resource consumption is greater than a set threshold, if so, split the first order task into a plurality of tasks, run the plurality of tasks at intervals, and after the plurality of tasks are finished running, merge graph analysis results to complete the first order task.
Further, after running the task, the scheduler 603 may also monitor whether the task runs overtime, and if yes, suspend the task and wait for the next period to restart the task.
The read/write lock module 801 and the partition identifier 701 may each be separately combined with the modules of fig. 6.
An embodiment of the present application further provides a graph data processing system, as shown in fig. 9, including:
the service interface layer comprises a graph updating interface and a graph analysis interface, wherein the graph updating interface is used for receiving a graph updating task and writing the graph updating task into a graph updating task queue; the graph analysis interface is used for receiving graph analysis tasks and writing the graph analysis tasks into a graph analysis task queue;
a task scheduling layer including the graph data processing apparatus;
the graph calculation engine is used for carrying out a graph updating operation and/or a graph analysis operation of the task;
and the graph storage engine is used for storing the graph.
In a specific implementation, an update interface in a service interface layer belongs to an interface for a service hierarchy, and includes corresponding service semantics, and a basic design rule is as follows: and converting the non-diagram data model of the business into a standard diagram data model, and standardizing all interface rules through Vertex, Edge, Relationship, Property.
Analysis interface in service interface layer: receiving a service-driven analysis task, or a timing task, or an analysis task depending on an updated data object; this type of interface generally accepts two types of rules: analysis source, analysis rule index.
Further, the graph storage engine comprises a cache region and a disk;
and after the graph calculation engine runs the graph updating task, storing the memory mapping object into a cache region and a disk simultaneously, acquiring data from the cache region when the task is a graph analysis task, and acquiring the data from the disk if the data related to the graph analysis task is not in the cache region.
Further, when the graph storage engine stores the memory mapping object into the cache region, the memory mapping object is processed and divided into a Delta update object and a hot object.
Further, the graph storage engine includes:
the graph data characteristic analyzer is used for determining the graph to be a sparse graph or a dense graph according to the data characteristics of the graph;
the graph calculation characteristic analyzer is used for determining whether the graph is dominant at the vertex or dominant at the edge according to the calculation characteristics of the graph;
and the graph storage and segmentation manager is used for determining a segmentation algorithm of the graph according to the data characteristics and the calculation characteristics of the graph and carrying out segmentation storage on the data of the graph.
Further, as shown in fig. 10, the system may further include a monitoring core, configured to collect resource load conditions of the graph computation engine and the graph storage engine in real time, convert the monitoring information into a measurable graph computation scheduling evaluation factor in real time, and provide the measurable graph computation scheduling evaluation factor to a scheduler 603 of the task scheduling layer;
the scheduler 603 also evaluates the scheduling task assignment according to the graph computed scheduling evaluation factor.
In particular implementations, the graph computing the schedule evaluation factor may include:
the number of newly added and updated tasks and the number of analysis tasks (within one minute) of the graph
Updating task in operation chart and number of chart analysis tasks
Updating task and analyzing task number queued in task queue
Number of partitions in graph, physical cutting status
Number of nodes and edges of whole graph
Current system read and write lock conditions
Newly added edge and vertex number cached in storage engine
Storing the number of edges and vertexes to be merged in the engine
Deleted, modified edge, top point number in storage engine
Number of sub-blocks to be split in storage engine
Compute Memory and IO overhead for an engine
Memory Size, Cache Size, Disk file Size of storage engine
The scheduler 603 can more efficiently distribute the computation tasks according to the graph computation scheduling evaluation factors, maximizing the concurrency between parallel and multi-graph services within a single machine.
In addition, the monitoring core can also provide the calculation scheduling evaluation factor to a graph monitoring display system to display the running condition of the system.
For convenience of description, each part of the above-described apparatus is separately described as being functionally divided into various modules or units. Of course, the functionality of the various modules or units may be implemented in the same one or more pieces of software or hardware when implementing the present application.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.