CN111400555B

CN111400555B - Graph data query task processing method and device, computer equipment and storage medium

Info

Publication number: CN111400555B
Application number: CN202010147602.5A
Authority: CN
Inventors: 李肯立; 翁同峰; 周旭; 廖清; 彭鹏; 林培英; 罗文晟; 李克勤
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2023-09-26
Anticipated expiration: 2040-03-05
Also published as: CN111400555A

Abstract

The application relates to a graph data query task processing method, a graph data query task processing device, computer equipment and a storage medium. The method comprises the following steps: obtaining a graph data set and a graph data query task set, wherein the graph data set is a set of graph data structures which are stored in a plurality of machine nodes in a distributed mode, identifying graph data types of the graph data set, calculating similarity or difference between query tasks in the graph data query task set, dividing the graph data query task set into a plurality of graph data query task subsets based on the similarity or difference between query tasks, and inputting the graph data query task subsets into a preset distributed graph query system to obtain corresponding query results. The method and the device solve the problem of load balancing when batch tasks are processed in a distributed system, optimize the problem of low efficiency of serial execution of query tasks and the problem of low utilization rate of parallel resources, improve the execution efficiency of the query tasks, and relieve the problem of real-time query requirements.

Description

Graph data query task processing method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a graph data query task processing method, a device, a computer device, and a storage medium.

Background

With the rapid development of internet technology and electronic informatization, the information content of each field is increased explosively, and the demands of people for data transmission and data processing are also increasing. The graph is used as a data storage structure, can effectively express various information, has higher and higher real-time requirements for a plurality of graph related tasks, such as real-time public opinion detection, product recommendation and the like, and has low computing mode I/O (Input/Output) speed based on a hard disk, so that user experience is also affected. Therefore, the optimization is required by utilizing the high-speed I/O characteristic of the memory, but the memory of a single machine cannot meet the requirement of large-scale and complex-structure graph data storage, so that a distributed graph calculation mode is generated.

However, in the current distributed graph computing system, generally, for large-scale graph computing task processing, for lightweight graph data query tasks (such as batch personalized graph query tasks), tasks are also typically input into the system to be executed in a pipeline manner, and the execution of the tasks is still affected by the variability between the tasks, so that the execution efficiency of the graph data query tasks is low.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a graph data query task processing method, apparatus, computer device, and storage medium that can improve the execution efficiency of the graph data query task.

A graph data query task processing method, the method comprising:

acquiring a graph data set and a graph data query task set, wherein the graph data set is a set of graph data structures which are stored in a plurality of machine nodes in a distributed manner;

identifying a graph data type of the graph dataset;

according to the graph data type, calculating the similarity or difference between every two query tasks in the graph data query task set;

dividing the graph data query task set into a plurality of graph data query task subsets based on similarity or difference between every two query tasks;

and inputting the graph data query task subset into a preset distributed graph query system to obtain a corresponding query result.

In one embodiment, the graph data types include dense graphs and sparse graphs;

according to the graph data type, calculating the similarity or the difference between every two query tasks in the graph data query task set comprises:

when the graph data type is a dense graph, calculating the similarity between every two query tasks in the graph data query task set;

And when the graph data type is a sparse graph, calculating the difference degree between every two query tasks in the graph data query task set.

In one embodiment, calculating the similarity between query tasks in the graph data query task set includes:

acquiring query points corresponding to query tasks in the graph data query task set every two times and adjacent nodes of the query points;

and calculating the similarity between every two inquiry tasks based on the inquiry points and the adjacent nodes of each inquiry point.

In one embodiment, calculating the degree of difference between query tasks in the graph data query task set comprises:

acquiring query points corresponding to query tasks in the graph data query task set every two;

selecting a query point embedding vector corresponding to the query point from a preset vertex embedding vector set;

and calculating the difference degree between every two inquiry tasks based on the selected inquiry point embedded vector.

In one embodiment, before selecting the query point embedding vector corresponding to the query point from the preset vertex embedding vector set, the method further includes:

screening a preset number of global pivot points from the graph data set;

calculating the distance from each vertex to each global pivot point in the graph data set to obtain a vertex embedding vector;

Based on the vertex embedding vectors, a vertex embedding vector set is constructed.

In one embodiment, the screening the map data set for the predetermined number of global pivot points includes:

obtaining target vertexes stored in a graph data structure of each machine node to obtain a distributed vertex set, wherein the target vertexes are vertexes of preset names arranged according to the degree of the vertexes in the graph data structure;

screening the distributed vertex set to obtain a global vertex set;

calculating the shortest distance between every two vertexes in the global vertex set;

based on the shortest distance, a preset number of global pivot points are screened out.

In one embodiment, dividing the graph data query task set into a plurality of graph data query task subsets based on similarity or difference between query tasks comprises:

inputting the graph data query task set into a preset sub-module model;

based on the similarity or the difference between every two query tasks, the graph data query task set is divided into a plurality of graph data query task subsets through a greedy algorithm.

In one embodiment, the method inputs the subset of graph data query tasks into a preset distributed graph query system, and before obtaining the corresponding query result, the method further includes:

Constructing a distributed bottom layer communication platform;

based on the distributed communication platform, a distributed graph query system centered on the vertex is constructed.

In one embodiment, obtaining the graph data query task set includes:

and when the map query system is detected to be in an idle state, scanning the specified hard disk path file to acquire a map data query task.

A graph data query task processing device, the device comprising:

the data acquisition module is used for acquiring a graph data set and a graph data query task set, wherein the graph data set is a set of graph data structures which are stored in a plurality of machine nodes in a distributed manner;

a data identification module for identifying a graph data type of the graph dataset;

the data analysis module is used for calculating the similarity or the difference between every two query tasks in the graph data query task set according to the graph data type;

the data dividing module is used for dividing the graph data query task set into a plurality of graph data query task subsets based on the similarity or the difference between the query tasks;

and the data query module is used for inputting the graph data query task subset into a preset distributed graph query system to obtain a corresponding query result.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

identifying a graph data type of the graph dataset;

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

identifying a graph data type of the graph dataset;

The graph data query task processing method, the device, the computer equipment and the storage medium define different relationship indexes for measuring tasks according to different graph data types of the distributed graph data set, divide the graph data query task set into a plurality of graph data query task subsets by taking similarity and difference among the tasks as indexes, and input the task subsets into the distributed graph query system so that the task subsets can be executed in parallel to obtain query results. According to the scheme, the difference and the similarity among the tasks are considered, the problem of load balancing when batch tasks are processed in the distributed system is solved in a dynamic combination optimization mode, meanwhile, the problem of low efficiency of serial execution of query tasks and the problem of low utilization rate of parallel resources are optimized, the execution efficiency of the query tasks is improved, and the problem of real-time query requirements is relieved.

Drawings

FIG. 1 is an application environment diagram of a method for processing a data query task of the diagram in one embodiment;

FIG. 2 is a flow chart of a method for processing a data query task in accordance with one embodiment;

FIG. 3 is a flow diagram illustrating details of the data query task processing of FIG. 3 in one embodiment;

FIG. 4 is a flowchart illustrating a step of calculating the degree of difference between tasks according to another embodiment;

FIG. 5 is a flowchart illustrating a global pivot screening step in another embodiment;

FIG. 6 is a block diagram of an apparatus for processing a task of querying data in accordance with one embodiment;

FIG. 7 is a block diagram of an apparatus for processing a task of querying data in accordance with one embodiment;

FIG. 8 is an internal block diagram of a computer device in one embodiment;

FIG. 9 is an architectural diagram of a distributed cluster, according to one embodiment;

fig. 10 is a block diagram of a distributed query system.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The graph data query task processing method provided by the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 (a server may be implemented by a distributed cluster of multiple servers) via a network. The user operates through the terminal 102, sends a graph data query task processing instruction to the server 104, the server 104 responds to the instruction, acquires a graph data set and a graph data query task set, wherein the graph data set is a set of graph data structures which are stored in a distributed mode in a plurality of machine nodes, identifies graph data types of the graph data set, calculates similarity or difference between query tasks in a pair-by-pair mode in the graph data query task set according to the graph data types, divides the graph data query task set into a plurality of graph data query task subsets based on the similarity or difference between the query tasks in the pair-by-pair mode, and inputs the graph data query task subsets into a preset distributed graph query system to obtain corresponding query results. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the server 104 may be implemented by a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 2, a graph data query task processing method is provided, and the method is applied to the server 104 in fig. 1 for illustration, and includes the following steps:

step 100, obtaining a graph data set and a graph data query task set, wherein the graph data set is a set of graph data structures stored in a plurality of machine nodes in a distributed manner.

A graph dataset is a collection of graph data structures, a graph being a data structure that consists of vertices (also called nodes) and edges (also called links). Where a node may have zero or more adjacent elements, the connection between two nodes is referred to as an edge. A graph may represent a social network where each person is a vertex and people that are aware of each other are connected by edges. In this embodiment, the graph data set is a set of graph data that is stored in a distributed manner in a plurality of distributed cluster machine nodes. In practical application, the graph data set may also be called original graph data, the graph data query task set may be a batch personalized graph query task set, the method may be implemented based on a distributed cluster (such as a supercomputer center), and the server is pre-configured with a graph query system, and when receiving a graph data task query task processing instruction, the graph data set and the graph data query task set are obtained.

Step 200, a graph data type of a graph dataset is identified.

The graph data types may include sparse graphs and dense graphs. After the graph data set and the graph data query task set are acquired, the graph data query task set is reasonably distributed to the distributed machine nodes in order to be better divided. Specifically, the relationship between tasks can be defined by identifying the graph data type of the graph data set, and then the graph data query task set is reasonably distributed according to the relationship between tasks.

Step 300, according to the graph data type, calculating the similarity or difference between every two query tasks in the graph data query task set.

In the implementation, the similarity or the difference between every two query tasks can be selected as an index for measuring the relationship between the tasks according to different graph data types. In another embodiment, step 300 may include: step 320, when the graph data type is dense graph, calculating the similarity between every two query tasks in the graph data query task set; in step 340, when the graph data type is a sparse graph, the degree of difference between every two query tasks in the graph data query task set is calculated. Specifically, the graph is further divided into a directed graph and an undirected graph, and the degree (degree) of a vertex in the graph refers to the number of sides associated with the vertex. In the directed graph, the degrees are further divided into an in-degree and an out-degree. The in-degree (in-degree) is the degree of in-degree of a certain vertex, the number of arcs ending at the vertex is called the degree of out-degree of the vertex. How to determine the relationship between the tasks using the similarity or difference as a measure based on the graph data type may be:

(1) If a graph g= (V, E) has n nodes, assume that the degree of departure of each node of the graph G is a fraction of n, and 0<f<=1, the coincidence e=f×v ² The map of conditions is called a dense map. In the dense graph, the diameter of the graph is usually smaller, and the distances between query points corresponding to different tasks are smaller, so that the relationship between the tasks can be better measured by using the similarity;

(2) If a graph g= (V, E) has n nodes, the degree of egress of each node of the graph G is a fixed constant: k. since e=kv=o (V), a graph satisfying the condition e=o (V) can be referred to as a sparse graph. In the sparse graph, because the tasks have randomness, the query points corresponding to the tasks also have randomness, and in most cases, the query points do not have public adjacent nodes, so that the difference degree is used for measuring the relationship between the tasks, and the distinction degree of different tasks can be better quantified.

Step 400, dividing the graph data query task set into a plurality of graph data query task subsets based on the similarity or the difference between every two query tasks.

When the graph data type of the graph data set is recognized as a sparse graph, the graph data query task set is divided into a plurality of graph data query task subsets by taking the similarity between tasks as an index for measuring the relationship between tasks, and when the graph data type of the graph data set is recognized as a sparse graph, the graph data query task set is divided into a plurality of graph data query task subsets by taking the difference between tasks as an index for measuring the relationship between tasks, and the graph data query task subsets are distributed to corresponding machine nodes. Specifically, the division of the graph data query task set can be completed according to a greedy algorithm.

And 500, inputting the map data query task subset into a preset distributed map query system to obtain a corresponding query result.

After obtaining the multiple graph data query task subsets, the graph data query task subsets may be input into a pre-built distributed query system (the graph query system is built on a distributed cluster), and corresponding query results are obtained through the distributed graph query system.

In the graph data query task processing method, different relationship indexes for measuring tasks are defined for different graph types of the distributed graph data set, the graph data query task set is divided into a plurality of graph data query task subsets by taking similarity and difference among the tasks as indexes, and then the task subsets are input into a distributed graph query system so that the task subsets can be executed in parallel to obtain query results. According to the scheme, the difference and the similarity among the tasks are considered, the problem of load balancing when batch tasks are processed in the distributed system is solved in a dynamic combination optimization mode, meanwhile, the problem of low efficiency of serial execution of query tasks and the problem of low utilization rate of parallel resources are optimized, the execution efficiency of the query tasks is improved, and the problem of real-time query requirements is relieved.

In one embodiment, as shown in fig. 3, before step S500, the method further includes: and 050, constructing a distributed bottom layer communication platform, and constructing a distributed graph query system with the vertex as the center based on the distributed communication platform.

In practical applications, the graph query system may be built based on a distributed cluster architecture as shown in fig. 9. The distributed cluster may be a local supercomputer central hierarchy, which has an autonomous high-speed internet and provides an MPI (Message Passing Interface ) programming environment, such as a user may use other versions of MPI, or may install and deploy the programming environment by himself. The virtual ethernet operation task of the high-speed internet can be similarly utilized by utilizing the self-running MPI compiled program, but the performance is much lower than that of the distributed self-running MPI. Therefore, in this embodiment, a programming language (e.g., c++) may be used to build an MPI-based distributed bottom layer communication platform to conform to the high-speed interconnect communication architecture, and a vertex-centric distributed graph query system (e.g., the structure diagram of the distributed graph query system shown in fig. 10) may be built based on the distributed bottom layer communication. The method comprises the steps of constructing a calculation mode taking the vertex as a center, taking the vertex in the graph as a calculation carrier, taking the edge as a message transmission carrier, and jointly completing the calculation process and the communication process of a graph data query task (such as a personalized graph query task). After completion of a vertex computation, a message is sent to the neighboring nodes, i.e., a round of iterations is completed, a process called a "superstep". It will be appreciated that in other embodiments, the graph query system may be built based on other distributed clusters, or the distributed underlying communication platform may be built using other programming languages, which is not limited herein. In the embodiment, by constructing the graph query system with the vertex as the center, the calculation and communication processes of the graph data query task can be completed quickly, and the task execution efficiency is improved.

In one embodiment, obtaining the graph data query task set includes: and when the map query system is detected to be in an idle state, scanning the specified hard disk path file to acquire a map data query task.

In specific implementation, the mode of acquiring the graph data query task set may be to detect the state of the graph query system, and scan the task file of the specified hard disk path when the graph query system is in an idle state, so as to acquire the graph data query task. In this embodiment, when the graph query system is in an idle state, the system resource can be effectively saved by acquiring the graph data query task.

In one embodiment, step 320 includes: and acquiring query points corresponding to the query tasks in the graph data query task set and adjacent nodes of the query points, and calculating the similarity between the query tasks based on the query points and the adjacent nodes of the query points.

Generally, each query task corresponds to a query point, and each query point may have a neighboring node (hereinafter referred to as a neighboring node). When the similarity of the query tasks is calculated, the number of the public adjacent nodes of the query points corresponding to the query tasks and the adjacent nodes of the query points can be counted, and the similarity between the query tasks is calculated according to the number of the public adjacent nodes of the query points and the adjacent nodes of the query points. Specifically, the method can be calculated by adopting the following mode:

Wherein v is ₁ ，v ₂ Respectively refer to the query points v ₁ Query point v ₂ ，share(v ₁ ,v ₂ ) Representing v ₁ ，v ₂ Neighbor (v) ₁ ) Representing v ₁ Neighbor (v) ₂ ) Representing v ₂ W represents the weight coefficient. It should be understood that, in other embodiments, the similarity calculation manner between the tasks may be, for example, a distance algorithm or a cosine similarity algorithm, which is not limited herein. In this embodiment, the similarity between every two tasks is calculated based on the number of the public neighboring nodes of the query point and the neighboring nodes of the query point, which is suitable for calculating the similarity between two vertices in the dense chart, and is more representative.

In one embodiment, as shown in FIG. 4, step 340 includes:

step 342, obtaining query points corresponding to query tasks in the graph data query task set;

step 344, selecting a query point embedding vector corresponding to the query point from the preset vertex embedding vector set;

step 346, calculate the difference between every two inquiry tasks based on the selected inquiry point embedded vector.

The vertex embedding vector is a vertex embedding vector, in this embodiment, a vector formed by the distances between a certain vertex and a preset number of global pivot points is named as a vector embedding. The query point embedding vector refers to the vertex embedding vector corresponding to the query point. Because the tasks in the sparse graph have randomness, the corresponding query points also have randomness, and most of the query points do not have public neighbor nodes, the relationship among the tasks can be measured by adopting the diversity. In specific implementation, the difference degree can be calculated by a distance calculation formula, and the difference is represented by the distance between two points. In this embodiment, the query points of the query tasks may be obtained first, then the query point embedded vectors corresponding to the query points may be selected from the pre-constructed vertex embedded vector set, and then the degree of difference between the query tasks may be calculated based on the query point embedded vectors. Specifically, the difference Dis of the embedding vectors of the query points can be calculated by using a euclidean distance calculation formula. The manner of calculating the degree of difference may be:

v ₁ :embedding(v _1-0 ,v _1-1 ,...,v _1-9 )

v ₂ :embedding(v _2-0 ,v _2-1 ,...,v _2-9 )

Wherein v is ₁ ，v ₂ Respectively refer to the embedded vectors v of the query points ₁ Query point embedding vector v ₂ . In this embodiment, the difference in the euclidean distance calculation is adopted, which can be more suitable for the distance calculation between two points of the sparse graph.

In another embodiment, prior to step 344, further comprising:

step 330, screening out a preset number of global pivot points from the graph data set;

step 332, calculating the distance from each vertex to each global pivot point in the graph data set to obtain a vertex embedding vector, and constructing a vertex embedding vector set based on the vertex embedding vector.

Global pivot points refer to representative vertices screened from a graph data structure (which may be collectively referred to as a graph dataset) stored by distributed machine nodes. In this embodiment, 10 global pivot points may be screened from the graph dataset. It will be appreciated that in other embodiments, the number of global pivot points may be 11, 15, etc., as the case may be, and is not limited herein. Each vertex embedding vector includes the distance from the vertex to each global pivot point, and takes 10 global pivot points (numbered from 0 to 9) as an example, the vertex embedding vector is a vector with the length of 10. And calculating the distance from each vertex to each global pivot point in the graph data set to obtain a vertex embedding vector, and then summarizing the vertex embedding vectors to construct a vertex embedding vector set. It will be appreciated that each query point can find a corresponding query point embedding vector in the vertex embedding vector set.

In another embodiment, as shown in FIG. 5, step 330 includes:

step 332, obtaining target vertices stored in the graph data structure of each machine node, to obtain a distributed vertex set, where the target vertices are vertices of preset ranks arranged according to the degree of the vertices in the graph data structure;

step 334, screening the distributed vertex set to obtain a global vertex set;

step 336, calculating the shortest distance between every two vertices in the global vertex set;

step 338, screening out a preset number of global pivot points based on the shortest distance.

In a specific implementation, the global pivot point may be selected by, for each distributed machine node, obtaining a target vertex stored in a graph data structure of the machine node, where in this embodiment, the target vertex is a vertex 20 bits before a degree in all vertices in the graph data structure, that is, a vertex in the first 20 bits arranged according to the degree of the vertex. Summarizing 20 top vertices in the graph data structure of each machine node to obtain a distributed vertex set, screening the top 20 vertices with the largest degree from the distributed vertex set in the same way to obtain 20 top vertices with the largest degree, and constructing the global vertex set. And further, calculating the shortest distance between every two vertexes in the 20 vertexes before the global degree, and selecting 10 vertexes with the largest shortest distance as global pivot points to obtain 10 global pivot points. It will be appreciated that in other embodiments, the top 10 vertices or the top 15 vertices may be selected, and the invention is not limited thereto. Because in the distributed environment, a single machine node does not know the vertex conditions on other machine nodes, and there is inconvenience in screening the global pivot points, in this embodiment, the selection manner of the global pivot points is more suitable for screening the vertices of the distributed graph data structure.

In one embodiment, as shown in FIG. 3, step 400 includes: step 420, the graph data query task set is input to a preset sub-module model, and the graph data query task set is divided into a plurality of graph data query task subsets through a greedy algorithm based on the similarity or the difference between every two query tasks.

In practical applications, the subset partitioning problem of the graph data query task set is modeled as a sub-model, i.e., sub-model problem. The sub-functions have a decreasing marginal effect (also called decreasing edge benefit) feature. By decreasing the marginal effect is meant that adding an element when there are few or no elements in the collection brings great benefit, and adding a new element when there are already many elements in the collection brings little benefit. Specifically, according to the feature of decreasing edge benefit of the sub-module model and the relation (difference or similarity) between tasks, a greedy algorithm can be combined to select tasks which are most suitable for the current sub-set one by one as set members, and in this way, the task set distribution is completed until the graph data is queried. Specifically, taking personalized query as an example, the graph data query task has the following submodulear modeling process:

(a) Defining the area S of the subgraph H on the graph dataset as the number of top points in the subgraph H, namely S (H) = |H|;

(b) Based on the locality characteristics of the personalized query task, assume that for a single task q, the regions initially related to the graph dataset are all H';

(c) According to step (b), each personalized query task q query process involves an area of the graph dataset S '= |h' |;

(d) For parallel execution of a plurality of personalized inquiry tasks, the same 'over step' requires as many machines as possible to participate in calculation, so that the problems of unbalanced calculation load and unbalanced communication load can be relieved;

(e) For step (d), the area areas of the simultaneous parallel task related graph datasets are required to be as different as possible. According to step (c), assuming that the area of the query task initialization related to the graph dataset is S', when the difference degree between the tasks is larger or the similarity degree is smaller, the public area related to the graph dataset is smaller in the calculation process, namely the probability that nodes needing to participate in calculation in the execution process of each task in the collection are smaller, so that more machines participating in calculation in each step of superstep in the parallel calculation process are more, communication tasks are borne by more machines together, the load on each machine node is more balanced, and the submodular model construction is completed according to the model construction thought.

After constructing the submodular model, an initial personalized query task set Q { Q ₁ ,q ₂ ,...,q _n Then decompose task set Q into subsets according to greedy algorithm may be:

1) Creating a subtask set C1, randomly selecting a task Q from the set Q, adding the task Q into the set C1, and deleting Q in the set Q to obtain an updated set Q1;

2) Traversing the set Q1, calculating the correlation between all tasks (regarded as a point) in the set Q1 and all points in the C1, summing the obtained correlation, and taking the summation result as an increment;

3) On the premise that the number of tasks in C1 is not more than m (the number of distributed cluster CPU cores), if the correlation among the tasks is similar, adding the task Q with the smallest increment into C1, if the correlation among the tasks is different, adding the task Q with the largest increment into C1, and deleting Q from Q, so that the public area related to the graph dataset among the tasks in the subset is as small as possible;

4) Returning to step 1) after C1 is saturated, otherwise repeating step 2), 3) until Q is empty.

According to the method, the graph data query task set is divided into a plurality of query task subsets, and the query task subsets are input into the graph query system one by one, so that tasks in each subset can be executed in parallel with high efficiency, and the overall efficiency of the system is improved correspondingly.

It should be understood that, although the steps in the flowcharts of fig. 2-5 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2-5 may include multiple steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the steps or stages in other steps or other steps.

In one embodiment, as shown in fig. 6, there is provided a graph data query task processing device, including: a data acquisition module 610, a data identification module 620, a data analysis module 630, a data partitioning module 640, and a data query module 650, wherein:

the data acquisition module 610 is configured to acquire a graph data set and a graph data query task set, where the graph data set is a set of graph data structures that are stored in a distributed manner in a plurality of machine nodes.

A data identification module 620 for identifying a graph data type of the graph dataset.

And the data analysis module 630 is configured to calculate a similarity or a difference between every two query tasks in the graph data query task set according to the graph data type.

The data dividing module 640 is configured to divide the graph data query task set into a plurality of graph data query task subsets based on the similarity or the difference between the query tasks.

The data query module 650 is configured to input the map data query task subset into a preset distributed map query system, so as to obtain a corresponding query result.

In one embodiment, the data analysis module 630 is further configured to calculate a similarity between two query tasks in the graph data query task set when the graph data type is a dense graph, and calculate a difference between two query tasks in the graph data query task set when the graph data type is a sparse graph.

In one embodiment, the data analysis module 630 is further configured to obtain query points corresponding to the query tasks in the graph data query task set and neighboring nodes of the query points, and calculate the similarity between the query tasks based on the query points and the neighboring nodes of the query points.

In one embodiment, the data analysis module 630 is further configured to obtain query points corresponding to query tasks in the graph data query task set, select a query point embedding vector corresponding to the query point from the preset vertex embedding vector set, and calculate a degree of difference between the query tasks based on the selected query point embedding vector.

As shown in FIG. 7, in one embodiment, the apparatus further comprises a graph query system construction module 660 for constructing a distributed underlying communication platform based on which the vertex-centric distributed graph query system is constructed.

In one embodiment, as shown in fig. 7, the apparatus further includes a vertex embedding vector set construction module 670, configured to filter out a preset number of global pivot points from the graph data set, calculate distances from each vertex to each global pivot point in the graph data set, obtain vertex embedding vectors, and construct a vertex embedding vector set based on the vertex embedding vectors.

In one embodiment, the vertex embedded vector set construction module 670 is further configured to obtain target vertices stored in the graph data structure of each machine node, obtain a distributed vertex set, wherein the target vertices are vertices of preset ranks arranged according to the degrees of the vertices in the graph data structure, screen the distributed vertex set, obtain a global vertex set, calculate the shortest distance between every two vertices in the global vertex set, and screen a preset number of global pivot points based on the shortest distance.

In one embodiment, the data dividing module 640 is further configured to input the graph data query task set to a preset sub-module model, and divide the graph data query task set into a plurality of graph data query task subsets by a greedy algorithm based on a similarity or a difference between query tasks.

In one embodiment, the data obtaining module 610 is further configured to scan the specified hard disk path file to obtain the graph data query task when the graph query system is detected to be in an idle state.

The specific limitation of the graph data query task processing device can be referred to the limitation of the graph data query task processing method hereinabove, and will not be described herein. The respective modules in the above-described graph data query task processing device may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing graph dataset data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a graph data query task processing method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 8 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program: obtaining a graph data set and a graph data query task set, wherein the graph data set is a set of graph data structures which are stored in a plurality of machine nodes in a distributed mode, identifying graph data types of the graph data set, calculating similarity or difference between query tasks in the graph data query task set, dividing the graph data query task set into a plurality of graph data query task subsets based on the similarity or difference between query tasks, and inputting the graph data query task subsets into a preset distributed graph query system to obtain corresponding query results.

In one embodiment, the processor when executing the computer program further performs the steps of: when the graph data type is a dense graph, calculating the similarity between every two query tasks in the graph data query task set; and when the graph data type is a sparse graph, calculating the difference degree between every two query tasks in the graph data query task set.

In one embodiment, the processor when executing the computer program further performs the steps of: and acquiring query points corresponding to the query tasks in the graph data query task set and adjacent nodes of the query points, and calculating the similarity between the query tasks based on the query points and the adjacent nodes of the query points.

In one embodiment, the processor when executing the computer program further performs the steps of: query points corresponding to the query tasks in the graph data query task set are obtained, query point embedded vectors corresponding to the query points are selected from a preset vertex embedded vector set, and the difference degree between the query tasks is calculated based on the selected query point embedded vectors.

In one embodiment, the processor when executing the computer program further performs the steps of: and screening out a preset number of global pivot points from the graph data set, calculating the distance from each vertex to each global pivot point in the graph data set, obtaining a vertex embedding vector, and constructing a vertex embedding vector set based on the vertex embedding vector.

In one embodiment, the processor when executing the computer program further performs the steps of: obtaining target vertexes stored in a graph data structure of each machine node to obtain a distributed vertex set, wherein the target vertexes are vertexes of preset names arranged according to the degree of the vertexes in the graph data structure, screening the distributed vertex set to obtain a global vertex set, calculating the shortest distance between every two vertexes in the global vertex set, and screening out the preset number of global pivot points based on the shortest distance.

In one embodiment, the processor when executing the computer program further performs the steps of: and inputting the graph data query task set into a preset sub-module model, and dividing the graph data query task set into a plurality of graph data query task subsets through a greedy algorithm based on the similarity or the difference between every two query tasks.

In one embodiment, the processor when executing the computer program further performs the steps of: and constructing a distributed bottom communication platform, and constructing a distributed graph query system taking the vertex as the center based on the distributed communication platform.

In one embodiment, the processor when executing the computer program further performs the steps of: and when the map query system is detected to be in an idle state, scanning the specified hard disk path file to acquire a map data query task.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of: obtaining a graph data set and a graph data query task set, wherein the graph data set is a set of graph data structures which are stored in a plurality of machine nodes in a distributed mode, identifying graph data types of the graph data set, calculating similarity or difference between query tasks in the graph data query task set, dividing the graph data query task set into a plurality of graph data query task subsets based on the similarity or difference between query tasks, and inputting the graph data query task subsets into a preset distributed graph query system to obtain corresponding query results.

In one embodiment, the computer program when executed by the processor further performs the steps of: when the graph data type is a dense graph, calculating the similarity between every two query tasks in the graph data query task set; and when the graph data type is a sparse graph, calculating the difference degree between every two query tasks in the graph data query task set.

In one embodiment, the computer program when executed by the processor further performs the steps of: and acquiring query points corresponding to the query tasks in the graph data query task set and adjacent nodes of the query points, and calculating the similarity between the query tasks based on the query points and the adjacent nodes of the query points.

In one embodiment, the computer program when executed by the processor further performs the steps of: query points corresponding to the query tasks in the graph data query task set are obtained, query point embedded vectors corresponding to the query points are selected from a preset vertex embedded vector set, and the difference degree between the query tasks is calculated based on the selected query point embedded vectors.

In one embodiment, the computer program when executed by the processor further performs the steps of: and screening out a preset number of global pivot points from the graph data set, calculating the distance from each vertex to each global pivot point in the graph data set, obtaining a vertex embedding vector, and constructing a vertex embedding vector set based on the vertex embedding vector.

In one embodiment, the computer program when executed by the processor further performs the steps of: obtaining target vertexes stored in a graph data structure of each machine node to obtain a distributed vertex set, wherein the target vertexes are vertexes of preset names arranged according to the degree of the vertexes in the graph data structure, screening the distributed vertex set to obtain a global vertex set, calculating the shortest distance between every two vertexes in the global vertex set, and screening out the preset number of global pivot points based on the shortest distance.

In one embodiment, the computer program when executed by the processor further performs the steps of: and inputting the graph data query task set into a preset sub-module model, and dividing the graph data query task set into a plurality of graph data query task subsets through a greedy algorithm based on the similarity or the difference between every two query tasks.

In one embodiment, the computer program when executed by the processor further performs the steps of: and constructing a distributed bottom communication platform, and constructing a distributed graph query system taking the vertex as the center based on the distributed communication platform.

In one embodiment, the computer program when executed by the processor further performs the steps of: : and when the map query system is detected to be in an idle state, scanning the specified hard disk path file to acquire a map data query task.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A graph data query task processing method, the method comprising:

acquiring a graph data set and a graph data query task set, wherein the graph data set is a set of graph data structures stored in a plurality of machine nodes in a distributed manner;

identifying a graph data type of the graph dataset, the graph data type comprising a dense graph or a sparse graph;

calculating the similarity or difference between every two query tasks in the graph data query task set according to the graph data type;

Dividing the graph data query task set into a plurality of graph data query task subsets based on the similarity or the difference between every two query tasks;

inputting the map data query task subset into a preset distributed map query system to obtain a corresponding query result;

the dividing the graph data query task set into a plurality of graph data query task subsets based on the similarity or the difference between the query tasks comprises:

when the graph data type is dense graph, dividing the graph data query task set into a plurality of graph data query task subsets according to a division principle of dividing query tasks with minimum similarity into the same graph data query subset;

and when the graph data type is a sparse graph, dividing the graph data query task set into a plurality of graph data query task subsets according to a division principle of dividing the query task with the largest difference into the same graph data query subset.

2. The method of claim 1, wherein calculating a similarity or a difference between query tasks in the graph data query task set according to the graph data type comprises:

3. The method of claim 2, wherein computing the similarity between query tasks in the graph data query task set comprises:

acquiring query points corresponding to the query tasks in the graph data query task set every two times and neighboring nodes of the query points;

4. The method of claim 2, wherein calculating a degree of difference between query tasks in the set of graph data query tasks comprises:

acquiring query points corresponding to the query tasks in the graph data query task set every two;

selecting a query point embedded vector corresponding to the query point from a preset vertex embedded vector set;

5. The method of claim 4, further comprising, prior to selecting a query point embedding vector corresponding to the query point from a set of preset vertex embedding vectors:

Screening a preset number of global pivot points from the graph data set;

and constructing a vertex embedding vector set based on the vertex embedding vector.

6. The method of claim 5, wherein said screening out a predetermined number of global pivot points from the graph dataset comprises:

screening the distributed vertex set to obtain a global vertex set;

and screening out a preset number of global pivot points based on the shortest distance.

7. The method of any one of claims 1 to 6, wherein dividing the graph data query task set into a plurality of graph data query task subsets based on a similarity or a difference between the query tasks comprises:

inputting the graph data query task set to a preset submodule model;

And dividing the graph data query task set into a plurality of graph data query task subsets through a greedy algorithm based on the similarity or the difference between every two query tasks.

8. The method according to any one of claims 1 to 6, wherein before inputting the subset of graph data query tasks into a preset distributed graph query system to obtain a corresponding query result, the method further comprises:

constructing a distributed bottom layer communication platform;

and constructing a distributed graph query system with the vertex as a center based on the distributed communication platform.

9. The method of any of claims 1 to 6, wherein obtaining a graph data query task set comprises:

10. A graph data query task processing device, the device comprising:

a data identification module for identifying a graph data type of the graph dataset, the graph data type comprising a dense graph or a sparse graph;

the data dividing module is used for dividing the graph data query task set into a plurality of graph data query task subsets according to the dividing principle of dividing query tasks with minimum similarity into the same graph data query subset when the graph data type is a dense graph, and dividing the graph data query task set into a plurality of graph data query task subsets according to the dividing principle of dividing query tasks with maximum difference into the same graph data query subset when the graph data type is a sparse graph;

11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 9 when the computer program is executed.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 9.