CN116304212A

CN116304212A - Data processing system, method, equipment and storage medium

Info

Publication number: CN116304212A
Application number: CN202310253283.XA
Authority: CN
Inventors: 张宇; 卢浩宇; 赵进; 余辉; 张湛
Original assignee: Huazhong University of Science and Technology; Zhejiang Lab
Current assignee: Huazhong University of Science and Technology; Zhejiang Lab
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2023-06-23

Abstract

The specification discloses a data processing system, a method, a device and a storage medium, which can screen out nodes with the degree higher than a preset threshold value from each node contained in original graph data, and can further take each neighbor node of the pivot node as a starting node, and access other nodes with connection relation with the neighbor node as associated nodes of the neighbor node, so that the associated nodes of the neighbor node and the neighbor node can be divided into graph data blocks of the original graph data, and node update processing is carried out on each graph data block, so that redundant calculation generated by calculation by directly using feature matrixes corresponding to all nodes contained in the original graph data can be avoided, and the data processing efficiency is improved.

Description

Data processing system, method, equipment and storage medium

Technical Field

The present disclosure relates to the field of graph computing technologies, and in particular, to a data processing system, a method, an apparatus, and a storage medium.

Background

At present, in the process of processing the graph data through the graph convolution neural network, the graph data to be processed has the characteristics of large size, sparse data distribution and the like, so that more invalid computing operations exist when the feature matrix corresponding to the graph data is used for computing, and further the processing efficiency is lower.

Therefore, how to improve the data processing efficiency is a problem to be solved.

Disclosure of Invention

The present disclosure provides a data processing method, apparatus, device, and storage medium, to partially solve the above-mentioned problems in the prior art.

The technical scheme adopted in the specification is as follows:

the present specification provides a data processing system, the data processing system comprising: a first processor and a second processor;

the first processor is configured to determine, for each node included in the original graph data, whether the degree of the node exceeds a preset threshold, if yes, determine that the node is a hub node, determine, for each neighboring node of the hub node, through multiple rounds of node query, other nodes in the original graph data that have a connection relationship with the neighboring node, as associated nodes of the neighboring node, and determine graph data blocks of the original graph data according to the neighboring node and the associated nodes of the neighboring node;

the second processor is configured to update, for each node included in the graph data block, the node characteristic representation of the node according to the node characteristic representations of neighboring nodes of the node in the graph data block, obtain an updated characteristic representation of the node, and update the original graph data according to the updated characteristic representations of the node and other nodes.

Optionally, the first processor is configured to determine, for each neighboring node of the hub node, an associated node of the neighboring node by using the neighboring node as a starting node and performing multiple rounds of node query; wherein, the liquid crystal display device comprises a liquid crystal display device,

determining each target node in each round of node query, judging whether each neighbor node of each target node has an unvisited node, if so, taking the unvisited node as an associated node of the starting node, setting the unvisited node as an accessed node, and taking the unvisited node as a target node of the next round of node query, wherein each target node is obtained by iterating the starting node to the previous round;

and after determining that the preset first termination condition is met, obtaining each associated node of the starting node.

Optionally, the first processor is configured to determine, for each target node in each round of node query, whether the target node is included in a preset global visited node set, if not, determine that the target node is a visited neighboring node corresponding to the starting node, and add the target node to the global visited node set, if yes, determine that the starting node does not have an associated node, and remove the visited neighboring node corresponding to the starting node from the global visited node set.

Optionally, the second processor includes: at least one processing unit;

at least part of the processing units are used for aggregating node characteristic representations of at least part of neighbor nodes of the nodes in the graph data blocks aiming at each node contained in the graph data blocks to obtain sub-aggregation characteristic representations of the nodes; and

and the other processing units are used for updating the node characteristic representation of the node according to each sub-aggregation characteristic representation of the node to obtain an updated characteristic representation of the node.

Optionally, the data processing system further comprises: a task scheduler;

the task scheduler is used for generating a first data processing task aiming at the graph data block, distributing the first data processing task to the processing units according to the load values of the processing units, and aggregating node characteristic representations of part of nodes; and

the task scheduler is used for generating a second data processing task aiming at the graph data block, distributing the second data processing task to the processing units according to the load value of each processing unit, and updating the node characteristic representation of at least one node in the graph data block by the second data processing task.

Optionally, the task scheduler is configured to determine, for each processing unit, whether a load value of the processing unit exceeds a preset first load threshold, if yes, determine each processing unit adjacent to the location of the processing unit in each processing unit, as each candidate processing unit, determine, as a target processing unit, at least one candidate processing unit with a lowest load value in each candidate processing unit, and allocate, as the target processing unit, a first data processing task or a second data processing task that is being processed by the processing unit.

Optionally, the task scheduler is configured to determine, for each processing unit in the processing units, whether a difference between a load value of the processing unit and a load value of another processing unit exceeds a preset second load threshold, if yes, split a first data processing task or a second data processing task that is being processed by the processing unit, to obtain each sub data processing task, and allocate the sub data processing task to a processing unit with a smallest load value in the processing units.

Optionally, the second processor is configured to determine, for each node included in the graph data block, an updated feature representation of the node through multiple iterations; wherein, the liquid crystal display device comprises a liquid crystal display device,

Determining node characteristic representation to be updated of the node according to each iteration round, updating the node characteristic representation to be updated according to the aggregation characteristic representation corresponding to the node determined in the previous iteration round to obtain updated characteristic representation corresponding to the node in the previous iteration round, and taking the updated characteristic representation corresponding to the node in the previous iteration round as node characteristic representation to be updated in the next iteration round, wherein the aggregation characteristic representation is used for representing an aggregation result of each neighbor node of the node in the graph data block, and the node characteristic representation to be updated is obtained after the node characteristic representation of the node is iterated to the previous iteration round;

and when the preset second termination condition is determined to be met, obtaining updated characteristic representation of the node.

The present specification provides a data processing method, the data processing method is applied to a data processing system, the data processing system includes: a first processor, a second processor, the method comprising:

the first processor judges whether the degree of each node contained in the original graph data exceeds a preset threshold value or not according to each node;

if yes, determining the node as a hub node, and determining other nodes with connection relation with the neighbor node in the original graph data as associated nodes of the neighbor node by multi-round node query aiming at each neighbor node of the hub node;

Determining, according to the neighboring node and the associated node of the neighboring node, a graph data block of the original graph data, so as to update, by the second processor, the node characteristic representation of each neighboring node of the graph data block according to the node characteristic representation of the node, to obtain an updated characteristic representation of the node, and update the original graph data according to the updated characteristic representations of the node and other nodes.

Optionally, for each neighboring node of the hub node, determining, by multiple rounds of node query, other nodes in the original graph data that have a connection relationship with the neighboring node, where the other nodes are associated nodes of the neighboring node, where the determining includes:

aiming at each neighbor node of the hub node, taking the neighbor node as a starting node, and determining an associated node of the neighbor node through multi-round node inquiry; wherein, the liquid crystal display device comprises a liquid crystal display device,

determining each target node in each round of node query, judging whether each target node has a neighbor node or not, if so, taking each target node as an associated node of the starting node, and taking the neighbor node of each target node as a target node of the next round of node query, wherein each target node is obtained by iterating the starting node to the previous round;

Optionally, determining whether each target node has a neighboring node specifically includes:

judging whether each target node in each round of node inquiry contains the target node in a preset global accessed node set or not;

if yes, determining that the initial node does not have an associated node, and removing the accessed neighbor node corresponding to the initial node from the global accessed node set;

if not, determining the target node as the accessed neighbor node corresponding to the starting node, adding the target node into the global accessed node set, and judging whether the neighbor nodes exist in each target node.

Optionally, the second processor includes: at least one processing unit;

updating the node characteristic representation of the node according to the node characteristic representation of each neighbor node of the node in the graph data block to obtain an updated characteristic representation of the node, and specifically comprises the following steps:

aggregating node characteristic representations of at least part of neighbor nodes of the nodes in the graph data block aiming at each node contained in the graph data block through at least part of processing units in the processing units to obtain sub-aggregation characteristic representations of the nodes; and

And updating the node characteristic representation of the node according to each sub-aggregation characteristic representation of the node through other processing units to obtain an updated characteristic representation of the node.

Optionally, the data processing system further comprises: a task scheduler;

aggregating, by at least some of the processing units, node feature representations of at least some neighboring nodes of the node in the graph data block for each node included in the graph data block, to obtain sub-aggregate feature representations of the node, including:

generating a first data processing task aiming at the graph data block through the task scheduler, and distributing the first data processing task to the processing units according to the load values of the processing units so that the processing units aggregate node characteristic representations of at least part of neighbor nodes of the nodes in the graph data block aiming at each node contained in the graph data block to obtain sub-aggregate characteristic representations of the nodes;

updating the node characteristic representation of the node according to each sub-aggregation characteristic representation of the node through other processing units to obtain an updated characteristic representation of the node, wherein the method specifically comprises the following steps:

Generating a second data processing task aiming at the graph data block through the task scheduler, and distributing the second data processing task to the processing units according to the load values of the processing units so that the processing units update the node characteristic representation of the node according to the sub-aggregation characteristic representations of the node to obtain updated characteristic representations of the node.

Optionally, the method further comprises:

and judging whether the load value of each processing unit exceeds a preset first load threshold value by the task scheduler, if so, determining each processing unit adjacent to the processing unit in each processing unit, as each candidate processing unit, determining at least one candidate processing unit with the lowest load value in each candidate processing unit, as a target processing unit, and distributing the first data processing task or the second data processing task which is being processed by the processing unit to the target processing unit.

Optionally, the method further comprises:

and judging whether the difference value between the load value of each processing unit and the load value of other processing units exceeds a preset second load threshold value or not according to each processing unit in the processing units through the task scheduler, if so, splitting the first data processing task or the second data processing task which are being processed by the processing unit to obtain each sub data processing task, and distributing the sub data processing tasks to the processing unit with the smallest load value in the processing units.

Optionally, for each node included in the graph data block, updating the node characteristic representation of the node according to the node characteristic representation of each neighbor node of the node in the graph data block to obtain an updated characteristic representation of the node, and specifically including:

determining updated feature representation of each node contained in the graph data block through multiple iterations; wherein, the liquid crystal display device comprises a liquid crystal display device,

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the above data processing method.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the data processing method described above when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

according to the data processing method provided by the specification, a first processor judges whether the degree of each node contained in original graph data exceeds a preset threshold value, if so, the node is determined to be a hub node, other nodes which have connection relation with the neighbor node in the original graph data are determined to be associated nodes of the neighbor node, graph data blocks of the original graph data are determined according to the neighbor node and the associated nodes of the neighbor node, and according to node characteristic representations of the neighbor nodes in the graph data blocks, the node characteristic representations of the node are updated by a second processor, so that updated characteristic representations of the node are obtained, and according to the updated characteristic representations of the node and other nodes, the original graph data are updated.

According to the method, nodes with degrees higher than a preset threshold value can be screened out from each node contained in the original graph data to serve as hub nodes for connecting communities, each neighbor node of the hub nodes can be further used as a starting node, other nodes with connection relations with the neighbor nodes are accessed to serve as associated nodes of the neighbor nodes, the associated nodes of the neighbor nodes and the neighbor nodes can be divided to serve as graph data blocks of the original graph data, node updating processing is carried out on each graph data block, and redundant calculation generated by directly using feature matrixes corresponding to all nodes contained in the original graph data to carry out calculation can be avoided, so that data processing efficiency is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

FIG. 1 is a schematic diagram of a data processing system provided in the present specification;

FIG. 2 is a schematic diagram of a task scheduler provided in the present specification;

FIG. 3 is a schematic diagram of a load balancing process provided in the present specification;

fig. 4 is a schematic diagram of the effect of the load balancing process provided in the present specification;

FIG. 5 is a schematic diagram of a data processing method provided in the present specification;

fig. 6 is a schematic diagram of an electronic device corresponding to fig. 5 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

In this specification there is provided a data processing system as shown in figure 1:

FIG. 1 is a schematic diagram of a data processing system provided herein.

As can be seen from fig. 1, the data processing system comprises: the system comprises a first processor and a second processor.

The first processor is configured to determine, for each node included in the original graph data, whether the degree of the node exceeds a preset threshold, if yes, determine that the node is a hub node, determine, for each neighboring node of the hub node, through multiple rounds of node query, other nodes in the original graph data that have a connection relationship with the neighboring node, as associated nodes of the neighboring node, and determine graph data blocks of the original graph data according to the neighboring node and the associated nodes of the neighboring node.

The degree of the node may be the sum of the ingress degree and egress degree of the node, where the ingress degree of the node indicates the number of edges entering the node in the graph data, and the egress degree of the node indicates the number of edges starting from the node.

The above-mentioned preset threshold may be set according to actual requirements, and may be gradually reduced as execution proceeds, for example: assuming that the value of the preset threshold value in the first round is 10, all nodes with the degree larger than 10 are used as the pivot nodes in the round, after all nodes of the original graph data are screened by taking 10 as the preset threshold value, the second round of graph data block can be started, the preset threshold value can be adjusted to 8 in the second round of graph data block, and the nodes with the degree larger than 8 are screened from all nodes of the original graph data to be used as the pivot nodes, and so on.

Specifically, the first processor may determine, for each neighboring node of the hub node, the neighboring node as an initial node, through multiple rounds of node queries, other nodes having a connection relationship with the neighboring node in the original graph data as associated nodes of the neighboring node, determine, for each round of node queries, each target node in the round of node queries, determine whether there is an unvisited node in each neighboring node of each target node, if yes, set the unvisited node in each neighboring node of each target node as an associated node of the initial node, set the unvisited node in each neighboring node of each target node as an accessed node, and set the unvisited node in each neighboring node of each target node as a target node of the next round of node query, where each target node is obtained by iterating the initial node to the previous round, and after determining that a preset first termination condition is met, obtain each associated node of the initial node.

The first termination condition may be set according to actual requirements, for example: each neighbor node of each target node in the round of node query is a visited node. For another example: the number of nodes that have been accessed exceeds a preset threshold.

It should be noted that, the process of querying the neighbor node of each target node by the first processor may be executed in parallel by multiple threads.

In addition, as can be seen from the foregoing, in the process of determining the associated node of each neighboring node by using each neighboring node of the hub node as the starting node through multi-round node query, the first processor may occur that a node is repeatedly accessed multiple times, that is, a node that is not accessed during the multi-round node query of the current starting node may be accessed during the multi-round node query of other starting nodes.

Based on this, the first processor may determine, for each target node in each round of node query, whether the target node is included in a preset global visited node set, before determining whether the target node has a neighboring node, if not, determine that the target node is a visited neighboring node corresponding to the starting node, and add the target node to the global visited node set, if so, determine that the starting node has no associated node, and remove the visited neighboring node corresponding to the starting node from the global visited node set.

It should be noted that, the map data block determined by the first processor includes, in addition to the data of the corresponding node, structure information of the map data block, where the structure information includes: the number of nodes included in the graph data block, the index of the hub node connected with the graph data block, the number of the hub node connected with the graph data block, and the like.

Further, the second processor is configured to update, for each node included in the graph data block, the node characteristic representation of the node according to the node characteristic representations of neighboring nodes of the node in the graph data block, obtain an updated characteristic representation of the node, and update the original graph data according to the updated characteristic representations of the node and other nodes.

Further, the second processor includes at least one processing unit, at least some of the processing units are configured to aggregate, for each node included in the graph data block, node feature representations of at least some neighboring nodes of the node in the graph data block to obtain sub-aggregate feature representations of the node, and other processing units in each processing unit are configured to update the node feature representations of the node according to each sub-aggregate feature representation of the node to obtain updated feature representations of the node.

In addition, the data processing system further comprises: the task scheduler is shown in particular in fig. 2.

Fig. 2 is a schematic diagram of a task scheduler provided in the present specification.

As can be seen in connection with fig. 2, a first data processing task for graph data blocking can be generated by a task scheduler and allocated to processing units according to the load values of the respective processing units, where the first data processing task is used to aggregate node characteristic representations of the partial nodes. And generating a second data processing task for the graph data block through the task scheduler, and distributing the second data processing task to the processing units according to the load value of each processing unit, wherein the second data processing task is used for updating the node characteristic representation of at least one node in the graph data block.

Further, the task scheduler may perform load balancing processing on each processing unit by two different load balancing methods, which are respectively described in detail below.

The first load balancing method may be that, for each processing unit, the task scheduler determines whether a load value of the processing unit exceeds a preset first load threshold, if so, determines each processing unit adjacent to the position of the processing unit in each processing unit, determines, as each candidate processing unit, at least one candidate processing unit with the lowest load value in each candidate processing unit, and assigns, as a target processing unit, a first data processing task or a second data processing task being processed by the processing unit to the target processing unit.

It should be noted that, in the foregoing, the allocation of the first data processing task or the second data processing task being processed by the processing unit to the target processing unit does not directly transfer the first data processing task or the second data processing task to the target processing unit, but performs at least part of the computing tasks in the first data processing task or the second data processing task by using the computing resources of the target processing unit, and sends the execution result to the processing unit.

The reason why the target processing unit is selected from the processing units adjacent to the processing unit is that the data transmission efficiency between the adjacent processing units is higher, and the transfer of the first data processing task or the second data processing task is easier to be realized.

The second load balancing method may be that the task scheduler determines, for each processing unit in the processing units, whether a difference value between a load value of the processing unit and a load value of another processing unit exceeds a preset second load threshold, if so, splits a first data processing task or a second data processing task being processed by the processing unit to obtain each sub data processing task, and distributes each sub data processing task to a processing unit with a minimum load value selected from the processing units.

It should be noted that the two load balancing methods may be used separately or simultaneously, as shown in fig. 3.

Fig. 3 is a schematic diagram of a load balancing procedure provided in the present specification.

As can be seen from fig. 3, preferably, when the first load balancing method is used to perform load balancing processing on adjacent processing units, the efficiency of data transmission is high, the cost is low, and the speed of performing the first load balancing method is high, so the task scheduler can first determine, for each processing unit, whether the load value of the processing unit exceeds a preset first load threshold, if so, determine, as each target processing unit, processing units adjacent to the processing unit in the processing units, and uniformly allocate, as each target processing unit, the first data processing task or the second data processing task being processed by the processing unit.

Further, in the practical application scenario, there may be a larger amount of data to be processed by the first data processing task or the second data processing task, so that after the task is uniformly distributed to each adjacent target processing unit for processing the task, the load of the processing unit and each target processing unit still far exceeds that of other processing units by the task scheduler, where after the first load balancing method is used, the task scheduler may further determine, for each processing unit in each processing unit, whether the difference between the load value of the processing unit and the load value of the other processing unit exceeds a preset second load threshold, if yes, split the first data processing task or the second data processing task being processed by the processing unit to obtain each sub data processing task, and distribute each sub data processing task to a processing unit with a minimum load value selected from the processing units, and may determine, again for each processing unit in each processing unit, whether the difference between the load value of the processing unit and the load value of the other processing unit exceeds the preset second load threshold until the difference between the load value of the processing unit and the other processing unit exceeds the preset second load threshold.

In order to explain the above in more detail, the effect of the load balancing processing of each processing unit by two load balancing methods will be described below, as shown in fig. 4.

Fig. 4 is a schematic diagram of the effect of the load balancing process provided in the present specification.

As can be seen from fig. 4, due to the power law distribution of the graph data, a few processing units may be in a busy state (i.e. a higher workload percentage), while other partial processing units are in an idle state for a part of time, resulting in overall inefficiency.

However, after the load balancing processing is performed by the first load balancing method, the load among the processing units may still be unbalanced, and at this time, the load balancing processing may be performed by the second load balancing method continuously, so that the load among the processing units is more balanced, so as to improve the overall efficiency.

Further, the second processor updates the node characteristic representation of each neighboring node of the node in the graph data block according to the node characteristic representation of the node, and the method for obtaining the updated characteristic representation of the node may be to determine, for each node included in the graph data block, the updated characteristic representation of the node through multiple iterations.

The second processor may determine, for each iteration round, a node feature representation to be updated of the node, update the node feature representation to be updated according to an aggregate feature representation corresponding to the node to be updated determined in the previous iteration round, obtain an updated feature representation corresponding to the node in the iteration round, and use the updated feature representation corresponding to the node in the iteration round as an aggregate result of the node feature representation to be updated in the next iteration round, where the aggregate feature representation is used to represent an aggregate result of each neighboring node of the node in the graph data block, the node feature representation to be updated is obtained after the node feature representation of the node is iterated to the previous iteration round, and when it is determined that a preset second termination condition is met, obtain the updated feature representation of the node.

The second termination condition may be set according to actual needs, for example: the number of iterative rounds reaches a preset number of rounds. The method for updating the node feature representation to be updated according to the aggregate feature representation corresponding to the node to be updated to obtain the updated feature representation corresponding to the node to be updated in the round of iteration may be to perform weighted fusion on the aggregate feature representation corresponding to the node to be updated and the node feature representation of the node to be updated to obtain the updated node feature representation of the node to be updated.

From the above, it can be seen that, in the process of updating a node for each node of the graph data block, the node is mainly divided into two phases, that is, an aggregation phase of aggregating each neighbor node of the node in the graph data block to obtain an aggregate feature representation corresponding to the node, and an update phase of updating a node feature representation to be updated according to the aggregate feature representation corresponding to the node.

In an actual application scenario, there may be a large number of neighboring nodes of a part of nodes in the graph data block, so that if the aggregate feature representation of each neighboring node obtained by the present iteration is used in each iteration, when updating the node to be updated, the time required for obtaining the aggregate feature representation is long, and the efficiency of updating the node to be updated may be affected, so that the aggregate feature representation of each neighboring node obtained by the previous iteration may be used in each iteration, and thus the above aggregation stage and update stage may be performed simultaneously in each iteration, thereby improving the efficiency of node updating the nodes included in the graph data block.

Further, in each iteration, for each neighbor node that needs to be aggregated in the iteration, the neighbor nodes are divided into each neighbor node group, and aggregation is performed for each neighbor node group to obtain each sub-aggregation feature representation, and finally each sub-aggregation feature representation can be aggregated to obtain the aggregation feature representation of each neighbor node.

It should be noted that, the above-mentioned hub node is a node for connecting each graph data block, and therefore, the hub node does not belong to any graph data block, and when the node characteristic representation of each node included in the graph data block is updated, the node characteristic representation of the hub node is not updated, and based on this, the second processor may update the node characteristic representation of each hub node separately for each hub node.

In the above, it can be seen that, a node with a degree higher than a preset threshold value can be screened out from each node included in the original graph data, and can be used as a hub node for connecting communities, and then each neighbor node of the hub node can be used as a starting node, and other nodes with a connection relationship with the neighbor node are accessed and used as associated nodes of the neighbor node, so that the associated nodes of the neighbor node and the neighbor node can be divided into graph data blocks of the original graph data, and node update processing is performed for each graph data block, and further redundant computation generated by directly using feature matrices corresponding to all nodes included in the original graph data to perform computation can be avoided, so that the data processing efficiency is improved.

It should be noted that, because the data size of the original image data is often larger, when the original image data is directly updated, the original image data cannot be loaded into the memory at one time, so that when the original image data is directly processed, more swap-in and swap-out operations are required to be performed through a swap-in and swap-out algorithm of the memory, and more redundant operations exist in the processing process, so that the processing efficiency of the original image data is greatly reduced.

By the method, the original graph data can be subjected to block processing by utilizing the inherent relativity among partial nodes in the original graph data, so that the original graph data can be subjected to update processing for one graph data block of the original graph data each time, the redundant calculation amount is reduced, the partial data of the original graph data can be loaded into the memory each time, and the update efficiency of the original graph data is improved.

Wherein, the internal connection between the partial nodes can be as follows: in a social networking graph, the inter-node interconnections in the graph data blocks may correspond to persons working in the same institute, for example: in a thesis citation network, the inter-connection between nodes in the graph data block described above may correspond to a thesis published in the same series of conferences.

In order to further describe the above data processing system in detail, the present disclosure also provides a method for performing data processing by using the above data processing system, as shown in fig. 5.

Fig. 5 is a schematic diagram of a data processing method provided in the present specification, including the following steps:

s501: the first processor judges whether the degree of each node contained in the original graph data exceeds a preset threshold value or not according to each node;

s502: if yes, determining the node as a hub node, and determining other nodes with connection relation with the neighbor node in the original graph data as associated nodes of the neighbor node by multi-round node query aiming at each neighbor node of the hub node;

s503: determining, according to the neighboring node and the associated node of the neighboring node, a graph data block of the original graph data, so as to update, by the second processor, the node characteristic representation of each neighboring node of the graph data block according to the node characteristic representation of the node, to obtain an updated characteristic representation of the node, and update the original graph data according to the updated characteristic representations of the node and other nodes.

The first processor judges whether the degree of each node contained in the original graph data exceeds a preset threshold value, if so, the node is determined to be a hub node, other nodes which have connection relation with the neighbor node in the original graph data are determined to be related nodes of the neighbor node, graph data blocks of the original graph data are determined according to the neighbor node and the related nodes of the neighbor node, and the second processor updates the node characteristic representation of each neighbor node in the graph data blocks according to the node characteristic representation of the node to obtain updated characteristic representation of the node, and updates the original graph data according to the updated characteristic representation of the node and other nodes.

Aiming at each neighbor node of the hub node, the neighbor node is used as a starting node, and the associated node of the neighbor node is determined through multi-round node inquiry.

And determining each target node in each round of node query, judging whether each target node has a neighbor node or not, if so, taking each target node as an associated node of the starting node, taking the neighbor node of each target node as a target node of the next round of node query, and obtaining each associated node of the starting node after iteration of the starting node to the previous round, wherein the starting node is obtained after the preset first termination condition is determined to be met.

For each target node in each round of node inquiry, judging whether the target node is contained in a preset global accessed node set, if yes, determining that the initial node does not have an associated node, removing the accessed neighbor node corresponding to the initial node from the global accessed node set, if no, determining that the target node is the accessed neighbor node corresponding to the initial node, adding the target node into the global accessed node set, and judging whether each target node has a neighbor node.

And aggregating node characteristic representations of at least part of neighbor nodes of the node in the graph data block aiming at each node contained in the graph data block through at least part of processing units in the processing units to obtain sub-aggregation characteristic representations of the node. And updating the node characteristic representation of the node according to each sub-aggregation characteristic representation of the node through other processing units to obtain an updated characteristic representation of the node.

And generating a first data processing task aiming at the graph data block through a task scheduler, and distributing the first data processing task to the processing units according to the load value of each processing unit, so that the processing units aggregate node characteristic representations of at least part of adjacent nodes of the nodes in the graph data block aiming at each node contained in the graph data block to obtain sub-aggregate characteristic representations of the nodes.

And generating a second data processing task aiming at the graph data block through a task scheduler, and distributing the second data processing task to the processing units according to the load values of the processing units so that the processing units update the node characteristic representation of the node according to each sub-aggregation characteristic representation of the node to obtain the updated characteristic representation of the node.

And judging whether the load value of each processing unit exceeds a preset first load threshold value by a task scheduler, if so, determining each processing unit adjacent to the position of the processing unit in each processing unit, determining at least one candidate processing unit with the lowest load value in each candidate processing unit as each candidate processing unit, and distributing the first data processing task or the second data processing task processed by the processing unit to the target processing unit as the target processing unit.

And judging whether the difference value between the load value of each processing unit and the load value of other processing units exceeds a preset second load threshold value or not by a task scheduler according to each processing unit, if so, selecting the processing unit with the smallest load value from the processing units as a target processing unit, and distributing the first data processing task or the second data processing task which is being processed by the processing unit to the target processing unit.

And judging whether the difference value between the load value of each processing unit and the load value of other processing units exceeds a preset second load threshold or not by a task scheduler according to each processing unit in the processing units, if so, splitting the first data processing task or the second data processing task which are being processed by the processing unit to obtain each sub data processing task, and distributing each sub data processing task to the processing unit with the smallest load value in each processing unit.

For each node contained in the graph data block, an updated feature representation of the node is determined through multiple iterations.

And determining a node characteristic representation to be updated of the node according to each iteration round, updating the node characteristic representation to be updated according to the aggregation characteristic representation corresponding to the node determined in the previous iteration round to obtain an updated characteristic representation corresponding to the node in the iteration round, taking the updated characteristic representation corresponding to the node in the iteration round as an aggregation result of each neighbor node of the node in the graph data block, wherein the aggregation characteristic representation is used for representing the node characteristic representation of the node in the next iteration round, the node characteristic representation to be updated is obtained after the node characteristic representation of the node is iterated to the previous iteration round, and when the condition that a preset second termination condition is met is determined to be met, the updated characteristic representation of the node is obtained.

From the above, it can be seen that, a node with a degree higher than a preset threshold value can be selected from each node included in the original graph data, and is used as a hub node for connecting communities, and then each neighbor node of the hub node can be used as a starting node, and other nodes having a connection relationship with the neighbor node are accessed and used as associated nodes of the neighbor node, so that the associated nodes of the neighbor node and the neighbor node can be divided into graph data blocks of the original graph data, and node update processing is performed for each graph data block, and further redundant computation generated by directly using feature matrices corresponding to all nodes included in the original graph data to perform computation can be avoided, so that the data processing efficiency is improved.

The present specification also provides a computer readable storage medium having stored thereon a computer program operable to perform a method of one of the methods provided in fig. 5 above.

The present specification also provides a schematic structural diagram of an electronic device corresponding to fig. 5 shown in fig. 6. At the hardware level, as shown in fig. 6, the electronic device includes a processor, an internal bus, a network interface, a memory, and a nonvolatile storage, and may of course include hardware required by other services. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs to implement the method of fig. 1 described above.

Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A data processing system, the data processing system comprising: a first processor and a second processor;

2. The data processing system of claim 1, wherein the first processor is configured to determine, for each neighboring node of the hub node, an associated node of the neighboring node by multiple rounds of node queries with the neighboring node as a starting node; wherein, the liquid crystal display device comprises a liquid crystal display device,

3. The data processing system of claim 2, wherein the first processor is configured to determine, for each target node in each round of node queries, whether the target node is included in a preset global set of accessed nodes, if not, determine that the target node is an accessed neighboring node corresponding to the starting node, and add the target node to the global set of accessed nodes, if so, determine that the starting node does not have an associated node, and remove the accessed neighboring node corresponding to the starting node from the global set of accessed nodes.

4. The data processing system of claim 1, wherein the second processor comprises: at least one processing unit;

5. The data processing system of claim 4, wherein the data processing system further comprises: a task scheduler;

6. The data processing system of claim 5, wherein the task scheduler is configured to determine, for each processing unit, whether a load value of the processing unit exceeds a preset first load threshold, and if so, determine each processing unit adjacent to the processing unit in each processing unit, as each candidate processing unit, determine at least one candidate processing unit with a lowest load value in each candidate processing unit, as a target processing unit, and assign, to the target processing unit, a first data processing task or a second data processing task being processed by the processing unit.

7. The data processing system of claim 5, wherein the task scheduler is configured to determine, for each of the processing units, whether a difference between a load value of the processing unit and a load value of another processing unit exceeds a preset second load threshold, and if so, split a first data processing task or a second data processing task that is being processed by the processing unit to obtain each sub-data processing task, and allocate the sub-data processing task to a processing unit with a smallest load value among the processing units.

8. The data processing system of claim 1, wherein the second processor is configured to determine, for each node included in the graph data block, an updated feature representation for that node through a plurality of iterations; wherein, the liquid crystal display device comprises a liquid crystal display device,

9. A data processing method, wherein the data processing method is applied to a data processing system, and the data processing system comprises: a first processor, a second processor, the method comprising:

10. The method of claim 9, wherein for each neighboring node of the hub node, determining, by multiple rounds of node queries, other nodes in the original graph data that have a connection relationship with the neighboring node as associated nodes of the neighboring node, specifically includes:

11. The method of claim 10, wherein determining whether each target node has a neighboring node comprises:

12. The method of claim 9, wherein the second processor comprises: at least one processing unit;

13. The method of claim 12, wherein the data processing system further comprises: a task scheduler;

14. The method of claim 13, wherein the method further comprises:

15. The method of claim 13, wherein the method further comprises:

16. The method according to claim 9, wherein for each node included in the graph data block, updating the node characteristic representation of the node according to the node characteristic representation of each neighboring node of the node in the graph data block to obtain an updated characteristic representation of the node, specifically comprising:

17. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 9-16.

18. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 9-16 when executing the program.