CN107943918B

CN107943918B - Operation system based on hierarchical large-scale graph data

Info

Publication number: CN107943918B
Application number: CN201711160660.6A
Authority: CN
Inventors: 姚伟强; 周基初; 张宇; 郑凯
Original assignee: Hefei Yamooc Information Technology Co ltd
Current assignee: Hefei Yamooc Information Technology Co ltd
Priority date: 2017-11-20
Filing date: 2017-11-20
Publication date: 2021-09-07
Anticipated expiration: 2037-11-20
Also published as: CN107943918A

Abstract

The invention discloses an operation system based on hierarchical large-scale graph data, which comprises a graph data acquisition unit, a graph data analysis unit and a graph data management unit; the graph data analysis unit comprises a graph data segmentation module, a statistic module and a graph data merging module; the graph data management unit comprises an operation module, a comparison module and a warning module; the graph data acquisition unit is used for acquiring large-scale graph data, performing noise filtering processing on the graph data through median filtering, and transmitting the processed graph data to the graph data analysis unit and the graph data management unit. According to the method, the graph data are preprocessed and then are segmented according to adjacent nodes of the graph data, then the segmented graph data are integrated, the boundary points of the preprocessed graph data are collected to obtain an original boundary, meanwhile, the original boundary is compared with the integrated data, the accuracy of the segmented data is judged, and the accuracy of the graph data is further ensured.

Description

Operation system based on hierarchical large-scale graph data

Technical Field

The invention belongs to the field of large-scale graph data processing, and relates to an operation system based on hierarchical large-scale graph data.

Background

In the era of big data mining, graphs can not only directly describe many real-world applications in the fields of computer science, chemistry, and bioinformatics, such as social networks, web (web page) graphs, chemicals, and biological structures, but also describe various data mining algorithms, such as matrix decomposition or shortest path, etc. The graph comprises a plurality of nodes and edges connecting the nodes, the graph data comprises node data of the nodes and edge data of the edges connecting the nodes, and the edge data of one edge comprises a source node, a destination node and a weight of the edge. In a stand-alone graph computation processing platform (i.e., a processing platform that performs graph computation by using a single computer), because the memory capacity of the local memory of the single computer is limited, when the data amount of graph data to be computed exceeds the memory capacity, edge data in the graph data needs to be processed to obtain a plurality of edge data blocks, where one edge data block includes one or more edge data.

At present, when processing edge data in graph data, a fixed method is adopted, so that when a computer calculates node data of a node in an edge data block, if the edge data related to the node cannot be directly acquired, the required edge data can be acquired only by adjusting the arrangement sequence of the edge data in the edge data block. For example, in graph chi (a stand-alone graph computation processing platform), because a computation mode with a destination node as a center is used in graph computation, a computer divides edge data in graph data into a plurality of edge data blocks (called Shard in graph chi) in order of ID (identification) of the destination node from small to large, and divides all edge data corresponding to the same destination node into one edge data block, but the edge data blocks obtained by different division rules are different, so that the accuracy of data obtained in final merging is low.

Disclosure of Invention

The invention aims to provide an operation system based on hierarchical large-scale graph data, which is characterized in that graph data are preprocessed and then are divided according to adjacent nodes of the graph data, then the divided graph data are integrated, boundary points of the preprocessed graph data are collected to obtain an original boundary, and meanwhile, the original boundary is compared with the integrated data to judge the accuracy of the divided data so as to ensure the accuracy of the graph data.

The purpose of the invention can be realized by the following technical scheme:

an operation system based on hierarchical large-scale graph data comprises a graph data acquisition unit, a graph data analysis unit and a graph data management unit;

the image data acquisition unit is used for acquiring large-scale image data, performing noise filtering processing on the image data through median filtering, and transmitting the processed image data to the image data analysis unit and the image data management unit;

the graph data analysis unit is used for regularly dividing the preprocessed graph data into different subdata, simultaneously distributing the subdata to corresponding computing nodes, then counting the results obtained by computing each computing node, combining the counted results, and transmitting the data computed by each computing node and the combined data to the graph data management unit;

the graph data management unit calculates the preprocessed graph data, compares the calculation result with the calculation result obtained after separation and combination in the graph data analysis unit to determine the similarity, when the similarity is greater than 80%, the combined data is transmitted to a user, if the similarity is less than 80%, a warning is directly sent to the graph data analysis unit, and the graph data analysis unit performs graph data segmentation again until the similarity between the calculation result obtained after separation and combination and the result obtained by directly calculating the preprocessed graph data is greater than 80%.

Furthermore, the graph data analysis unit comprises a graph data segmentation module, a statistic module and a graph data merging module; the graph data partitioning module is used for partitioning graph nodes which are connected in pairs in the graph data after preprocessing, and partitioning every two adjacent nodes in a certain number as sub-graph data one by one from a boundary of the graph data according to the number of the total every two nodes, wherein a boundary node can be formed between every two nodes of each sub-graph data, and the points of the boundary are connected to form a super edge; the statistical module is used for performing statistical random integration on a plurality of super edges obtained by segmentation; and the graph data merging module merges the nodes of the super edges of the randomly integrated sub-graph data to form a total super edge so as to obtain the calculated data.

Further, the number of pairwise nodes in each sub-graph data is 10% -20% of the total number of pairwise nodes.

Furthermore, the graph data management unit comprises an operation module, a comparison module and a warning module, wherein the operation module is used for extracting boundary graph nodes of preprocessed graph data to obtain an original boundary; the comparison module compares a plurality of total excess edges obtained by combining in the graph data combination module, when the coincidence rate of the boundary nodes of the total excess edges and the original boundary nodes reaches more than 80%, the combined calculation result is transmitted to a user, if the coincidence rate is less than 80%, the warning module sends a warning to the graph data division module, and the graph data division module reselects the boundary points to divide the graph data until the final comparison result is more than 80%.

The invention has the beneficial effects that:

the system is used for preprocessing the graph data, then segmenting the graph data according to adjacent nodes of the graph data, then integrating the segmented graph data, performing boundary point acquisition on the preprocessed graph data to obtain an original boundary, and meanwhile, comparing the original boundary with the integrated data to judge the accuracy of the segmented data so as to ensure the accuracy of the graph data.

Drawings

In order to facilitate understanding for those skilled in the art, the present invention will be further described with reference to the accompanying drawings.

FIG. 1 is a diagram of a data calculation system according to the present invention.

Detailed Description

A large-scale graph data operation system based on hierarchy is shown in figure 1 and comprises a graph data acquisition unit, a graph data analysis unit and a graph data management unit;

the graph data analysis unit is used for regularly dividing the preprocessed graph data into different subdata, simultaneously distributing the subdata to corresponding computing nodes, then counting the results obtained by computing each computing node, combining the counted results, and transmitting the data computed by each computing node and the combined data to the graph data management unit; the graph data analysis unit comprises a graph data segmentation module, a statistic module and a graph data merging module; the graph data partitioning module is used for partitioning graph nodes which are connected in pairs in the graph data after preprocessing, and partitioning adjacent nodes in a certain number as sub-graph data one by one from a boundary of the graph data according to the number of the total nodes in pairs, wherein the number of the nodes in each sub-graph data is 10% -20% of the number of the total nodes in each sub-graph data, a boundary node can be formed between every two nodes in each sub-graph data, and the points of the boundary are connected to form a super edge; the statistical module is used for performing statistical random integration on a plurality of super edges obtained by segmentation; the graph data merging module merges nodes of the randomly integrated super edges of the sub-graph data to form a total super edge so as to obtain calculated data;

the graph data management unit calculates the preprocessed graph data, compares the calculation result with the calculation result obtained after separation and combination in the graph data analysis unit to determine the similarity, if the similarity is greater than 80%, the combined data is transmitted to a user, if the similarity is less than 80%, a warning is directly sent to the graph data analysis unit, and the graph data analysis unit performs graph data segmentation again until the similarity between the calculation result obtained after separation and combination and the result obtained by directly calculating the preprocessed graph data is greater than 80%; the graph data management unit comprises an operation module, a comparison module and a warning module, wherein the operation module is used for extracting boundary graph nodes of preprocessed graph data to obtain an original boundary; the comparison module compares a plurality of total excess edges obtained by combining in the graph data combination module, when the coincidence rate of the boundary nodes of the total excess edges and the original boundary nodes reaches more than 80%, the combined calculation result is transmitted to a user, if the coincidence rate is less than 80%, the warning module sends a warning to the graph data division module, and the graph data division module reselects the boundary points to divide the graph data until the final comparison result is more than 80%.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. An operation system based on hierarchical large-scale graph data is characterized by comprising a graph data acquisition unit, a graph data analysis unit and a graph data management unit;

the graph data management unit calculates the preprocessed graph data, compares the calculation result with the calculation result obtained by dividing and combining in the graph data analysis unit to determine the similarity, if the similarity is greater than 80%, the combined data is transmitted to a user, if the similarity is less than 80%, a warning is directly sent to the graph data analysis unit, and the graph data analysis unit divides the graph data again until the similarity between the calculation result obtained by dividing and combining and the calculation result obtained by directly calculating the preprocessed graph data is greater than 80%;

the graph data analysis unit comprises a graph data segmentation module, a statistic module and a graph data merging module; the graph data dividing module divides graph nodes which are connected in pairs in the graph data after pretreatment, and divides the adjacent nodes in pairs as sub-graph data one by one from a boundary of the graph data according to the number of the total nodes in pairs, wherein a boundary node can be formed between every two nodes of each sub-graph data, and the points of the boundary are connected to form a super edge; the statistical module is used for performing statistical random integration on a plurality of super edges obtained by segmentation; the graph data merging module merges nodes of the randomly integrated super edges of the sub-graph data to form a total super edge so as to obtain calculated data;

the number of the two nodes in each sub-graph data is 10% -20% of the total number of the two nodes.

2. The operation system based on the hierarchical large-scale graph data according to claim 1, wherein the graph data management unit comprises an operation module, a comparison module and a warning module, wherein the operation module extracts boundary graph nodes of preprocessed graph data to obtain an original boundary; the comparison module compares the boundary node of the total excess edge with the original boundary node, when the coincidence rate of the boundary node of the total excess edge and the original boundary node reaches more than 80%, the combined calculation result is transmitted to a user, if the coincidence rate is less than 80%, the warning module sends a warning to the graph data segmentation module, and the graph data segmentation module reselects the boundary node to segment the graph data until the final comparison result is more than 80%.