WO2017076296A1 - 处理图数据的方法和装置 - Google Patents

处理图数据的方法和装置 Download PDF

Info

Publication number
WO2017076296A1
WO2017076296A1 PCT/CN2016/104370 CN2016104370W WO2017076296A1 WO 2017076296 A1 WO2017076296 A1 WO 2017076296A1 CN 2016104370 W CN2016104370 W CN 2016104370W WO 2017076296 A1 WO2017076296 A1 WO 2017076296A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
mapreduce
sub
graph
round
Prior art date
Application number
PCT/CN2016/104370
Other languages
English (en)
French (fr)
Inventor
林学练
郑金龙
马帅
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2017076296A1 publication Critical patent/WO2017076296A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Definitions

  • Embodiments of the present invention relate to the field of data processing, and in particular, to a method and apparatus for processing map data.
  • a graph is an abstract data structure that describes rich information and dependencies between information.
  • algorithms based on graph data such as shortest path algorithm, graph simulation algorithm, web page ranking algorithm and breadth-first search algorithm.
  • the application of graph data and its related algorithms is ubiquitous, such as social network analysis, semantic web analytics, bioinformatics science and traffic navigation.
  • the prior art mainly uses a MapReduce system to store and process graph data.
  • the MapReduce system generally uses a Distributed File System (DFS) to store graph data.
  • DFS Distributed File System
  • the computing node of the entire system is generally scheduled by the main control node of the MapReduce system (Map compute node and Reduce).
  • Map compute node and Reduce the main control node of the MapReduce system
  • the computing node performs multiple rounds of MapReduce jobs on the graph data to obtain the processing result of the graph data.
  • Embodiments of the present invention provide a method and apparatus for processing map data to improve processing efficiency of graph data.
  • an embodiment of the present invention provides a method for processing graph data, the method comprising: determining graph data to be processed, the graph corresponding to the graph data is divided into a plurality of subgraphs; and the scheduling map is simplified in a MapReduce system. Calculating a node, performing multiple rounds of MapReduce operations on the graph data to obtain processing results of the graph data, wherein each Map computing node in the MapReduce job is configured to process each of the plurality of subgraphs having mutual The vertices of the connection relationship.
  • the method further includes: the multiple sub-pictures include m sub-pictures, where the data is stored in a distributed file system DFS, and the DFS includes m first files corresponding to the m sub-pictures, and m second files corresponding to the m sub-pictures, wherein the m first files are respectively used to store sub-picture data corresponding to the m sub-pictures
  • the m second files are respectively used to store message data corresponding to the processed vertex in the m subgraphs
  • the computing node in the scheduling MapReduce system performs multiple rounds of MapReduce operations on the graph data, including: Each round of MapReduce jobs in the multiple rounds of MapReduce jobs allocates the to-be-processed sub-pictures; according to the to-be-processed sub-pictures, the input data of each round of MapReduce jobs is selected from the m first files and the m second files,
  • the input data includes subgraph data corresponding to the subgraph
  • the method further includes: performing, according to the input data, the each MapReduce job, including: according to the input The data is allocated to the Map computing node and the Reduce computing node of each round of the MapReduce job; and the Reduce computing node in each round of the MapReduce job controls the processed message data to be stored in the m second files.
  • the method further includes: the MapReduce job Each Map compute node in the process is processed according to the breadth-first search BFS algorithm Vertices with interconnected relationships.
  • an embodiment of the present invention provides an apparatus for processing map data, the apparatus comprising: a determining module, configured to determine graph data to be processed, wherein a map corresponding to the graph data is divided into a plurality of subgraphs; a scheduling module, The method is used for scheduling a computing node in the MapReduce system, and performing multiple rounds of MapReduce operations on the graph data to obtain processing results of the graph data, wherein each Map computing node in the MapReduce job is used to process the plurality of sub-trees.
  • a determining module configured to determine graph data to be processed, wherein a map corresponding to the graph data is divided into a plurality of subgraphs
  • a scheduling module The method is used for scheduling a computing node in the MapReduce system, and performing multiple rounds of MapReduce operations on the graph data to obtain processing results of the graph data, wherein each Map computing node in the MapReduce job is used to process the plurality of sub-trees.
  • the multiple sub-pictures include m sub-pictures, where the data is stored in a distributed file system DFS, and the DFS includes one and one sub-pictures Corresponding m first files, and m second files corresponding to the m sub-pictures, wherein the m first files are respectively used to store sub-picture data corresponding to the m sub-pictures, the m first
  • the second file is used to store the message data corresponding to the processed vertex in the m sub-pictures
  • the scheduling module is specifically configured to: allocate a sub-picture to be processed for each round of MapReduce jobs in the multiple rounds of MapReduce jobs; a sub-picture to be processed, selecting input data of each round of MapReduce jobs from the m first files and the m second files, the input data including sub-picture data corresponding to the to-be-processed sub-picture, and the The message data of the last round of MapReduce job of each round of
  • the scheduling module is specifically configured to: calculate, according to the input data, a Map compute node for each round of MapReduce jobs The Reduce compute node allocates a calculation task; and stores the processed message data into the second file of the m according to the Reduce compute node in each round of the MapReduce job.
  • each map in the MapReduce job processes the connected relationship according to the breadth-first search BFS algorithm The apex.
  • the map corresponding to the graph data to be processed is first divided into a plurality of subgraphs, and then, in each round of MapReduce operations, each Map computing node processes a connection relationship inside one of the plurality of subgraphs at a time.
  • the vertices make each round of MapReduce jobs handle as many vertices as possible, which can reduce the number of rounds of MapReduce jobs and improve the processing efficiency of graph data.
  • FIG. 1 is a schematic block diagram of a mapping simplification system for a method of processing map data according to an embodiment of the present invention.
  • FIG. 2 is a schematic flow chart of a method of processing map data according to an embodiment of the present invention.
  • mapping simplification operation is a flow chart of a mapping simplification operation according to another embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a division of a map in accordance with another embodiment of the present invention.
  • FIG. 5 is a schematic flowchart of a method of processing map data according to another embodiment of the present invention.
  • FIG. 6 is a schematic flowchart of a method of processing map data according to another embodiment of the present invention.
  • FIG. 7 is a schematic block diagram of an apparatus for processing map data according to still another embodiment of the present invention.
  • FIG. 8 is a schematic block diagram of an apparatus for processing map data according to still another embodiment of the present invention.
  • FIG. 1 is a schematic block diagram of a MapReduce system to which a method of processing map data of an embodiment of the present invention can be applied.
  • the system can include a DFS, a Map compute node, and a Reduce compute node.
  • the implementation process of processing graph data generally requires some kind of traversal method to traverse the graph data, so processing a complete graph data usually requires multiple rounds of MapReduce jobs (jobs).
  • the Map computing node includes at least one Map computing node corresponding to the Map phase; the Reduce computing node includes at least one Reduce computing node, corresponding to the Reduce phase.
  • the Map compute node processes the input data to obtain intermediate calculation results or message data.
  • the Reduce compute node simplifies the input data, obtains the simplified message data, and saves it in the DFS.
  • the Map phase to the Reduce phase may pass through a shuffle phase, and the intermediate calculation result is taken out from the disk during the Shuffle process, and after being merged and sorted, transmitted to the Reduce compute node. Input data for the Reduce phase.
  • the method for processing map data may be performed by a master device.
  • the master control device is responsible for allocating scheduling and computing tasks of all working devices in the data processing process.
  • the master device can schedule the Map computing node, the Reduce computing node, and control the task allocation of the Map computing node and the Reduce computing node, or Control the Map compute node to read the required data from the DFS, or control the Reduce compute node to store the processed message data in the DFS.
  • the method and apparatus for processing map data in the embodiments of the present invention may be applied to a shortest path algorithm, a graph simulation algorithm, a strong simulation algorithm, a web page ranking algorithm, or a breadth first search (BFS).
  • a shortest path algorithm a graph simulation algorithm, a strong simulation algorithm, a web page ranking algorithm, or a breadth first search (BFS).
  • BFS breadth first search
  • the distributed file system DFS in the embodiment of the present invention may be a Hadoop Distributed File System (HDFS), or may be a Network File System ("NFS”). It may be a Google File System (“GFS”), or any other distributed file system, and the present invention is not limited thereto.
  • HDFS Hadoop Distributed File System
  • NFS Network File System
  • GFS Google File System
  • FIG. 2 is a schematic flow diagram of a method 200 of processing map data in accordance with an embodiment of the present invention. As shown in FIG. 2, the method 200 includes:
  • S220 Scheduling and mapping the computing nodes in the MapReduce system, performing multiple rounds of MapReduce operations on the graph data to obtain processing results of the graph data, where each Map computing node in the MapReduce job is used to process the plurality of sub-trees A vertex with interconnected relationships within a subgraph in the figure.
  • the to-be-processed graph data may be located in the DFS, and the determining the graph data to be processed may include determining a storage location of the graph data to be processed in the DFS system, so as to control a computing node in the MapReduce system from the storage. The location gets the data needed for the MapReduce job.
  • the map corresponding to the graph data is divided into a plurality of subgraphs, each of which may correspond to one subgraph data, and the graph data may include subgraph data corresponding to each of the plurality of subgraphs.
  • the sub-picture data corresponding to each sub-picture may include information of the vertices in the sub-picture and information of the connection relationship of the vertices in the sub-picture, wherein the connection relationship of the vertices in the sub-picture is
  • the information may include information on the connection relationship between the vertices in the subgraph, and may also include information on the connection relationship of the vertices in the subgraph to the vertices in the other subgraphs.
  • each Map computing node in each round of MapReduce jobs processes vertices having interconnected relations in one subgraph, or each Map computing node processes a set of vertices or groups in a subgraph having interconnected relationships vertex.
  • the input data of each Map computing node may include sub-picture data corresponding to a sub-picture, and each of the Map computing nodes may be used to process sub-picture data corresponding to one sub-picture.
  • each round of MapReduce the connection relationship between the internal vertices of each sub-picture can be utilized, and the vertices with the connection relationship are placed in the same Map calculation node for processing.
  • each round of MapReduce Each Map compute node process in the job is an isolated vertex. When the vertices are processed, the connection relationship between the vertices is not considered or utilized. It is a vertex-centric processing method, and in the embodiment of the present invention, each round Each Map computation node in a MapReduce job utilizes the connection relationship between the internal vertices of a subgraph, and treats the vertices with connected relationships in the subgraph data as a calculation object, which is a subgraph-centric processing. the way.
  • the connection relationship with the vertex may be processed in the same round of MapReduce job.
  • the apex a method of calculating the object.
  • the message data can be multi-stepped between the vertices inside the same subgraph. Therefore, in a round of MapReduce operation, the vertices of the connection relationship within the same subgraph can be processed simultaneously, thereby reducing processing.
  • the computational speed and computational efficiency of data processing reduces resource and time overhead.
  • the vertices of the subgraph may be divided into two types: an inner vertex and a boundary vertex.
  • the inner vertex indicates that all the vertices connected to the inner vertex belong to the same subgraph.
  • the boundary vertex indicates that at least one vertex connected to the boundary vertex does not belong to the subgraph where the boundary vertex is located.
  • the map corresponding to the graph data to be processed is divided into a plurality of subgraphs, and the Map computing node in the MapReduce job uses the subgraph as a calculation object, and each time a subgraph is connected with a connection relationship.
  • the vertices make full use of the connection relationship of the vertices in the subgraph, so that each round of MapReduce jobs can process more vertices as much as possible, thereby reducing the number of rounds of MapReduce jobs required to process the graph data, and improving the processing efficiency of the graph data.
  • the Map phase to the Reduce phase can pass through a shuffle phase.
  • the intermediate calculation results are taken out of the disk during the Shuffle process.
  • the data is transferred to the Reduce compute node.
  • the graph data belongs to the invariant data in the realization process of the graph algorithm, that is, the graph data remains unchanged during each round of MapReduce operation.
  • the data volume of the graph data is relatively large, and in each round of MapReduce operation Will be used.
  • the message data belongs to the changed data. Usually, the data volume of the message data is relatively small.
  • the multiple sub-pictures may include m sub-pictures, where the picture data is stored in a distributed file system DFS, where the DFS includes m first files corresponding to the m sub-pictures, and m second files corresponding to the m sub-pictures one by one, wherein the m first A file is used to store the sub-picture data corresponding to the m sub-pictures, and the m second files are respectively used to store the message data corresponding to the processed vertex in the m sub-pictures.
  • DFS distributed file system
  • the DFS includes m first files corresponding to the m sub-pictures, and m second files corresponding to the m sub-pictures one by one
  • the m first A file is used to store the sub-picture data corresponding to the m sub-pictures
  • the m second files are respectively used to store the message data corresponding to the processed vertex in the m sub-pictures.
  • scheduling a computing node in the MapReduce system performing multiple rounds of MapReduce operations on the graph data, including: allocating a sub-graph to be processed for each round of MapReduce jobs in the multiple rounds of MapReduce jobs; Processing the sub-picture, selecting input data of each round of MapReduce jobs from the m first files and the m second files, the input data including sub-picture data corresponding to the to-be-processed sub-picture, and each round The last round of MapReduce job of the MapReduce job processes the obtained message data; according to the input data, the round MapReduce job is performed.
  • the graph data that remains unchanged throughout the MapReduce job process is extracted, separately stored in the DFS, and each round of MapReduce jobs.
  • the generated message data is also saved in the DFS corresponding to the map data.
  • the required graph data and message data are read from the DFS as input data of the current MapReduce.
  • the Map computing node does not need to transfer the graph data to other computing nodes after processing the subgraph data, so there is no need to transfer the graph data in the Shuffle process, thereby reducing the graph data in the calculation process.
  • the I/O overhead and communication overhead during the Shuffle process speed up the processing of graph data.
  • FIG. 3 shows a flowchart of a mapping simplification operation according to another embodiment of the present invention.
  • the map data is divided into a specified number of sub-picture data and then saved in the DFS.
  • the message data can be the result of each round of MapReduce jobs, and the message data and the sub-picture data are in one-to-one correspondence.
  • Message data can also be stored in the DFS after being processed by the Reduce compute node.
  • the Map node reads the required message data and graph data from the DFS, and combines the message data and the graph data as the input data of the Map compute node in the current MapReduce job.
  • the message data also referred to as a message
  • the subgraph and the message data have a one-to-one correspondence.
  • Message data can be combined with submap data as input data for each round of MapReduce jobs.
  • G i a file in which m sub-picture data is placed
  • M i a file in which message data corresponding to m sub-pictures is placed
  • the G i files are in one-to-one correspondence with the M i files.
  • the corresponding G i and M i can be named the same file name according to the file naming rules of DFS.
  • MapReduce When the compute node in the MapReduce job needs to read the input data, it can override MapReduce's CombineFileInputFormat class, logically merge the G i and M i of the same file name into one file, as the input of the Map compute node.
  • each round of MapReduce operations is performed according to the input data, including: assigning a computing task to each of the MapReduce job's Map computing node and the Reduce computing node according to the input data; and controlling the Reduce computing node in each round of the MapReduce job.
  • the processed message data is stored in the m second files.
  • the master control device controls the Reduce compute node to store the message data in the corresponding m seconds corresponding to the m subgraphs.
  • the required input data is read from the m second files. Therefore, the graph data and the message data can be processed separately, which reduces the I/O overhead brought by the graph data in the calculation process and the communication overhead in the Shuffle process, thereby speeding up the processing speed of the graph data.
  • FIG. 4 A specific embodiment of a method of processing map data is described above in connection with FIGS. 1 through 3. A specific implementation of the method for processing map data in the embodiment of the present invention will be described below with reference to FIG. 4 to FIG.
  • the map G to be processed is first divided into three sub-pictures G1, G2, and G3.
  • V 1 [1,2]
  • V 2 [3,4]
  • V 3 [5,6]
  • the vertices identified by the dashed lines in each subgraph are used to indicate that they do not belong to the subgraph but with the sub
  • the graph has vertices with edges connected.
  • vertex 3 as the source point
  • the calculated intermediate result is passed along the edge to the adjacent vertex in the form of a message until the end of all the reachable vertex calculations.
  • the vertex 3 is the source point, that is, the starting vertex
  • the outgoing message of the vertex 3 corresponds to the vertex 1 and the vertex 4
  • the outgoing message of the vertex 1 corresponds to the vertex 2 and the vertex 5,
  • the vertex 4 The outgoing message corresponds to vertex 1 and vertex 5,
  • the outgoing message of vertex 2 corresponds to vertex 6,
  • the outgoing message of vertex 5 corresponds to vertex 2 and vertex 6.
  • FIG. 5 illustrates a MapReduce job process in which the graph G of the embodiment of the present invention is processed, wherein the vertices indicated by the broken lines represent the starting vertices of the next round of MapReduce jobs, and the vertices of the gray marks represent the vertices that have been processed.
  • the Map computing node takes the subgraph G2 as the calculation object, and since the vertex 3 and the vertex 4 have a connection relationship, and the vertex 4 and the vertex 3 are in the same subgraph, the processing is completed.
  • the data generated by vertex 3 is passed to vertex 4, and vertex 3 and vertex 4 in G2 can be processed in the same round of MapReduce job to obtain the message data of the first round of MapReduce job.
  • the message data of the vertex 4 obtained in the current MapReduce job is transmitted to the vertex 1 and the vertex 5, so as to be performed.
  • a round of MapReduce jobs since the vertex 4 and the vertex 1 in the subgraph G1 and the vertex 5 in the subgraph G3 have a connection relationship, the message data of the vertex 4 obtained in the current MapReduce job is transmitted to the vertex 1 and the vertex 5, so as to be performed.
  • FIG. 6 shows a MapReduce job process for processing a graph G in the prior art, in which the vertices indicated by the broken lines represent the starting vertices of the next round of MapReduce jobs, and the vertices of the gray marks represent the vertices that have been processed. As shown in FIG.
  • the method for processing the map data in the embodiment of the present invention is significantly reduced compared with the prior art, thereby improving the processing efficiency of the graph data.
  • the setper(), Map(), and clean() functions of the Mapper class can be rewritten in the Map stage.
  • the function of the setup() function is to do some initialization of the related work before the start of the Map, while the clean() function does the finishing work after the Map calculation is completed, and the setup() and clean() functions can be executed only once during the Map phase. Therefore, first, use the setup() function to initialize a HashMap structure to hold the entire submap; after that, the Map() function reads the vertex data one by one and maps it to the HashMap structure; finally, in the clean() function, you can The entire subgraph that has been saved in the HashMap is used for custom calculations.
  • the key pseudo code of the method for processing the map data of the embodiment of the present invention in the Map stage can be as follows.
  • the hash map may be used to divide the sub-graph in the implementation process of the MapReduce job.
  • the MapReduce distributed computing framework does not consider the internal relationship of the graph data in the design process. Therefore, the hash map method does not consider the connection relationship of the vertices inside the subgraph. If the vertices with edges are allocated to the same subgraph as much as possible while ensuring load balancing, and the number of edges across the subgraphs is minimized, the same subgraph can be processed simultaneously in one round of MapReduce operations.
  • More vertices which can reduce the number of rounds of MapReduce jobs required to process graph data, and improve the processing efficiency of graph data.
  • the local features of the graph data can be fully considered, and the sub-graphs are divided according to the characteristics of the graph data in practical applications. For example, in the map corresponding to the traffic network, the numbers of adjacent vertices are small. Therefore, subgraphs can be divided according to the order of the vertices, such as 1 to 1000, 1001 to 2000... and stored in the subgraph data corresponding to the same subgraph.
  • the number of the vertex, m is the number of subgraphs, and N is the number of vertices in the graph.
  • the traffic network map it can also be divided according to the GIS location information, such as a city or province traffic network as a sub-picture according to actual needs.
  • the GIS location information such as a city or province traffic network as a sub-picture according to actual needs.
  • gr in the Map function it is necessary to parse the GIS data and extract the position information.
  • the key pseudocode that it implements in the MapReduce system can be as follows.
  • the corresponding method of dividing subgraphs can also be used for social networks.
  • the public information provided by the user on the registered social networking site such as the city, work unit or school, can be used as the basis for dividing the sub-picture.
  • the gr in the Map function can be assigned as needed.
  • the vertices with edges connected are divided into the same subgraph as much as possible, and the subgraphs are weakened at the same time. Coupling can further reduce the number of MapReduce job rounds required to process graph data, and improve the processing speed and computational efficiency of graph data.
  • the Map computing node in the MapReduce job takes the subgraph as a calculation object, and each time a subgraph is processed
  • the vertices of the connection relationship make full use of the connection relationship of the vertices in the subgraph. This allows each round of MapReduce jobs to process as many vertices as possible, thereby reducing the number of MapReduce job rounds required to process the graph data, and improving the processing efficiency of the graph data.
  • the graph data that remains unchanged throughout the MapReduce job process is extracted, separately stored in the DFS, and the messages generated by each round of MapReduce jobs are generated.
  • the data is also stored in the DFS at the location corresponding to the graph data.
  • the required graph data and message data are read from the DFS as input data of the current MapReduce.
  • the Map computing node does not need to transfer the graph data to other computing nodes after processing the subgraph data, so there is no need to transfer the graph data in the Shuffle process, thereby reducing the graph data in the calculation process.
  • the method for dividing a subgraph adopted by the embodiment of the present invention analyzes the characteristics of the graph data involved in the actual application, and divides the vertices connected with edges into the same subgraph as much as possible under the premise of load balancing.
  • the number of MapReduce job rounds required to process the graph data can be further reduced, and the processing speed and computational efficiency of the graph data can be improved.
  • FIG. 7 A specific embodiment of a method for processing map data according to an embodiment of the present invention is described in detail above with reference to FIG. 1 to FIG. 6.
  • an apparatus for processing map data according to an embodiment of the present invention will be described in detail with reference to FIG. 7 and FIG.
  • FIG. 7 is a schematic diagram of an apparatus 700 for processing map data according to an embodiment of the present invention. It should be understood that the apparatus 700 according to an embodiment of the present invention may correspond to a master control apparatus in the method embodiment of the present invention, and The following and other operations and/or functions of the respective modules are respectively implemented in order to implement the respective processes of the respective methods in FIG. 2 to FIG. 6 , and are not described herein again for brevity.
  • the device 700 includes:
  • the determining module 710 is configured to determine the graph data to be processed, and the graph corresponding to the graph data is divided into multiple subgraphs;
  • the scheduling module 720 is configured to schedule a computing node in the mapped simplified MapReduce system, and perform multiple rounds of MapReduce operations on the graph data to obtain processing results of the graph data, where each Map computing node in the MapReduce job is used for A vertex having an interconnected relationship within one of the plurality of subgraphs is processed.
  • the Map computing node in the MapReduce job takes the subgraph as a calculation object, and processes one subgraph at a time.
  • the vertices with connection relationships in the graph make full use of the connection relationship of the vertices in the subgraph, so that each round of MapReduce job process can process more vertices as much as possible, thereby reducing the number of rounds of MapReduce jobs required to process the graph data, and improving The processing efficiency of the graph data.
  • the map corresponding to the to-be-processed graph data determined by the determining module 710 of the device 700 is divided into a plurality of sub-pictures.
  • the plurality of sub-pictures include m sub-pictures.
  • the map data is stored in a distributed file system DFS, the DFS includes m first files corresponding to the m sub-pictures, and m second files corresponding to the m sub-pictures, wherein the m
  • the first files are respectively used to store the sub-picture data corresponding to the m sub-pictures, and the m second files are respectively used to store the message data corresponding to the processed vertex in the m sub-pictures.
  • the scheduling module 720 of the apparatus 700 of the embodiment of the present invention is specifically configured to: allocate, to each of the multiple rounds of MapReduce jobs, a sub-picture to be processed; according to the to-be-processed sub-picture, from the m
  • the input data of each round of MapReduce jobs is selected in the first file and the m second files, the input data includes sub-picture data corresponding to the to-be-processed sub-picture, and the last round of MapReduce jobs of each round of MapReduce jobs Processing the obtained message data; and performing each round of MapReduce jobs according to the input data.
  • the graph data that remains unchanged throughout the MapReduce job process is extracted, separately stored in the DFS, and each round of MapReduce jobs.
  • the generated message data is also saved in the DFS corresponding to the map data.
  • the required graph data and message data are read from the DFS as input data of the current MapReduce.
  • the Map computing node does not need to transfer the graph data to other computing nodes after processing the subgraph data, so there is no need to transfer the graph data in the Shuffle process, thereby reducing the graph data in the calculation process.
  • the I/O overhead and communication overhead during the Shuffle process speed up the processing of graph data.
  • the scheduling module 720 is specifically configured to: according to the input data, allocate a computing task to the Map computing node and the Reduce computing node of each round of the MapReduce job; and control the processed by the Reduce computing node in each round of the MapReduce job.
  • the message data is stored in the second file of the m.
  • the vertices with edges connected are divided into the same subgraph as much as possible, and the subgraphs are weakened at the same time. Coupling can further reduce the number of MapReduce job rounds required to process graph data, and improve the processing efficiency of graph data.
  • FIG. 8 shows an apparatus 800 for processing map data according to another embodiment of the present invention.
  • the apparatus 800 includes a processor 810, a memory 820, and a bus system 830.
  • the device 800 is connected to the computing node in the MapReduce system through the bus system 830.
  • the processor 810 and the memory 820 are connected by the bus system 830.
  • the memory 820 is used to store instructions, and the processor 810 is configured to execute the
  • the memory 820 stores instructions to facilitate the processor 810 to control MapReduce jobs performed by computing nodes in the MapReduce system.
  • the processor 810 is configured to: determine a graph data to be processed, the graph corresponding to the graph data is divided into a plurality of subgraphs; schedule a computing node in the MapReduce system, and perform multiple rounds of MapReduce operations on the graph data to obtain the graph data. The result of the processing; wherein each Map computing node in the at least one Map computing node in the MapReduce job is configured to process vertices in one of the plurality of subgraphs, the vertices having an interconnection relationship.
  • the map corresponding to the graph data to be processed is divided into a plurality of subgraphs, and the Map computing node in the MapReduce job uses the subgraph as a calculation object, and each time a subgraph is connected with a connection relationship.
  • the vertices make full use of the connection relationship of the vertices in the subgraph, so that each round of MapReduce job process can process more vertices as much as possible, thereby reducing the number of MapReduce job rounds required to process the graph data, and improving the processing efficiency of the graph data.
  • the processor 810 may be a central processing unit (“CPU"), and the processor 810 may also be other general-purpose processors, digital signal processors (DSPs). , an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, and the like.
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • the memory 820 can include read only memory and random access memory and provides instructions and data to the processor 810. A portion of the memory 820 may also include a non-volatile random access memory. For example, the memory 820 can also store information of the device type.
  • the bus system 830 may include a power bus, a control bus, a status signal bus, and the like in addition to the data bus.
  • the bus system 830 can also include an internal bus, a system bus, and an external bus. However, for clarity of description, various buses are labeled as bus system 830 in the figure.
  • each step of the foregoing method may be completed by an integrated logic circuit of hardware in the processor 810 or an instruction in a form of software.
  • the steps of the method disclosed in the embodiments of the present invention may be directly implemented as a hardware processor, or may be performed by a combination of hardware and software modules in the processor.
  • the software module can be located in a conventional storage medium such as random access memory, flash memory, read only memory, programmable read only memory or electrically erasable programmable memory, registers, and the like.
  • the storage medium is located in the memory 820, and the processor 810 reads the information in the memory 820 and completes the steps of the above method in combination with its hardware. To avoid repetition, it will not be described in detail here.
  • the map corresponding to the map data processed by the processor 810 is divided into a plurality of sub-pictures.
  • the plurality of sub-pictures are m sub-pictures, and m pieces corresponding to the m sub-pictures one by one a file and m second files corresponding to the m sub-pictures are stored in a distributed file system DFS, wherein each file in the first file is used to store each sub-picture corresponding to the m sub-pictures Sub-picture data, each file in the second file is used to store the message data corresponding to each sub-picture.
  • the processor 810 schedules a computing node in the MapReduce system, and performs multiple rounds of MapReduce operations on the graph data, including:
  • each round of MapReduce jobs is performed.
  • the graph data that remains unchanged throughout the MapReduce job process is extracted, separately stored in the DFS, and each round of MapReduce jobs.
  • the generated message data is also saved in the DFS corresponding to the map data.
  • the required graph data and message data are read from the DFS as input data of the current MapReduce.
  • the Map computing node does not need to transfer the graph data to other computing nodes after processing the subgraph data, so there is no need to transfer the graph data in the Shuffle process, thereby reducing the graph data in the calculation process.
  • the I/O overhead and communication overhead during the Shuffle process speed up the processing of graph data.
  • the processor 810 performs the current MapReduce job according to the input data, and specifically includes:
  • the Map computing node in the MapReduce job takes the subgraph as a computing object, and each time a subgraph is connected with a connection.
  • the vertices of the relationship make full use of the connection relationship of the vertices in the subgraph, so that each round of MapReduce job process can process more vertices as much as possible, thereby reducing the number of MapReduce job rounds required to process the graph data, and improving the processing efficiency of the graph data.
  • the apparatus 800 for transmitting information control information may correspond to a master device in the method embodiment of the present invention, and the above and other operations and/or functions of the respective modules in the apparatus 800 are respectively implemented in order to implement the map. 2 to the corresponding flow of each method in FIG. 6, for brevity, no further details are provided herein.
  • the method for dividing a subgraph adopted by the device for processing graph data in the embodiment of the present invention analyzes the characteristics of the graph data involved in the actual application, and divides the vertices connected to each other as much as possible under the premise of load balancing.
  • the coupling between the subgraphs is weakened at the same time, which can further reduce the number of MapReduce job rounds required to process the graph data, and improve the processing efficiency of the graph data.
  • system and “network” are used interchangeably herein.
  • the term “and/or” in this context is merely an association describing the associated object, indicating that there may be three relationships, for example, A and / or B, which may indicate that A exists separately, and both A and B exist, respectively. B these three situations.
  • the character "/" in this article generally indicates that the contextual object is an "or" relationship.
  • B corresponding to A means that B is associated with A, and B can be determined according to A. But it should also be understood that determining B according to A does not mean that it is only determined according to A. B, B can also be determined based on A and/or other information.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, or an electrical, mechanical or other form of connection.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiments of the present invention.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, can be stored in a computer readable storage medium.
  • the technical solution of the present invention contributes in essence or to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium.
  • Including a number of instructions to make a computer device (which can be a personal computer, A server, or network device, etc.) performs all or part of the steps of the method in accordance with various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

Abstract

一种处理图数据的方法和装置,该方法包括:确定待处理的图数据,该图数据对应的图被划分成多个子图(S210);调度映射化简系统中的计算节点,对该图数据进行多轮映射化简作业,以得到该图数据的处理结果,其中,该映射化简作业中的每个映射计算节点用于处理该多个子图中的一个子图内的具有相互连接关系的顶点(S220)。所述方法能够提高图数据的处理效率。

Description

处理图数据的方法和装置
本申请要求于2015年11月03日提交中国专利局、申请号为201510737900.9、发明名称为“处理图数据的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及数据处理领域,尤其涉及一种处理图数据的方法和装置。
背景技术
图是一种抽象的数据结构,能够描述丰富的信息以及信息之间的依赖关系。现有技术中存在很多基于图数据的算法,如最短路径算法、图模拟算法、网页排名算法以及广度优先搜索等算法。图数据及其相关算法的应用无处不在,如社交网络分析、语义Web分析、生物信息科学和交通导航等。
随着这些应用的迅速发展,它们涉及的图数据的规模也变得越来越大,动辄有上亿的顶点和数十亿条边。如何高效地存储和处理大规模图数据也越来越受到学术界和工业界的关注。
现有技术主要采用映射化简(MapReduce)系统存储和处理图数据。具体地,MapReduce系统一般使用分布式文件系统(Distributed File System,DFS)存储图数据,当需要处理该图数据时,一般由MapReduce系统的主控节点调度整个系统的计算节点(Map计算节点和Reduce计算节点)对图数据进行多轮MapReduce作业(MapReduce job),得到图数据的处理结果。
现有技术中,MapReduce系统对图数据进行处理时,在Map阶段只是针对输入文件进行逐条数据的计算,在处理图数据的过程中表现为以单个顶点为计算对象,其中每个顶点包含自身和出边的信息,每轮MapReduce作业过程中,消息被限制为只能沿出边进行单步传递,以进行下一轮MapReduce作业,当图数据的规模很大时,需要进行多轮MapReduce作业,导致图数据的处理效率低下。
发明内容
本发明实施例提供了一种处理图数据的方法和装置,以提高图数据的处理效率。
第一方面,本发明实施例提供了一种处理图数据的方法,该方法包括:确定待处理的图数据,该图数据对应的图被划分成多个子图;调度映射化简MapReduce系统中的计算节点,对该图数据进行多轮MapReduce作业,以得到该图数据的处理结果,其中,该MapReduce作业中的每个Map计算节点用于处理该多个子图中的一个子图内的具有相互连接关系的顶点。
结合第一方面,在第一方面的第一种可能的实现方式中,该方法还包括:该多个子图包括m个子图,该图数据存储在分布式文件系统DFS中,该DFS包括与该m个子图一一对应的m个第一文件,以及与该m个子图一一对应的m个第二文件,其中,该m个第一文件分别用于存储该m个子图对应的子图数据,该m个第二文件分别用于存储该m个子图中的被处理过的顶点对应的消息数据,该调度MapReduce系统中的计算节点,对该图数据进行多轮MapReduce作业,包括:为该多轮MapReduce作业中的每轮MapReduce作业分配待处理的子图;根据该待处理的子图,从该m个第一文件和该m个第二文件中选取该每轮MapReduce作业的输入数据,该输入数据包括该待处理的子图对应的子图数据,以及该每轮MapReduce作业的上一轮MapReduce作业处理得到的消息数据;根据该输入数据,进行该每轮MapReduce作业。
结合第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,该方法还包括:该根据该输入数据,进行该每轮MapReduce作业,包括:根据该输入数据,为该每轮MapReduce作业的Map计算节点和Reduce计算节点分配计算任务;控制该每轮MapReduce作业中的Reduce计算节点将处理得到的消息数据存入该m个第二文件中。
结合第一方面、第一方面的第一种或第二种可能的实现方式,在第一方面的第三种实现方式中,该方法还包括:根据公式gr=(nid*m)/N,将该图划分成该多个子图,其中,gr取值相同的顶点被划分到同一子图,nid为该图中的顶点的编号,m为该子图的个数,N为该图中的顶点的个数。
结合第一方面、第一方面的第一种至第三种可能的实现方式中的任一种可能的实现方式,在第一方面的第四种实现方式中,该方法还包括:该MapReduce作业中的每个Map计算节点按照广度优先搜索BFS算法处理该 具有相互连接关系的顶点。
第二方面,本发明实施例提供了一种处理图数据的装置,该装置包括:确定模块,用于确定待处理的图数据,该图数据对应的图被划分成多个子图;调度模块,用于调度映射化简MapReduce系统中的计算节点,对该图数据进行多轮MapReduce作业,以得到该图数据的处理结果,其中,该MapReduce作业中的每个Map计算节点用于处理该多个子图中的一个子图内的具有相互连接关系的顶点。
结合第二方面,在第二方面的第一种可能的实现方式中,该多个子图包括m个子图,该图数据存储在分布式文件系统DFS中,该DFS包括与该m个子图一一对应的m个第一文件,以及与该m个子图一一对应的m个第二文件,其中,该m个第一文件分别用于存储该m个子图对应的子图数据,该m个第二文件分别用于存储该m个子图中的被处理过的顶点对应的消息数据,该调度模块具体用于:为该多轮MapReduce作业中的每轮MapReduce作业分配待处理的子图;根据该待处理的子图,从该m个第一文件中和该m个第二文件中选取该每轮MapReduce作业的输入数据,该输入数据包括该待处理的子图对应的子图数据,以及该每轮MapReduce作业的上一轮MapReduce作业处理得到的消息数据;根据该输入数据,进行该每轮MapReduce作业。
结合第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,该调度模块具体用于:根据该输入数据,为该每轮MapReduce作业的Map计算节点和Reduce计算节点分配计算任务;根据控制该每轮MapReduce作业中的Reduce计算节点将处理得到的消息数据存入该m第二个文件中。
结合第二方面、第二方面的第一种或第二种可能的实现方式,在第二方面的第三种可能的实现方式中,该装置还包括:划分模块,用于根据公式gr=(nid*m)/N,将该图划分成该多个子图,其中,gr取值相同的顶点被划分到同一子图,nid为该图中的顶点的编号,m为该子图的个数,N为该图中的顶点的个数。
结合第二方面、第二方面的第一种至第三种可能的实现方式中的任一种可能的实现方式,在第二方面的第四种实现方式中,该MapReduce作业中的每个Map计算节点按照广度优先搜索BFS算法处理该具有相互连接关系 的顶点。
本发明实施例中,首先将待处理的图数据对应的图划分成多个子图,然后每轮MapReduce作业中,每个Map计算节点每次处理多个子图中的一个子图内部的具有连接关系的顶点,使得每轮MapReduce作业尽可能处理更多的顶点,从而能够减少了MapReduce作业的轮数,提高图数据的处理效率。
附图说明
图1是根据本发明实施例的一种处理图数据的方法的映射化简系统的示意性框图。
图2是根据本发明实施例的一种处理图数据的方法的示意性流程图。
图3是根据本发明另一实施例的映射化简作业流程图。
图4是根据本发明另一实施例的图划分的示意图。
图5是根据本发明另一实施例的处理图数据的方法的示意性流程图。
图6是根据本发明另一实施例的处理图数据的方法的示意性流程图。
图7是根据本发明又一实施例的处理图数据的装置的示意性框图。
图8是根据本发明又一实施例的处理图数据的装置的示意性框图。
具体实施方式
下面将结合附图,对本发明实施例中的技术方案进行描述。
图1示出了可以应用本发明实施例的处理图数据的方法的一种MapReduce系统的示意性框图。如图1所示,该系统可以包括DFS,Map计算节点和Reduce计算节点。处理图数据的实现过程一般需要采用某种遍历方式遍历图数据,因此处理一个完整的图数据通常需要多轮MapReduce作业(job)。其中,Map计算节点包括至少一个Map计算节点,对应Map阶段;Reduce计算节点包括至少一个Reduce计算节点,对应Reduce阶段。在Map阶段,Map计算节点对输入的数据进行处理,获得中间计算结果或消息数据。在Reduce阶段,Reduce计算节点对输入的数据进行化简操作,得到化简后的消息数据,并保存在DFS中。其中,Map阶段到Reduce阶段之间可以经过一个混洗(Shuffle)阶段,在Shuffle过程中将中间计算结果从磁盘中取出,在进行合并以及排序操作后,传输给Reduce计算节点作为 Reduce阶段的输入数据。
应理解,如图1所示,本发明实施例的处理图数据的方法可以由主控设备执行。主控设备负责图数据处理过程中所有的工作设备的调度和计算任务的分配,例如,主控设备可以调度Map计算节点、Reduce计算节点,并控制Map计算节点、Reduce计算节点的任务分配,或者控制Map计算节点从DFS中读取需要的数据,或者控制Reduce计算节点把处理过的消息数据存入DFS之中。
应理解,本发明实施例的处理图数据的方法和装置可以应用于最短路径算法、图模拟算法、强模拟算法、网页排名算法或广度优先搜索算法(Breadth First Search,简称为“BFS”)等图算法中,且并不限于此,还可以应用于其它图算法。
应理解,本发明实施例中的分布式文件系统DFS,可以是Hadoop分布式文件系统(Hadoop Distributed File System,简称为“HDFS”),可以是网络文件系统(Network File System,简称为“NFS”),可以是谷歌文件系统(Google File System,简称为“GFS”),也可以是其它任何分布式文件系统,本发明并不限于此。
图2根据示出了本发明实施例的一种处理图数据的方法200的示意性流程图。如图2所示,该方法200包括:
S210,确定待处理的图数据,该图数据对应的图被划分成多个子图;
S220,调度映射化简MapReduce系统中的计算节点,对该图数据进行多轮MapReduce作业,以得到该图数据的处理结果,其中,该MapReduce作业中的每个Map计算节点用于处理该多个子图中的一个子图内的具有相互连接关系的顶点。
可选地,该待处理的图数据可以位于DFS中,上述确定待处理的图数据可以包括确定待处理的图数据在DFS系统中的存储位置,以便于控制MapReduce系统中的计算节点从该存储位置获取MapReduce作业需要的数据。
应理解,图数据对应的图被划分为多个子图,每个子图可以对应一个子图数据,图数据可以包括该多个子图中每个子图对应的子图数据。其中,每个子图对应的子图数据中可以包含子图内的顶点的信息以及子图内的顶点的连接关系的信息,其中,该子图内的顶点的连接关系的信 息可以包括该子图内的顶点之间的连接关系的信息,也可以包括该子图内的顶点到其它子图内的顶点的连接关系的信息。
应理解,在确定待处理的图数据之后,可以调度MapReduce中的计算节点,对待处理的图数据进行MapReduce作业,以得到该图数据的处理结果。对该图数据的MapReduce作业可以包括多轮MapReduce作业。其中,每轮MapReduce作业中的每个Map计算节点处理一个子图内的具有相互连接关系的顶点,或者,每个Map计算节点处理一个子图内的具有相互连接关系的一组顶点或多组顶点。换句话说,每个Map计算节点的输入数据可以包括一个子图对应的子图数据,该每个Map计算节点可用于处理一个子图对应的子图数据。可以这样理解,在每轮MapReduce作业过程中,可以利用每个子图内部顶点之间的连接关系,把存在连接关系的顶点放在同一个Map计算节点中进行处理,现有技术中,每轮MapReduce作业中的每个Map计算节点处理是孤立的顶点,在处理顶点时并没有考虑或利用顶点之间的连接关系,是一种以顶点为中心的处理方式,而本发明实施例中,每轮MapReduce作业中的每个Map计算节点利用了一个子图内部顶点之间的连接关系,把子图数据中有连接关系的顶点当作一个计算对象进行处理,是一种以子图为中心的处理方式。
例如,当其中一个Map计算节点在处理子图内的顶点时,如果与该处理的顶点存在连接关系的顶点也属于该子图时,则可以在同一轮MapReduce作业中处理与该顶点存在连接关系的顶点。从而可以在一轮MapReduce作业过程中处理更多的顶点,进而减少图数据,尤其是大规模图数据处理过程中的MapReduce作业的轮数。在现有技术中,由于采取了以顶点为计算对象的MapReduce作业方式,处理顶点产生的消息只能沿出边进行单边传递,而本发明实施例的处理图数据的方法中,采取了以子图为计算对象的方法,消息数据在同一子图内部的顶点之间可以进行多步传递,所以在一轮MapReduce作业过程中,可以同时处理同一子图内部存在连接关系的顶点,从而减少处理图数据所需的处理的MapReduce作业轮数。由于采取了以子图为中心的计算模型,在图数据处理过程中利用子图数据内顶点之间的连接关系,将计算粒度扩展为整个子图,减少了MapReduce作业的轮数,从而提高图数据处理的计算速度和计算效率,减少资源和时间的开销。
可选地,作为一个实施例,可以将子图的顶点分为内部顶点和边界顶点两类。其中,内部顶点表示与该内部顶点相连的所有顶点都属于同一子图。边界顶点表示与该边界顶点相连的至少一个顶点不属于该边界顶点所在的子图。可选地,可以定义图数据为图G=(V,E),其中,V和E分别表示顶点集合和边集合,边集合中的边用于表示顶点之间的连接关系。可以定义子图数据为(G1[V1],...Gk[Vk]),表示图数据G按顶点划分得到的k个子图,其中V1∪V2∪...∪Vk=V,且
Figure PCTCN2016104370-appb-000001
另外,可以定义在子图Gi[Vi](i∈[1,k])中,若ν∈Vi满足条件{μ|(ν,μ)∈E∧μ∈Vi},则ν为内部顶点,若ν不满足上述条件,则ν为边界顶点。子图之间通过边界顶点进行通信,在每轮MapReduce作业过程中产生的中间结果以及消息数据在内部顶点之间完成多步传递以实现多步计算,然后沿边界顶点传输给其它相关联的子图,以便于进行下一轮MapReduce作业的计算。
在本发明实施例中,通过将待处理的图数据对应的图划分成多个子图,在MapReduce作业中的Map计算节点以子图为计算对象,每次处理一个子图内的具有连接关系的顶点,充分利用了子图内顶点的连接关系,使得每轮MapReduce作业尽可能处理更多的顶点,从而减少了处理图数据所需的MapReduce作业的轮数,提高了图数据的处理效率。
如图1所示,Map阶段到Reduce阶段之间可以经过一个混洗(Shuffle)阶段,在Shuffle过程中将中间计算结果从磁盘中取出,在进行合并以及排序操作后,传输给Reduce计算节点作为Reduce阶段的输入数据。图数据在图算法实现过程中属于不变的数据,也即图数据在每轮MapReduce作业过程中都保持不变,通常情况下图数据的数据量相对较大,并且在每轮MapReduce作业过程中都会使用到。而消息数据属于变化的数据,通常情况下消息数据的数据量都比较小。但是现有技术中在进行MapReduce作业时并没有区分图数据和消息数据,所以图数据需要在每轮MapReduce作业的过程中进行重复的处理并进行Shuffle。这种对于图数据的重复读写以及网络传输,造成了很大的开销,极大地影响了图数据的处理效率。
可选地,作为一个实施例,该多个子图可以包括m个子图,该图数据存储在分布式文件系统DFS中,该DFS包括与该m个子图一一对应的m个第一文件,以及与该m个子图一一对应的m个第二文件,其中,该m个第 一文件分别用于存储该m个子图对应的子图数据,该m个第二文件分别用于存储该m个子图中的被处理过的顶点对应的消息数据。
可选地,在S220中,调度MapReduce系统中的计算节点,对该图数据进行多轮MapReduce作业,包括:为该多轮MapReduce作业中的每轮MapReduce作业分配待处理的子图;根据该待处理的子图,从该m个第一文件和该m个第二文件中选取该每轮MapReduce作业的输入数据,该输入数据包括该待处理的子图对应的子图数据,以及该每轮MapReduce作业的上一轮MapReduce作业处理得到的消息数据;根据该输入数据,进行该每轮MapReduce作业。
在本发明实施例中,因为采取了对图数据和消息数据分开处理的方式,把在整个MapReduce作业过程中始终保持不变的图数据抽离出来,单独保存在DFS中,并且每轮MapReduce作业产生的消息数据也被保存在DFS中与图数据对应的位置。在每轮MapReduce作业的开始,从DFS中读取需要的图数据和消息数据作为本轮MapReduce的输入数据。在每轮MapReduce作业过程中,Map计算节点在处理完子图数据后,无需向其它计算节点传输图数据,所以在Shuffle过程中也不需要传输图数据,从而能够减少图数据在计算过程中带来的I/O开销以及在Shuffle过程中的通信开销,进而加快了图数据的处理速度。
例如,图3示出了本发明另一实施例的映射化简作业流程图,如图3所示,可选地,图数据被划分成指定数量的子图数据之后被保存在DFS中。而消息数据可以是每轮MapReduce作业的结果,消息数据和子图数据一一对应。消息数据在经过Reduce计算节点处理之后也可以被保存在DFS之中。在每轮MapReduce作业开始时,Map节点从DFS中读取需要的消息数据和图数据,并将消息数据和图数据进行合并,作为本轮MapReduce作业中的Map计算节点的输入数据。
具体地,消息数据又称为消息,可以是每轮MapReduce作业处理的结果。子图和消息数据具有一一对应的关系。消息数据可以和子图数据合并在一起,作为每轮MapReduce作业的输入数据。例如,在DFS中,可以将放置m个子图数据的文件命名为Gi(i∈[1,m]),将放置与m个子图对应的消息数据的文件命名为Mi(i∈[1,m]),该Gi个文件与该Mi个文件一一对应。可以按照DFS的文件命名规则,把对应的Gi和Mi命名为相同的文件名。当 MapReduce作业中的计算节点需要读取输入数据时,可以重载MapReduce的CombineFileInputFormat类,逻辑上将相同文件名的Gi和Mi合并成一个文件,作为Map计算节点的输入。
可选地,根据该输入数据,进行每轮MapReduce作业,包括:根据该输入数据,为每轮MapReduce作业的Map计算节点和Reduce计算节点分配计算任务;控制该每轮MapReduce作业中的Reduce计算节点将处理得到的消息数据存入该m个第二文件中。
可选地,在每轮MapReduce作业中,当Reduce计算节点获取本轮MapReduce的消息数据后,主控设备控制Reduce计算节点把消息数据存储在与m个子图一一对应的相应的m个第二文件中,以便于下轮MapReduce作业开始时,从该m个第二文件中读取需要的输入数据。从而使图数据和消息数据能够分开进行处理,减少了图数据在计算过程中带来的I/O开销以及在Shuffle过程中的通信开销,从而加快了图数据的处理速度。
上文结合图1至图3描述了处理图数据的方法的具体实施方式。下文将结合图4至图5,以BFS为例,说明本发明实施例的处理图数据的方法的其中一种具体实现方案。
如图4所示,首先待处理的图G被划分成三个子图G1、G2、G3。其中V1=[1,2],V2=[3,4],V3=[5,6],其中,每个子图中虚线标识的顶点用于表示不属于该子图但与该子图有边相连的顶点。其中,以顶点3为源点,计算的中间结果以消息的形式沿边传递给相邻顶点,直到遍历到所有的可达顶点结束计算。由图4可知,在图G中,顶点3为源点,即起始的顶点,顶点3的出边消息对应顶点1和顶点4,顶点1的出边消息对应顶点2和顶点5,顶点4的出边消息对应顶点1和顶点5,顶点2的出边消息对应顶点6,顶点5的出边消息对应顶点2和顶点6。具体地,图5示出了本发明实施例处理图G的MapReduce作业过程,其中虚线标识的顶点代表下轮MapReduce作业的起始的顶点,灰色标识的顶点代表已经处理完的顶点。如图5所示,在第一轮MapReduce作业过程中,Map计算节点以子图G2为计算对象,由于顶点3和顶点4有连接关系,且顶点4与顶点3处于同一子图,所以处理完顶点3产生的数据会传递给顶点4,可以在同一轮MapReduce作业过程中处理G2中的顶点3和顶点4,以获得第一轮MapReduce作业的消息数据。同时,因为顶点4和子图G1 中的顶点1以及子图G3中顶点5都有连接关系,所以将在本轮MapReduce作业得到的顶点4的消息数据传给顶点1和顶点5,以便于进行下一轮MapReduce作业。在第二轮MapReduce作业过程中,因为子图G1中的顶点2和顶点1有连接关系,子图G3中的顶点5跟顶点6有连接关系,所以Map计算节点分别以子图G1和子图G3为计算对象,处理子图G1中的顶点1和顶点2以及子图G3中的顶点5和顶点6,以获得第二轮MapReduce作业的消息数据。经过两轮MapReduce作业可以处理完图G中的所有顶点。
在现有技术中并没有划分子图,而是采取以顶点为计算对象的图数据处理方式。图6示出了现有技术中处理图G的MapReduce作业过程,其中虚线标识的顶点代表下轮MapReduce作业的起始的顶点,灰色标识的顶点代表已经处理完的顶点。如图6所示,当以顶点为计算对象时,对于图G,在第一轮MapReduce作业中,首先处理顶点3,获得顶点3的消息数据后,将顶点3的消息数据传给顶点1和顶点4;在第二轮MapReduce作业中,处理顶点1和顶点4,并将顶点1的消息数据传递给顶点2和顶点5,将顶点4的消息数据传给顶点1和顶点5;在第三轮MapReduce作业中,处理顶点2和顶点5,并将顶点2的消息数据传递给顶点6,将顶点5的消息数据传送给顶点2和顶点6;在第四轮MapReduce作业中,处理顶点6,以获得本轮MapReduce作业的消息数据。经过四轮MapReduce作业处理完图G中所有的顶点。
由此具体实施例可见,本发明实施例的处理图数据的方法与现有技术相比较,MapReduce作业轮数明显减少,从而提高了图数据的处理效率。
可选地,在MapReduce编程框架下,为了实现以子图为计算对象的计算模型,可以在Map阶段重写Mapper类的的setup()、Map()、clean()三个函数。setup()函数的作用是Map开始之前做一些相关工作的初始化,而clean()函数则是在Map计算完成之后进行收尾工作,并且setup()和clean()函数在Map阶段可以只执行一次。因此,首先,使用setup()函数初始化一个HashMap结构用于保存整个子图;之后,Map()函数逐条读取顶点数据并映射到HashMap结构中;最后,在clean()函数中可以按需要对已保存在HashMap中的整个子图进行自定义计算。以BFS为例,Map阶段实现本发明实施例的处理图数据的方法的关键伪代码可以如下所示。
1:class Mapper
2:method setup()
3:new HashMap(nid n,node nd)
4:method Map(nid n,node nd)
5:HashMap.put(n,nd)
6:method clean()
7:bfs(HashMap)
应理解,本发明实施例提供的处理图数据的方法,图数据对应的图被划分为多个子图时,在MapReduce作业实现过程中可以采用hash的方法划分子图。但是MapReduce分布式计算框架在设计过程中并没有考虑图数据内部的关联关系,所以采用hash的方法划分子图时并没有考虑子图内部的顶点的连接关系。如果在保证负载均衡的前提下,将有边相连的顶点尽可能的分到同一子图,同时尽量减少跨子图的边的数量,则在一轮MapReduce作业过程中可以同时处理同一子图内更多的顶点,从而可以减少处理图数据所需的MapReduce作业的轮数,提高图数据的处理效率。换句话说,在划分子图时可以充分考虑到图数据的局部性特征,根据图数据在实际应用中的自身特点,来划分子图。例如,交通网络对应的图中,相邻顶点的编号相差很小。因此,可以按照顶点的编号顺序划分子图,如1~1000,1001~2000…并分别保存在同一子图对应的子图数据中。
可选地,可以根据公式gr=(nid*m)/N,将图数据对应的图划分成多个子图,其中,gr取值相同的顶点被划分到同一子图中,nid为图中的顶点的编号,m为子图的个数,N为图中的顶点的个数。
例如,若需要将包含N个顶点的图划分成m个子图,可以按照公式gr=(nid*m)/N来划分,其在MapReduce系统中实现的关键伪代码可以如下所示。
1:class Mapper
2:method Map()
3:gr←(n*m)/N
4:EMIT(nid n,gr)
5:class Partitioner
6:method getPartition(nid n,node nd)
7:return gr
8:class Reducer
9:method Reduce(nid n,[gr])
10:for all gr∈[gr]
11:EMIT(nid n,node nd)
例如,对于交通网络图,还可以根据GIS位置信息进行划分,如按照实际需要将一个市或者省的交通网络作为一个子图。实现时Map函数中gr值的计算需要解析GIS数据,提取位置信息。其在MapReduce系统中实现的关键伪代码可以如下所示。
1:class Mapper
2:method Map(nid n,node nd)
3:gr←nd.GIS.location
4:EMIT(nid n,gr)
5:class Partitioner
6:method Partition(nid n,gr)
7:return gr
8:class Reducer
9:method Reduce(nid n,[gr])
10:for all gr∈[gr]
11:EMIT(nid n,node nd)
此外,对于社交网络也可以使用相应的划分子图的方法。用户在注册社交网站提供的公开信息,如所在城市、工作单位或学校等都可以作为划分子图的依据。在通过MapReduce系统架构实现时将Map函数中的gr按需赋值即可。
在本发明实施例中,通过分析实际应用中涉及的图数据的特点,在考虑到负载均衡的前提下,将有边相连的顶点尽可能划分到同一子图内,同时削弱子图之间的耦合性,可以进一步地减少处理图数据所需的MapReduce作业轮数,提高图数据的处理速度和计算效率。
在本发明实施例中,一方面,通过将待处理的图数据对应的图划分成多个子图,在MapReduce作业中的Map计算节点以子图为计算对象,每次处理一个子图内的具有连接关系的顶点,充分利用了子图内顶点的连接关系, 使得每轮MapReduce作业尽可能处理更多的顶点,从而减少了处理图数据所需的MapReduce作业轮数,提高了图数据的处理效率。另一方面,因为采取了对图数据和消息数据分开处理的方式,把在整个MapReduce作业过程中始终保持不变的图数据抽离出来,单独保存在DFS中,并且每轮MapReduce作业产生的消息数据也被保存在DFS中与图数据对应的位置。在每轮MapReduce作业的开始,从DFS中读取需要的图数据和消息数据作为本轮MapReduce的输入数据。在每轮MapReduce作业过程中,Map计算节点在处理完子图数据后,无需向其它计算节点传输图数据,所以在Shuffle过程中也不需要传输图数据,从而能够减少图数据在计算过程中带来的I/O开销以及在Shuffle过程中的通信开销,进而加快了图数据的处理速度。又一方面,本发明实施例采取的划分子图的方法,通过分析实际应用中涉及的图数据的特点,在考虑到负载均衡的前提下,将有边相连的顶点尽可能划分到同一子图内,同时削弱子图之间的耦合性,可以进一步地减少处理图数据所需的MapReduce作业轮数,提高图数据的处理速度和计算效率。
上文结合图1至图6详细阐述了本发明实施例的处理图数据的方法的具体实施例,下文将结合图7和图8,详细描述本发明实施例的处理图数据的装置。
图7示出了本发明实施例的处理图数据的装置700的示意图,应理解,根据本发明实施例的的装置700可对应于本发明方法实施例中的主控设备,并且装置700中的各个模块的下述和其它操作和/或功能分别为了实现图2至图6中的各个方法的相应流程,为了简洁,在此不再赘述。该装置700包括:
确定模块710,用于确定待处理的图数据,该图数据对应的图被划分成多个子图;
调度模块720,用于调度映射化简MapReduce系统中的计算节点,对该图数据进行多轮MapReduce作业,以得到该图数据的处理结果,其中,该MapReduce作业中的每个Map计算节点用于处理该多个子图中的一个子图内的具有相互连接关系的顶点。
在本发明实施例中,通过将待处理的图数据对应的图划分成多个子图,在MapReduce作业中的Map计算节点以子图为计算对象,每次处理一个子 图内的具有连接关系的顶点,充分利用了子图内顶点的连接关系,使得每轮MapReduce作业过程尽可能处理更多的顶点,从而减少了处理图数据所需的MapReduce作业的轮数,提高了图数据的处理效率。
在本发明实施例中,装置700的确定模块710确定的待处理的图数据对应的图被划分成多个子图,可选地,在本发明实施例中,该多个子图包括m个子图,该图数据存储在分布式文件系统DFS中,该DFS包括与该m个子图一一对应的m个第一文件,以及与该m个子图一一对应的m个第二文件,其中,该m个第一文件分别用于存储该m个子图对应的子图数据,该m个第二文件分别用于存储该m个子图中的被处理过的顶点对应的消息数据。
可选地,本发明实施例的装置700的调度模块720具体用于:为该多轮MapReduce作业中的每轮MapReduce作业分配待处理的子图;根据该待处理的子图,从该m个第一文件中和该m个第二文件中选取该每轮MapReduce作业的输入数据,该输入数据包括该待处理的子图对应的子图数据,以及该每轮MapReduce作业的上一轮MapReduce作业处理得到的消息数据;根据该输入数据,进行该每轮MapReduce作业。
在本发明实施例中,因为采取了对图数据和消息数据分开处理的方式,把在整个MapReduce作业过程中始终保持不变的图数据抽离出来,单独保存在DFS中,并且每轮MapReduce作业产生的消息数据也被保存在DFS中与图数据对应的位置。在每轮MapReduce作业的开始,从DFS中读取需要的图数据和消息数据作为本轮MapReduce的输入数据。在每轮MapReduce作业过程中,Map计算节点在处理完子图数据后,无需向其它计算节点传输图数据,所以在Shuffle过程中也不需要传输图数据,从而能够减少图数据在计算过程中带来的I/O开销以及在Shuffle过程中的通信开销,进而加快了图数据的处理速度。
可选地,该调度模块720具体用于:根据该输入数据,为该每轮MapReduce作业的Map计算节点和Reduce计算节点分配计算任务;控制该每轮MapReduce作业中的Reduce计算节点将处理得到的消息数据存入该m第二个文件中。
可选地,本发明实施例的装置700还包括:划分模块730,用于根据公式gr=(nid*m)/N,将该图划分成该多个子图,其中,gr取值相同的顶点被划分到同一子图,nid为该图中的顶点的编号,m为该子图的个数,N为该图 中的顶点的个数。
在本发明实施例中,通过分析实际应用中涉及的图数据的特点,在考虑到负载均衡的前提下,将有边相连的顶点尽可能划分到同一子图内,同时削弱子图之间的耦合性,可以进一步地减少处理图数据所需的MapReduce作业轮数,提高图数据的处理效率。
图8示出了本发明另一实施例的处理图数据的装置800,如图8所示,该装置800包括:处理器810,存储器820,总线系统830。其中,该装置800与MapReduce系统中的计算节点通过该总线系统830相连,该处理器810和该存储器820通过该总线系统830相连,该存储器820用于存储指令,该处理器810用于执行该存储器820存储的指令,以便于处理器810控制该MapReduce系统中的计算节点进行的MapReduce作业。
该处理器810用于:确定待处理的图数据,该图数据对应的图被划分成多个子图;调度MapReduce系统中的计算节点,对该图数据进行多轮MapReduce作业,以得到该图数据的处理结果;其中,该MapReduce作业中的至少一个Map计算节点中的每个Map计算节点用于处理该多个子图中的一个子图内的顶点,该顶点之间具有相互连接关系。
在本发明实施例中,通过将待处理的图数据对应的图划分成多个子图,在MapReduce作业中的Map计算节点以子图为计算对象,每次处理一个子图内的具有连接关系的顶点,充分利用了子图内顶点的连接关系,使得每轮MapReduce作业过程尽可能处理更多的顶点,从而减少了处理图数据所需的MapReduce作业轮数,提高了图数据的处理效率。
应理解,在本发明实施例中,该处理器810可以是中央处理单元(Central Processing Unit,简称为“CPU”),该处理器810还可以是其他通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
该存储器820可以包括只读存储器和随机存取存储器,并向处理器810提供指令和数据。存储器820的一部分还可以包括非易失性随机存取存储器。例如,存储器820还可以存储设备类型的信息。
该总线系统830除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。该总线系统830还可以包括内部总线、系统总线和外部 总线。但是为了清楚说明起见,在图中将各种总线都标为总线系统830。
在实现过程中,上述方法的各步骤可以通过处理器810中的硬件的集成逻辑电路或者软件形式的指令完成。结合本发明实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器820,处理器810读取存储器820中的信息,结合其硬件完成上述方法的步骤。为避免重复,这里不再详细描述。
在本发明实施例中,该处理器810处理的图数据对应的图被划分成多个子图,可选地,该多个子图为m个子图,与该m个子图一一对应的m个第一文件和与该m个子图一一对应的m个第二文件被存储在分布式文件系统DFS中,其中,该第一文件中的每个文件用于存储该m个子图中每个子图对应的子图数据,该第二文件中的每个文件用于存储该每个子图对应的消息数据。
可选地,该处理器810调度MapReduce系统中的计算节点,对该图数据进行多轮MapReduce作业,具体包括:
为该多轮MapReduce作业中的每轮MapReduce作业分配待处理的子图;
根据该待处理的子图,从该m个第一文件中和该m个第二文件中选取该每轮MapReduce作业的输入数据,该输入数据包括该待处理的子图的子图数据,以及该每轮MapReduce作业的上一轮MapReduce作业处理得到的消息数据;
根据该输入数据,进行该每轮MapReduce作业。
在本发明实施例中,因为采取了对图数据和消息数据分开处理的方式,把在整个MapReduce作业过程中始终保持不变的图数据抽离出来,单独保存在DFS中,并且每轮MapReduce作业产生的消息数据也被保存在DFS中与图数据对应的位置。在每轮MapReduce作业的开始,从DFS中读取需要的图数据和消息数据作为本轮MapReduce的输入数据。在每轮MapReduce作业过程中,Map计算节点在处理完子图数据后,无需向其它计算节点传输图数据,所以在Shuffle过程中也不需要传输图数据,从而能够减少图数据在计算过程中带来的I/O开销以及在Shuffle过程中的通信开销,进而加快了图数据的处理速度。
可选地,在本发明另一实施例中,该处理器810根据该输入数据,进行该每轮MapReduce作业,具体可以包括:
根据该输入数据,为该每轮MapReduce作业的Map计算节点和Reduce计算节点分配计算任务;
控制该每轮MapReduce作业中的Reduce计算节点将处理得到的消息数据存入该m第二个文件中。
可选地,在本发明另一实施例中,该处理器810还用于:根据公式gr=(nid*m)/N,将该图划分成该多个子图,其中,gr取值相同的顶点被划分到同一子图,nid为该图中的顶点的编号,m为该子图的个数,N为该图中的顶点的个数。
因此,在本发明实施例中,通过将待处理的图数据对应的图划分成多个子图,在MapReduce作业中的Map计算节点以子图为计算对象,每次处理一个子图内的具有连接关系的顶点,充分利用了子图内顶点的连接关系,使得每轮MapReduce作业过程尽可能处理更多的顶点,从而减少了处理图数据所需的MapReduce作业轮数,提高了图数据的处理效率。
应理解,根据本发明实施例的传输信息控制信息的装置800可对应于本发明方法实施例中的主控设备,并且装置800中的各个模块的上述和其它操作和/或功能分别为了实现图2至图6中的各个方法的相应流程,为了简洁,在此不再赘述。
本发明实施例的处理图数据的装置采取的划分子图的方法,通过分析实际应用中涉及的图数据的特点,在考虑到负载均衡的前提下,将有边相连的顶点尽可能划分到同一子图内,同时削弱子图之间的耦合性,可以进一步地减少处理图数据所需的MapReduce作业轮数,提高图数据的处理效率。
另外,本文中术语“系统”和“网络”在本文中常被可互换使用。本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
应理解,在本发明实施例中,“与A相应的B”表示B与A相关联,根据A可以确定B。但还应理解,根据A确定B并不意味着仅仅根据A确定 B,还可以根据A和/或其它信息确定B。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,该单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
该作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本发明实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
该集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机, 服务器,或者网络设备等)执行本发明各个实施例该方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上某一实施例中的技术特征和描述,为了使申请文件简洁清楚,可以理解适用于其他实施例,在其他实施例不再一一赘述。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。

Claims (10)

  1. 一种处理图数据的方法,其特征在于,包括:
    确定待处理的图数据,所述图数据对应的图被划分成多个子图;
    调度映射化简MapReduce系统中的计算节点,对所述图数据进行多轮MapReduce作业,以得到所述图数据的处理结果,其中,所述MapReduce作业中的每个Map计算节点用于处理所述多个子图中的一个子图内的具有相互连接关系的顶点。
  2. 如权利要求1所述的方法,其特征在于,所述多个子图包括m个子图,所述图数据存储在分布式文件系统DFS中,所述DFS包括与所述m个子图一一对应的m个第一文件,以及与所述m个子图一一对应的m个第二文件,其中,所述m个第一文件分别用于存储所述m个子图对应的子图数据,所述m个第二文件分别用于存储所述m个子图中的被处理过的顶点对应的消息数据,
    所述调度MapReduce系统中的计算节点,对所述图数据进行多轮MapReduce作业,包括:
    为所述多轮MapReduce作业中的每轮MapReduce作业分配待处理的子图;
    根据所述待处理的子图,从所述m个第一文件和所述m个第二文件中选取所述每轮MapReduce作业的输入数据,所述输入数据包括所述待处理的子图对应的子图数据,以及所述每轮MapReduce作业的上一轮MapReduce作业处理得到的消息数据;
    根据所述输入数据,进行所述每轮MapReduce作业。
  3. 如权利要求2所述的方法,其特征在于,所述根据所述输入数据,进行所述每轮MapReduce作业,包括:
    根据所述输入数据,为所述每轮MapReduce作业的Map计算节点和Reduce计算节点分配计算任务;
    控制所述每轮MapReduce作业中的Reduce计算节点将处理得到的消息数据存入所述m个第二文件中。
  4. 如权利要求1-3中任一项所述的方法,其特征在于,所述方法还包括:
    根据公式gr=(nid*m)/N,将所述图划分成所述多个子图,其中,gr取值 相同的顶点被划分到同一子图,nid为所述图中的顶点的编号,m为所述子图的个数,N为所述图中的顶点的个数。
  5. 如权利要求1-4中任一项所述的方法,其特征在于,所述MapReduce作业中的每个Map计算节点按照广度优先搜索BFS算法处理所述具有相互连接关系的顶点。
  6. 一种处理图数据的装置,其特征在于,包括:
    确定模块,用于确定待处理的图数据,所述图数据对应的图被划分成多个子图;
    调度模块,用于调度映射化简MapReduce系统中的计算节点,对所述图数据进行多轮MapReduce作业,以得到所述图数据的处理结果,其中,所述MapReduce作业中的每个Map计算节点用于处理所述多个子图中的一个子图内的具有相互连接关系的顶点。
  7. 如权利要求6所述的装置,其特征在于,所述多个子图包括m个子图,所述图数据存储在分布式文件系统DFS中,所述DFS包括与所述m个子图一一对应的m个第一文件,以及与所述m个子图一一对应的m个第二文件,其中,所述m个第一文件分别用于存储所述m个子图对应的子图数据,所述m个第二文件分别用于存储所述m个子图中的被处理过的顶点对应的消息数据,
    所述调度模块具体用于:
    为所述多轮MapReduce作业中的每轮MapReduce作业分配待处理的子图;
    根据所述待处理的子图,从所述m个第一文件中和所述m个第二文件中选取所述每轮MapReduce作业的输入数据,所述输入数据包括所述待处理的子图对应的子图数据,以及所述每轮MapReduce作业的上一轮MapReduce作业处理得到的消息数据;
    根据所述输入数据,进行所述每轮MapReduce作业。
  8. 如权利要求7所述的装置,其特征在于,所述调度模块具体用于:
    根据所述输入数据,为所述每轮MapReduce作业的Map计算节点和Reduce计算节点分配计算任务;
    根据控制所述每轮MapReduce作业中的Reduce计算节点将处理得到的消息数据存入所述m第二个文件中。
  9. 如权利要求6-8中任一项所述的装置,其特征在于,所述装置还包括:
    划分模块,用于根据公式gr=(nid*m)/N,将所述图划分成所述多个子图,其中,gr取值相同的顶点被划分到同一子图,nid为所述图中的顶点的编号,m为所述子图的个数,N为所述图中的顶点的个数。
  10. 如权利要求6-9中任一项所述的装置,其特征在于,所述MapReduce作业中的每个Map计算节点按照广度优先搜索BFS算法处理所述具有相互连接关系的顶点。
PCT/CN2016/104370 2015-11-03 2016-11-02 处理图数据的方法和装置 WO2017076296A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510737900.9 2015-11-03
CN201510737900.9A CN106649391B (zh) 2015-11-03 2015-11-03 处理图数据的方法和装置

Publications (1)

Publication Number Publication Date
WO2017076296A1 true WO2017076296A1 (zh) 2017-05-11

Family

ID=58661919

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/104370 WO2017076296A1 (zh) 2015-11-03 2016-11-02 处理图数据的方法和装置

Country Status (2)

Country Link
CN (1) CN106649391B (zh)
WO (1) WO2017076296A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377770A (zh) * 2019-06-26 2019-10-25 珠海泰芯半导体有限公司 一种不同测试站点晶圆map图文件合并处理方法及系统
CN113489790A (zh) * 2021-07-06 2021-10-08 四川蜀天梦图数据科技有限公司 一种优化分布式PageRank算法通信过程的方法及装置
WO2021208174A1 (zh) * 2020-04-16 2021-10-21 南方科技大学 分布式图计算方法、终端、系统及存储介质
CN114490833A (zh) * 2022-04-06 2022-05-13 支付宝(杭州)信息技术有限公司 一种图计算结果可视化方法和系统
CN115658975A (zh) * 2022-10-27 2023-01-31 西安邮电大学 用于实现负载均衡的图数据划分方法

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107315834A (zh) * 2017-07-12 2017-11-03 广东奡风科技股份有限公司 一种基于广度优先搜索算法的etl作业流程分析方法
CN107529638B (zh) * 2017-08-18 2018-05-11 浙江远算云计算有限公司 线性求解器的加速方法、存储数据库及gpu系统
CN109165325B (zh) * 2018-08-27 2021-08-17 北京百度网讯科技有限公司 用于切分图数据的方法、装置、设备以及计算机可读存储介质
CN109711633B (zh) * 2018-12-29 2022-09-20 中山大学 一种基于MapReduce的公共交通出行路径规划索引方法
CN111598036B (zh) * 2020-05-22 2021-01-01 广州地理研究所 分布式架构的城市群地理环境知识库构建方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521051A (zh) * 2011-12-05 2012-06-27 中国联合网络通信集团有限公司 应用于图算法的Map Reduce系统中的任务调度方法、装置和系统
US20130024412A1 (en) * 2011-06-28 2013-01-24 Salesforce.Com, Inc. Methods and systems for using map-reduce for large-scale analysis of graph-based data
CN103164261A (zh) * 2011-12-15 2013-06-19 中国移动通信集团公司 多中心数据任务处理方法、装置及系统
CN103793525A (zh) * 2014-02-21 2014-05-14 江苏唯实科技有限公司 基于局部迭代的MapReduce模型的图结点的权威值计算方法
CN104699698A (zh) * 2013-12-05 2015-06-10 深圳先进技术研究院 基于海量数据的图查询处理方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8224825B2 (en) * 2010-05-31 2012-07-17 Microsoft Corporation Graph-processing techniques for a MapReduce engine
CN104239553A (zh) * 2014-09-24 2014-12-24 江苏名通信息科技有限公司 一种基于Map-Reduce框架的实体识别方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130024412A1 (en) * 2011-06-28 2013-01-24 Salesforce.Com, Inc. Methods and systems for using map-reduce for large-scale analysis of graph-based data
CN102521051A (zh) * 2011-12-05 2012-06-27 中国联合网络通信集团有限公司 应用于图算法的Map Reduce系统中的任务调度方法、装置和系统
CN103164261A (zh) * 2011-12-15 2013-06-19 中国移动通信集团公司 多中心数据任务处理方法、装置及系统
CN104699698A (zh) * 2013-12-05 2015-06-10 深圳先进技术研究院 基于海量数据的图查询处理方法
CN103793525A (zh) * 2014-02-21 2014-05-14 江苏唯实科技有限公司 基于局部迭代的MapReduce模型的图结点的权威值计算方法

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377770A (zh) * 2019-06-26 2019-10-25 珠海泰芯半导体有限公司 一种不同测试站点晶圆map图文件合并处理方法及系统
CN110377770B (zh) * 2019-06-26 2022-11-04 珠海泰芯半导体有限公司 一种不同测试站点晶圆map图文件合并处理方法及系统
WO2021208174A1 (zh) * 2020-04-16 2021-10-21 南方科技大学 分布式图计算方法、终端、系统及存储介质
CN113489790A (zh) * 2021-07-06 2021-10-08 四川蜀天梦图数据科技有限公司 一种优化分布式PageRank算法通信过程的方法及装置
CN113489790B (zh) * 2021-07-06 2024-02-02 四川蜀天梦图数据科技有限公司 一种优化分布式PageRank算法通信过程的方法及装置
CN114490833A (zh) * 2022-04-06 2022-05-13 支付宝(杭州)信息技术有限公司 一种图计算结果可视化方法和系统
CN114490833B (zh) * 2022-04-06 2022-10-11 支付宝(杭州)信息技术有限公司 一种图计算结果可视化方法和系统
CN115658975A (zh) * 2022-10-27 2023-01-31 西安邮电大学 用于实现负载均衡的图数据划分方法

Also Published As

Publication number Publication date
CN106649391B (zh) 2020-10-27
CN106649391A (zh) 2017-05-10

Similar Documents

Publication Publication Date Title
WO2017076296A1 (zh) 处理图数据的方法和装置
US9003425B2 (en) Optimizing workflow engines
Rashid et al. Design and analysis of proposed remote controlling distributed parallel computing system over the cloud
US9053067B2 (en) Distributed data scalable adaptive map-reduce framework
CN107111517B (zh) 针对归约器任务的虚拟机优化分配和/或生成
JP7469026B2 (ja) ストレージでの最適な動的シャードを生成する装置及びシステム
JP2015011709A (ja) 明示的に表されたグラフで並列探索を行うシステムおよび方法
US9141677B2 (en) Apparatus and method for arranging query
CN109710406B (zh) 数据分配及其模型训练方法、装置、及计算集群
EP3079077A1 (en) Graph data query method and device
CN109189572B (zh) 一种资源预估方法及系统、电子设备和存储介质
CN114327844A (zh) 内存分配方法、相关设备及计算机可读存储介质
Ortmann et al. Efficient orbit-aware triad and quad census in directed and undirected graphs
JP2016081496A (ja) 複合イベント処理のためのイベント構成規則の動的更新システム及び方法
CN109416688B (zh) 用于灵活的高性能结构化数据处理的方法和系统
JP2016081494A (ja) 分散コンピューティング環境におけるグラフデータの振り分け方法及び装置
US20220156324A1 (en) Graph refactorization method and graph refactorization apparatus
CN107204998B (zh) 处理数据的方法和装置
Gandhi et al. Performance comparison of parallel graph coloring algorithms on bsp model using hadoop
Lin et al. A parallel Cop-Kmeans clustering algorithm based on MapReduce framework
US8392393B2 (en) Graph searching
KR20140103805A (ko) 데이터 분배 방법 및 장치
CN113222099A (zh) 卷积运算方法及芯片
CN110325984B (zh) 在图形中进行层次社区检测的系统和方法
AkashKumar Heuristic for accelerating run-time task mapping in NoC-based heterogeneous MPSoCs

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16861564

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16861564

Country of ref document: EP

Kind code of ref document: A1