WO2018006625A1 - Graph data calculation method, host and graph calculation system - Google Patents
Graph data calculation method, host and graph calculation system Download PDFInfo
- Publication number
- WO2018006625A1 WO2018006625A1 PCT/CN2017/077859 CN2017077859W WO2018006625A1 WO 2018006625 A1 WO2018006625 A1 WO 2018006625A1 CN 2017077859 W CN2017077859 W CN 2017077859W WO 2018006625 A1 WO2018006625 A1 WO 2018006625A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- vertex
- random walk
- instances
- calculation
- source point
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
Definitions
- the present invention relates to the field of computers, and in particular, to a method, a host, and a graph computing system for graph data calculation.
- Graph computing is an abstract representation of a "graph” structure of the real world based on “graph theory” and a computational model on such a data structure.
- the graph calculation system runs all the algorithms in multiple rounds of iteration until the end of the algorithm convergence.
- an update function on the graph is formed by writing a vertex program, wherein the update function is defined by the user.
- the update function is a user-defined calculation that can process a source path.
- An update function can modify the weight of a vertex and the edge connected to it.
- each iteration is performed by the system to perform an update function on each vertex of the graph.
- the size of the graph we are dealing with is usually larger than the memory of a computer. Therefore, in the calculation of the graph, the graph is divided into equal parts according to the number of compute nodes in the cluster, and is allocated to the memory of these compute nodes before the calculation is started. In the graph calculation process, each host needs to communicate with each other through the network to tell each other the calculation state in the memory to make the overall calculation forward.
- a graph calculation method based on random walk is generally employed. Among them, one graph includes N nodes. For the N nodes in the graph, we need to independently search for a random walk path with the node as the source point from each node, then N times for the same graph. Calculation.
- Each random walk starts from each different starting node in the graph, and each step randomly selects an adjacent vertex of the current node to advance.
- each step of the random walk to the next adjacent vertex requires an iteration of the graph calculation, then how long the swam path needs to be sampled, and how many iterations the system processes to complete the sampling. Therefore, N times of sampling is performed N times of graph calculation, and each graph calculation corresponds to one sampling calculation, and the total calculation time is N times of the time of one distributed sampling calculation. It is also possible to extend the existing stand-alone system to run on the different hosts simultaneously for the N samples, so that the overall calculation time is N/M times of the calculation time of a single sample, and M is the number of hosts.
- N is a variable
- M is small, and is a constant. Therefore, according to the above analysis, even if the existing graph computing system is so fast that it can complete a sampling calculation in one second, the common graph of more than 10 million nodes has to spend more than 10 ⁇ 7 (10 to the power of 7) seconds. , that is, 115 days. So how Reducing the time of graph calculation is an important challenge.
- Embodiments of the present invention provide a method, a host, and a graph computing system for graph calculation, which are used to improve the efficiency of graph calculation and save time.
- a first aspect of the embodiments of the present invention provides a method for graph calculation, which is applied to a disk-based graph computing system, where the graph computing system includes M hosts, and each host saves graph data on a local disk.
- the data includes N vertices, and each host simultaneously runs path calculations of N/M different sources.
- R 1 random walk instances on each vertex, and the current path length of each random walk instance is L 1 .
- Methods can include:
- the first host obtains a set of calculations after the Xth iteration is calculated, and the set of vertices is a set of vertices of the update function to be executed when the first host performs the X+1th iteration calculation; the first host according to the The graph data and the set of calculation results are subjected to the X+1th concurrent iterative calculation, and the number of random walk instances on each vertex becomes R 1 /2, and the R 1 /2 random walk instances are respectively The current path length becomes 2L 1 +1; if the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition, the first host completes the calculation of the graph data.
- the iterative calculation performed by the first host is equivalent to performing the disk iterative calculation and the network iterative calculation simultaneously, that is, performing two iteration calculations of disk input/output and network input/output concurrently, which may be referred to as hybrid iterative calculation.
- hybrid iterative calculation After performing a hybrid iterative calculation, the path length of each random walk instance is increased to the longest, from the original L 1 to 2L 1 +1, then the time taken to complete the calculation of the graph data is correspondingly It is less, so in a complete graph calculation process, the efficiency of the graph calculation is improved correspondingly, saving a certain time.
- the first host obtains a set of calculation results after the X-th iteration calculation is performed by the first host, and may include: calculation result of the first host to the first set of vertices u i X iteration calculation performed by obtaining a set of network T X (u i), which comprises a set of vertices in vertex u i is the first host to the source vertex v
- the vertices of the update function are executed when the X+1th iteration of the point is calculated.
- the vertex v is one of N/M vertices on the first host, and the vertex set u i includes the vertex v, and X is greater than 0.
- the first host performs the X+1th concurrent iteration calculation with the vertex v as the source point according to the graph data and the calculation result set, and obtains the number of random walk instances on each vertex to become R 1 /2, the current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1, which may include: the first host performs according to the graph data and the calculation result T X (u i )
- the X+1th concurrent iteration calculation with the vertex v as the source point obtains the top v is the number of instances of random walk on the source point becomes R 1/2 th, the length of the current path of each of R 1/2 random walk instance becomes 2L 1 +1.
- a specific description of the vertex v as a source point is performed for the network iterative calculation and the disk iteration calculation, and further, the calculation process of the network iteration and the disk iteration concurrent iteration is embodied.
- the number of random walk instances on each vertex changes from R 1 to R 1 /2, and the current path length of each R 1 /2 random walk instances becomes 2L 1 +1, so, a hybrid iteration is performed, with separate disk iterations, network iterations, and the path length of each random walk instance is increased the most, correspondingly, the graph calculation time is correspondingly reduced.
- the first host is configured according to the graph data and the calculation result set T X (u i ), performing the X+1th concurrent iteration calculation with the vertex v as the source point, and obtaining the number of random walk instances on the vertex v as the source point becomes R 1 /2,
- the current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1, which may include:
- the first host based on the map data, iterative calculation X + 1 times the source vertex point v, to give an example of the number of random walk on the vertex v is the source point for the R 1 th, the example 1 R & lt each increased random walk a path length; the first set of host T X (u i) based on the calculation result, the iterative calculation X + 1 times the source vertex point v, to give the
- the vertex v is the number of random walk instances on the source point becomes R 1 /2, and the current path length of each R 1 /2 random walk instances becomes 2L 1 ; the first host is according to the R 1
- Each of the random walk instances is incremented by one path length, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 , and the current path length of each of the R 1 /2 random walk instances is determined. +1 for 2L 1 .
- the iterative calculation performed by the first host according to the graph calculation is called a disk iteration
- the iterative calculation according to the calculation result set T X (u i ) obtained through the network is called network iteration, where network iteration and
- the concurrent iteration calculation of the disk iteration has a specific description.
- a network iteration is performed, and the number of random walk instances on each vertex is halved, but the path length of the random walk instance is changed from the previous L 1 to 2L 1
- the number of random walk instances on each vertex is unchanged, but the path length is increased by 1 path length, so the length of each random walk instance becomes 2L 1 +1.
- the vertex set u i is ⁇ v ⁇ Q 1 ,
- the Q 1 ⁇ Q the set Q is ⁇ u 1 , u 2 , . . . , u n ⁇
- the set of calculation results T X (u i ) is ⁇ T X (v) ⁇ T X (Q 1 ), the T X (Q 1 ) ⁇ T X (Q),
- the set T X (Q) is ⁇ T X (u 1 ), T X (u 2 ), ..., T X (u n ) ⁇ ;
- the first host performs an X+1th iteration calculation with the vertex v as a source point according to the calculation result set T X (u i ), and obtains the number of random walk instances on the vertex v as a source point. It becomes R 1 /2, and the current path length of each R 1 /2 random walk instances becomes 2L 1 , which may include: if the Q 1 is ⁇ u 1 , u 2 , u 3 , u 4 ⁇ , Then the calculation result set T X (u i ) is ⁇ T X (v), T X (u 1 ), T X (u 2 ), T X (u 3 ), T X (u 4 ) ⁇ ; The host performs the X+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances on the vertex v as the source point becomes R 1 /2, and the R 1 / The current path length of each of the two random walk instances becomes 2L
- the image when the network iteration is performed, specifically, the image may be described by a formula, and the path length of each random walk instance is changed from L 1 to 2L 1 .
- the first host is configured according to the calculation result set T X (u i ), performing an X+1th iteration calculation with the vertex v as a source point, and obtaining the number of random walk instances on the vertex v as a source point becomes R 1 /2, the R 1 /2
- the current path length of each random walk instance becomes 2L 1 , which may include:
- the first host obtains the current path length of each of the previous R 1 /2 random walk instances starting from the vertex v as the source point after performing the Xth iteration calculation, and obtains the Xth iteration calculation after the calculation
- the R 1 /2 random walks are located at the vertices; the first host splices the current path lengths of the previous R 1 /2 random walk instances starting from the vertex v as the source point and The current path degree of the R 1 /2 random walk instances starting from another source point u, and the number of random walk instances on the vertex v as the source point is changed to R 1 /2.
- the current path length of each of the R 1 /2 random walk instances becomes 2L 1 .
- the length of the random walk instance is changed from L 1 to 2L 1 , and a specific description is added, which increases the feasibility of the technical solution of the present invention.
- the first possible implementation of the fourth embodiment of the present invention to the fourth embodiment of the present invention, the fifth possible implementation of the first aspect of the embodiment of the present invention In the mode, there are currently R 2 random walk instances on each vertex, and the path length of each random walk instance is L 2 .
- the method may further include: the first host performs vertex v according to the graph data.
- the Z+1th iteration calculation of the source point obtains that the number of random walk instances on the vertex v as the source point is the R 2 , and the path length of each of the R 2 random walk instances is increased by one.
- Path length, Z is an integer greater than or equal to 0.
- the first possible implementation of the fifth embodiment of the present invention to the fifth embodiment of the present invention there are currently R 3 random walk instances on each vertex, and the current path length of each random walk instance is L 3 , and the method may further include:
- the first host acquires the calculation result after the iterative calculation Y times to complete the set of vertices of the set h i Y T (h i) over a network, which includes a set of vertices in the vertex h i is the first host to the vertex v
- the Y+1th iteration of the source point is calculated to perform a vertex of the update function, the vertex set h i includes the vertex v, Y is an integer greater than 0; the first host according to the calculation result set T Y (h i ) Performing the Y+1th iteration calculation with the vertex v as the source point, and obtaining the number of random walk instances on the vertex v as the source point becomes R 3 /2, the R 3 /2 random
- the current path length of each of the swim instances becomes 2L 3 .
- each random walk instance adds more path lengths.
- the vertex set h i is ⁇ v ⁇ W 1 , The W 1 ⁇ W, the set W is ⁇ h 1 , h 2 , . . . , h n ⁇ , and the set of calculation results T Y (h i ) is ⁇ T Y (v) ⁇ T Y (W 1 ), the T Y (W 1 ) ⁇ T Y (W), The set T Y (W) is ⁇ T Y (h 1 ), T Y (h 2 ), ..., T Y (h n ) ⁇ ;
- the first host performs the Y+1th iteration calculation with the vertex v as the source point according to the calculation result set T Y (h i ), and obtains the number of random walk instances on the vertex v as the source point. It becomes R 3 /2, and the current path length of each R 3 /2 random walk instances becomes 2L 3 , which may include: if the W 1 is ⁇ h 1 , h 2 , h 3 , h 4 ⁇ , Then the calculation result set T Y (h i ) is ⁇ T Y (v), T Y (h 1 ), T Y (h 2 ), T Y (h 3 ), T Y (h 4 ) ⁇ ;
- the first host performs the Y+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances on the vertex v as the source point becomes R 3 /2, The path length of each R 3 /2 random walk instances becomes 2L 3 ;
- T Y (v) represents the calculation result after the Yth iteration with the vertex v as the source point. Indicates that the calculation results for the left and right ends are combined by the specified operation.
- the network iterative calculation is performed, and specifically, the image may be described by a formula. Represents a doubling of the path length of each random walk instance.
- the first host is configured according to the calculation result set T Y (h i ), performing the Y+1th iteration calculation with the vertex v as the source point, and obtaining the number of random walk instances on the vertex v as the source point becomes R 3 /2, the R 3 /2
- the current path length of each random walk instance becomes 2L 3 , which may include:
- the other host After the first host acquires the current path length of the first R 3 /2 random walk instances starting from the vertex v as the source point after performing the Y-th iteration calculation, and obtaining the Y-th iteration calculation, the other host The current path length of each R 3 /2 random walk instances starting from a source point w, the other source point w being the first R 3 /2 random walk instances starting from the vertex v as the source point After Y iterations, the R 3 /2 random walks are located at the vertices; the first host splices the current path lengths of the previous R 3 /2 random walk instances starting from the vertex v as the source point and The current path degree of the R 3 /2 random walk instances starting from another source point w, and the number of random walk instances on the vertex v as the source point is changed to R 3 /2. The current path length of each of the R 3 /2 random walk instances becomes 2L 3 .
- the length of the random walk instance is changed from L 1 to 2L 1 , and a specific description is added, which increases the feasibility of the technical solution of the present invention.
- a second aspect of the embodiments of the present invention provides a host, where the host is applied to a disk-based graph computing system, where the graph computing system includes M hosts, and each host saves graph data on a local disk, and the graph data includes N.
- the graph computing system includes M hosts, and each host saves graph data on a local disk, and the graph data includes N.
- each host simultaneously runs N/M paths of different sources.
- Each vertex currently has R 1 random walk instances.
- the current path length of each random walk instance is L 1 .
- the host includes:
- An obtaining unit configured to obtain a set of calculation results after the Xth iteration calculation by the vertex set, where the vertex set is a set of vertices to perform an update function when the first host performs the X+1th iteration calculation;
- a calculation unit configured to perform an X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtain that the number of random walk instances on each vertex becomes R 1 /2, and the R 1 / The current path length of each of the two random walk instances becomes 2L 1 +1; and is also used to calculate the graph data if the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition.
- the host has the function of implementing a method corresponding to the graph calculation provided by the above first aspect.
- This function can be implemented in hardware or in hardware by executing the corresponding software.
- the hardware or software includes one or more modules corresponding to the functions described above.
- a third aspect of the embodiments of the present invention provides a host, where the host is applied to a disk-based graph computing system, where the computing system includes M hosts, and each host saves graph data on a local disk, and the graph data includes N.
- Each vertex runs N/M paths of different sources at the same time.
- the current path length of each random walk instance is L 1
- the host can include a transceiver, a processor; the transceiver and the processor are connected by a bus;
- the transceiver is configured to obtain a set of calculation results after the Xth iteration calculation by the set of vertices, where the set of vertices is a set of vertices of the update function to be executed when the first host performs the X+1th iteration calculation;
- the processor is configured to perform an X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtain the number of random walk instances on each vertex to become R 1 /2, and the R 1
- the current path length of each of the /2 random walk instances becomes 2L 1 +1; and is also used to calculate the graph data if the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition.
- a fourth aspect of the embodiments of the present invention provides a graph computing system, where the graph computing system includes M hosts, each of which stores graph data on a local disk, the graph data includes N vertices, and each host simultaneously runs N/M. Path calculation for different sources;
- a fifth aspect of the embodiments of the present invention provides a storage medium.
- the technical solution of the present invention may contribute to the prior art or all or part of the technical solution may be implemented by software.
- the computer software product is stored in a storage medium for storing computer software instructions for use in the apparatus described above, including a program designed to execute the first aspect described above for a host.
- the storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes.
- the set of vertices is obtained by performing the Xth iteration calculation, and the set of vertices is a set of vertices of the update function to be executed when the first host performs the X+1th iteration calculation;
- the first host performs the X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtains that the number of random walk instances on each vertex becomes R 1 /2,
- the current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1; if the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition, the first host completes the graph data Calculation.
- the iterative calculation performed by the first host is equivalent to the simultaneous iteration of the disk and the network iterative calculation, that is, concurrent execution of disk input/output (I/O) and network input/output (I/O) two iteration calculations, which can be called Hybrid iterative calculation, after performing a hybrid iterative calculation, the path length of each random walk instance is increased to the longest, from the original 2L 1 to 2L 1 +1, then the time taken to complete the calculation of the graph data There will be less corresponding, so in a complete graph calculation process, the efficiency of the graph calculation is improved correspondingly, saving a certain time.
- FIG. 1 is a system architecture diagram of a graph computing system in an embodiment of the present invention
- FIG. 2 is a schematic diagram of running a disk-based graph calculation on different M hosts in a cluster according to an embodiment of the present invention
- FIG. 3 is a schematic diagram of an embodiment of a method for calculating a graph in an embodiment of the present invention
- FIG. 5 is a schematic diagram of an embodiment of a host according to an embodiment of the present invention.
- FIG. 6 is a schematic diagram of another embodiment of a host according to an embodiment of the present invention.
- Embodiments of the present invention provide a method, a host, and a graph computing system for graph calculation, which are used to increase the rate of graph calculation and save time.
- a graph is a method that represents the relationship between an object and an object, and is the basic research object of graph theory.
- a graph appears to consist of small dots (called vertices or nodes) and lines or curves (called edges) that join the dots.
- vertices or nodes small dots
- edges lines or curves
- each web page can be regarded as a vertex. If a web page contains a link to another web page, it can be seen that there is a side connection between the two web pages.
- a social network Take a social network as an example, where each user in the social network can be regarded as a vertex, and a friend relationship between users can be regarded as an edge.
- the graph computing algorithm needs to search the path in a large amount of graph data, which is computationally intensive. , high complexity.
- the memory computing is mainly based on dividing the graph data into equal parts according to the number of computing nodes in the cluster, and assigning the good graph data to the memory of these computing nodes, and then starting the calculation.
- An abstract "graph data" consists of N vertices/nodes, existing M hosts, and each host serially runs path calculations of N/M different sources.
- the algorithm based on random walk requires: for the N nodes in the graph, it is necessary to independently search for a random walk path with the node as the source point from each node. Among them, each step of random walk to the next adjacent vertex requires an iteration of the existing graph calculation, how long the random walk path needs to be sampled, and how many iterations the system needs to process to complete the sampling. Therefore, the N-times calculation is performed for N times of sampling, and each time the figure calculation corresponds to one sampling calculation. Therefore, the total calculation time is N times the sampling calculation time. If you extend to an existing stand-alone system, these N samples are run simultaneously on multiple different hosts, because N/M sampling is required on each host, because M hosts are running at the same time, so the total calculation time is once. The sampling calculation time is N/M times. In the context of big data, usually the value of N is very large, so it takes a very long time to complete the required sampling calculation. Therefore, how to improve the efficiency of graph calculation is a major problem.
- the technical solution of the present invention proposes a method for utilizing external storage, that is, a disk-based distributed graph calculation method, each host has a relatively sufficient storage space to store map data on a local disk. It is not necessary to store a complete map data in the memory according to the number of hosts and then calculate it. Because the architecture and design of the traditional graph computing system usually only retains the computational state of a graph calculation, it is now necessary to retain N calculation states at the same time, and the size of each computation state is usually a given number L (step size). Or N is a linear relationship with a complexity of O(L) or even O(N). The size of all calculation states is L*N or even O(N ⁇ 2). For a large image, data of this size cannot be reserved with memory.
- the graph computing system to which the present invention is applied includes M hosts, each of which stores graph data on a local disk. As shown in FIG. 1 , the number of hosts shown in FIG. 1 is not limited in practical applications.
- the graph data includes N vertices, and each host runs a path computation of N/M different sources.
- the graph computing system is a distributed cluster architecture. As shown in FIG. 2, a disk-based graph calculation is performed on M different hosts in the cluster, and N/M different source path computation tasks are simultaneously run on each host. Graph calculation can be divided into offline Preprocessing, loading graph data and online graph calculation are three parts. In the preprocessing stage, the system divides the input data into several pieces of data and saves them on the disk of each computer. Each time a piece of data is added, the path calculation task code of N/M different sources is scheduled to be executed.
- the graph calculation system runs all the algorithms in multiple rounds of iteration until the convergence of the algorithm ends.
- the algorithm for abstracting data processing is generally abstracted by using the idea of "think like a vertex", that is, by writing a vertex program to form an update function on the graph, the update function here is defined by the user.
- the user-defined update function can only process the path calculation of one source point in one iteration calculation, and in the technical solution of the present invention, the user-defined update function can process the path calculation of multiple different sources concurrently.
- the path calculation of different sources mentioned here refers to the path calculation with different vertices as the source point.
- each iteration calculation is performed by the system to perform an update function on each vertex of the graph data.
- the system can quickly read the target map data stored on the local disk from the disk sequential input/output (I/O) for calculation, and does not require different hosts to perform network communication on the graph data.
- the sequential disk I/O reads the target graph data for iterative calculation, and the network resources are also efficiently utilized, that is, the network is obtained through different networks between different hosts.
- the calculation result set T X (u i ) is the calculation result of the last iteration calculation performed by the vertex set u i of the execution function to be updated, macroscopically, It can be considered that the iterative calculation using the data acquired on the disk and the iterative calculation performed by the data acquired by the network are performed simultaneously, and the path length of the R 1 /2 random walk instances is increased to 2L 1 +1.
- the target data is obtained from the local disk and iteratively calculated.
- the path length of each random walk instance can only be increased by one path length.
- the calculation result set T X (u i obtained from the previous iteration is obtained separately from the network. ), when performing this iterative calculation, the path length of R 1 /2 random walk instances is increased to 2L 1 .
- disk iteration the iterative calculation of the data acquired from the disk
- network iteration the iterative calculation of the data acquired from the network
- the iterative calculation and the data obtained from the network are iteratively calculated and concurrently performed. It will be referred to as “hybrid iteration” for convenience of description. It should be understood that disk iteration, network iteration and hybrid iteration are not a professional name.
- FIG. 3 it is an embodiment of a method for calculating a graph in an embodiment of the present invention.
- the method is applied to a disk-based graph computing system.
- the system includes M hosts, each of which saves graph data on a local disk.
- the graph data includes N vertices, and each host simultaneously runs path calculations of N/M different sources, including:
- the first host performs the Z+1th iteration calculation with the vertex v as the source point according to the graph data, and obtains the number of random walk instances with the vertex v as the source point as R 2 and R 2
- the path length of each random walk instance is increased by 1 path length, and Z is an integer greater than or equal to 0;
- each vertex there are currently R 2 random walk instances on each vertex, and the path length of each random walk instance is L 2 .
- the first host performs the Z+1th iteration calculation with the vertex v as the source point according to the graph data, and obtains the number of random walk instances with the vertex v as the source point as R 2 and R 2 random tours.
- the path length of each instance is increased by 1 path length, and Z is an integer greater than or equal to 0.
- the graph data here is the graph data saved on each local disk on each host.
- the graph data saved by the disk is calculated by the first iteration with v as the source point, and the number of random walk instances on the vertex v as the source point is unchanged, or 40, but these 40 random walk instances
- the length of each path is increased by 1 path length, which can be expressed as ⁇ v, w ⁇ .
- the first iteration calculation can only perform disk iteration, and the subsequent iterative calculation is mainly based on hybrid iteration, because the efficiency is high, but it can also be based on actual conditions. Interspersed individual disk iterations and network iterations are not limited.
- the first host acquires a set of vertices and performs a calculation result set after the Xth iteration calculation, and the vertices set is a set of vertices of the update function to be executed when the first host performs the X+1th iteration calculation;
- the description can be made in step 302.
- the first host to the set of vertices result of calculation of the iterative calculation u i X times through the network to complete a set of acquisition T X (u i), comprising a set of vertices in vertex u i is the first host to the source vertex v
- the X+1th iteration of the point is calculated to perform the vertices of the update function
- the vertex v is one of the N/M vertices on the first host
- the vertex set u i includes v
- X is an integer greater than 0;
- the first host obtains the X-th iteration calculation result set T X (u i ) calculated by the vertex set u i through the network, and the vertex included in the vertex set u i is performed by the first host.
- the vertices of the update function are executed when the X+1th iteration of the vertex v is the source point, and the vertex v is one of the N/M vertices on the first host, and the vertex set u i includes v, and X is greater than 0.
- the integer There is currently R 1 random walk instances on each vertex, and the current path length of each random walk instance is L 1 .
- the host 1 obtains the calculation result set T X (u i ) from the network, and assumes that the second time the iterative calculation with the vertex v as the source point is performed, the vertex set of the update function to be executed is ⁇ v, u 1 , u 2 , u 3 , u 4 ⁇ , that is, v, u 1 , u 2 , u 3 and u 4 are the vertices of the second execution update function, and the corresponding calculation result T 1 (v) after the first iteration calculation, T 1 (u 1 ), T 1 (u 2 ), T 1 (u 3 ) and T 1 (u 4 ), then the set of calculation results is ⁇ T 1 (w), T 1 (u 1 ), T 1 ( u 2 ), T 1 (u 3 ), T 1 (u 4 ) ⁇ , and the number of vertices performing the update function herein is not limited.
- the first host performs the X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtains that the number of random walk instances on each vertex becomes R 1 /2, R
- the current path length of each of the 1 /2 random walk instances becomes 2L 1 +1; specifically, it can be explained in step 303.
- the first host performs the X+1th concurrent iteration calculation with the vertex v as the source point according to the graph data and the calculation result set T X (u i ), and obtains a random walk instance with the vertex v as the source point.
- the number of the paths becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1;
- the first host performs the X+1th concurrent iteration calculation with the vertex v as the source point according to the graph data and the calculation result set T X (u i ), and obtains the vertex v as the source point.
- the number of random walk instances becomes R 1 /2
- the current path length of each of R 1 /2 random walk instances becomes 2L 1 +1.
- the method includes: 1) the first host performs the X+1th iteration calculation with the vertex v as the source point according to the graph data, and obtains the number of random walk instances with the vertex v as the source point as R.
- R 1 random walk instances each increase by 1 path length;
- the first host performs the X+1th iteration calculation with the vertex v as the source point according to the calculation result set T X (u i ), and obtains the number change of the random walk instance with the vertex v as the source point.
- the current path length of each R 1 /2 random walk instances becomes 2L 1 ;
- the first host adds 1 path length according to R 1 random walk instances, and the current path length of R 1 /2 random walk instances becomes 2L 1 , and determines R 1 /2 random walk instances respectively.
- the current path length becomes 2L 1 +1.
- the first host performs the X+1th iteration calculation with the vertex v as the source point according to the calculation result set T X (u i ), and obtains the number change of the random walk instance with the vertex v as the source point.
- the current path length of each of the R 1 /2 random walk instances becomes 2L 1 , and may also include:
- the vertex set u i is ⁇ v ⁇ Q 1 , Q 1 ⁇ Q
- the set Q is ⁇ u 1 , u 2 , . . . , u n ⁇
- the result set T X (u i ) is calculated. Is ⁇ T X (v) ⁇ T X (Q 1 ), T X (Q 1 ) ⁇ T X (Q),
- the set T X (Q) is ⁇ T X (u 1 ), T X (u 2 ), ..., T X (u n ) ⁇ ;
- the first host performs the X+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances on the vertex v as the source point becomes R 1 /2, R 1 / The current path length of each of the two random walk instances becomes 2L 1 ;
- T X (v) represents the calculation result after the Xth iteration with the vertex v as the source point. Indicates that the calculation results for the left and right ends are combined by the specified operation. It should be understood that the above-described set of vertices Q 1 is only an illustrative example and does not constitute a definition of the vertices included in Q 1 .
- Splicing the first host to vertex v is the source point of departure before R 1/2 Examples of random walk respective current path length and the starting point from another source w R 1/2 Examples of random walk respective current path To the extent, the number of random walk instances with the vertex v as the source point becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 .
- the host 1 performs the second iteration calculation using the hybrid iteration, acquires the graph data and the calculation result set T X (u i ), and performs the second iterative calculation with the vertex v as the source point.
- the disk iterations and network iterations are performed concurrently, but for a clearer expression, they are described separately.
- the host 1 performs disk iteration according to the graph data, and obtains the number of random walk instances with the vertex v as the source point unchanged, or 40, but the current path length of each of the 40 random walk instances is increased by 1 path.
- the length, that is, the path length of each random walk instance is now 2 path lengths.
- the host 1 performs network iteration according to the calculation result set T X (u i ).
- the schematic of the source then, there are formulas as follows:
- the host performs an iterative calculation based on the formula, and Indicates that the calculation results for the left and right ends are combined by the specified operation.
- the number of random walk instances with v as the source point is halved to 20, and the path length of the 20 random walk instances is the path length of the random walk instance spliced by another source point. Therefore, the length of one path after the end of the first iteration is doubled to two path lengths.
- each random walk instance of the disk iteration is also increased by 1 path length. Then, the result is that the number of random walk instances starting from v as the source point becomes 20, which is 20 The path length of each random walk instance becomes 3 path lengths.
- the host 1 scans the first iteration calculation in the second iteration calculation, the host 1 selects the path of the first 20 random walk instances starting from each vertex.
- the vertex v for each path, find the path for the first iteration calculation using another strip, which is one of the last 20 random walk paths starting from the other vertex. Path, and then merge the two paths into one path, then the path length of the 20 random walk instances becomes 2 path lengths. For example, suppose there is a path ⁇ v, w ⁇ starting from v.
- the machine where node v is located will send a request to the machine where node w is located, get a path ⁇ w, u ⁇ on w, and then 2 The paths are merged into ⁇ v, w, u ⁇ . Then, after running the network-based random walk calculation, 20 random walk paths with double length are obtained on each node.
- the path length of each of the 20 random walk instances obtained now can be expressed as ⁇ v, w, u, x ⁇ , where ⁇ u, x ⁇ is the disk The path length added after iteration.
- the Y+1th iteration of the point is calculated to perform all the vertices of the update function, the vertex set h i includes the vertex v, and Y is an integer greater than 0;
- the first host obtains the Y-th iteration calculation result set T Y (h i ) calculated by the vertex set h i through the network, and the vertex included in the vertex set h i is performed by the first host.
- the Y+1th iteration with the vertex v as the source point is calculated to execute all the vertices of the update function.
- the vertex set h i includes the vertex v, and Y is an integer greater than 0; wherein each vertex currently has R 3 random numbers Walk the instance, the current path length of each random walk instance is L 3 .
- step 303 there are currently 20 random walk instances on each vertex, and the current path length of each random walk instance is 3 path lengths.
- the first host performs the Y+1th iteration calculation with the vertex v as the source point according to the calculation result set T Y (h i ), and obtains the number change of the random walk instance with the vertex v as the source point.
- the current path length of each of R 3 /2 random walk instances becomes 2L 3 ;
- the first host after the first host obtains the calculation result set T Y (h i ) through the network, the first host performs iterative calculation according to the calculation result set T Y (h i ), and obtains the vertex v as the source point.
- the number of random walk instances becomes R 3 /2
- the current path length of each of R 3 /2 random walk instances becomes 2L 3 .
- the step may specifically include:
- the set of vertices h i is ⁇ v ⁇ ⁇ W 1 , W 1 ⁇ W, and the set W is ⁇ h 1 , h 2 , ..., h n ⁇ , and the result set T Y (h i ) is calculated.
- the set T Y (W) is ⁇ T Y (h 1 ), T Y (h 2 ), ..., T Y (h n ) ⁇ ;
- the first host performs the Y+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances on the vertex v as the source point becomes R 3 /2, R 3 / The path length of each of the two random walk instances becomes 2L 3 ;
- T Y (v) represents the calculation result after the Yth iteration calculated with the vertex v as the source point
- the network iteration is used separately.
- the host 1 performs network iteration according to the calculation result set T Y (h i ), and if the calculation result set T Y (h i ) is ⁇ T Y (v), T Y (h 1 ), T Y (h 2 ), ..., T Y (h n ) ⁇ , host 1 according to the formula
- the iterative calculation is performed to obtain that the number of random walk instances with v as the source point is halved to 10, and the length of each random walk instance is increased from the previous three path lengths to six path lengths.
- the path length of each of the 10 random walk instances can be expressed as ⁇ v, w, u, x , y, z, a ⁇ .
- steps 305 and 306 are optional steps. In actual applications, whether or not execution is required may be determined according to actual needs. Moreover, although the computational power of hybrid iterations is relatively fast, in the process of graph calculation, in addition to the first iteration calculation using disk iteration calculation, other iterations of calculations can be based on actual needs. Use hybrid iterations, disk iterations, or network iterations without specific limitations.
- the first host completes calculation of the map data.
- the calculation of the map data is completed.
- step 305, R 3 /2 corresponds to 10
- 2L 3 corresponds to ⁇ v, w, u, x, y, z, a ⁇ , that is, 6 path lengths
- the host 1 completes the calculation of the graph data.
- each random walk instance is increased by one path length, and when the hybrid iteration is used, there is a current path length of each of the R 1 /2 random walk instances. From the previous L 1 to 2L 1 +1, when using network iteration, the current path length of each R 3 /2 random walk instances is changed from the previous L 3 to 2L 3 . Therefore, if you only use disk iterations, you need to access the disk multiple times with limited efficiency.
- Using network iterations can greatly reduce the number of disk scans. However, network iterations are limited by network resources and do not make full use of disk resources. So hybrid iterations can use both disk and network to grow random walk paths, and the fastest growth rate.
- the graph computing system of the embodiment of the present invention can concurrently perform two processes of disk iteration and network iteration, which is much faster than the performance of the existing distributed graph computing system on related algorithms.
- the graph data here is an image representation, which is described above and will not be described here.
- the four vertices are: 1, 2, 3, and 4.
- the path calculation of the vertices 1 and 2 is concurrently executed on the host 1; the path calculation of the vertices 3 and 4 is concurrently executed on the host 2.
- the random walk instance of vertex 1 is two 1, the random walk instance of vertex 2 is two 2, the random walk instance of vertex 3 is two 3, and the random walk instance of vertex 4 is two 4.
- hybrid iteration In the second iteration calculation, hybrid iteration, network iteration or disk iteration can be performed.
- the technical solution of the present invention mainly performs hybrid iteration. Then, the hybrid iteration is used for explanation, although the result of the network iteration is said. It is only one path length worse than the hybrid iteration. However, in the context of big data, if you increase the length of one path, you can handle a lot of data.
- the path length of each random walk instance here is increased by one path length.
- the path lengths of 4 ⁇ 2, 2 ⁇ 3, 1 ⁇ 4, 3 ⁇ 1 are spliced on the corresponding vertices, so there is a path of one random walk instance on each vertex.
- the length is doubled from the length of one path after the first iteration to the length of two paths, plus the disk iteration in the hybrid iteration, then the path length of one random walk instance on each vertex increases. It is 3 path lengths. That is, after the completion of the second iteration calculation, the number of random walk instances on each vertex becomes one, and the path length of each random walk instance is increased to three path lengths.
- hybrid iteration is written here, mainly to understand the process more clearly.
- the length of the hybrid iteration is usually longer in the disk iteration and network iteration, and the longer duration is used as the duration of the hybrid iteration.
- the process of individual network iteration can refer to the process of network iteration in the hybrid iteration described above, and details are not described herein again.
- the network iteration in the above hybrid iteration is because the number of iterations is relatively small. Therefore, it may not be clear to explain that the path length of the random walk instance is doubled.
- the following is a simple example to illustrate the network iteration.
- the current path length is 3 path lengths, which are ⁇ a, b, c, d ⁇ , and the second iteration is performed with d as the vertex source point.
- the current path length is also 3 path lengths, which are ⁇ d, e, f, g ⁇ ,
- the third network iteration calculation is performed on the vertex a as the source point, and ⁇ a, b, c, d ⁇ and ⁇ d, e, f, g ⁇ can be merged, then the vertex a is
- the current path length of the source point becomes ⁇ a, b, c, d, e, f, g ⁇ , which is changed from the previous three path lengths to six path lengths. Therefore, it can be said that a network iterative calculation is performed here.
- the path length doubles.
- the graph computing system includes M hosts, each of which stores graph data on a local disk, the graph data includes N vertices, and each host simultaneously runs N/M paths of different sources for calculation, each There are currently R 1 random walk instances on the vertex, and the current path length of each random walk instance is L 1 , including:
- the obtaining unit 501 is configured to obtain a set of calculation results after the Xth iteration calculation is performed on the vertex set, and the vertex set is a set of the vertex of the update function to be executed when the first host performs the X+1th iteration calculation;
- the calculating unit 502 is configured to perform the X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtain that the number of random walk instances on each vertex becomes R 1 /2, R 1 /2 The current path length of each of the random walk instances becomes 2L 1 +1; and is also used to calculate the graph data if the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition.
- the obtaining unit 501 is specifically configured to perform steps 302 and 304 shown in FIG. 3 above.
- the calculating unit 502 is specifically configured to perform steps 301, 303, 305, and 306 shown in FIG. 3 above.
- FIG. 6 is a schematic diagram of another embodiment of a host in an embodiment of the present invention, including:
- the host may vary considerably depending on configuration or performance, and may include a transceiver 601, one or more central processing units (CPU) 602 (eg, one or more processors), and a memory 603.
- CPU central processing units
- One or more storage media 604 that store application 6041 or data 6042 eg, one or one storage device in Shanghai.
- the memory 603 and the storage medium 604 may be short-term storage or persistent storage.
- the program stored on storage medium 604 may include one or more modules (not shown in Figure 6), each of which may include a series of instruction operations in a wireless network controller.
- central processor 602 can be configured to communicate with storage medium 604 to perform a series of instruction operations in storage medium 604 on the wireless network controller.
- the transceiver 601 is configured to obtain a set of calculation results after the Xth iteration calculation is performed on the vertex set, and the vertex set is to perform an update function when the first host performs the X+1th iteration calculation.
- a collection of vertices a collection of vertices
- the processor 602 is configured to perform the X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtain that the number of random walk instances on each vertex becomes R 1 /2, R 1 /2 The current path length of each of the random walk instances becomes 2L 1 +1; if R 1 /2 and 2L 1 +1 satisfy the iterative completion condition, the first host completes the calculation of the map data.
- the transceiver 601 is specifically configured to perform the steps 302 and 304 shown in FIG. 3 above.
- the processor 602 is specifically configured to perform steps 301, 303, 305, and 306 shown in FIG. 3 above.
- the embodiment of the invention further provides a computer program product for data processing, comprising a computer readable storage medium storing program code, the program code comprising instructions for executing the method flow of any one of the foregoing method embodiments.
- a computer readable storage medium includes: a USB flash drive, a mobile hard disk, a magnetic disk, an optical disk, a random access memory (RAM), a solid state disk (SSD), or other nonvolatiles.
- RAM random access memory
- SSD solid state disk
- the disclosed system, apparatus, and method may be implemented in other manners.
- the device embodiments described above are merely illustrative.
- the division of the unit is only a logical function division.
- there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
- the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
- each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
- the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
- the technical solution of the present invention which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium.
- a number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
- the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A graph calculation method, used for increasing the speed of calculating graphs, thereby saving time. Said method comprises: a first host acquiring a calculation result set after carrying out the Xth iterative calculation on a vertex set, said vertex set being a set of a vertex of an update function which is to be executed by said first host when carrying out the X+1th iterative calculation; said first host carrying out the X+1th concurrent iterative calculation according to graph data and said calculation result set, obtaining the number of random walk examples of each vertex being changed to R1/2, a current path length of each of said R1/2 random walk examples being changed to 2L1+1; if said R1/2 and 2L1+1 satisfy the conditions for completing iteration, then said first host completing the calculation of said graph data.
Description
本申请要求于2016年7月6日提交中国专利局、申请号为201610527136.7、发明名称为“一种图数据计算的方法、主机以及图计算系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on July 6, 2016, the Chinese Patent Office, the application number is 201610527136.7, and the invention name is "a method for computing data, a host computer, and a graph computing system". The citations are incorporated herein by reference.
本发明涉及计算机领域,尤其涉及一种图数据计算的方法、主机以及图计算系统。The present invention relates to the field of computers, and in particular, to a method, a host, and a graph computing system for graph data calculation.
随着收集与产生数据的能力的进步,我们进入了大数据时代,每天我们都能从各类传感器、设备和互联网中收集到大量的数据。为了寻找新的商业价值和建立新的商业模型,我们必须处理、分析、存储并理解这些大数据。随着大规模图数据分析的需要,近几年涌现出了很多基于分布式或单机的并行图计算系统,其中常见的有:大规模图分布式计算框架(Pregel)、基于内存的分布式图计算系统(GraphLab)、基于磁盘的单机图计算系统(GraphChi)等。With the advancement in the ability to collect and generate data, we have entered the era of big data, and every day we are able to collect large amounts of data from various types of sensors, devices and the Internet. In order to find new business value and build new business models, we must process, analyze, store and understand these big data. With the need for large-scale graph data analysis, many distributed graph computing systems based on distributed or stand-alone have emerged in recent years. Common ones are: large-scale graph distributed computing framework (Pregel), memory-based distributed graph. Computing system (GraphLab), disk-based stand-alone graph computing system (GraphChi), etc.
“图计算”是以“图论”为基础的对现实世界的一种“图”结构的抽象表达,以及在这种数据结构上的计算模式。图计算系统对所有的算法的运行都是以多轮迭代进行直到算法收敛结束。一般通过使用“以顶点想(think like a vertex)”的思路去抽象数据处理的算法,通过编写顶点程序形成图上的更新函数,其中,更新函数是由用户定义的。在现有技术中,更新函数是用户定义的可以处理一个源路径的计算。一个更新函数可以修改一个顶点以及与它相连的边上的权值。在图计算完成一个算法的多次迭代中,每次迭代就是系统完成一遍在图的每一个顶点上执行更新函数。"Graph computing" is an abstract representation of a "graph" structure of the real world based on "graph theory" and a computational model on such a data structure. The graph calculation system runs all the algorithms in multiple rounds of iteration until the end of the algorithm convergence. Generally, by using the idea of "think like a vertex" to abstract the data processing algorithm, an update function on the graph is formed by writing a vertex program, wherein the update function is defined by the user. In the prior art, the update function is a user-defined calculation that can process a source path. An update function can modify the weight of a vertex and the edge connected to it. In the iterations in which the graph calculation completes an algorithm, each iteration is performed by the system to perform an update function on each vertex of the graph.
但在大数据分析的背景下,我们要处理的图的大小通常是大于一台计算机的内存。因此,在图计算时,要根据集群中计算节点的数目把图分成同等份数,并分配到这些计算节点的内存中,才开始计算。图计算过程中需要各主机通过网络不断彼此通信告诉对方自己内存中的计算状态才能使得整体的计算向前进行。一般采用基于随机游走(random walk)的图计算方法。其中,一个图包括N个节点,对图中的N个节点,我们需要独立地从每个节点出发搜索一条以该节点为源点的随机游走路径,那么对同一个图需进行N次图计算。每次随机游走都从图中每个不同的起始节点开始,每一步随机选取当前节点的一个相邻顶点前进。其中,随机游走往下一个相邻顶点前进的每一步都需要图计算的一次迭代,随即游走路径需要被采样多长,系统就处理多少步迭代完成该次采样。所以,N次采样就运行N次图计算,每次图计算对应一次采样计算,其总的计算时间就是一次分布式采样计算时间的N倍。也可以扩展现有单机系统到多个不同的主机上同时运行这N次采样,这样整体的计算时间就是一次单机采样计算时间的N/M倍,M为主机的个数。But in the context of big data analytics, the size of the graph we are dealing with is usually larger than the memory of a computer. Therefore, in the calculation of the graph, the graph is divided into equal parts according to the number of compute nodes in the cluster, and is allocated to the memory of these compute nodes before the calculation is started. In the graph calculation process, each host needs to communicate with each other through the network to tell each other the calculation state in the memory to make the overall calculation forward. A graph calculation method based on random walk is generally employed. Among them, one graph includes N nodes. For the N nodes in the graph, we need to independently search for a random walk path with the node as the source point from each node, then N times for the same graph. Calculation. Each random walk starts from each different starting node in the graph, and each step randomly selects an adjacent vertex of the current node to advance. Among them, each step of the random walk to the next adjacent vertex requires an iteration of the graph calculation, then how long the swam path needs to be sampled, and how many iterations the system processes to complete the sampling. Therefore, N times of sampling is performed N times of graph calculation, and each graph calculation corresponds to one sampling calculation, and the total calculation time is N times of the time of one distributed sampling calculation. It is also possible to extend the existing stand-alone system to run on the different hosts simultaneously for the N samples, so that the overall calculation time is N/M times of the calculation time of a single sample, and M is the number of hosts.
但是,由于图计算系统要处理的图数据的N都很大,N是一个是变量,M很小,是恒量。所以,据上述分析,就算现有的图计算系统快到能一秒钟完成一次采样计算,常见的超过千万节点的图就已经需要花超过10^7(10的7次方)秒的时间,即115天。因此,怎
么降低图计算的时间是一个重要的挑战。However, since the graph data to be processed by the graph calculation system has a large N, N is a variable, M is small, and is a constant. Therefore, according to the above analysis, even if the existing graph computing system is so fast that it can complete a sampling calculation in one second, the common graph of more than 10 million nodes has to spend more than 10^7 (10 to the power of 7) seconds. , that is, 115 days. So how
Reducing the time of graph calculation is an important challenge.
发明内容Summary of the invention
本发明实施例提供了一种图计算的方法、主机以及图计算系统,用于提高图计算的效率,节约时间。Embodiments of the present invention provide a method, a host, and a graph computing system for graph calculation, which are used to improve the efficiency of graph calculation and save time.
本发明实施例第一方面提供一种图计算的方法,该方法应用于以磁盘为基础的图计算系统,该图计算系统包括M个主机,每个主机在本地磁盘上保存图数据,该图数据包括N个顶点,每个主机同时运行N/M个不同源的路径计算,每个顶点上当前有R1个随机游走实例,每个随机游走实例的当前路径长度为L1,该方法可包括:A first aspect of the embodiments of the present invention provides a method for graph calculation, which is applied to a disk-based graph computing system, where the graph computing system includes M hosts, and each host saves graph data on a local disk. The data includes N vertices, and each host simultaneously runs path calculations of N/M different sources. Currently, there are R 1 random walk instances on each vertex, and the current path length of each random walk instance is L 1 . Methods can include:
第一主机获取顶点集合进行第X次迭代计算后的计算结果集合,该顶点集合为该第一主机进行第X+1次迭代计算时要执行更新函数的顶点的集合;该第一主机根据该图数据和该计算结果集合,进行第X+1次的并发迭代计算,得到每个顶点上随机游走实例的个数变为R1/2个,该R1/2个随机游走实例各自的当前路径长度变为2L1+1;若该R1/2和2L1+1满足迭代完成条件,则该第一主机完成该图数据的计算。The first host obtains a set of calculations after the Xth iteration is calculated, and the set of vertices is a set of vertices of the update function to be executed when the first host performs the X+1th iteration calculation; the first host according to the The graph data and the set of calculation results are subjected to the X+1th concurrent iterative calculation, and the number of random walk instances on each vertex becomes R 1 /2, and the R 1 /2 random walk instances are respectively The current path length becomes 2L 1 +1; if the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition, the first host completes the calculation of the graph data.
在本发明实施例中,第一主机进行的迭代计算相当于是磁盘迭代计算和网络迭代计算同时进行的,即并发执行磁盘输入/输出和网络输入/输出两种迭代计算,可称呼为混合迭代计算,进行一次混合迭代计算之后,得到每个随机游走实例的路径长度是增加最长的,从原来的L1变为2L1+1,那么,完成图数据的计算所花费的时间就会相应的较少,所以,在一次完整的图计算过程中,完成图计算的效率对应的提高,节约了一定的时间。In the embodiment of the present invention, the iterative calculation performed by the first host is equivalent to performing the disk iterative calculation and the network iterative calculation simultaneously, that is, performing two iteration calculations of disk input/output and network input/output concurrently, which may be referred to as hybrid iterative calculation. After performing a hybrid iterative calculation, the path length of each random walk instance is increased to the longest, from the original L 1 to 2L 1 +1, then the time taken to complete the calculation of the graph data is correspondingly It is less, so in a complete graph calculation process, the efficiency of the graph calculation is improved correspondingly, saving a certain time.
结合本发明实施例第一方面,在本发明实施例的第一方面的第一种可能的实现方式中,该第一主机获取顶点集合进行第X次迭代计算后的计算结果集合,可包括:第一主机通过网络获取以顶点集合ui完成的第X次迭代计算后的计算结果集合TX(ui),该顶点集合ui中包括的顶点是该第一主机进行以顶点v为源点的第X+1次迭代计算时要执行更新函数的顶点,该顶点v为该第一主机上N/M个顶点中的其中一个,该顶点集合ui包括该顶点v,X为大于0的整数;该第一主机根据该图数据和该计算结果集合,进行以顶点v为源点的第X+1次的并发迭代计算,得到每个顶点上随机游走实例的个数变为R1/2个,该R1/2个随机游走实例各自的当前路径长度变为2L1+1,可包括:该第一主机根据该图数据和该计算结果TX(ui),进行以顶点v为源点的第X+1次的并发迭代计算,得到该以顶点v为源点上的随机游走实例的个数变为R1/2个,该R1/2个随机游走实例各自的当前路径长度变为2L1+1。With reference to the first aspect of the embodiments of the present invention, in a first possible implementation manner of the first aspect of the embodiment, the first host obtains a set of calculation results after the X-th iteration calculation is performed by the first host, and may include: calculation result of the first host to the first set of vertices u i X iteration calculation performed by obtaining a set of network T X (u i), which comprises a set of vertices in vertex u i is the first host to the source vertex v The vertices of the update function are executed when the X+1th iteration of the point is calculated. The vertex v is one of N/M vertices on the first host, and the vertex set u i includes the vertex v, and X is greater than 0. The first host performs the X+1th concurrent iteration calculation with the vertex v as the source point according to the graph data and the calculation result set, and obtains the number of random walk instances on each vertex to become R 1 /2, the current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1, which may include: the first host performs according to the graph data and the calculation result T X (u i ) The X+1th concurrent iteration calculation with the vertex v as the source point obtains the top v is the number of instances of random walk on the source point becomes R 1/2 th, the length of the current path of each of R 1/2 random walk instance becomes 2L 1 +1.
在本发明实施例中,对网络迭代计算和磁盘迭代计算进行一个具体的以顶点v为源点的说明,进而,更具体的体现了网络迭代和磁盘迭代的并发迭代的计算过程。进行一次混合迭代之后,每个顶点上的随机游走实例的个数由之前的R1变为R1/2个,该R1/2个随机游走实例各自的当前路径长度变为2L1+1,所以,进行一次混合迭代,与单独的进行磁盘迭代,网络迭代,每个随机游走实例的路径长度增加的最多,对应的,图计算的时间就相应的减少了。In the embodiment of the present invention, a specific description of the vertex v as a source point is performed for the network iterative calculation and the disk iteration calculation, and further, the calculation process of the network iteration and the disk iteration concurrent iteration is embodied. After a hybrid iteration, the number of random walk instances on each vertex changes from R 1 to R 1 /2, and the current path length of each R 1 /2 random walk instances becomes 2L 1 +1, so, a hybrid iteration is performed, with separate disk iterations, network iterations, and the path length of each random walk instance is increased the most, correspondingly, the graph calculation time is correspondingly reduced.
结合本发明实施例的第一方面的第一种可能的实现方式,在本发明实施例的第一方面
的第二种可能的实现方式中,该第一主机根据该图数据和该计算结果集合TX(ui),进行以顶点v为源点的第X+1次的并发迭代计算,得到该以顶点v为源点上的随机游走实例的个数变为R1/2个,该R1/2个随机游走实例各自的当前路径长度变为2L1+1,可包括:With reference to the first possible implementation manner of the first aspect of the embodiment of the present invention, in a second possible implementation manner of the first aspect of the embodiment, the first host is configured according to the graph data and the calculation result set T X (u i ), performing the X+1th concurrent iteration calculation with the vertex v as the source point, and obtaining the number of random walk instances on the vertex v as the source point becomes R 1 /2, The current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1, which may include:
该第一主机根据该图数据,进行以顶点v为源点的第X+1次的迭代计算,得到该以顶点v为源点上的随机游走实例的个数为该R1个,该R1个随机游走实例各增加1个路径长度;该第一主机根据该计算结果集合TX(ui),进行以顶点v为源点的第X+1次的迭代计算,得到该以顶点v为源点上的随机游走实例的个数变为R1/2个,该R1/2个随机游走实例各自的当前路径长度变为2L1;该第一主机根据该R1个随机游走实例各增加1个路径长度,以及该R1/2个随机游走实例各自的当前路径长度变为2L1,确定该R1/2个随机游走实例各自的当前路径长度变为2L1+1。The first host based on the map data, iterative calculation X + 1 times the source vertex point v, to give an example of the number of random walk on the vertex v is the source point for the R 1 th, the example 1 R & lt each increased random walk a path length; the first set of host T X (u i) based on the calculation result, the iterative calculation X + 1 times the source vertex point v, to give the The vertex v is the number of random walk instances on the source point becomes R 1 /2, and the current path length of each R 1 /2 random walk instances becomes 2L 1 ; the first host is according to the R 1 Each of the random walk instances is incremented by one path length, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 , and the current path length of each of the R 1 /2 random walk instances is determined. +1 for 2L 1 .
在本发明实施例中,第一主机根据图计算进行的迭代计算称为磁盘迭代,根据通过网络获取的计算结果集合TX(ui)进行的迭代计算称为网络迭代,这里对网络迭代和磁盘迭代的并发迭代计算做了一个具体的说明,进行一次网络迭代,每个顶点上的随机游走实例的个数减半,但是随机游走实例的路径长度由之前的L1变为2L1,进行一次磁盘迭代,每个顶点上的随机游走实例的个数不变,但是路径长度各增加1个路径长度,所以,最后每个随机游走实例的长度变为2L1+1。In the embodiment of the present invention, the iterative calculation performed by the first host according to the graph calculation is called a disk iteration, and the iterative calculation according to the calculation result set T X (u i ) obtained through the network is called network iteration, where network iteration and The concurrent iteration calculation of the disk iteration has a specific description. A network iteration is performed, and the number of random walk instances on each vertex is halved, but the path length of the random walk instance is changed from the previous L 1 to 2L 1 To perform a disk iteration, the number of random walk instances on each vertex is unchanged, but the path length is increased by 1 path length, so the length of each random walk instance becomes 2L 1 +1.
结合本发明实施例的第一方面的第二种可能的实现方式,在本发明实施例的第一方面的第三种可能的实现方式中,该顶点集合ui为{v}∪Q1,该Q1∈Q,集合Q为{u1,u2,......,un},该计算结果集合TX(ui)为{TX(v)}∪TX(Q1),该TX(Q1)∈TX(Q),集合TX(Q)为{TX(u1),TX(u2),......,TX(un)};With reference to the second possible implementation manner of the first aspect of the embodiment of the present invention, in a third possible implementation manner of the first aspect of the embodiment, the vertex set u i is {v}∪Q 1 , The Q 1 ∈Q, the set Q is {u 1 , u 2 , . . . , u n }, and the set of calculation results T X (u i ) is {T X (v)}∪T X (Q 1 ), the T X (Q 1 )∈T X (Q), The set T X (Q) is {T X (u 1 ), T X (u 2 ), ..., T X (u n )};
该第一主机根据该计算结果集合TX(ui),进行以顶点v为源点的第X+1次的迭代计算,得到该以顶点v为源点上的随机游走实例的个数变为R1/2个,该R1/2个随机游走实例各自的当前路径长度变为2L1,可包括:若该Q1为{u1,u2,u3,u4},则计算结果集合TX(ui)为{TX(v),TX(u1),TX(u2),TX(u3),TX(u4)};该第一主机根据如下公式进行以顶点v为源点的第X+1次的迭代计算,得到该以顶点v为源点上的随机游走实例的个数变为R1/2个,该R1/2个随机游走实例各自的当前路径长度变为2L1;
其中,TX(v)表示以顶点v为源点的第X次迭代计算后的计算结果,表示对左右两端的计算结果按指定操作合并。The first host performs an X+1th iteration calculation with the vertex v as a source point according to the calculation result set T X (u i ), and obtains the number of random walk instances on the vertex v as a source point. It becomes R 1 /2, and the current path length of each R 1 /2 random walk instances becomes 2L 1 , which may include: if the Q 1 is {u 1 , u 2 , u 3 , u 4 }, Then the calculation result set T X (u i ) is {T X (v), T X (u 1 ), T X (u 2 ), T X (u 3 ), T X (u 4 )}; The host performs the X+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances on the vertex v as the source point becomes R 1 /2, and the R 1 / The current path length of each of the two random walk instances becomes 2L 1 ; Where T X (v) represents the calculation result after the Xth iteration with the vertex v as the source point. Indicates that the calculation results for the left and right ends are combined by the specified operation.
在本发明实施例中,进行网络迭代时,具体的,可以以一个公式进行形象的说明,每个随机游走实例的路径长度由L1变为2L1。In the embodiment of the present invention, when the network iteration is performed, specifically, the image may be described by a formula, and the path length of each random walk instance is changed from L 1 to 2L 1 .
结合本发明实施例的第一方面的第二种可能的实现方式,在本发明实施例的第一方面的第四种可能的实现方式中,该第一主机根据该计算结果集合TX(ui),进行以顶点v为源点的第X+1次的迭代计算,得到该以顶点v为源点上的随机游走实例的个数变为R1/2个,该R1/2个随机游走实例各自的当前路径长度变为2L1,可包括:With reference to the second possible implementation manner of the first aspect of the embodiment of the present invention, in a fourth possible implementation manner of the first aspect of the embodiment, the first host is configured according to the calculation result set T X (u i ), performing an X+1th iteration calculation with the vertex v as a source point, and obtaining the number of random walk instances on the vertex v as a source point becomes R 1 /2, the R 1 /2 The current path length of each random walk instance becomes 2L 1 , which may include:
该第一主机获取进行第X次迭代计算后,以顶点v为源点出发的前R1/2个随机游走实例各自的当前路径长度,并获取进行该第X次迭代计算后,从另一个源点u出发的后R1/2个随机游走实例各自的当前路径长度,该另一个源点u为该以顶点v为源点出发的前R1/2
个随机游走实例经过第X次迭代计算后,该R1/2个随机游走实例所在的顶点;该第一主机拼接该以顶点v为源点出发的前R1/2个随机游走实例各自的当前路径长度和该从另一个源点u出发的后R1/2个随机游走实例各自的当前路径程度,得到该以顶点v为源点上的随机游走实例的个数变为R1/2个,该R1/2个随机游走实例各自的当前路径长度变为2L1。The first host obtains the current path length of each of the previous R 1 /2 random walk instances starting from the vertex v as the source point after performing the Xth iteration calculation, and obtains the Xth iteration calculation after the calculation The current path length of each of the R 1 /2 random walk instances starting from a source point u, the other source point u being the first R 1 /2 random walk instances starting from the vertex v as the source point After X iterations, the R 1 /2 random walks are located at the vertices; the first host splices the current path lengths of the previous R 1 /2 random walk instances starting from the vertex v as the source point and The current path degree of the R 1 /2 random walk instances starting from another source point u, and the number of random walk instances on the vertex v as the source point is changed to R 1 /2. The current path length of each of the R 1 /2 random walk instances becomes 2L 1 .
在本发明实施例中,对进行一次网络迭代时,随机游走实例的长度由L1变为2L1进行了一个具体的说明,增加了本发明技术方案的可行性。In the embodiment of the present invention, when a network iteration is performed, the length of the random walk instance is changed from L 1 to 2L 1 , and a specific description is added, which increases the feasibility of the technical solution of the present invention.
结合本发明实施例的第一方面,本发明实施例的第一种至本发明实施例的第四种任一可能的实现方式,在本发明实施例的第一方面的第五种可能的实现方式中,每个顶点上当前有R2个随机游走实例,每个随机游走实例的路径长度为L2,该方法还可包括:该第一主机根据该图数据,进行以顶点v为源点的第Z+1次的迭代计算,得到该以顶点v为源点上的随机游走实例的个数为该R2个,该R2个随机游走实例各自的路径长度增加1个路径长度,Z为大于等于0的整数。With reference to the first aspect of the embodiments of the present invention, the first possible implementation of the fourth embodiment of the present invention to the fourth embodiment of the present invention, the fifth possible implementation of the first aspect of the embodiment of the present invention In the mode, there are currently R 2 random walk instances on each vertex, and the path length of each random walk instance is L 2 . The method may further include: the first host performs vertex v according to the graph data. The Z+1th iteration calculation of the source point obtains that the number of random walk instances on the vertex v as the source point is the R 2 , and the path length of each of the R 2 random walk instances is increased by one. Path length, Z is an integer greater than or equal to 0.
在本发明实施例中,对单独的进行磁盘迭代做了一个说明,进行一次磁盘迭代,每个顶点上的随机游走实例的个数不变,每个随机游走实例各增加一个路径长度。In the embodiment of the present invention, a separate disk iteration is described. A disk iteration is performed, the number of random walk instances on each vertex is unchanged, and each random walk instance is incremented by one path length.
结合本发明实施例的第一方面,本发明实施例的第一种至本发明实施例的第五种任一可能的实现方式,在本发明实施例的第一方面的第六种可能的实现方式中,每个顶点上当前有R3个随机游走实例,每个随机游走实例的当前路径长度为L3,该方法还可包括:With reference to the first aspect of the embodiments of the present invention, the first possible implementation of the fifth embodiment of the present invention to the fifth embodiment of the present invention, the sixth possible implementation of the first aspect of the embodiment of the present invention In the mode, there are currently R 3 random walk instances on each vertex, and the current path length of each random walk instance is L 3 , and the method may further include:
该第一主机通过网络获取以顶点集合hi完成的第Y次迭代计算后的计算结果集合TY(hi),该顶点集合hi中包括的顶点是该第一主机进行以顶点v为源点的第Y+1次迭代计算时要执行更新函数的顶点,该顶点集合hi包括该顶点v,Y为大于0的整数;该第一主机根据该计算结果集合TY(hi),进行以顶点v为源点的第Y+1次的迭代计算,得到该以顶点v为源点上的随机游走实例的个数变为R3/2个,该R3/2个随机游走实例各自的当前路径长度变为2L3。The first host acquires the calculation result after the iterative calculation Y times to complete the set of vertices of the set h i Y T (h i) over a network, which includes a set of vertices in the vertex h i is the first host to the vertex v The Y+1th iteration of the source point is calculated to perform a vertex of the update function, the vertex set h i includes the vertex v, Y is an integer greater than 0; the first host according to the calculation result set T Y (h i ) Performing the Y+1th iteration calculation with the vertex v as the source point, and obtaining the number of random walk instances on the vertex v as the source point becomes R 3 /2, the R 3 /2 random The current path length of each of the swim instances becomes 2L 3 .
在本发明实施例中,对单独的进行网络迭代进行一个说明,每个顶点上的随机游走实例的个数减半,但是随机游走实例的路径长度由之前的L3变为2L3。相对于磁盘迭代来说,每个随机游走实例增加的路径长度比较多。In the embodiment of the present invention, a separate network iteration is described, and the number of random walk instances on each vertex is halved, but the path length of the random walk instance is changed from the previous L 3 to 2L 3 . Relative to disk iterations, each random walk instance adds more path lengths.
结合本发明实施例的第一方面的第六种可能的实现方式,在本发明实施例的第一方面的第七种可能的实现方式中,该顶点集合hi为{v}∪W1,该W1∈W,集合W为{h1,h2,......,hn},该计算结果集合TY(hi)为{TY(v)}∪TY(W1),该TY(W1)∈TY(W),集合TY(W)为{TY(h1),TY(h2),......,TY(hn)};With reference to the sixth possible implementation manner of the first aspect of the embodiment of the present invention, in a seventh possible implementation manner of the first aspect of the embodiment, the vertex set h i is {v}∪W 1 , The W 1 ∈W, the set W is {h 1 , h 2 , . . . , h n }, and the set of calculation results T Y (h i ) is {T Y (v)}∪T Y (W 1 ), the T Y (W 1 )∈T Y (W), The set T Y (W) is {T Y (h 1 ), T Y (h 2 ), ..., T Y (h n )};
该第一主机根据该计算结果集合TY(hi),进行以顶点v为源点的第Y+1次的迭代计算,得到该以顶点v为源点上的随机游走实例的个数变为R3/2个,该R3/2个随机游走实例各自的当前路径长度变为2L3,可包括:若该W1为{h1,h2,h3,h4},则计算结果集合TY(hi)为{TY(v),TY(h1),TY(h2),TY(h3),TY(h4)};The first host performs the Y+1th iteration calculation with the vertex v as the source point according to the calculation result set T Y (h i ), and obtains the number of random walk instances on the vertex v as the source point. It becomes R 3 /2, and the current path length of each R 3 /2 random walk instances becomes 2L 3 , which may include: if the W 1 is {h 1 , h 2 , h 3 , h 4 }, Then the calculation result set T Y (h i ) is {T Y (v), T Y (h 1 ), T Y (h 2 ), T Y (h 3 ), T Y (h 4 )};
该第一主机根据如下公式进行以顶点v为源点的第Y+1次的迭代计算,得到该以顶点v为源点上的随机游走实例的个数变为R3/2个,该R3/2个随机游走实例各自的路径长度变为2L3;其中,TY(v)表示以顶点v为源点的第Y
次迭代计算后的计算结果,表示对左右两端的计算结果按指定操作合并。The first host performs the Y+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances on the vertex v as the source point becomes R 3 /2, The path length of each R 3 /2 random walk instances becomes 2L 3 ; Where T Y (v) represents the calculation result after the Yth iteration with the vertex v as the source point. Indicates that the calculation results for the left and right ends are combined by the specified operation.
在本发明实施例中,对进行网络迭代计算,具体的,可以以一个公式进行形象的说明。表示每个随机游走实例的路径长度的翻倍增长。In the embodiment of the present invention, the network iterative calculation is performed, and specifically, the image may be described by a formula. Represents a doubling of the path length of each random walk instance.
结合本发明实施例的第一方面的第六种可能的实现方式,在本发明实施例的第一方面的第八种可能的实现方式中,该第一主机根据该计算结果集合TY(hi),进行以顶点v为源点的第Y+1次的迭代计算,得到该以顶点v为源点上的随机游走实例的个数变为R3/2个,该R3/2个随机游走实例各自的当前路径长度变为2L3,可包括:With reference to the sixth possible implementation manner of the first aspect of the embodiments of the present invention, in the eighth possible implementation manner of the first aspect of the embodiment, the first host is configured according to the calculation result set T Y (h i ), performing the Y+1th iteration calculation with the vertex v as the source point, and obtaining the number of random walk instances on the vertex v as the source point becomes R 3 /2, the R 3 /2 The current path length of each random walk instance becomes 2L 3 , which may include:
该第一主机获取进行第Y次迭代计算后,以顶点v为源点出发的前R3/2个随机游走实例各自的当前路径长度,并获取进行该第Y次迭代计算后,从另一个源点w出发的后R3/2个随机游走实例各自的当前路径长度,该另一个源点w为该以顶点v为源点出发的前R3/2个随机游走实例经过第Y次迭代计算后,该R3/2个随机游走实例所在的顶点;该第一主机拼接该以顶点v为源点出发的前R3/2个随机游走实例各自的当前路径长度和该从另一个源点w出发的后R3/2个随机游走实例各自的当前路径程度,得到该以顶点v为源点上的随机游走实例的个数变为R3/2个,该R3/2个随机游走实例各自的当前路径长度变为2L3。After the first host acquires the current path length of the first R 3 /2 random walk instances starting from the vertex v as the source point after performing the Y-th iteration calculation, and obtaining the Y-th iteration calculation, the other host The current path length of each R 3 /2 random walk instances starting from a source point w, the other source point w being the first R 3 /2 random walk instances starting from the vertex v as the source point After Y iterations, the R 3 /2 random walks are located at the vertices; the first host splices the current path lengths of the previous R 3 /2 random walk instances starting from the vertex v as the source point and The current path degree of the R 3 /2 random walk instances starting from another source point w, and the number of random walk instances on the vertex v as the source point is changed to R 3 /2. The current path length of each of the R 3 /2 random walk instances becomes 2L 3 .
在本发明实施例中,对进行网络迭代计算时,随机游走实例的长度由L1变为2L1进行了一个具体的说明,增加了本发明技术方案的可行性。In the embodiment of the present invention, when the network iterative calculation is performed, the length of the random walk instance is changed from L 1 to 2L 1 , and a specific description is added, which increases the feasibility of the technical solution of the present invention.
本发明实施例第二方面提供一种主机,该主机应用于以磁盘为基础的图计算系统,该图计算系统包括M个主机,每个主机在本地磁盘上保存图数据,该图数据包括N个顶点,每个主机同时运行N/M个不同源的路径计算,每个顶点上当前有R1个随机游走实例,每个随机游走实例的当前路径长度为L1,该主机包括:A second aspect of the embodiments of the present invention provides a host, where the host is applied to a disk-based graph computing system, where the graph computing system includes M hosts, and each host saves graph data on a local disk, and the graph data includes N. For each vertex, each host simultaneously runs N/M paths of different sources. Each vertex currently has R 1 random walk instances. The current path length of each random walk instance is L 1 . The host includes:
获取单元,用于获取顶点集合进行第X次迭代计算后的计算结果集合,该顶点集合为该第一主机进行第X+1次迭代计算时要执行更新函数的顶点的集合;An obtaining unit, configured to obtain a set of calculation results after the Xth iteration calculation by the vertex set, where the vertex set is a set of vertices to perform an update function when the first host performs the X+1th iteration calculation;
计算单元,用于根据该图数据和该计算结果集合,进行第X+1次的并发迭代计算,得到每个顶点上随机游走实例的个数变为R1/2个,该R1/2个随机游走实例各自的当前路径长度变为2L1+1;还用于若该R1/2和2L1+1满足迭代完成条件,则该第一主机完成该图数据的计算。a calculation unit, configured to perform an X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtain that the number of random walk instances on each vertex becomes R 1 /2, and the R 1 / The current path length of each of the two random walk instances becomes 2L 1 +1; and is also used to calculate the graph data if the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition.
该主机具有实现对应于上述第一方面提供的图计算的方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。The host has the function of implementing a method corresponding to the graph calculation provided by the above first aspect. This function can be implemented in hardware or in hardware by executing the corresponding software. The hardware or software includes one or more modules corresponding to the functions described above.
本发明实施例第三方面提供一种主机,该主机应用于以磁盘为基础的图计算系统,该图计算系统包括M个主机,每个主机在本地磁盘上保存图数据,该图数据包括N个顶点,每个主机同时运行N/M个不同源的路径计算,每个顶点上当前有R1个随机游走实例,每个随机游走实例的当前路径长度为L1,该主机可包括:收发器,处理器;该收发器和该处理器通过总线连接;A third aspect of the embodiments of the present invention provides a host, where the host is applied to a disk-based graph computing system, where the computing system includes M hosts, and each host saves graph data on a local disk, and the graph data includes N. Each vertex runs N/M paths of different sources at the same time. There are currently R 1 random walk instances on each vertex. The current path length of each random walk instance is L 1 , and the host can include a transceiver, a processor; the transceiver and the processor are connected by a bus;
该收发器,用于获取顶点集合进行第X次迭代计算后的计算结果集合,该顶点集合为该第一主机进行第X+1次迭代计算时要执行更新函数的顶点的集合;The transceiver is configured to obtain a set of calculation results after the Xth iteration calculation by the set of vertices, where the set of vertices is a set of vertices of the update function to be executed when the first host performs the X+1th iteration calculation;
该处理器,用于根据该图数据和该计算结果集合,进行第X+1次的并发迭代计算,得
到每个顶点上随机游走实例的个数变为R1/2个,该R1/2个随机游走实例各自的当前路径长度变为2L1+1;还用于若该R1/2和2L1+1满足迭代完成条件,则该第一主机完成该图数据的计算。The processor is configured to perform an X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtain the number of random walk instances on each vertex to become R 1 /2, and the R 1 The current path length of each of the /2 random walk instances becomes 2L 1 +1; and is also used to calculate the graph data if the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition.
本发明实施例第四方面提供一种图计算系统,该图计算系统包括M个主机,每个主机在本地磁盘上保存图数据,该图数据包括N个顶点,每个主机同时运行N/M个不同源的路径计算;A fourth aspect of the embodiments of the present invention provides a graph computing system, where the graph computing system includes M hosts, each of which stores graph data on a local disk, the graph data includes N vertices, and each host simultaneously runs N/M. Path calculation for different sources;
该图计算系统中的每个主机都对应执行上述第一方面提供的图计算的方法的功能。Each of the figures in the graph computing system corresponds to the function of the method of performing the graph calculation provided by the first aspect above.
本发明实施例第五方面提供一种存储介质,需要说明的是,本发的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产口的形式体现出来,该计算机软件产品存储在一个存储介质中,用于储存为上述设备所用的计算机软件指令,其包含用于执行上述第一方面为主机所设计的程序。该存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。A fifth aspect of the embodiments of the present invention provides a storage medium. It should be noted that the technical solution of the present invention may contribute to the prior art or all or part of the technical solution may be implemented by software. Formally embodied, the computer software product is stored in a storage medium for storing computer software instructions for use in the apparatus described above, including a program designed to execute the first aspect described above for a host. The storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes.
从以上技术方案可以看出,本发明实施例具有以下优点:It can be seen from the above technical solutions that the embodiments of the present invention have the following advantages:
在本发明实施例中,获取顶点集合进行第X次迭代计算后的计算结果集合,所述顶点集合为所述第一主机进行第X+1次迭代计算时要执行更新函数的顶点的集合;所述第一主机根据所述图数据和所述计算结果集合,进行第X+1次的并发迭代计算,得到每个顶点上随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1+1;若所述R1/2和2L1+1满足迭代完成条件,则所述第一主机完成所述图数据的计算。第一主机进行的迭代计算相当于是磁盘迭代计算和网络迭代计算同时进行的,即并发执行磁盘输入/输出(I/O)和网络输入/输出(I/O)两种迭代计算,可称呼为混合迭代计算,进行一次混合迭代计算之后,得到每个随机游走实例的路径长度是增加最长的,从原来的2L1变为2L1+1,那么,完成图数据的计算所花费的时间就会相应的较少,所以,在一次完整的图计算过程中,完成图计算的效率对应的提高,节约了一定的时间。In the embodiment of the present invention, the set of vertices is obtained by performing the Xth iteration calculation, and the set of vertices is a set of vertices of the update function to be executed when the first host performs the X+1th iteration calculation; The first host performs the X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtains that the number of random walk instances on each vertex becomes R 1 /2, The current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1; if the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition, the first host completes the graph data Calculation. The iterative calculation performed by the first host is equivalent to the simultaneous iteration of the disk and the network iterative calculation, that is, concurrent execution of disk input/output (I/O) and network input/output (I/O) two iteration calculations, which can be called Hybrid iterative calculation, after performing a hybrid iterative calculation, the path length of each random walk instance is increased to the longest, from the original 2L 1 to 2L 1 +1, then the time taken to complete the calculation of the graph data There will be less corresponding, so in a complete graph calculation process, the efficiency of the graph calculation is improved correspondingly, saving a certain time.
为了更清楚地说明本发明实施例技术方案,下面将对实施例和现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments and the prior art description will be briefly described below. Obviously, the drawings in the following description are only some implementations of the present invention. For example, other drawings may be obtained from those skilled in the art without any inventive effort.
图1为本发明实施例中的图计算系统的一个系统架构图;1 is a system architecture diagram of a graph computing system in an embodiment of the present invention;
图2为本发明实施例中的在集群中M台不同的主机上运行基于磁盘的图计算的一个示意图;2 is a schematic diagram of running a disk-based graph calculation on different M hosts in a cluster according to an embodiment of the present invention;
图3为本发明实施例中的图计算的方法的一个实施例示意图;FIG. 3 is a schematic diagram of an embodiment of a method for calculating a graph in an embodiment of the present invention; FIG.
图4为本发明实施例中不同源的一个示意图;4 is a schematic diagram of different sources in an embodiment of the present invention;
图5为本发明实施例中主机的一个实施例示意图;FIG. 5 is a schematic diagram of an embodiment of a host according to an embodiment of the present invention; FIG.
图6为本发明实施例中主机的另一个实施例示意图。
FIG. 6 is a schematic diagram of another embodiment of a host according to an embodiment of the present invention.
本发明实施例提供了一种图计算的方法、主机以及图计算系统,用于提高图计算的速率,节省时间。Embodiments of the present invention provide a method, a host, and a graph computing system for graph calculation, which are used to increase the rate of graph calculation and save time.
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is an embodiment of the invention, but not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.
在数学上,一个图(Graph)是表示物件与物件之间的关系的方法,是图论的基本研究对象。一个图看起来是由一些小圆点(称为顶点或节点)和连结这些圆点的直线或曲线(称为边)组成的。以互联网网页为例,每个网页可以看做一个顶点,如果一个网页包含指向另一个网页的链接,可以看做这两个网页之间有边相连。以社交网络为例,其中社交网络中的每个用户可以看做顶点,用户间的好友关系可以被看作边。In mathematics, a graph is a method that represents the relationship between an object and an object, and is the basic research object of graph theory. A graph appears to consist of small dots (called vertices or nodes) and lines or curves (called edges) that join the dots. Taking an Internet web page as an example, each web page can be regarded as a vertex. If a web page contains a link to another web page, it can be seen that there is a side connection between the two web pages. Take a social network as an example, where each user in the social network can be regarded as a vertex, and a friend relationship between users can be regarded as an edge.
随着图计算的渐渐推广,图计算系统的应用需求也相应的出现了一些新的挑战,除了要处理的数据量大,还有图计算的算法需要搜索大量图数据中的路径,计算量大,复杂度高。在现有的图计算系统中,以内存计算为主,根据集群中计算节点的数目把图数据分成同等份数,并将分好的图数据分配到这些计算节点的内存中,才开始计算。一个抽象的“图数据”包括N个顶点/节点,现有M个主机,每个主机串行运行N/M个不同源的路径计算。基于随机游走的算法要求:对图中的N个节点,需要独立地从每个节点出发搜索一条以该节点为源点的随机游走路径。其中,随机游走往下一个相邻顶点前进的每一步都需要现有图计算的一次迭代,随机游走路径需要被采样多长,系统就需要处理多少步迭代完成该次采样。所以,N次采样就运行N次图计算,每次图计算对应一次采样计算。所以,总的计算时间就是一次采样计算时间的N倍。如果扩展到现有单机系统中,多个不同的主机上同时运行这N次采样,因为每个主机上需要运行N/M次采样,因为是M个主机同时运行,这样总的计算时间为一次采样计算时间的N/M倍。在大数据的背景下,通常N的值都很大,所以,完成要求的采样计算所花费的时间特别长,因此,怎么提高图计算的效率是一个重大问题。With the gradual popularization of graph computing, the application requirements of graph computing systems have correspondingly presented some new challenges. In addition to the large amount of data to be processed, the graph computing algorithm needs to search the path in a large amount of graph data, which is computationally intensive. , high complexity. In the existing graph computing system, the memory computing is mainly based on dividing the graph data into equal parts according to the number of computing nodes in the cluster, and assigning the good graph data to the memory of these computing nodes, and then starting the calculation. An abstract "graph data" consists of N vertices/nodes, existing M hosts, and each host serially runs path calculations of N/M different sources. The algorithm based on random walk requires: for the N nodes in the graph, it is necessary to independently search for a random walk path with the node as the source point from each node. Among them, each step of random walk to the next adjacent vertex requires an iteration of the existing graph calculation, how long the random walk path needs to be sampled, and how many iterations the system needs to process to complete the sampling. Therefore, the N-times calculation is performed for N times of sampling, and each time the figure calculation corresponds to one sampling calculation. Therefore, the total calculation time is N times the sampling calculation time. If you extend to an existing stand-alone system, these N samples are run simultaneously on multiple different hosts, because N/M sampling is required on each host, because M hosts are running at the same time, so the total calculation time is once. The sampling calculation time is N/M times. In the context of big data, usually the value of N is very large, so it takes a very long time to complete the required sampling calculation. Therefore, how to improve the efficiency of graph calculation is a major problem.
本发明技术方案提出了一种利用外存,也就是基于磁盘的分布式图计算的方法,每台主机都有相对充足的存储空间可以在本地磁盘保存图数据。不需要将一份完整的图数据根据主机的个数分为同等份数保存在内存中再计算。因为传统的图计算系统的架构和设计通常只保留一份图计算的计算状态,但现在需要同时保留N份计算状态,而每个计算状态的大小通常和一个给定的数L(步长)或N呈线性的关系,复杂度为O(L)甚至O(N)。全部计算状态的大小就是L*N甚至O(N^2)。对于一个大图,这样大小的数据是无法用内存来保留的。本发明适用的图计算系统包括M个主机,每个主机在本地磁盘上保存图数据,如图1所示,图1所示的主机的个数在实际应用中不做限定。所述图数据包括N个顶点,每个主机运行N/M个不同源的路径计算。The technical solution of the present invention proposes a method for utilizing external storage, that is, a disk-based distributed graph calculation method, each host has a relatively sufficient storage space to store map data on a local disk. It is not necessary to store a complete map data in the memory according to the number of hosts and then calculate it. Because the architecture and design of the traditional graph computing system usually only retains the computational state of a graph calculation, it is now necessary to retain N calculation states at the same time, and the size of each computation state is usually a given number L (step size). Or N is a linear relationship with a complexity of O(L) or even O(N). The size of all calculation states is L*N or even O(N^2). For a large image, data of this size cannot be reserved with memory. The graph computing system to which the present invention is applied includes M hosts, each of which stores graph data on a local disk. As shown in FIG. 1 , the number of hosts shown in FIG. 1 is not limited in practical applications. The graph data includes N vertices, and each host runs a path computation of N/M different sources.
图计算系统为分布式的集群架构,如图2所示,为在集群中M台不同的主机上运行基于磁盘的图计算,每台主机上同时运行N/M个不同源的路径计算任务。图计算可分为离线
预处理、加载图数据和在线图计算三个部分。在预处理阶段,系统将输入数据分为若干个数据片,保存在每台计算机的磁盘中。每一次加一个数据片,都调度执行N/M个不同源的路径计算任务代码。The graph computing system is a distributed cluster architecture. As shown in FIG. 2, a disk-based graph calculation is performed on M different hosts in the cluster, and N/M different source path computation tasks are simultaneously run on each host. Graph calculation can be divided into offline
Preprocessing, loading graph data and online graph calculation are three parts. In the preprocessing stage, the system divides the input data into several pieces of data and saves them on the disk of each computer. Each time a piece of data is added, the path calculation task code of N/M different sources is scheduled to be executed.
这里对每个主机运行N/M个不同源的路径计算做一个说明:图计算系统对所有的算法的运行都是以多轮迭代进行直到算法收敛结束。一般通过使用“以顶点想(think like a vertex)”的思路去抽象数据处理的算法,即通过编写顶点程序形成图上的更新函数,这里的更新函数是由用户定义的。在现有技术中,用户定义的更新函数一次迭代计算只能处理一个源点的路径计算,而在本发明技术方案中,用户定义的更新函数一次迭代计算可以并发处理多个不同源的路径计算,这里所说的不同源的路径计算指的是以不同顶点为源点的路径计算。在图计算完成一个算法的多次迭代计算中,每次迭代计算就是系统完成一遍在图数据的每一个顶点上执行更新函数。Here is an explanation of the path calculation of each host running N/M different sources: the graph calculation system runs all the algorithms in multiple rounds of iteration until the convergence of the algorithm ends. The algorithm for abstracting data processing is generally abstracted by using the idea of "think like a vertex", that is, by writing a vertex program to form an update function on the graph, the update function here is defined by the user. In the prior art, the user-defined update function can only process the path calculation of one source point in one iteration calculation, and in the technical solution of the present invention, the user-defined update function can process the path calculation of multiple different sources concurrently. The path calculation of different sources mentioned here refers to the path calculation with different vertices as the source point. In the iterative calculation in which the graph calculation completes an algorithm, each iteration calculation is performed by the system to perform an update function on each vertex of the graph data.
在本发明的图计算过程中,系统可以快速的从磁盘顺序输入/输出(I/O)读取本地磁盘上存储的目标图数据进行计算,不需要不同的主机之间对图数据进行网络通信。在系统执行以顶点v为源点的迭代计算时,顺序磁盘I/O读取目标图数据进行迭代计算的同时,网络资源也被高效的利用起来,即在不同主机之间通过网络获取本次进行迭代计算用到的计算结果集合TX(ui),计算结果集合TX(ui)是执行函数要更新的顶点集合ui完成的上一次迭代计算后的计算结果,在宏观上,可以认为是利用磁盘上获取的数据进行迭代计算和利用网络获取的数据进行的迭代计算同时进行,会得到有R1/2个随机游走实例的路径长度增加为2L1+1。而单独的从本地磁盘获取目标图数据进行迭代计算,得到每个随机游走实例的路径长度只能增加一个路径长度;单独的从网络获取上一次迭代计算后的计算结果集合TX(ui),进行本次迭代计算时,会有R1/2个随机游走实例的路径长度增加为2L1。In the graph calculation process of the present invention, the system can quickly read the target map data stored on the local disk from the disk sequential input/output (I/O) for calculation, and does not require different hosts to perform network communication on the graph data. . When the system performs the iterative calculation with the vertex v as the source point, the sequential disk I/O reads the target graph data for iterative calculation, and the network resources are also efficiently utilized, that is, the network is obtained through different networks between different hosts. Performing a set of calculation results T X (u i ) used for iterative calculation, the calculation result set T X (u i ) is the calculation result of the last iteration calculation performed by the vertex set u i of the execution function to be updated, macroscopically, It can be considered that the iterative calculation using the data acquired on the disk and the iterative calculation performed by the data acquired by the network are performed simultaneously, and the path length of the R 1 /2 random walk instances is increased to 2L 1 +1. The target data is obtained from the local disk and iteratively calculated. The path length of each random walk instance can only be increased by one path length. The calculation result set T X (u i obtained from the previous iteration is obtained separately from the network. ), when performing this iterative calculation, the path length of R 1 /2 random walk instances is increased to 2L 1 .
综合上述的几种方法,使用同时使用磁盘迭代计算和网络迭代计算,会比较更快捷的增加随机游走实例的路径长度,从而,提高了图计算的速率,相应的节约了一定的时间。需要说明的是,在下述的描述中,从磁盘获取的数据进行迭代计算会以“磁盘迭代”来简称,从网络获取的数据进行迭代计算会以“网络迭代”来简称,从磁盘获取的数据进行迭代计算和从网络获取的数据进行迭代计算并发进行,会以“混合迭代”来简称,便于描述,应理解,磁盘迭代、网络迭代和混合迭代并不是一个专业的称呼。Combining the above several methods, using both disk iterative calculation and network iterative calculation, it will increase the path length of the random walk instance more quickly, thereby increasing the rate of graph calculation and saving a certain time accordingly. It should be noted that in the following description, the iterative calculation of the data acquired from the disk will be referred to as “disk iteration”, and the iterative calculation of the data acquired from the network will be referred to as “network iteration”, and the data obtained from the disk. The iterative calculation and the data obtained from the network are iteratively calculated and concurrently performed. It will be referred to as “hybrid iteration” for convenience of description. It should be understood that disk iteration, network iteration and hybrid iteration are not a professional name.
下面以实施例的方式对本发明技术方案进行具体描述,如图3所示,为本发明实施例中图计算的方法的一个实施例,该方法应用于以磁盘为基础的图计算系统,图计算系统包括M个主机,每个主机在本地磁盘上保存图数据,图数据包括N个顶点,每个主机同时运行N/M个不同源的路径计算,包括:The technical solution of the present invention is specifically described in the following embodiments. As shown in FIG. 3, it is an embodiment of a method for calculating a graph in an embodiment of the present invention. The method is applied to a disk-based graph computing system. The system includes M hosts, each of which saves graph data on a local disk. The graph data includes N vertices, and each host simultaneously runs path calculations of N/M different sources, including:
301、第一主机根据图数据,进行以顶点v为源点的第Z+1次的迭代计算,得到以顶点v为源点上的随机游走实例的个数为R2个,R2个随机游走实例各自的路径长度增加1个路径长度,Z为大于等于0的整数;301. The first host performs the Z+1th iteration calculation with the vertex v as the source point according to the graph data, and obtains the number of random walk instances with the vertex v as the source point as R 2 and R 2 The path length of each random walk instance is increased by 1 path length, and Z is an integer greater than or equal to 0;
在本发明实施例中,每个顶点上当前有R2个随机游走实例,每个随机游走实例的路径长度为L2。第一主机根据图数据,进行以顶点v为源点的第Z+1次的迭代计算,得到以顶点v为源点上的随机游走实例的个数为R2个,R2个随机游走实例各自的路径长度增加1个
路径长度,Z为大于等于0的整数。这里的图数据是在每个主机在本地磁盘上保存的图数据。In the embodiment of the present invention, there are currently R 2 random walk instances on each vertex, and the path length of each random walk instance is L 2 . The first host performs the Z+1th iteration calculation with the vertex v as the source point according to the graph data, and obtains the number of random walk instances with the vertex v as the source point as R 2 and R 2 random tours. The path length of each instance is increased by 1 path length, and Z is an integer greater than or equal to 0. The graph data here is the graph data saved on each local disk on each host.
示例性的,假设本发明实施例的最终要求,也就是迭代完成的条件是每个节点上的随机游走实例的个数达到第一预置阈值R=10个,且每个随机游走实例的路径长度达到第二预置阈值L=6,这里的每个顶点上当前有R2=40个随机游走实例,每个随机游走实例的路径长度为L2=0,主机1根据本地磁盘保存的图数据进行以v为源点的第1次迭代计算,得到以顶点v为源点上的随机游走实例的个数不变,还是40个,但是,这40个随机游走实例各自的路径长度增加1个路径长度,可以表示为{v,w}。Exemplarily, it is assumed that the final requirement of the embodiment of the present invention, that is, the iterative completion condition, is that the number of random walk instances on each node reaches a first preset threshold R=10, and each random walk instance The path length reaches the second preset threshold L=6, where there are currently R 2 = 40 random walk instances on each vertex, and the path length of each random walk instance is L 2 =0, and host 1 is locally The graph data saved by the disk is calculated by the first iteration with v as the source point, and the number of random walk instances on the vertex v as the source point is unchanged, or 40, but these 40 random walk instances The length of each path is increased by 1 path length, which can be expressed as {v, w}.
需要说明的是,在图计算系统的过程中,通常情况下,第一次迭代计算只能进行磁盘迭代,之后的迭代计算主要以混合迭代为主,因为效率高,但也可以根据实际情况,穿插的进行单独的磁盘迭代和网络迭代,具体不作限定。It should be noted that in the process of the graph computing system, in general, the first iteration calculation can only perform disk iteration, and the subsequent iterative calculation is mainly based on hybrid iteration, because the efficiency is high, but it can also be based on actual conditions. Interspersed individual disk iterations and network iterations are not limited.
在本发明实施例中,第一主机获取顶点集合进行第X次迭代计算后的计算结果集合,顶点集合为第一主机进行第X+1次迭代计算时要执行更新函数的顶点的集合;具体的,可以以步骤302进行说明。In the embodiment of the present invention, the first host acquires a set of vertices and performs a calculation result set after the Xth iteration calculation, and the vertices set is a set of vertices of the update function to be executed when the first host performs the X+1th iteration calculation; The description can be made in step 302.
302、第一主机通过网络获取以顶点集合ui完成的第X次迭代计算后的计算结果集合TX(ui),顶点集合ui中包括的顶点是第一主机进行以顶点v为源点的第X+1次迭代计算时要执行更新函数的顶点,顶点v为第一主机上N/M个顶点中的其中一个,顶点集合ui包括v,X为大于0的整数;302, the first host to the set of vertices result of calculation of the iterative calculation u i X times through the network to complete a set of acquisition T X (u i), comprising a set of vertices in vertex u i is the first host to the source vertex v The X+1th iteration of the point is calculated to perform the vertices of the update function, the vertex v is one of the N/M vertices on the first host, the vertex set u i includes v, and X is an integer greater than 0;
在本发明实施例中,第一主机通过网络获取以顶点集合ui完成的第X次迭代计算后的计算结果集合TX(ui),顶点集合ui中包括的顶点是第一主机进行以顶点v为源点的第X+1次迭代计算时要执行更新函数的顶点,顶点v为第一主机上N/M个顶点中的其中一个,顶点集合ui包括v,X为大于0的整数。其中,每个顶点上当前有R1个随机游走实例,每个随机游走实例的当前路径长度为L1。In the embodiment of the present invention, the first host obtains the X-th iteration calculation result set T X (u i ) calculated by the vertex set u i through the network, and the vertex included in the vertex set u i is performed by the first host. The vertices of the update function are executed when the X+1th iteration of the vertex v is the source point, and the vertex v is one of the N/M vertices on the first host, and the vertex set u i includes v, and X is greater than 0. The integer. There is currently R 1 random walk instances on each vertex, and the current path length of each random walk instance is L 1 .
示例性的,因为这里接的是步骤301,那么这里进行的是以顶点v为源点的第2次迭代计算,每个顶点上当前有R1=40个随机游走实例,这40个随机游走实例的当前路径长度L1=1,即1个路径长度。主机1从通过网络获取计算结果集合TX(ui),假设,第2次进行以顶点v为源点的迭代计算时,要执行更新函数的顶点集合为{v,u1,u2,u3,u4},即v,u1,u2,u3和u4为第2次执行更新函数的顶点,对应的在第1次进行迭代计算后的计算结果T1(v),T1(u1),T1(u2),T1(u3)和T1(u4),那么计算结果集合为{T1(w),T1(u1),T1(u2),T1(u3),T1(u4)},这里执行更新函数的顶点的个数不作限定。Exemplarily, since step 301 is followed here, the second iteration calculation with the vertex v as the source point is performed here, and there are currently R 1 = 40 random walk instances on each vertex, and these 40 random numbers The current path length of the walked instance is L 1 =1, which is 1 path length. The host 1 obtains the calculation result set T X (u i ) from the network, and assumes that the second time the iterative calculation with the vertex v as the source point is performed, the vertex set of the update function to be executed is {v, u 1 , u 2 , u 3 , u 4 }, that is, v, u 1 , u 2 , u 3 and u 4 are the vertices of the second execution update function, and the corresponding calculation result T 1 (v) after the first iteration calculation, T 1 (u 1 ), T 1 (u 2 ), T 1 (u 3 ) and T 1 (u 4 ), then the set of calculation results is {T 1 (w), T 1 (u 1 ), T 1 ( u 2 ), T 1 (u 3 ), T 1 (u 4 )}, and the number of vertices performing the update function herein is not limited.
在本发明实施例中,第一主机根据图数据和计算结果集合,进行第X+1次的并发迭代计算,得到每个顶点上随机游走实例的个数变为R1/2个,R1/2个随机游走实例各自的当前路径长度变为2L1+1;具体的,可以以步骤303进行说明。In the embodiment of the present invention, the first host performs the X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtains that the number of random walk instances on each vertex becomes R 1 /2, R The current path length of each of the 1 /2 random walk instances becomes 2L 1 +1; specifically, it can be explained in step 303.
303、第一主机根据图数据和计算结果集合TX(ui),进行以顶点v为源点的第X+1次的并发迭代计算,得到以顶点v为源点上的随机游走实例的个数变为R1/2个,R1/2个随机游走实例各自的当前路径长度变为2L1+1;303. The first host performs the X+1th concurrent iteration calculation with the vertex v as the source point according to the graph data and the calculation result set T X (u i ), and obtains a random walk instance with the vertex v as the source point. The number of the paths becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1;
在本发明实施例中,第一主机根据图数据和计算结果集合TX(ui),进行以顶点v为源点
的第X+1次的并发迭代计算,得到以顶点v为源点上的随机游走实例的个数变为R1/2个,R1/2个随机游走实例各自的当前路径长度变为2L1+1。In the embodiment of the present invention, the first host performs the X+1th concurrent iteration calculation with the vertex v as the source point according to the graph data and the calculation result set T X (u i ), and obtains the vertex v as the source point. The number of random walk instances becomes R 1 /2, and the current path length of each of R 1 /2 random walk instances becomes 2L 1 +1.
具体的,可包括:1)第一主机根据图数据,进行以顶点v为源点的第X+1次的迭代计算,得到以顶点v为源点上的随机游走实例的个数为R1个,R1个随机游走实例各增加1个路径长度;Specifically, the method includes: 1) the first host performs the X+1th iteration calculation with the vertex v as the source point according to the graph data, and obtains the number of random walk instances with the vertex v as the source point as R. One , R 1 random walk instances each increase by 1 path length;
2)第一主机根据计算结果集合TX(ui),进行以顶点v为源点的第X+1次的迭代计算,得到以顶点v为源点上的随机游走实例的个数变为R1/2个,R1/2个随机游走实例各自的当前路径长度变为2L1;2) The first host performs the X+1th iteration calculation with the vertex v as the source point according to the calculation result set T X (u i ), and obtains the number change of the random walk instance with the vertex v as the source point. For R 1 /2, the current path length of each R 1 /2 random walk instances becomes 2L 1 ;
3)第一主机根据R1个随机游走实例各增加1个路径长度,以及R1/2个随机游走实例的当前路径长度变为2L1,确定R1/2个随机游走实例各自的当前路径长度变为2L1+1。3) The first host adds 1 path length according to R 1 random walk instances, and the current path length of R 1 /2 random walk instances becomes 2L 1 , and determines R 1 /2 random walk instances respectively. The current path length becomes 2L 1 +1.
其中,第一主机根据计算结果集合TX(ui),进行以顶点v为源点的第X+1次的迭代计算,得到以顶点v为源点上的随机游走实例的个数变为R1/2个,R1/2个随机游走实例各自的当前路径长度变为2L1,还可包括:The first host performs the X+1th iteration calculation with the vertex v as the source point according to the calculation result set T X (u i ), and obtains the number change of the random walk instance with the vertex v as the source point. For R 1 /2, the current path length of each of the R 1 /2 random walk instances becomes 2L 1 , and may also include:
(1)顶点集合ui为{v}∪Q1,Q1∈Q,集合Q为{u1,u2,......,un},计算结果集合TX(ui)为{TX(v)}∪TX(Q1),TX(Q1)∈TX(Q),集合TX(Q)为{TX(u1),TX(u2),......,TX(un)};(1) The vertex set u i is {v}∪Q 1 , Q 1 ∈Q, the set Q is {u 1 , u 2 , . . . , u n }, and the result set T X (u i ) is calculated. Is {T X (v)}∪T X (Q 1 ), T X (Q 1 )∈T X (Q), The set T X (Q) is {T X (u 1 ), T X (u 2 ), ..., T X (u n )};
若Q1为{u1,u2,u3,u4},则计算结果集合TX(ui)为{TX(v),TX(u1),TX(u2),TX(u3),TX(u4)};If Q 1 is {u 1 , u 2 , u 3 , u 4 }, the calculation result set T X (u i ) is {T X (v), T X (u 1 ), T X (u 2 ), T X (u 3 ), T X (u 4 )};
第一主机根据如下公式进行以顶点v为源点的第X+1次的迭代计算,得到以顶点v为源点上的随机游走实例的个数变为R1/2个,R1/2个随机游走实例各自的当前路径长度变为2L1;The first host performs the X+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances on the vertex v as the source point becomes R 1 /2, R 1 / The current path length of each of the two random walk instances becomes 2L 1 ;
其中,TX(v)表示以顶点v为源点的第X次迭代计算后的计算结果,表示对左右两端的计算结果按指定操作合并。应理解,上述的顶点集合Q1只是一个示例的说明,并不构成对Q1中包括的顶点的限定。 Where T X (v) represents the calculation result after the Xth iteration with the vertex v as the source point. Indicates that the calculation results for the left and right ends are combined by the specified operation. It should be understood that the above-described set of vertices Q 1 is only an illustrative example and does not constitute a definition of the vertices included in Q 1 .
(2)第一主机获取进行第X次迭代计算后,以顶点v为源点出发的前R1/2个随机游走实例各自的当前路径长度,并获取进行第X次迭代计算后,从另一个源点u出发的后R1/2个随机游走实例各自的当前路径长度,另一个源点u为以顶点v为源点出发的前R1/2个随机游走实例经过第X次迭代计算后,R1/2个随机游走实例所在的顶点;(2) After the first host acquires the Xth iteration calculation, the current path length of each of the previous R 1 /2 random walk instances starting from the vertex v is obtained, and after obtaining the Xth iteration calculation, Another source point u starts with the current path length of the R 1 /2 random walk instances, and the other source point u is the pre-R 1 /2 random walk instances starting from the vertex v as the source point. After the iterative calculation, the vertices where R 1 /2 random walk instances are located;
第一主机拼接以顶点v为源点出发的前R1/2个随机游走实例各自的当前路径长度和从另一个源点w出发的后R1/2个随机游走实例各自的当前路径程度,得到以顶点v为源点上的随机游走实例的个数变为R1/2个,R1/2个随机游走实例各自的当前路径长度变为2L1。Splicing the first host to vertex v is the source point of departure before R 1/2 Examples of random walk respective current path length and the starting point from another source w R 1/2 Examples of random walk respective current path To the extent, the number of random walk instances with the vertex v as the source point becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 .
示例性的,主机1使用混合迭代进行第2次迭代计算,获取图数据和计算结果集合TX(ui),进行以顶点v为源点的第2次的迭代计算,需要说明的是,这里的磁盘迭代和网络迭代并发进行,不过为了比较清楚的表达,就分开进行说明了。Exemplarily, the host 1 performs the second iteration calculation using the hybrid iteration, acquires the graph data and the calculation result set T X (u i ), and performs the second iterative calculation with the vertex v as the source point. The disk iterations and network iterations are performed concurrently, but for a clearer expression, they are described separately.
主机1根据图数据进行磁盘迭代,得到以顶点v为源点的随机游走实例的个数不变,还是40个,但是,这40个随机游走实例各自的当前路径长度又增加1个路径长度,即现在每个随机游走实例的路径长度为2个路径长度。
The host 1 performs disk iteration according to the graph data, and obtains the number of random walk instances with the vertex v as the source point unchanged, or 40, but the current path length of each of the 40 random walk instances is increased by 1 path. The length, that is, the path length of each random walk instance is now 2 path lengths.
主机1根据计算结果集合TX(ui),进行网络迭代,这里所说的根据计算结果集合进行迭代计算,其实是根据计算结果集合中的元素进行迭代计算,若计算结果集合为TX(ui)={TX(v),TX(u1),TX(u2),TX(u3),TX(u4)},如图4所示,为几个不同源的示意图,那么,就有公式如下所示:
The host 1 performs network iteration according to the calculation result set T X (u i ). The iterative calculation according to the calculation result set here is actually iteratively calculated according to the elements in the calculation result set, if the calculation result set is T X ( u i )={T X (v), T X (u 1 ), T X (u 2 ), T X (u 3 ), T X (u 4 )}, as shown in FIG. 4, for several different The schematic of the source, then, there are formulas as follows:
主机根据该公式,进行迭代计算,而表示对左右两端的计算结果按指定操作合并。会得到以v为源点上的随机游走实例的个数减半,变为20个,这20个随机游走实例的路径长度因为拼接了另一个源点的随机游走实例的路径长度,因此,由第1次迭代结束之后的1个路径长度翻倍增长为2个路径长度。The host performs an iterative calculation based on the formula, and Indicates that the calculation results for the left and right ends are combined by the specified operation. The number of random walk instances with v as the source point is halved to 20, and the path length of the 20 random walk instances is the path length of the random walk instance spliced by another source point. Therefore, the length of one path after the end of the first iteration is doubled to two path lengths.
再加上磁盘迭代每个随机游走实例的路径长度也增加了1个路径长度,那么,现在得到的结果就是以v为源点出发的随机游走实例的个数变为20个,这20个随机游走实例各自的路径长度变为3个路径长度。In addition, the path length of each random walk instance of the disk iteration is also increased by 1 path length. Then, the result is that the number of random walk instances starting from v as the source point becomes 20, which is 20 The path length of each random walk instance becomes 3 path lengths.
对于混合迭代中的网络迭代的路径增长方式,实际上可以这样形象的表示:For the path growth mode of network iteration in hybrid iteration, it can actually be represented like this:
主机1在第2次迭代计算时,依次扫描第1次迭代计算时,选择从每个顶点出发的前20个随机游走实例的路径。以顶点v来说明,对于每条路径,为它寻找使用另一条进行第1次迭代计算时的路径,该路径是从另一顶点出发的后20个随机游走路径中的一条未被使用的路径,然后将这2条路径合并为一条路径,则这20个随机游走实例的路径长度变为2个路径长度。例如,假设有一条从v出发的路径{v,w},在运行时,节点v所在的机器会向节点w所在的机器发送请求,得到w上的一条路径{w,u},然后将2条路径合并成{v,w,u}。则运行1次基于网络的随机游走计算后,在每个节点上得到20个长度翻倍的随机游走路径。When the host 1 scans the first iteration calculation in the second iteration calculation, the host 1 selects the path of the first 20 random walk instances starting from each vertex. Explain by the vertex v, for each path, find the path for the first iteration calculation using another strip, which is one of the last 20 random walk paths starting from the other vertex. Path, and then merge the two paths into one path, then the path length of the 20 random walk instances becomes 2 path lengths. For example, suppose there is a path {v, w} starting from v. At runtime, the machine where node v is located will send a request to the machine where node w is located, get a path {w, u} on w, and then 2 The paths are merged into {v, w, u}. Then, after running the network-based random walk calculation, 20 random walk paths with double length are obtained on each node.
综合上述混合迭代中的磁盘迭代和网络迭代,那么,现在得到的20个随机游走实例各自的路径长度可以表示为{v,w,u,x},其中,{u,x}是进行磁盘迭代后增加的路径长度。Combining the disk iteration and network iteration in the above hybrid iteration, then the path length of each of the 20 random walk instances obtained now can be expressed as {v, w, u, x}, where {u, x} is the disk The path length added after iteration.
304、第一主机通过网络获取以顶点集合hi完成的第Y次迭代计算后的计算结果集合TY(hi),顶点集合hi中包括的顶点是第一主机进行以顶点v为源点的第Y+1次迭代计算时要执行更新函数的全部顶点,顶点集合hi包括顶点v,Y为大于0的整数;304, the first host to the set of vertices iteratively calculated calculation result of Y h i times acquired via a network to complete a set of Y T (h i), comprising a set of vertices in the vertex h i is the first host to the source vertex v The Y+1th iteration of the point is calculated to perform all the vertices of the update function, the vertex set h i includes the vertex v, and Y is an integer greater than 0;
在本发明实施例中,第一主机通过网络获取以顶点集合hi完成的第Y次迭代计算后的计算结果集合TY(hi),顶点集合hi中包括的顶点是第一主机进行以顶点v为源点的第Y+1次迭代计算时要执行更新函数的全部顶点,顶点集合hi包括顶点v,Y为大于0的整数;其中,每个顶点上当前有R3个随机游走实例,每个随机游走实例的当前路径长度为L3。In the embodiment of the present invention, the first host obtains the Y-th iteration calculation result set T Y (h i ) calculated by the vertex set h i through the network, and the vertex included in the vertex set h i is performed by the first host. The Y+1th iteration with the vertex v as the source point is calculated to execute all the vertices of the update function. The vertex set h i includes the vertex v, and Y is an integer greater than 0; wherein each vertex currently has R 3 random numbers Walk the instance, the current path length of each random walk instance is L 3 .
示例性的,接步骤303,则每个顶点上当前有20个随机游走实例,每个随机游走实例的当前路径长度为3个路径长度。Exemplarily, following step 303, there are currently 20 random walk instances on each vertex, and the current path length of each random walk instance is 3 path lengths.
305、第一主机根据计算结果集合TY(hi),进行以顶点v为源点的第Y+1次的迭代计算,得到以顶点v为源点上的随机游走实例的个数变为R3/2个,R3/2个随机游走实例各自的当前路径长度变为2L3;305. The first host performs the Y+1th iteration calculation with the vertex v as the source point according to the calculation result set T Y (h i ), and obtains the number change of the random walk instance with the vertex v as the source point. For R 3 /2, the current path length of each of R 3 /2 random walk instances becomes 2L 3 ;
在本发明实施例中,第一主机通过网络获取计算结果集合TY(hi)之后,第一主机根据计算结果集合TY(hi),进行迭代计算,得到以顶点v为源点上的随机游走实例的个数变为R3/2个,R3/2个随机游走实例各自的当前路径长度变为2L3。
In the embodiment of the present invention, after the first host obtains the calculation result set T Y (h i ) through the network, the first host performs iterative calculation according to the calculation result set T Y (h i ), and obtains the vertex v as the source point. The number of random walk instances becomes R 3 /2, and the current path length of each of R 3 /2 random walk instances becomes 2L 3 .
其中,该步骤具体可包括:The step may specifically include:
(1)顶点集合hi为{v}∪W1,W1∈W,集合W为{h1,h2,......,hn},计算结果集合TY(hi)为{TY(v)}∪TY(W1),TY(W1)∈TY(W),集合TY(W)为{TY(h1),TY(h2),......,TY(hn)};(1) The set of vertices h i is {v} ∪ W 1 , W 1 ∈ W, and the set W is {h 1 , h 2 , ..., h n }, and the result set T Y (h i ) is calculated. Is {T Y (v)}∪T Y (W 1 ), T Y (W 1 )∈T Y (W), The set T Y (W) is {T Y (h 1 ), T Y (h 2 ), ..., T Y (h n )};
若W1为{h1,h2,h3,h4},则计算结果集合TY(hi)为{TY(v),TY(h1),TY(h2),TY(h3),TY(h4)};If W 1 is {h 1 , h 2 , h 3 , h 4 }, the calculation result set T Y (h i ) is {T Y (v), T Y (h 1 ), T Y (h 2 ), T Y (h 3 ), T Y (h 4 )};
第一主机根据如下公式进行以顶点v为源点的第Y+1次的迭代计算,得到以顶点v为源点上的随机游走实例的个数变为R3/2个,R3/2个随机游走实例各自的路径长度变为2L3;The first host performs the Y+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances on the vertex v as the source point becomes R 3 /2, R 3 / The path length of each of the two random walk instances becomes 2L 3 ;
其中,TY(v)表示以顶点v为源点的第Y次迭代计算后的计算结果,表示对左右两端的计算结果按指定操作合并。应理解,上述的顶点集合W1只是一个示例的说明,并不构成对W1中包括的顶点的限定。 Where T Y (v) represents the calculation result after the Yth iteration calculated with the vertex v as the source point, Indicates that the calculation results for the left and right ends are combined by the specified operation. It should be understood that the above-described set of vertices W 1 is merely an illustrative example and does not constitute a limitation on the vertices included in W 1 .
(2)第一主机获取进行第Y次迭代计算后,以顶点v为源点出发的前R3/2个随机游走实例各自的当前路径长度,并获取进行第Y次迭代计算后,从另一个源点w出发的后R3/2个随机游走实例各自的当前路径长度,另一个源点w为以顶点v为源点出发的前R3/2个随机游走实例经过第Y次迭代计算后,R3/2个随机游走实例所在的顶点;(2) After the first host acquires the current path length of the previous R 3 /2 random walk instances starting from the vertex v as the source point after performing the Y-th iteration calculation, and obtaining the Y-th iteration calculation, Another source point w starts with the current path length of the R 3 /2 random walk instances, and the other source point w is the former R 3 /2 random walk instances starting from the vertex v as the source point. After the iterative calculation, R 3 /2 random walks are located at the vertices;
第一主机拼接以顶点v为源点出发的前R3/2个随机游走实例各自的当前路径长度和从另一个源点w出发的后R3/2个随机游走实例各自的当前路径程度,得到以顶点v为源点上的随机游走实例的个数变为R3/2个,R3/2个随机游走实例各自的当前路径长度变为2L3。Splicing the first host to the source vertex point v before departure R 3/2 random walk Examples respective current path length and the starting point from another source w R 3/2 random walk Examples respective current path To the extent, the number of random walk instances on the vertex v as the source point becomes R 3 /2, and the current path length of each of the R 3 /2 random walk instances becomes 2L 3 .
示例性的,主机1进行第3次迭代计算时,单独的使用网络迭代的方式。接上述步骤的内容,主机1根据计算结果集合TY(hi),进行网络迭代,若计算结果集合TY(hi)为{TY(v),TY(h1),TY(h2),......,TY(hn)},主机1根据公式进行迭代计算,得到以v为源点的随机游走实例的个数减半,变为10个,每个随机游走实例的长度由之前的3个路径长度增加为6个路径长度。Exemplarily, when the host 1 performs the third iteration calculation, the network iteration is used separately. Following the content of the above steps, the host 1 performs network iteration according to the calculation result set T Y (h i ), and if the calculation result set T Y (h i ) is {T Y (v), T Y (h 1 ), T Y (h 2 ), ..., T Y (h n )}, host 1 according to the formula The iterative calculation is performed to obtain that the number of random walk instances with v as the source point is halved to 10, and the length of each random walk instance is increased from the previous three path lengths to six path lengths.
因为之前的随机游走实例的长度表示为{v,w,u,x},那么现在进行网络迭代之后,这10个随机游走实例各自的路径长度可以表示为{v,w,u,x,y,z,a}。Because the length of the previous random walk instance is expressed as {v, w, u, x}, then after the network iteration, the path length of each of the 10 random walk instances can be expressed as {v, w, u, x , y, z, a}.
需要说明的是,步骤305和306是可选的步骤,在实际应用中,可以根据实际需要确定是否需要执行。而且,虽然混合迭代的计算能力相对而言,是最快的,但是,在图计算的过程中,除了第1次迭代计算使用磁盘迭代计算,其他次数的迭代计算,可以根据实际需要的不同,使用混合迭代,磁盘迭代或者网络迭代,不作具体的限定。It should be noted that steps 305 and 306 are optional steps. In actual applications, whether or not execution is required may be determined according to actual needs. Moreover, although the computational power of hybrid iterations is relatively fast, in the process of graph calculation, in addition to the first iteration calculation using disk iteration calculation, other iterations of calculations can be based on actual needs. Use hybrid iterations, disk iterations, or network iterations without specific limitations.
306、若R1/2和2L1+1满足迭代完成条件,则第一主机完成图数据的计算。306. If R 1 /2 and 2L 1 +1 satisfy the iterative completion condition, the first host completes calculation of the graph data.
在本发明实施例中,若R1/2和2L1+1满足迭代完成条件,则第一主机完成图数据的计算。当进行迭代完成后的每个顶点上的随机游走实例的个数和随机游走实例的路径长度满足迭代完成条件,则表示完成图数据的计算。In the embodiment of the present invention, if R 1 /2 and 2L 1 +1 satisfy the iterative completion condition, the first host completes calculation of the map data. When the number of random walk instances on each vertex after the completion of the iteration and the path length of the random walk instance satisfy the iterative completion condition, the calculation of the map data is completed.
示例性的,接步骤305,R3/2对应的是10,2L3对应的是{v,w,u,x,y,z,a},即6个路径长度,则与步骤301的要求对应,即每个节点上的随机游走实例的个数达到第一预置阈值R=10个,且每个随机游走实例的路径长度达到第二预置阈值L=6,所以,可认为,主机1完成该图数据的计算。
Exemplarily, step 305, R 3 /2 corresponds to 10, 2L 3 corresponds to {v, w, u, x, y, z, a}, that is, 6 path lengths, and the requirements of step 301 Correspondingly, the number of random walk instances on each node reaches the first preset threshold R=10, and the path length of each random walk instance reaches the second preset threshold L=6, so it can be considered The host 1 completes the calculation of the graph data.
在本发明实施例中,第一主机使用磁盘迭代时,每个随机游走实例各增加1个路径长度,使用混合迭代时,会有R1/2个随机游走实例各自的路当前径长度由之前的L1变为2L1+1,使用网络迭代时,会有R3/2个随机游走实例各自的当前路径长度由之前的L3变为2L3。所以,如果只使用磁盘迭代,需要多次对磁盘的访问,效率有限。使用网络迭代,可以大大的减少磁盘扫描的次数。但是,网络迭代时又会受到网络资源的限制,并且没有充分利用磁盘资源。所以混合迭代可以同时利用磁盘和网络对随机游走路径进行增长,且增长速度最快。本发明实施例的图计算系统可以并发执行磁盘迭代和网络迭代两种处理,大快于现有分布式图计算系统在相关算法上的性能表现。In the embodiment of the present invention, when the first host uses the disk iteration, each random walk instance is increased by one path length, and when the hybrid iteration is used, there is a current path length of each of the R 1 /2 random walk instances. From the previous L 1 to 2L 1 +1, when using network iteration, the current path length of each R 3 /2 random walk instances is changed from the previous L 3 to 2L 3 . Therefore, if you only use disk iterations, you need to access the disk multiple times with limited efficiency. Using network iterations can greatly reduce the number of disk scans. However, network iterations are limited by network resources and do not make full use of disk resources. So hybrid iterations can use both disk and network to grow random walk paths, and the fastest growth rate. The graph computing system of the embodiment of the present invention can concurrently perform two processes of disk iteration and network iteration, which is much faster than the performance of the existing distributed graph computing system on related algorithms.
上面的实施例中主要是以顶点v为源点而进行的说明,下面可以对图计算中的整体计算以示例的方式做一个简要的说明,进一步形象的展现图计算的过程。In the above embodiment, the description is mainly made by using the vertex v as a source point. The overall calculation in the graph calculation can be briefly described by way of example, and the process of graph calculation is further visualized.
假设:现有2台主机,每台主机上在本地磁盘上都有保存一个“图数据”,当然,这里的图数据是一个形象的表示,前面有说明,此处不再赘述。图数据包括4个顶点/节点,那么,每个主机并发运行4/2=2个不同源的路径计算,每个顶点上现有2个随机游走实例,每个随机游走实例的路径长度初始值都为0。4个顶点分别为:①,②,③和④。主机1上并发运行顶点①和②的路径计算;主机2上并发运行顶点③和④的路径计算。Assume that there are two existing hosts, each of which has a “map data” stored on the local disk. Of course, the graph data here is an image representation, which is described above and will not be described here. The graph data includes 4 vertices/nodes. Then, each host concurrently runs 4/2=2 different source path calculations, and there are 2 random walk instances on each vertex, and the path length of each random walk instance The initial values are all 0. The four vertices are: 1, 2, 3, and 4. The path calculation of the vertices 1 and 2 is concurrently executed on the host 1; the path calculation of the vertices 3 and 4 is concurrently executed on the host 2.
其中,顶点①的随机游走实例为两个①,顶点②的随机游走实例为两个②,顶点③的随机游走实例为两个③,顶点④的随机游走实例为两个④。The random walk instance of vertex 1 is two 1, the random walk instance of vertex 2 is two 2, the random walk instance of vertex 3 is two 3, and the random walk instance of vertex 4 is two 4.
(1)在第1次迭代计算时,只能进行磁盘迭代,那么,主机1和主机2都分别从本地磁盘上获取图数据,进行各自不同源点的迭代计算。随机游走实例就用对应的顶点来表示,第1次迭代计算完成后,2台主机上的随机游走实例的个数和路径长度可以如下分别表示:(1) In the first iteration calculation, only disk iteration can be performed. Then, both host 1 and host 2 respectively obtain graph data from the local disk, and perform iterative calculation of different source points. The random walk instance is represented by the corresponding vertex. After the first iteration is completed, the number of random walk instances and the path length on the two hosts can be expressed as follows:
主机1:Host 1:
①:①→②,①→④1:1→2,1→4
②:②→④,②→③2:2→4,2→3
主机2:Host 2:
③:③→②,③→①3:3→2,3→1
④:④→①,④→②4:4→1,4→2
应理解,这里①→②表示的意思就是顶点①上的一个随机游走实例经过一个迭代计算跳到顶点②的位置了,其他的同理,所以每个节点上的随机游走实例的个数不变,每个随机游走实例增加了1个路径长度。It should be understood that the meaning of 1→2 here is that a random walk instance on vertex 1 jumps to the position of vertex 2 through an iterative calculation, and other similarities, so the number of random walk instances on each node No change, each random walk instance adds 1 path length.
(2)在第2次迭代计算时,可以进行混合迭代、网络迭代或者磁盘迭代,本发明技术方案主要是进行混合迭代,那么,这里就以混合迭代来进行说明吧,虽然说网络迭代的结果只比混合迭代差了1个路径长度,可是,在大数据的背景下,如果增加1个路径长度就可以代表处理了很多数据。(2) In the second iteration calculation, hybrid iteration, network iteration or disk iteration can be performed. The technical solution of the present invention mainly performs hybrid iteration. Then, the hybrid iteration is used for explanation, although the result of the network iteration is said. It is only one path length worse than the hybrid iteration. However, in the context of big data, if you increase the length of one path, you can handle a lot of data.
因为进行的是混合迭代,那么,并发执行磁盘迭代和网络迭代,假设,进行磁盘迭代时:Because of the hybrid iterations, disk iterations and network iterations are performed concurrently, assuming that when iterating over the disk:
主机1:Host 1:
①:①→②→④,①→④
1:1→2→4,1→4
②:②→④→②,②→③2:2→4→2,2→3
主机2:Host 2:
③:③→②→①,③→①3:3→2→1,3→1
④:④→①→③,④→②4:4→1→3,4→2
这里的每个随机游走实例的路径长度增加了一个路径长度。The path length of each random walk instance here is increased by one path length.
进行网络迭代时:When doing network iterations:
主机1:Host 1:
①:①→②→④→②,1:1→2→4→2,
②:②→④→②→③,2:2→4→2→3,
主机2:Host 2:
③:③→②→①→④,3:3→2→1→4,
④:④→①→③→①。4:4→1→3→1.
这里进行网络迭代时,是将④→②,②→③,①→④,③→①这些路径长度拼接在对应的各个顶点上的,所以,每个顶点上有1个随机游走实例的路径长度就从第1次迭代完的1个路径长度翻倍增长为2个路径长度,再加上混合迭代里面的磁盘迭代,那么,每个顶点上有1个随机游走实例的路径长度就增加为3个路径长度。即第2次迭代计算完成之后,每个顶点上的随机游走实例的个数变为1个,每个随机游走实例的路径长度增加到3个路径长度。Here, when the network iteration is performed, the path lengths of 4→2, 2→3, 1→4, 3→1 are spliced on the corresponding vertices, so there is a path of one random walk instance on each vertex. The length is doubled from the length of one path after the first iteration to the length of two paths, plus the disk iteration in the hybrid iteration, then the path length of one random walk instance on each vertex increases. It is 3 path lengths. That is, after the completion of the second iteration calculation, the number of random walk instances on each vertex becomes one, and the path length of each random walk instance is increased to three path lengths.
概括的描述,在每一台主机上,进行第X+1次迭代计算时,依次扫描第X次迭代计算后,从每个节点出发的前R/2条路径。对于每条游走路径,为它寻找使用另一条现有的游走路径,该路径是从另一顶点出发的后R/2个随机游走路径中的一条未被使用的路径,然后将2条长度为L的路径合并为一条长度为2L的路径。In a general description, on each host, when the X+1th iteration calculation is performed, the first R/2 paths from each node are sequentially scanned after the Xth iteration calculation. For each walk path, look for another existing walk path for it, which is an unused path from the last R/2 random walk paths starting from the other vertex, then 2 Paths of length L are merged into a path of length 2L.
需要说明的是,这里把混合迭代给拆开写了,主要是想更清楚的理解这其中的过程。其中混合迭代的时长,通常以磁盘迭代和网络迭代中哪个用时更长,就以更长的时长作为混合迭代的时长。应理解,单独的网络迭代的过程可以参考上述中混合迭代中网络迭代的过程,此处不再赘述。It should be noted that the hybrid iteration is written here, mainly to understand the process more clearly. The length of the hybrid iteration is usually longer in the disk iteration and network iteration, and the longer duration is used as the duration of the hybrid iteration. It should be understood that the process of individual network iteration can refer to the process of network iteration in the hybrid iteration described above, and details are not described herein again.
上面的混合迭代中的网络迭代因为进行的迭代次数比较少,所以,来说明随机游走实例的路径长度翻倍增长可能不太清楚,下面以一个简单的实例对网络迭代再进一步的进行说明。The network iteration in the above hybrid iteration is because the number of iterations is relatively small. Therefore, it may not be clear to explain that the path length of the random walk instance is doubled. The following is a simple example to illustrate the network iteration.
假设,以顶点a为源点进行完第2次迭代计算之后,当前路径长度为3个路径长度,为{a,b,c,d},而以d为顶点源点进行完第2次迭代计算之后,当前路径长度也为3个路径长度,为{d,e,f,g},Suppose that after the second iteration calculation is performed with the vertex a as the source point, the current path length is 3 path lengths, which are {a, b, c, d}, and the second iteration is performed with d as the vertex source point. After the calculation, the current path length is also 3 path lengths, which are {d, e, f, g},
那么,此时,再对以顶点a为源点进行第3次网络迭代计算,可将{a,b,c,d}和{d,e,f,g}合并了,那么以顶点a为源点的当前路径长度就变为{a,b,c,d,e,f,g},由之前的3个路径长度变为6个路径长度,所以,这里可说,进行一次网络迭代计算,路径长度翻倍增长。Then, at this time, the third network iteration calculation is performed on the vertex a as the source point, and {a, b, c, d} and {d, e, f, g} can be merged, then the vertex a is The current path length of the source point becomes {a, b, c, d, e, f, g}, which is changed from the previous three path lengths to six path lengths. Therefore, it can be said that a network iterative calculation is performed here. The path length doubles.
上面对本发明实施例中的图计算的方法进行了说明,下面对本发明实施例中的主机进
行描述,如图5所示,为本发明实施例中主机的一个实施例示意图,主机应用于以磁盘为基础的图计算系统,图计算系统包括M个主机,每个主机在本地磁盘上保存图数据,图数据包括N个顶点,每个主机同时运行N/M个不同源的路径计算,每个顶点上当前有R1个随机游走实例,每个随机游走实例的当前路径长度为L1,包括:The method for calculating the graph in the embodiment of the present invention is described above. The following describes the host in the embodiment of the present invention. As shown in FIG. 5, it is a schematic diagram of an embodiment of a host in the embodiment of the present invention. Based on the graph computing system, the graph computing system includes M hosts, each of which stores graph data on a local disk, the graph data includes N vertices, and each host simultaneously runs N/M paths of different sources for calculation, each There are currently R 1 random walk instances on the vertex, and the current path length of each random walk instance is L 1 , including:
获取单元501,用于获取顶点集合进行第X次迭代计算后的计算结果集合,顶点集合为第一主机进行第X+1次迭代计算时要执行更新函数的顶点的集合;The obtaining unit 501 is configured to obtain a set of calculation results after the Xth iteration calculation is performed on the vertex set, and the vertex set is a set of the vertex of the update function to be executed when the first host performs the X+1th iteration calculation;
计算单元502,用于根据图数据和计算结果集合,进行第X+1次的并发迭代计算,得到每个顶点上随机游走实例的个数变为R1/2个,R1/2个随机游走实例各自的当前路径长度变为2L1+1;还用于若R1/2和2L1+1满足迭代完成条件,则第一主机完成图数据的计算。The calculating unit 502 is configured to perform the X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtain that the number of random walk instances on each vertex becomes R 1 /2, R 1 /2 The current path length of each of the random walk instances becomes 2L 1 +1; and is also used to calculate the graph data if the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition.
可选的,在本发明的一些实施例中,Optionally, in some embodiments of the invention,
获取单元501,具体用于执行上述图3所示的步骤302和304。The obtaining unit 501 is specifically configured to perform steps 302 and 304 shown in FIG. 3 above.
计算单元502,具体用于执行上述图3所示的步骤301、303、305和306。The calculating unit 502 is specifically configured to perform steps 301, 303, 305, and 306 shown in FIG. 3 above.
如图6所示,为本发明实施例中的主机的另一个实施例示意图,包括:FIG. 6 is a schematic diagram of another embodiment of a host in an embodiment of the present invention, including:
该主机可因配置或性能不同而产生比较大的差异,可以包括收发器601,一个或一个以上中央处理器(central processing units,CPU)602(例如,一个或一个以上处理器)和存储器603,一个或一个以上存储应用程序6041或数据6042的存储介质604(例如一个或一个以上海量存储设备)。其中,存储器603和存储介质604可以是短暂存储或持久存储。存储在存储介质604的程序可以包括一个或一个以上模块(图6中没示出),每个模块可以包括对无线网络控制器中的一系列指令操作。更进一步地,中央处理器602可以设置为与存储介质604通信,在无线网络控制器上执行存储介质604中的一系列指令操作。The host may vary considerably depending on configuration or performance, and may include a transceiver 601, one or more central processing units (CPU) 602 (eg, one or more processors), and a memory 603. One or more storage media 604 that store application 6041 or data 6042 (eg, one or one storage device in Shanghai). Among them, the memory 603 and the storage medium 604 may be short-term storage or persistent storage. The program stored on storage medium 604 may include one or more modules (not shown in Figure 6), each of which may include a series of instruction operations in a wireless network controller. Still further, central processor 602 can be configured to communicate with storage medium 604 to perform a series of instruction operations in storage medium 604 on the wireless network controller.
对应的,在本发明实施例中,收发器601,用于获取顶点集合进行第X次迭代计算后的计算结果集合,顶点集合为第一主机进行第X+1次迭代计算时要执行更新函数的顶点的集合;Correspondingly, in the embodiment of the present invention, the transceiver 601 is configured to obtain a set of calculation results after the Xth iteration calculation is performed on the vertex set, and the vertex set is to perform an update function when the first host performs the X+1th iteration calculation. a collection of vertices;
处理器602,用于根据图数据和计算结果集合,进行第X+1次的并发迭代计算,得到每个顶点上随机游走实例的个数变为R1/2个,R1/2个随机游走实例各自的当前路径长度变为2L1+1;若R1/2和2L1+1满足迭代完成条件,则第一主机完成图数据的计算。The processor 602 is configured to perform the X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtain that the number of random walk instances on each vertex becomes R 1 /2, R 1 /2 The current path length of each of the random walk instances becomes 2L 1 +1; if R 1 /2 and 2L 1 +1 satisfy the iterative completion condition, the first host completes the calculation of the map data.
可选的,在本发明的一些实施例中,Optionally, in some embodiments of the invention,
收发器601,具体用于执行上述图3所示的步骤302和304。The transceiver 601 is specifically configured to perform the steps 302 and 304 shown in FIG. 3 above.
处理器602,具体用于执行上述图3所示的步骤301、303、305和306。The processor 602 is specifically configured to perform steps 301, 303, 305, and 306 shown in FIG. 3 above.
本发明实施例还提供一种数据处理的计算机程序产品,包括存储了程序代码的计算机可读存储介质,程序代码包括的指令用于执行前述任意一个方法实施例的方法流程。本领域普通技术人员可以理解,前述的存储介质包括:U盘、移动硬盘、磁碟、光盘、随机存储器(Random-Access Memory,RAM)、固态硬盘(Solid State Disk,SSD)或者其他非易失性存储器(non-volatile memory)等各种可以存储程序代码的非短暂性的(non-transitory)机器可读介质。The embodiment of the invention further provides a computer program product for data processing, comprising a computer readable storage medium storing program code, the program code comprising instructions for executing the method flow of any one of the foregoing method embodiments. A person skilled in the art can understand that the foregoing storage medium includes: a USB flash drive, a mobile hard disk, a magnetic disk, an optical disk, a random access memory (RAM), a solid state disk (SSD), or other nonvolatiles. A non-transitory machine readable medium that can store program code, such as non-volatile memory.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the system, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .
以上所述,以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。
The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that The technical solutions described in the embodiments are modified, or the equivalents of the technical features are replaced by the equivalents of the technical solutions of the embodiments of the present invention.
Claims (19)
- 一种图计算的方法,其特征在于,所述方法应用于以磁盘为基础的图计算系统,所述图计算系统包括M个主机,每个主机在本地磁盘上保存图数据,所述图数据包括N个顶点,每个主机同时运行N/M个不同源的路径计算,每个顶点上当前有R1个随机游走实例,每个随机游走实例的当前路径长度为L1,所述方法包括:A method for graph calculation, characterized in that the method is applied to a disk-based graph computing system, the graph computing system comprising M hosts, each host storing graph data on a local disk, the graph data Including N vertices, each host simultaneously runs path calculations of N/M different sources, and there are currently R 1 random walk instances on each vertex, and the current path length of each random walk instance is L 1 . Methods include:第一主机获取顶点集合进行第X次迭代计算后的计算结果集合,所述顶点集合为所述第一主机进行第X+1次迭代计算时要执行更新函数的顶点的集合;And acquiring, by the first host, a set of calculation results after the Xth iteration calculation, where the set of vertices is a set of vertices of the update function to be executed when the first host performs the X+1th iteration calculation;所述第一主机根据所述图数据和所述计算结果集合,进行第X+1次的并发迭代计算,得到每个顶点上随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1+1;The first host performs the X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtains that the number of random walk instances on each vertex becomes R 1 /2, The current path length of each R 1 /2 random walk instances becomes 2L 1 +1;若所述R1/2和2L1+1满足迭代完成条件,则所述第一主机完成所述图数据的计算。If the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition, the first host completes the calculation of the graph data.
- 根据权利要求1所述的方法,其特征在于,所述第一主机获取顶点集合进行第X次迭代计算后的计算结果集合,包括:The method according to claim 1, wherein the first host acquires a set of vertices and performs a calculation result set after the Xth iteration calculation, including:第一主机通过网络获取以顶点集合ui完成的第X次迭代计算后的计算结果集合TX(ui),所述顶点集合ui中包括的顶点是所述第一主机进行以顶点v为源点的第X+1次迭代计算时要执行更新函数的顶点,所述顶点v为所述第一主机上N/M个顶点中的其中一个,所述顶点集合ui包括所述顶点v,X为大于0的整数;Calculation result of the first host to the first set of vertices u i X iteration calculation performed by obtaining a set of network T X (u i), the set of vertices comprises vertices in the u i is the first host to the vertex v The vertex of the update function is to be executed when calculating the X+1th iteration of the source point, the vertex v being one of N/M vertices on the first host, the vertex set u i including the vertex v, X is an integer greater than 0;所述第一主机根据所述图数据和所述计算结果集合,进行以顶点v为源点的第X+1次的并发迭代计算,得到每个顶点上随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1+1,包括:The first host performs the X+1th concurrent iteration calculation with the vertex v as a source point according to the graph data and the calculation result set, and obtains the number of random walk instances on each vertex to become R. 1 /2, the current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1, including:所述第一主机根据所述图数据和所述计算结果TX(ui),进行以顶点v为源点的第X+1次的并发迭代计算,得到所述以顶点v为源点上的随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1+1。The first host performs an X+1th concurrent iteration calculation with the vertex v as a source point according to the graph data and the calculation result T X (u i ), and obtains the vertex v as a source point. The number of random walk instances becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1.
- 根据权利要求2所述的方法,其特征在于,所述第一主机根据所述图数据和所述计算结果集合TX(ui),进行以顶点v为源点的第X+1次的并发迭代计算,得到所述以顶点v为源点上的随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1+1,包括:The method according to claim 2, wherein the first host performs the X+1th time with the vertex v as a source point according to the graph data and the calculation result set T X (u i ) The number of random walk instances with the vertex v as the source point becomes R 1 /2, and the current path length of the R 1 /2 random walk instances becomes 2L 1 +1, including:所述第一主机根据所述图数据,进行以顶点v为源点的第X+1次的迭代计算,得到所述以顶点v为源点上的随机游走实例的个数为所述R1个,所述R1个随机游走实例各增加1个路径长度;The first host performs an X+1th iteration calculation with the vertex v as a source point according to the graph data, and obtains the number of random walk instances with the vertex v as a source point as the R. 1 , the R 1 random walk instances each increase by 1 path length;所述第一主机根据所述计算结果集合TX(ui),进行以顶点v为源点的第X+1次的迭代计算,得到所述以顶点v为源点上的随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1;The first host performs an X+1th iteration calculation with the vertex v as a source point according to the calculation result set T X (u i ), and obtains the random walk instance with the vertex v as a source point. The number of the paths becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 ;所述第一主机根据所述R1个随机游走实例各增加1个路径长度,以及所述R1/2个随机游走实例各自的当前路径长度变为2L1,确定所述R1/2个随机游走实例各自的当前路径长度变为2L1+1。The first host adds one path length according to each of the R 1 random walk instances, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 , and determines the R 1 / The current path length of each of the two random walk instances becomes 2L 1 +1.
- 根据权利要求3所述的方法,其特征在于,所述顶点集合ui为{v}∪Q1,所述Q1∈Q, 集合Q为{u1,u2,......,un},所述计算结果集合TX(ui)为{TX(v)}∪TX(Q1),所述TX(Q1)∈TX(Q),集合TX(Q)为{TX(u1),TX(u2),......,TX(un)};The method according to claim 3, wherein said set of vertices u i is {v} ∪Q 1, said Q 1 ∈Q, Q is the set {u 1, u 2, ...... , u n }, the set of calculation results T X (u i ) is {T X (v)} ∪ T X (Q 1 ), the T X (Q 1 ) ∈ T X (Q), The set T X (Q) is {T X (u 1 ), T X (u 2 ), ..., T X (u n )};所述第一主机根据所述计算结果集合TX(ui),进行以顶点v为源点的第X+1次的迭代计算,得到所述以顶点v为源点上的随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1,包括:The first host performs an X+1th iteration calculation with the vertex v as a source point according to the calculation result set T X (u i ), and obtains the random walk instance with the vertex v as a source point. The number of the paths becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 , including:若所述Q1为{u1,u2,u3,u4},则计算结果集合TX(ui)为{TX(v),TX(u1),TX(u2),TX(u3),TX(u4)};If Q 1 is {u 1 , u 2 , u 3 , u 4 }, the calculation result set T X (u i ) is {T X (v), T X (u 1 ), T X (u 2 ), T X (u 3 ), T X (u 4 )};所述第一主机根据如下公式进行以顶点v为源点的第X+1次的迭代计算,得到所述以顶点v为源点上的随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1;The first host performs the X+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances on the vertex v as the source point becomes R 1 /2 The current path length of each of the R 1 /2 random walk instances becomes 2L 1 ;
- 根据权利要求3所述的方法,其特征在于,所述第一主机根据所述计算结果集合TX(ui),进行以顶点v为源点的第X+1次的迭代计算,得到所述以顶点v为源点上的随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1,包括:The method according to claim 3, wherein the first host performs an X+1th iteration calculation with the vertex v as a source point according to the calculation result set T X (u i ), and obtains the method. The number of random walk instances on the vertex v as the source point becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 , including:所述第一主机获取进行第X次迭代计算后,以顶点v为源点出发的前R1/2个随机游走实例各自的当前路径长度,并获取进行所述第X次迭代计算后,从另一个源点u出发的后R1/2个随机游走实例各自的当前路径长度,所述另一个源点u为所述以顶点v为源点出发的前R1/2个随机游走实例经过第X次迭代计算后,所述R1/2个随机游走实例所在的顶点;After the first host acquires the current path length of the first R 1 /2 random walk instances starting from the vertex v as the source point after performing the Xth iteration calculation, and acquiring the Xth iteration calculation, The current path length of the respective R 1 /2 random walk instances starting from another source point u, the other source point u being the pre-R 1 /2 random tour starting from the vertex v as the source point After the instance is calculated by the Xth iteration, the R 1 /2 random walks are located at the vertices;所述第一主机拼接所述以顶点v为源点出发的前R1/2个随机游走实例各自的当前路径长度和所述从另一个源点u出发的后R1/2个随机游走实例各自的当前路径程度,得到所述以顶点v为源点上的随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1。The first host splicing the current path length of each of the pre R 1 /2 random walk instances starting from the vertex v as the source point and the post R 1 /2 random tour starting from the other source point u Taking the current path degree of each instance, the number of random walk instances on the vertex v as the source point is changed to R 1 /2, and the current path length of each of the R 1 /2 random walk instances It becomes 2L 1 .
- 根据权利要求1-5任一所述的方法,其特征在于,每个顶点上当前有R2个随机游走实例,每个随机游走实例的路径长度为L2,所述方法还包括:The method of any of claims 1-5, wherein there are currently R 2 random walk instances on each vertex, and the path length of each random walk instance is L 2 , the method further comprising:所述第一主机根据所述图数据,进行以顶点v为源点的第Z+1次的迭代计算,得到所述以顶点v为源点上的随机游走实例的个数为所述R2个,所述R2个随机游走实例各自的路径长度增加1个路径长度,Z为大于等于0的整数。The first host performs the Z+1th iteration calculation with the vertex v as the source point according to the graph data, and obtains the number of random walk instances on the vertex v as the source point as the R. 2 , the path length of each of the R 2 random walk instances is increased by 1 path length, and Z is an integer greater than or equal to 0.
- 根据权利要求1-6任一所述的方法,其特征在于,每个顶点上当前有R3个随机游走实例,每个随机游走实例的当前路径长度为L3,所述方法还包括:The method according to any one of claims 1-6, wherein there are currently R 3 random walk instances on each vertex, and the current path length of each random walk instance is L 3 , the method further includes :所述第一主机通过网络获取以顶点集合hi完成的第Y次迭代计算后的计算结果集合TY(hi),所述顶点集合hi中包括的顶点是所述第一主机进行以顶点v为源点的第Y+1次迭代计算时要执行更新函数的顶点,所述顶点集合hi包括所述顶点v,Y为大于0的整数;The first host network calculation result obtained by the iterative calculation of Y times to complete the set of vertices of the set h i T Y (h i), the set of vertices comprises vertices h i is the first host to The vertex v is a vertice of the update function when the Y+1th iteration of the source point is calculated, and the vertex set h i includes the vertex v, and Y is an integer greater than 0;所述第一主机根据所述计算结果集合TY(hi),进行以顶点v为源点的第Y+1次的迭代计算,得到所述以顶点v为源点上的随机游走实例的个数变为R3/2个,所述R3/2个随机游走实例各自的当前路径长度变为2L3。 The first host performs the Y+1th iteration calculation with the vertex v as the source point according to the calculation result set T Y (h i ), and obtains the random walk instance with the vertex v as the source point. The number of the numbers becomes R 3 /2, and the current path length of each of the R 3 /2 random walk instances becomes 2L 3 .
- 根据权利要求7所述的方法,其特征在于,所述顶点集合hi为{v}∪W1,所述W1∈W,集合W为{h1,h2,......,hn},所述计算结果集合TY(hi)为{TY(v)}∪TY(W1),所述TY(W1)∈TY(W),集合TY(W)为{TY(h1),TY(h2),......,TY(hn)};The method according to claim 7, wherein said set of vertices h i is {v} ∪W 1, the W 1 ∈W, W is a set {h 1, h 2, ...... , h n }, the set of calculation results T Y (h i ) is {T Y (v)} ∪ T Y (W 1 ), the T Y (W 1 ) ∈ T Y (W), The set T Y (W) is {T Y (h 1 ), T Y (h 2 ), ..., T Y (h n )};所述第一主机根据所述计算结果集合TY(hi),进行以顶点v为源点的第Y+1次的迭代计算,得到所述以顶点v为源点上的随机游走实例的个数变为R3/2个,所述R3/2个随机游走实例各自的当前路径长度变为2L3,包括:The first host performs the Y+1th iteration calculation with the vertex v as the source point according to the calculation result set T Y (h i ), and obtains the random walk instance with the vertex v as the source point. The number of the paths becomes R 3 /2, and the current path length of each of the R 3 /2 random walk instances becomes 2L 3 , including:若所述W1为{h1,h2,h3,h4},则计算结果集合TY(hi)为{TY(v),TY(h1),TY(h2),TY(h3),TY(h4)};If W 1 is {h 1 , h 2 , h 3 , h 4 }, the calculation result set T Y (h i ) is {T Y (v), T Y (h 1 ), T Y (h 2 ), T Y (h 3 ), T Y (h 4 )};所述第一主机根据如下公式进行以顶点v为源点的第Y+1次的迭代计算,得到所述以顶点v为源点上的随机游走实例的个数变为R3/2个,所述R3/2个随机游走实例各自的路径长度变为2L3;The first host performs the Y+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances with the vertex v as the source point becomes R 3 /2 , the path length of each of the R 3 /2 random walk instances becomes 2L 3 ;
- 根据权利要求7所述的方法,其特征在于,所述第一主机根据所述计算结果集合TY(hi),进行以顶点v为源点的第Y+1次的迭代计算,得到所述以顶点v为源点上的随机游走实例的个数变为R3/2个,所述R3/2个随机游走实例各自的当前路径长度变为2L3,包括:The method according to claim 7, wherein the first host performs the Y+1th iteration calculation with the vertex v as a source point according to the calculation result set T Y (h i ), and obtains the method. The number of random walk instances on the vertex v as the source point becomes R 3 /2, and the current path length of each of the R 3 /2 random walk instances becomes 2L 3 , including:所述第一主机获取进行第Y次迭代计算后,以顶点v为源点出发的前R3/2个随机游走实例各自的当前路径长度,并获取进行所述第Y次迭代计算后,从另一个源点w出发的后R3/2个随机游走实例各自的当前路径长度,所述另一个源点w为所述以顶点v为源点出发的前R3/2个随机游走实例经过第Y次迭代计算后,所述R3/2个随机游走实例所在的顶点;After the first host acquires the current path length of the first R 3 /2 random walk instances starting from the vertex v as the source point after performing the Y-th iteration calculation, and acquiring the Y-th iteration calculation, The current path length of the respective R 3 /2 random walk instances starting from another source point w, the other source point w being the pre-R 3 /2 random tour starting from the vertex v as the source point After the instance is calculated by the Yth iteration, the R 3 /2 random walks are located at the vertices;所述第一主机拼接所述以顶点v为源点出发的前R3/2个随机游走实例各自的当前路径长度和所述从另一个源点w出发的后R3/2个随机游走实例各自的当前路径程度,得到所述以顶点v为源点上的随机游走实例的个数变为R3/2个,所述R3/2个随机游走实例各自的当前路径长度变为2L3。The first host splicing the current path length of each of the pre R 3 /2 random walk instances starting from the vertex v as the source point and the post R 3 /2 random tour starting from the other source point w Taking the current path degree of each instance, the number of random walk instances on the vertex v as the source point is changed to R 3 /2, and the current path length of each R 3 /2 random walk instances It becomes 2L 3 .
- 一种主机,其特征在于,所述主机应用于以磁盘为基础的图计算系统,所述图计算系统包括M个主机,每个主机在本地磁盘上保存图数据,所述图数据包括N个顶点,每个主机同时运行N/M个不同源的路径计算,每个顶点上当前有R1个随机游走实例,每个随机游走实例的当前路径长度为L1,所述主机包括:A host, wherein the host is applied to a disk-based graph computing system, the graph computing system includes M hosts, and each host saves graph data on a local disk, where the graph data includes N Vertex, each host simultaneously runs N/M paths of different sources. Each vertex currently has R 1 random walk instances. The current path length of each random walk instance is L 1 . The host includes:获取单元,用于获取顶点集合进行第X次迭代计算后的计算结果集合,所述顶点集合为所述第一主机进行第X+1次迭代计算时要执行更新函数的顶点的集合;An obtaining unit, configured to obtain a set of calculation results after the Xth iteration calculation is performed on the vertex set, where the vertex set is a set of vertices to perform an update function when the first host performs the X+1th iteration calculation;计算单元,用于根据所述图数据和所述计算结果集合,进行第X+1次的并发迭代计算,得到每个顶点上随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1+1;还用于若所述R1/2和2L1+1满足迭代完成条件,则所述第一主机完成所述图数据的计算。a calculation unit, configured to perform an X+1th concurrent iteration calculation according to the graph data and the calculation result set, to obtain that the number of random walk instances on each vertex becomes R 1 /2, The current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1; and is also used to complete the said first host if the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition Calculation of graph data.
- 根据权利要求10所述的主机,其特征在于,The host according to claim 10, characterized in that所述获取单元,具体用于通过网络获取以顶点集合ui完成的第X次迭代计算后的计算 结果集合TX(ui),所述顶点集合ui中包括的顶点是所述第一主机进行以顶点v为源点的第X+1次迭代计算时要执行更新函数的顶点,所述顶点v为所述第一主机上N/M个顶点中的其中一个,所述顶点集合ui包括所述顶点v,X为大于0的整数;The obtaining unit is configured to obtain the results of the set of vertices X iteration u i is calculated by the network after the completion of the set X T (u i), the set of vertices comprises vertices u i is in the first The host performs a X+1 iteration calculation with the vertex v as a source point to execute a vertex of the update function, the vertex v being one of N/M vertices on the first host, the vertex set u i includes the vertex v, and X is an integer greater than 0;所述计算单元,具体用于根据所述图数据和所述计算结果TX(ui),进行以顶点v为源点的第X+1次的并发迭代计算,得到所述以顶点v为源点上的随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1+1。The calculating unit is specifically configured to perform, according to the graph data and the calculation result T X (u i ), an X+1th concurrent iteration calculation with the vertex v as a source point, and obtain the vertex v as The number of random walk instances on the source point becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1.
- 根据权利要求11所述的主机,其特征在于,A host according to claim 11 wherein:所述计算单元,具体用于根据所述图数据,进行以顶点v为源点的第X+1次的迭代计算,得到所述以顶点v为源点上的随机游走实例的个数为所述R1个,所述R1个随机游走实例各增加1个路径长度;The calculating unit is specifically configured to perform an X+1th iteration calculation with the vertex v as a source point according to the graph data, and obtain the number of random walk instances with the vertex v as a source point as R 1 , the R 1 random walk instances each increase by 1 path length;根据所述计算结果集合TX(ui),进行以顶点v为源点的第X+1次的迭代计算,得到所述以顶点v为源点上的随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1;According to the calculation result set T X (u i ), an X+1th iteration calculation using the vertex v as a source point is performed, and the number of random walk instances on the vertex v as a source point is obtained. R 1 /2, the current path length of each of the R 1 /2 random walk instances becomes 2L 1 ;根据所述R1个随机游走实例各增加1个路径长度,以及所述R1/2个随机游走实例各自的当前路径长度变为2L1,确定所述R1/2个随机游走实例各自的当前路径长度变为2L1+1。Adding 1 path length according to each of the R 1 random walk instances, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 , and determining the R 1 /2 random walks The current path length of each instance becomes 2L 1 +1.
- 根据权利要求12所述的主机,其特征在于,所述顶点集合ui为{v}∪Q1,所述Q1∈Q,集合Q为{u1,u2,......,un},所述计算结果集合TX(ui)为{TX(v)}∪TX(Q1),所述TX(Q1)∈TX(Q),集合TX(Q)为{TX(u1),TX(u2),......,TX(un)};The host according to claim 12, wherein said set of vertices u i is {v} ∪Q 1, said Q 1 ∈Q, Q is the set {u 1, u 2, ...... , u n }, the set of calculation results T X (u i ) is {T X (v)} ∪ T X (Q 1 ), the T X (Q 1 ) ∈ T X (Q), The set T X (Q) is {T X (u 1 ), T X (u 2 ), ..., T X (u n )};所述计算单元,具体还用于若所述Q1为{u1,u2,u3,u4},则计算结果集合TX(ui)为{TX(v),TX(u1),TX(u2),TX(u3),TX(u4)};根据如下公式进行以顶点v为源点的第X+1次的迭代计算,得到所述以顶点v为源点上的随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1;其中,TX(v)表示以顶点v为源点的第X次迭代计算后的计算结果,表示对左右两端的计算结果按指定操作合并。The calculating unit is further configured to: if the Q 1 is {u 1 , u 2 , u 3 , u 4 }, the calculation result set T X (u i ) is {T X (v), T X ( u 1 ), T X (u 2 ), T X (u 3 ), T X (u 4 )}; performing an X+1th iteration calculation with the vertex v as a source point according to the following formula, The vertex v is the number of random walk instances on the source point becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 ; Where T X (v) represents the calculation result after the Xth iteration with the vertex v as the source point. Indicates that the calculation results for the left and right ends are combined by the specified operation.
- 根据权利要求12所述的主机,其特征在于,The host according to claim 12, wherein所述获取单元,具体还用于获取进行第X次迭代计算后,以顶点v为源点出发的前R1/2个随机游走实例各自的当前路径长度,并获取进行所述第X次迭代计算后,从另一个源点u出发的后R1/2个随机游走实例各自的当前路径长度,所述另一个源点u为所述以顶点v为源点出发的前R1/2个随机游走实例经过第X次迭代计算后,所述R1/2个随机游走实例所在的顶点;The acquiring unit is further configured to obtain a current path length of each of the first R 1 /2 random walk instances starting from the vertex v as the source point after performing the Xth iteration calculation, and obtain the Xth time after the iterative calculation, the R starting from another source u 1/2 random walk path length of each of the current example, the point u is from another source to the front vertex v R is the source point of departure 1 / The vertices of the R 1 /2 random walk instances after the 2 random iterations are calculated by the Xth iteration;所述计算单元,具体还用于拼接所述以顶点v为源点出发的前R1/2个随机游走实例各自的当前路径长度和所述从另一个源点u出发的后R1/2个随机游走实例各自的当前路径程度,得到所述以顶点v为源点上的随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1。The calculation unit is further configured specifically the length of the current path of each of said front and said splice R is the source point to the starting vertex v 1/2 Examples of random walk starting from another source U 1 R / The current path degree of each of the two random walk instances is obtained, and the number of random walk instances on the vertex v as the source point is changed to R 1 /2, and the R 1 /2 random walk instances are respectively The current path length becomes 2L 1 .
- 根据权利要求9-14任一所述的主机,其特征在于,每个顶点上当前有R2个随机游走实例,每个随机游走实例的路径长度为L2,The host according to any one of claims 9-14, characterized in that there are currently R 2 random walk instances on each vertex, and the path length of each random walk instance is L 2 .所述计算单元,具体还用于根据所述图数据,进行以顶点v为源点的第Z+1次的迭代 计算,得到所述以顶点v为源点上的随机游走实例的个数为所述R2个,所述R2个随机游走实例各自的路径长度增加1个路径长度,Z为大于等于0的整数。The calculating unit is further configured to perform, according to the graph data, an iterative calculation of the Z+1th time with the vertex v as a source point, and obtain the number of random walk instances with the vertex v as a source point. For the R 2 , the path length of each of the R 2 random walk instances is increased by 1 path length, and Z is an integer greater than or equal to 0.
- 根据权利要求9-15任一所述的主机,其特征在于,每个顶点上当前有R3个随机游走实例,每个随机游走实例的当前路径长度为L3,The host according to any one of claims 9-15, characterized in that there are currently R 3 random walk instances on each vertex, and the current path length of each random walk instance is L 3 .所述获取单元,具体还用于通过网络获取以顶点集合hi完成的第Y次迭代计算后的计算结果集合TY(hi),所述顶点集合hi中包括的顶点是所述第一主机进行以顶点v为源点的第Y+1次迭代计算时要执行更新函数的顶点,所述顶点集合hi包括所述顶点v,Y为大于0的整数;The obtaining unit is further configured to calculate the specific results to obtain a set of vertices of Y h i th iteration is calculated by the network after the completion of a set of Y T (h i), the set of vertices comprising vertex h i is in the first A host performs a Y+1th iteration calculation with the vertex v as a source point to execute a vertex of the update function, the vertex set h i including the vertex v, Y being an integer greater than 0;所述计算单元,具体还用于根据所述计算结果集合TY(hi),进行以顶点v为源点的第Y+1次的迭代计算,得到所述以顶点v为源点上的随机游走实例的个数变为R3/2个,所述R3/2个随机游走实例各自的当前路径长度变为2L3。The calculating unit is further configured to perform, according to the calculation result set T Y (h i ), a Y+1th iteration calculation with the vertex v as a source point, and obtain the vertex v as a source point. The number of random walk instances becomes R 3 /2, and the current path length of each of the R 3 /2 random walk instances becomes 2L 3 .
- 根据权利要求16所述的主机,其特征在于,所述顶点集合hi为{v}∪W1,所述W1∈W,集合W为{h1,h2,......,hn},所述计算结果集合TY(hi)为{TY(v)}∪TY(W1),所述TY(W1)∈TY(W),集合TY(W)为{TY(h1),TY(h2),......,TY(hn)};The host according to claim 16, wherein said set of vertices h i is {v} ∪W 1, the W 1 ∈W, W is a set {h 1, h 2, ...... , h n }, the set of calculation results T Y (h i ) is {T Y (v)} ∪ T Y (W 1 ), the T Y (W 1 ) ∈ T Y (W), The set T Y (W) is {T Y (h 1 ), T Y (h 2 ), ..., T Y (h n )};所述计算单元,具体还用于若所述W1为{h1,h2,h3,h4},则计算结果集合TY(hi)为{TY(v),TY(h1),TY(h2),TY(h3),TY(h4)};根据如下公式进行以顶点v为源点的第Y+1次的迭代计算,得到所述以顶点v为源点上的随机游走实例的个数变为R3/2个,所述R3/2个随机游走实例各自的路径长度变为2L3;其中,TY(v)表示以顶点v为源点的第Y次迭代计算后的计算结果,表示对左右两端的计算结果按指定操作合并。The calculating unit is further configured to: if the W 1 is {h 1 , h 2 , h 3 , h 4 }, the calculation result set T Y (h i ) is {T Y (v), T Y ( h 1 ), T Y (h 2 ), T Y (h 3 ), T Y (h 4 )}; performing the Y+1th iteration calculation with the vertex v as the source point according to the following formula, The vertex v is the number of random walk instances on the source point becomes R 3 /2, and the path length of each of the R 3 /2 random walk instances becomes 2L 3 ; Where T Y (v) represents the calculation result after the Yth iteration calculated with the vertex v as the source point, Indicates that the calculation results for the left and right ends are combined by the specified operation.
- 根据权利要求16所述的主机,其特征在于,A host according to claim 16 wherein:所述获取单元,具体还用于获取进行第Y次迭代计算后,以顶点v为源点出发的前R3/2个随机游走实例各自的当前路径长度,并获取进行所述第Y次迭代计算后,从另一个源点w出发的后R3/2个随机游走实例各自的当前路径长度,所述另一个源点w为所述以顶点v为源点出发的前R3/2个随机游走实例经过第Y次迭代计算后,所述R3/2个随机游走实例所在的顶点;The acquiring unit is further configured to obtain a current path length of each of the pre R 3 /2 random walk instances starting from the vertex v as a source point after performing the Yth iteration calculation, and acquiring the Yth time after the iterative calculation, the R starting from another source w 3/2 random walk the length of the current path of each of examples, the other source vertex point v w is the starting point for the source before R 3 / After the second random walk instance is calculated by the Yth iteration, the R 3 /2 random walks are located at the vertices;所述计算单元,具体还用于拼接所述以顶点v为源点出发的前R3/2个随机游走实例各自的当前路径长度和所述从另一个源点w出发的后R3/2个随机游走实例各自的当前路径程度,得到所述以顶点v为源点上的随机游走实例的个数变为R3/2个,所述R3/2个随机游走实例各自的当前路径长度变为2L3。The calculating unit is further configured to splicing the current path length of each of the pre R 3 /2 random walk instances starting from the vertex v as the source point and the post R 3 starting from the other source point w / The current path degree of each of the two random walk instances is obtained, and the number of random walk instances on the vertex v as the source point is changed to R 3 /2, and the R 3 /2 random walk instances are respectively The current path length becomes 2L 3 .
- 一种主机,其特征在于,所述主机应用于以磁盘为基础的图计算系统,所述图计算系统包括M个主机,每个主机在本地磁盘上保存图数据,所述图数据包括N个顶点,每个主机同时运行N/M个不同源的路径计算,每个顶点上当前有R1个随机游走实例,每个随机游走实例的当前路径长度为L1,所述主机包括:A host, wherein the host is applied to a disk-based graph computing system, the graph computing system includes M hosts, and each host saves graph data on a local disk, where the graph data includes N Vertex, each host simultaneously runs N/M paths of different sources. Each vertex currently has R 1 random walk instances. The current path length of each random walk instance is L 1 . The host includes:收发器,处理器;Transceiver所述收发器和所述处理器通过总线连接;The transceiver and the processor are connected by a bus;所述收发器,用于获取顶点集合进行第X次迭代计算后的计算结果集合,所述顶点集 合为所述第一主机进行第X+1次迭代计算时要执行更新函数的顶点的集合;The transceiver is configured to obtain a set of calculation results after the Xth iteration calculation is performed on the vertex set, and the vertex set Forming a set of vertices of the update function when performing the X+1th iteration calculation on the first host;所述处理器,用于根据所述图数据和所述计算结果集合,进行第X+1次的并发迭代计算,得到每个顶点上随机游走实例的个数变为R1/2个,所述R1/2个随机游走实例各自的当前路径长度变为2L1+1;还用于若所述R1/2和2L1+1满足迭代完成条件,则所述第一主机完成所述图数据的计算。 The processor is configured to perform an X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtain that the number of random walk instances on each vertex becomes R 1 /2, The current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1; and is also used to complete the first host if the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition Calculation of the graph data.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610527136.7 | 2016-07-06 | ||
CN201610527136.7A CN107590769B (en) | 2016-07-06 | 2016-07-06 | Graph data calculation method, host and graph calculation system |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018006625A1 true WO2018006625A1 (en) | 2018-01-11 |
Family
ID=60901384
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2017/077859 WO2018006625A1 (en) | 2016-07-06 | 2017-03-23 | Graph data calculation method, host and graph calculation system |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN107590769B (en) |
WO (1) | WO2018006625A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110166344B (en) * | 2018-04-25 | 2021-08-24 | 腾讯科技(深圳)有限公司 | Identity identification method, device and related equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104063507A (en) * | 2014-07-09 | 2014-09-24 | 时趣互动(北京)科技有限公司 | Graph computation method and engine |
US20150186427A1 (en) * | 2013-12-26 | 2015-07-02 | Telefonica Digital Espana, S.L.U. | Method and system of analyzing dynamic graphs |
CN105224528A (en) * | 2014-05-27 | 2016-01-06 | 华为技术有限公司 | The large data processing method calculated based on figure and device |
CN105653204A (en) * | 2015-12-24 | 2016-06-08 | 华中科技大学 | Distributed graph calculation method based on disk |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8645339B2 (en) * | 2011-11-11 | 2014-02-04 | International Business Machines Corporation | Method and system for managing and querying large graphs |
-
2016
- 2016-07-06 CN CN201610527136.7A patent/CN107590769B/en active Active
-
2017
- 2017-03-23 WO PCT/CN2017/077859 patent/WO2018006625A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150186427A1 (en) * | 2013-12-26 | 2015-07-02 | Telefonica Digital Espana, S.L.U. | Method and system of analyzing dynamic graphs |
CN105224528A (en) * | 2014-05-27 | 2016-01-06 | 华为技术有限公司 | The large data processing method calculated based on figure and device |
CN104063507A (en) * | 2014-07-09 | 2014-09-24 | 时趣互动(北京)科技有限公司 | Graph computation method and engine |
CN105653204A (en) * | 2015-12-24 | 2016-06-08 | 华中科技大学 | Distributed graph calculation method based on disk |
Also Published As
Publication number | Publication date |
---|---|
CN107590769B (en) | 2021-02-09 |
CN107590769A (en) | 2018-01-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022048280A1 (en) | Distributed quantum computing simulation method and device | |
WO2018077181A1 (en) | Method and device for graph centrality calculation, and storage medium | |
EP3223169A1 (en) | Search method and apparatus for graph data | |
JP2019501451A (en) | Neural network execution order determination | |
WO2017137000A1 (en) | Method, device and apparatus for combining different instances describing same entity | |
WO2017076296A1 (en) | Method and device for processing graph data | |
JP5950285B2 (en) | A method for searching a tree using an instruction that operates on data having a plurality of predetermined bit widths, a computer for searching a tree using the instruction, and a computer thereof program | |
JP6387399B2 (en) | Management of memory and storage space for data manipulation | |
CN112667860A (en) | Sub-graph matching method, device, equipment and storage medium | |
CN108880846B (en) | Method and device for determining vector representation form for nodes in network | |
CN103701469A (en) | Compression and storage method for large-scale image data | |
Xie et al. | Graphiler: Optimizing graph neural networks with message passing data flow graph | |
JP2019513245A (en) | METHOD, DEVICE, SERVER AND STORAGE MEDIUM FOR SEARCHING GROUPS BASED ON SOCIAL NETWORK | |
CN105978711A (en) | Best switching edge searching method based on minimum spanning tree | |
US8392393B2 (en) | Graph searching | |
Sarkar et al. | Flowgnn: A dataflow architecture for universal graph neural network inference via multi-queue streaming | |
WO2018006625A1 (en) | Graph data calculation method, host and graph calculation system | |
CN114567634B (en) | Method, system, storage medium and electronic device for calculating E-level map facing backward | |
JP5964781B2 (en) | SEARCH DEVICE, SEARCH METHOD, AND SEARCH PROGRAM | |
CN113222125A (en) | Convolution operation method and chip | |
US11675766B1 (en) | Scalable hierarchical clustering | |
CN110751284B (en) | Heterogeneous information network embedding method and device, electronic equipment and storage medium | |
Ying et al. | Towards fault tolerance optimization based on checkpoints of in-memory framework spark | |
CN113568987B (en) | Training method and device for knowledge graph embedded model and computer equipment | |
CN116974249A (en) | Flexible job shop scheduling method and flexible job shop scheduling device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17823438 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17823438 Country of ref document: EP Kind code of ref document: A1 |