WO2018006625A1

WO2018006625A1 - Graph data calculation method, host and graph calculation system

Info

Publication number: WO2018006625A1
Application number: PCT/CN2017/077859
Authority: WO
Inventors: 成杰峰; 李震国; 刘勤
Original assignee: 华为技术有限公司
Priority date: 2016-07-06
Filing date: 2017-03-23
Publication date: 2018-01-11
Also published as: CN107590769B; CN107590769A

Abstract

A graph calculation method, used for increasing the speed of calculating graphs, thereby saving time. Said method comprises: a first host acquiring a calculation result set after carrying out the Xth iterative calculation on a vertex set, said vertex set being a set of a vertex of an update function which is to be executed by said first host when carrying out the X+1th iterative calculation; said first host carrying out the X+1th concurrent iterative calculation according to graph data and said calculation result set, obtaining the number of random walk examples of each vertex being changed to R₁/2, a current path length of each of said R₁/2 random walk examples being changed to 2L₁+1; if said R₁/2 and 2L₁+1 satisfy the conditions for completing iteration, then said first host completing the calculation of said graph data.

Description

Method, host and graph computing system for graph data calculation

This application claims the priority of the Chinese patent application filed on July 6, 2016, the Chinese Patent Office, the application number is 201610527136.7, and the invention name is "a method for computing data, a host computer, and a graph computing system". The citations are incorporated herein by reference.

Technical field

The present invention relates to the field of computers, and in particular, to a method, a host, and a graph computing system for graph data calculation.

Background technique

With the advancement in the ability to collect and generate data, we have entered the era of big data, and every day we are able to collect large amounts of data from various types of sensors, devices and the Internet. In order to find new business value and build new business models, we must process, analyze, store and understand these big data. With the need for large-scale graph data analysis, many distributed graph computing systems based on distributed or stand-alone have emerged in recent years. Common ones are: large-scale graph distributed computing framework (Pregel), memory-based distributed graph. Computing system (GraphLab), disk-based stand-alone graph computing system (GraphChi), etc.

"Graph computing" is an abstract representation of a "graph" structure of the real world based on "graph theory" and a computational model on such a data structure. The graph calculation system runs all the algorithms in multiple rounds of iteration until the end of the algorithm convergence. Generally, by using the idea of "think like a vertex" to abstract the data processing algorithm, an update function on the graph is formed by writing a vertex program, wherein the update function is defined by the user. In the prior art, the update function is a user-defined calculation that can process a source path. An update function can modify the weight of a vertex and the edge connected to it. In the iterations in which the graph calculation completes an algorithm, each iteration is performed by the system to perform an update function on each vertex of the graph.

But in the context of big data analytics, the size of the graph we are dealing with is usually larger than the memory of a computer. Therefore, in the calculation of the graph, the graph is divided into equal parts according to the number of compute nodes in the cluster, and is allocated to the memory of these compute nodes before the calculation is started. In the graph calculation process, each host needs to communicate with each other through the network to tell each other the calculation state in the memory to make the overall calculation forward. A graph calculation method based on random walk is generally employed. Among them, one graph includes N nodes. For the N nodes in the graph, we need to independently search for a random walk path with the node as the source point from each node, then N times for the same graph. Calculation. Each random walk starts from each different starting node in the graph, and each step randomly selects an adjacent vertex of the current node to advance. Among them, each step of the random walk to the next adjacent vertex requires an iteration of the graph calculation, then how long the swam path needs to be sampled, and how many iterations the system processes to complete the sampling. Therefore, N times of sampling is performed N times of graph calculation, and each graph calculation corresponds to one sampling calculation, and the total calculation time is N times of the time of one distributed sampling calculation. It is also possible to extend the existing stand-alone system to run on the different hosts simultaneously for the N samples, so that the overall calculation time is N/M times of the calculation time of a single sample, and M is the number of hosts.

However, since the graph data to be processed by the graph calculation system has a large N, N is a variable, M is small, and is a constant. Therefore, according to the above analysis, even if the existing graph computing system is so fast that it can complete a sampling calculation in one second, the common graph of more than 10 million nodes has to spend more than 10^7 (10 to the power of 7) seconds. , that is, 115 days. So how Reducing the time of graph calculation is an important challenge.

Summary of the invention

Embodiments of the present invention provide a method, a host, and a graph computing system for graph calculation, which are used to improve the efficiency of graph calculation and save time.

A first aspect of the embodiments of the present invention provides a method for graph calculation, which is applied to a disk-based graph computing system, where the graph computing system includes M hosts, and each host saves graph data on a local disk. The data includes N vertices, and each host simultaneously runs path calculations of N/M different sources. Currently, there are R ₁ random walk instances on each vertex, and the current path length of each random walk instance is L ₁ . Methods can include:

The first host obtains a set of calculations after the Xth iteration is calculated, and the set of vertices is a set of vertices of the update function to be executed when the first host performs the X+1th iteration calculation; the first host according to the The graph data and the set of calculation results are subjected to the X+1th concurrent iterative calculation, and the number of random walk instances on each vertex becomes R ₁ /2, and the R ₁ /2 random walk instances are respectively The current path length becomes 2L ₁ +1; if the R ₁ /2 and 2L ₁ +1 satisfy the iterative completion condition, the first host completes the calculation of the graph data.

In the embodiment of the present invention, the iterative calculation performed by the first host is equivalent to performing the disk iterative calculation and the network iterative calculation simultaneously, that is, performing two iteration calculations of disk input/output and network input/output concurrently, which may be referred to as hybrid iterative calculation. After performing a hybrid iterative calculation, the path length of each random walk instance is increased to the longest, from the original L ₁ to 2L ₁ +1, then the time taken to complete the calculation of the graph data is correspondingly It is less, so in a complete graph calculation process, the efficiency of the graph calculation is improved correspondingly, saving a certain time.

With reference to the first aspect of the embodiments of the present invention, in a first possible implementation manner of the first aspect of the embodiment, the first host obtains a set of calculation results after the X-th iteration calculation is performed by the first host, and may include: calculation result of the first host to the first set of vertices u _i X iteration calculation performed by obtaining a set of network T _X (u _i), which comprises a set of vertices in vertex u _i is the first host to the source vertex v The vertices of the update function are executed when the X+1th iteration of the point is calculated. The vertex v is one of N/M vertices on the first host, and the vertex set u _i includes the vertex v, and X is greater than 0. The first host performs the X+1th concurrent iteration calculation with the vertex v as the source point according to the graph data and the calculation result set, and obtains the number of random walk instances on each vertex to become R ₁ /2, the current path length of each of the R ₁ /2 random walk instances becomes 2L ₁ +1, which may include: the first host performs according to the graph data and the calculation result T _X (u _i ) The X+1th concurrent iteration calculation with the vertex v as the source point obtains the top v is the number of instances of random walk on the source point becomes R _1/2 th, the length of the current path of each of R _1/2 random walk instance becomes 2L ₁ +1.

In the embodiment of the present invention, a specific description of the vertex v as a source point is performed for the network iterative calculation and the disk iteration calculation, and further, the calculation process of the network iteration and the disk iteration concurrent iteration is embodied. After a hybrid iteration, the number of random walk instances on each vertex changes from R ₁ to R ₁ /2, and the current path length of each R ₁ /2 random walk instances becomes 2L ₁ +1, so, a hybrid iteration is performed, with separate disk iterations, network iterations, and the path length of each random walk instance is increased the most, correspondingly, the graph calculation time is correspondingly reduced.

With reference to the first possible implementation manner of the first aspect of the embodiment of the present invention, in a second possible implementation manner of the first aspect of the embodiment, the first host is configured according to the graph data and the calculation result set T _X (u _i ), performing the X+1th concurrent iteration calculation with the vertex v as the source point, and obtaining the number of random walk instances on the vertex v as the source point becomes R ₁ /2, The current path length of each of the R ₁ /2 random walk instances becomes 2L ₁ +1, which may include:

The first host based on the map data, iterative calculation X + 1 times the source vertex point v, to give an example of the number of random walk on the vertex v is the source point for the R ₁ th, the example ₁ R & lt each increased random walk a path length; the first set of host T _X (u _i) based on the calculation result, the iterative calculation X + 1 times the source vertex point v, to give the The vertex v is the number of random walk instances on the source point becomes R ₁ /2, and the current path length of each R ₁ /2 random walk instances becomes 2L ₁ ; the first host is according to the R ₁ Each of the random walk instances is incremented by one path length, and the current path length of each of the R ₁ /2 random walk instances becomes 2L ₁ , and the current path length of each of the R ₁ /2 random walk instances is determined. +1 for 2L ₁ .

In the embodiment of the present invention, the iterative calculation performed by the first host according to the graph calculation is called a disk iteration, and the iterative calculation according to the calculation result set T _X (u _i ) obtained through the network is called network iteration, where network iteration and The concurrent iteration calculation of the disk iteration has a specific description. A network iteration is performed, and the number of random walk instances on each vertex is halved, but the path length of the random walk instance is changed from the previous L ₁ to 2L ₁ To perform a disk iteration, the number of random walk instances on each vertex is unchanged, but the path length is increased by 1 path length, so the length of each random walk instance becomes 2L ₁ +1.

With reference to the second possible implementation manner of the first aspect of the embodiment of the present invention, in a third possible implementation manner of the first aspect of the embodiment, the vertex set u _i is {v}∪Q ₁ , The Q ₁ ∈Q, the set Q is {u ₁ , u ₂ , . . . , u _n }, and the set of calculation results T _X (u _i ) is {T _X (v)}∪T _X (Q ₁ ), the T _X (Q ₁ )∈T _X (Q),

The set T _X (Q) is {T _X (u ₁ ), T _X (u ₂ ), ..., T _X (u _n )};

The first host performs an X+1th iteration calculation with the vertex v as a source point according to the calculation result set T _X (u _i ), and obtains the number of random walk instances on the vertex v as a source point. It becomes R ₁ /2, and the current path length of each R ₁ /2 random walk instances becomes 2L ₁ , which may include: if the Q ₁ is {u ₁ , u ₂ , u ₃ , u ₄ }, Then the calculation result set T _X (u _i ) is {T _X (v), T _X (u ₁ ), T _X (u ₂ ), T _X (u ₃ ), T _X (u ₄ )}; The host performs the X+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances on the vertex v as the source point becomes R ₁ /2, and the R ₁ / The current path length of each of the two random walk instances becomes 2L ₁ ;

Where T _X (v) represents the calculation result after the Xth iteration with the vertex v as the source point.

Indicates that the calculation results for the left and right ends are combined by the specified operation.

In the embodiment of the present invention, when the network iteration is performed, specifically, the image may be described by a formula, and the path length of each random walk instance is changed from L ₁ to 2L ₁ .

With reference to the second possible implementation manner of the first aspect of the embodiment of the present invention, in a fourth possible implementation manner of the first aspect of the embodiment, the first host is configured according to the calculation result set T _X (u _i ), performing an X+1th iteration calculation with the vertex v as a source point, and obtaining the number of random walk instances on the vertex v as a source point becomes R ₁ /2, the R ₁ /2 The current path length of each random walk instance becomes 2L ₁ , which may include:

The first host obtains the current path length of each of the previous R ₁ /2 random walk instances starting from the vertex v as the source point after performing the Xth iteration calculation, and obtains the Xth iteration calculation after the calculation The current path length of each of the R ₁ /2 random walk instances starting from a source point u, the other source point u being the first R ₁ /2 random walk instances starting from the vertex v as the source point After X iterations, the R ₁ /2 random walks are located at the vertices; the first host splices the current path lengths of the previous R ₁ /2 random walk instances starting from the vertex v as the source point and The current path degree of the R ₁ /2 random walk instances starting from another source point u, and the number of random walk instances on the vertex v as the source point is changed to R ₁ /2. The current path length of each of the R ₁ /2 random walk instances becomes 2L ₁ .

In the embodiment of the present invention, when a network iteration is performed, the length of the random walk instance is changed from L ₁ to 2L _{1 ,} and a specific description is added, which increases the feasibility of the technical solution of the present invention.

With reference to the first aspect of the embodiments of the present invention, the first possible implementation of the fourth embodiment of the present invention to the fourth embodiment of the present invention, the fifth possible implementation of the first aspect of the embodiment of the present invention In the mode, there are currently R ₂ random walk instances on each vertex, and the path length of each random walk instance is L ₂ . The method may further include: the first host performs vertex v according to the graph data. The Z+1th iteration calculation of the source point obtains that the number of random walk instances on the vertex v as the source point is the R ₂ , and the path length of each of the R ₂ random walk instances is increased by one. Path length, Z is an integer greater than or equal to 0.

In the embodiment of the present invention, a separate disk iteration is described. A disk iteration is performed, the number of random walk instances on each vertex is unchanged, and each random walk instance is incremented by one path length.

With reference to the first aspect of the embodiments of the present invention, the first possible implementation of the fifth embodiment of the present invention to the fifth embodiment of the present invention, the sixth possible implementation of the first aspect of the embodiment of the present invention In the mode, there are currently R ₃ random walk instances on each vertex, and the current path length of each random walk instance is L ₃ , and the method may further include:

The first host acquires the calculation result after the iterative calculation Y times to complete the set of vertices of the set h _i _Y T (h _i) over a network, which includes a set of vertices in the vertex h _i is the first host to the vertex v The Y+1th iteration of the source point is calculated to perform a vertex of the update function, the vertex set h _i includes the vertex v, Y is an integer greater than 0; the first host according to the calculation result set T _Y (h _i ) Performing the Y+1th iteration calculation with the vertex v as the source point, and obtaining the number of random walk instances on the vertex v as the source point becomes R ₃ /2, the R ₃ /2 random The current path length of each of the swim instances becomes 2L ₃ .

In the embodiment of the present invention, a separate network iteration is described, and the number of random walk instances on each vertex is halved, but the path length of the random walk instance is changed from the previous L ₃ to 2L ₃ . Relative to disk iterations, each random walk instance adds more path lengths.

With reference to the sixth possible implementation manner of the first aspect of the embodiment of the present invention, in a seventh possible implementation manner of the first aspect of the embodiment, the vertex set h _i is {v}∪W ₁ , The W ₁ ∈W, the set W is {h ₁ , h ₂ , . . . , h _n }, and the set of calculation results T _Y (h _i ) is {T _Y (v)}∪T _Y (W ₁ ), the T _Y (W ₁ )∈T _Y (W),

The set T _Y (W) is {T _Y (h ₁ ), T _Y (h ₂ ), ..., T _Y (h _n )};

The first host performs the Y+1th iteration calculation with the vertex v as the source point according to the calculation result set T _Y (h _i ), and obtains the number of random walk instances on the vertex v as the source point. It becomes R ₃ /2, and the current path length of each R ₃ /2 random walk instances becomes 2L ₃ , which may include: if the W ₁ is {h ₁ , h ₂ , h ₃ , h ₄ }, Then the calculation result set T _Y (h _i ) is {T _Y (v), T _Y (h ₁ ), T _Y (h ₂ ), T _Y (h ₃ ), T _Y (h ₄ )};

The first host performs the Y+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances on the vertex v as the source point becomes R ₃ /2, The path length of each R ₃ /2 random walk instances becomes 2L ₃ ;

Where T _Y (v) represents the calculation result after the Yth iteration with the vertex v as the source point.

In the embodiment of the present invention, the network iterative calculation is performed, and specifically, the image may be described by a formula. Represents a doubling of the path length of each random walk instance.

With reference to the sixth possible implementation manner of the first aspect of the embodiments of the present invention, in the eighth possible implementation manner of the first aspect of the embodiment, the first host is configured according to the calculation result set T _Y (h _i ), performing the Y+1th iteration calculation with the vertex v as the source point, and obtaining the number of random walk instances on the vertex v as the source point becomes R ₃ /2, the R ₃ /2 The current path length of each random walk instance becomes 2L ₃ , which may include:

After the first host acquires the current path length of the first R ₃ /2 random walk instances starting from the vertex v as the source point after performing the Y-th iteration calculation, and obtaining the Y-th iteration calculation, the other host The current path length of each R ₃ /2 random walk instances starting from a source point w, the other source point w being the first R ₃ /2 random walk instances starting from the vertex v as the source point After Y iterations, the R ₃ /2 random walks are located at the vertices; the first host splices the current path lengths of the previous R ₃ /2 random walk instances starting from the vertex v as the source point and The current path degree of the R ₃ /2 random walk instances starting from another source point w, and the number of random walk instances on the vertex v as the source point is changed to R ₃ /2. The current path length of each of the R ₃ /2 random walk instances becomes 2L ₃ .

In the embodiment of the present invention, when the network iterative calculation is performed, the length of the random walk instance is changed from L ₁ to 2L _{1 ,} and a specific description is added, which increases the feasibility of the technical solution of the present invention.

A second aspect of the embodiments of the present invention provides a host, where the host is applied to a disk-based graph computing system, where the graph computing system includes M hosts, and each host saves graph data on a local disk, and the graph data includes N. For each vertex, each host simultaneously runs N/M paths of different sources. Each vertex currently has R ₁ random walk instances. The current path length of each random walk instance is L ₁ . The host includes:

An obtaining unit, configured to obtain a set of calculation results after the Xth iteration calculation by the vertex set, where the vertex set is a set of vertices to perform an update function when the first host performs the X+1th iteration calculation;

a calculation unit, configured to perform an X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtain that the number of random walk instances on each vertex becomes R ₁ /2, and the R ₁ / The current path length of each of the two random walk instances becomes 2L ₁ +1; and is also used to calculate the graph data if the R ₁ /2 and 2L ₁ +1 satisfy the iterative completion condition.

The host has the function of implementing a method corresponding to the graph calculation provided by the above first aspect. This function can be implemented in hardware or in hardware by executing the corresponding software. The hardware or software includes one or more modules corresponding to the functions described above.

A third aspect of the embodiments of the present invention provides a host, where the host is applied to a disk-based graph computing system, where the computing system includes M hosts, and each host saves graph data on a local disk, and the graph data includes N. Each vertex runs N/M paths of different sources at the same time. There are currently R ₁ random walk instances on each vertex. The current path length of each random walk instance is L ₁ , and the host can include a transceiver, a processor; the transceiver and the processor are connected by a bus;

The transceiver is configured to obtain a set of calculation results after the Xth iteration calculation by the set of vertices, where the set of vertices is a set of vertices of the update function to be executed when the first host performs the X+1th iteration calculation;

The processor is configured to perform an X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtain the number of random walk instances on each vertex to become R ₁ /2, and the R ₁ The current path length of each of the /2 random walk instances becomes 2L ₁ +1; and is also used to calculate the graph data if the R ₁ /2 and 2L ₁ +1 satisfy the iterative completion condition.

A fourth aspect of the embodiments of the present invention provides a graph computing system, where the graph computing system includes M hosts, each of which stores graph data on a local disk, the graph data includes N vertices, and each host simultaneously runs N/M. Path calculation for different sources;

Each of the figures in the graph computing system corresponds to the function of the method of performing the graph calculation provided by the first aspect above.

A fifth aspect of the embodiments of the present invention provides a storage medium. It should be noted that the technical solution of the present invention may contribute to the prior art or all or part of the technical solution may be implemented by software. Formally embodied, the computer software product is stored in a storage medium for storing computer software instructions for use in the apparatus described above, including a program designed to execute the first aspect described above for a host. The storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes.

It can be seen from the above technical solutions that the embodiments of the present invention have the following advantages:

In the embodiment of the present invention, the set of vertices is obtained by performing the Xth iteration calculation, and the set of vertices is a set of vertices of the update function to be executed when the first host performs the X+1th iteration calculation; The first host performs the X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtains that the number of random walk instances on each vertex becomes R ₁ /2, The current path length of each of the R ₁ /2 random walk instances becomes 2L ₁ +1; if the R ₁ /2 and 2L ₁ +1 satisfy the iterative completion condition, the first host completes the graph data Calculation. The iterative calculation performed by the first host is equivalent to the simultaneous iteration of the disk and the network iterative calculation, that is, concurrent execution of disk input/output (I/O) and network input/output (I/O) two iteration calculations, which can be called Hybrid iterative calculation, after performing a hybrid iterative calculation, the path length of each random walk instance is increased to the longest, from the original 2L ₁ to 2L ₁ +1, then the time taken to complete the calculation of the graph data There will be less corresponding, so in a complete graph calculation process, the efficiency of the graph calculation is improved correspondingly, saving a certain time.

DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments and the prior art description will be briefly described below. Obviously, the drawings in the following description are only some implementations of the present invention. For example, other drawings may be obtained from those skilled in the art without any inventive effort.

1 is a system architecture diagram of a graph computing system in an embodiment of the present invention;

2 is a schematic diagram of running a disk-based graph calculation on different M hosts in a cluster according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an embodiment of a method for calculating a graph in an embodiment of the present invention; FIG.

4 is a schematic diagram of different sources in an embodiment of the present invention;

FIG. 5 is a schematic diagram of an embodiment of a host according to an embodiment of the present invention; FIG.

FIG. 6 is a schematic diagram of another embodiment of a host according to an embodiment of the present invention.

Detailed ways

Embodiments of the present invention provide a method, a host, and a graph computing system for graph calculation, which are used to increase the rate of graph calculation and save time.

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is an embodiment of the invention, but not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.

In mathematics, a graph is a method that represents the relationship between an object and an object, and is the basic research object of graph theory. A graph appears to consist of small dots (called vertices or nodes) and lines or curves (called edges) that join the dots. Taking an Internet web page as an example, each web page can be regarded as a vertex. If a web page contains a link to another web page, it can be seen that there is a side connection between the two web pages. Take a social network as an example, where each user in the social network can be regarded as a vertex, and a friend relationship between users can be regarded as an edge.

With the gradual popularization of graph computing, the application requirements of graph computing systems have correspondingly presented some new challenges. In addition to the large amount of data to be processed, the graph computing algorithm needs to search the path in a large amount of graph data, which is computationally intensive. , high complexity. In the existing graph computing system, the memory computing is mainly based on dividing the graph data into equal parts according to the number of computing nodes in the cluster, and assigning the good graph data to the memory of these computing nodes, and then starting the calculation. An abstract "graph data" consists of N vertices/nodes, existing M hosts, and each host serially runs path calculations of N/M different sources. The algorithm based on random walk requires: for the N nodes in the graph, it is necessary to independently search for a random walk path with the node as the source point from each node. Among them, each step of random walk to the next adjacent vertex requires an iteration of the existing graph calculation, how long the random walk path needs to be sampled, and how many iterations the system needs to process to complete the sampling. Therefore, the N-times calculation is performed for N times of sampling, and each time the figure calculation corresponds to one sampling calculation. Therefore, the total calculation time is N times the sampling calculation time. If you extend to an existing stand-alone system, these N samples are run simultaneously on multiple different hosts, because N/M sampling is required on each host, because M hosts are running at the same time, so the total calculation time is once. The sampling calculation time is N/M times. In the context of big data, usually the value of N is very large, so it takes a very long time to complete the required sampling calculation. Therefore, how to improve the efficiency of graph calculation is a major problem.

The technical solution of the present invention proposes a method for utilizing external storage, that is, a disk-based distributed graph calculation method, each host has a relatively sufficient storage space to store map data on a local disk. It is not necessary to store a complete map data in the memory according to the number of hosts and then calculate it. Because the architecture and design of the traditional graph computing system usually only retains the computational state of a graph calculation, it is now necessary to retain N calculation states at the same time, and the size of each computation state is usually a given number L (step size). Or N is a linear relationship with a complexity of O(L) or even O(N). The size of all calculation states is L*N or even O(N^2). For a large image, data of this size cannot be reserved with memory. The graph computing system to which the present invention is applied includes M hosts, each of which stores graph data on a local disk. As shown in FIG. 1 , the number of hosts shown in FIG. 1 is not limited in practical applications. The graph data includes N vertices, and each host runs a path computation of N/M different sources.

The graph computing system is a distributed cluster architecture. As shown in FIG. 2, a disk-based graph calculation is performed on M different hosts in the cluster, and N/M different source path computation tasks are simultaneously run on each host. Graph calculation can be divided into offline Preprocessing, loading graph data and online graph calculation are three parts. In the preprocessing stage, the system divides the input data into several pieces of data and saves them on the disk of each computer. Each time a piece of data is added, the path calculation task code of N/M different sources is scheduled to be executed.

Here is an explanation of the path calculation of each host running N/M different sources: the graph calculation system runs all the algorithms in multiple rounds of iteration until the convergence of the algorithm ends. The algorithm for abstracting data processing is generally abstracted by using the idea of "think like a vertex", that is, by writing a vertex program to form an update function on the graph, the update function here is defined by the user. In the prior art, the user-defined update function can only process the path calculation of one source point in one iteration calculation, and in the technical solution of the present invention, the user-defined update function can process the path calculation of multiple different sources concurrently. The path calculation of different sources mentioned here refers to the path calculation with different vertices as the source point. In the iterative calculation in which the graph calculation completes an algorithm, each iteration calculation is performed by the system to perform an update function on each vertex of the graph data.

In the graph calculation process of the present invention, the system can quickly read the target map data stored on the local disk from the disk sequential input/output (I/O) for calculation, and does not require different hosts to perform network communication on the graph data. . When the system performs the iterative calculation with the vertex v as the source point, the sequential disk I/O reads the target graph data for iterative calculation, and the network resources are also efficiently utilized, that is, the network is obtained through different networks between different hosts. Performing a set of calculation results T _X (u _i ) used for iterative calculation, the calculation result set T _X (u _i ) is the calculation result of the last iteration calculation performed by the vertex set u _i of the execution function to be updated, macroscopically, It can be considered that the iterative calculation using the data acquired on the disk and the iterative calculation performed by the data acquired by the network are performed simultaneously, and the path length of the R ₁ /2 random walk instances is increased to 2L ₁ +1. The target data is obtained from the local disk and iteratively calculated. The path length of each random walk instance can only be increased by one path length. The calculation result set T _X (u _i obtained from the previous iteration is obtained separately from the network. ), when performing this iterative calculation, the path length of R ₁ /2 random walk instances is increased to 2L ₁ .

Combining the above several methods, using both disk iterative calculation and network iterative calculation, it will increase the path length of the random walk instance more quickly, thereby increasing the rate of graph calculation and saving a certain time accordingly. It should be noted that in the following description, the iterative calculation of the data acquired from the disk will be referred to as “disk iteration”, and the iterative calculation of the data acquired from the network will be referred to as “network iteration”, and the data obtained from the disk. The iterative calculation and the data obtained from the network are iteratively calculated and concurrently performed. It will be referred to as “hybrid iteration” for convenience of description. It should be understood that disk iteration, network iteration and hybrid iteration are not a professional name.

The technical solution of the present invention is specifically described in the following embodiments. As shown in FIG. 3, it is an embodiment of a method for calculating a graph in an embodiment of the present invention. The method is applied to a disk-based graph computing system. The system includes M hosts, each of which saves graph data on a local disk. The graph data includes N vertices, and each host simultaneously runs path calculations of N/M different sources, including:

301. The first host performs the Z+1th iteration calculation with the vertex v as the source point according to the graph data, and obtains the number of random walk instances with the vertex v as the source point as R ₂ and R ₂ The path length of each random walk instance is increased by 1 path length, and Z is an integer greater than or equal to 0;

In the embodiment of the present invention, there are currently R ₂ random walk instances on each vertex, and the path length of each random walk instance is L ₂ . The first host performs the Z+1th iteration calculation with the vertex v as the source point according to the graph data, and obtains the number of random walk instances with the vertex v as the source point as R ₂ and R ₂ random tours. The path length of each instance is increased by 1 path length, and Z is an integer greater than or equal to 0. The graph data here is the graph data saved on each local disk on each host.

Exemplarily, it is assumed that the final requirement of the embodiment of the present invention, that is, the iterative completion condition, is that the number of random walk instances on each node reaches a first preset threshold R=10, and each random walk instance The path length reaches the second preset threshold L=6, where there are currently R ₂ = 40 random walk instances on each vertex, and the path length of each random walk instance is L ₂ =0, and host 1 is locally The graph data saved by the disk is calculated by the first iteration with v as the source point, and the number of random walk instances on the vertex v as the source point is unchanged, or 40, but these 40 random walk instances The length of each path is increased by 1 path length, which can be expressed as {v, w}.

It should be noted that in the process of the graph computing system, in general, the first iteration calculation can only perform disk iteration, and the subsequent iterative calculation is mainly based on hybrid iteration, because the efficiency is high, but it can also be based on actual conditions. Interspersed individual disk iterations and network iterations are not limited.

In the embodiment of the present invention, the first host acquires a set of vertices and performs a calculation result set after the Xth iteration calculation, and the vertices set is a set of vertices of the update function to be executed when the first host performs the X+1th iteration calculation; The description can be made in step 302.

302, the first host to the set of vertices result of calculation of the iterative calculation u _i X times through the network to complete a set of acquisition T _X (u _i), comprising a set of vertices in vertex u _i is the first host to the source vertex v The X+1th iteration of the point is calculated to perform the vertices of the update function, the vertex v is one of the N/M vertices on the first host, the vertex set u _i includes v, and X is an integer greater than 0;

In the embodiment of the present invention, the first host obtains the X-th iteration calculation result set T _X (u _i ) calculated by the vertex set u _i through the network, and the vertex included in the vertex set u _i is performed by the first host. The vertices of the update function are executed when the X+1th iteration of the vertex v is the source point, and the vertex v is one of the N/M vertices on the first host, and the vertex set u _i includes v, and X is greater than 0. The integer. There is currently R ₁ random walk instances on each vertex, and the current path length of each random walk instance is L ₁ .

Exemplarily, since step 301 is followed here, the second iteration calculation with the vertex v as the source point is performed here, and there are currently R ₁ = 40 random walk instances on each vertex, and these 40 random numbers The current path length of the walked instance is L ₁ =1, which is 1 path length. The host 1 obtains the calculation result set T _X (u _i ) from the network, and assumes that the second time the iterative calculation with the vertex v as the source point is performed, the vertex set of the update function to be executed is {v, u ₁ , u ₂ , u ₃ , u ₄ }, that is, v, u ₁ , u ₂ , u ₃ and u ₄ are the vertices of the second execution update function, and the corresponding calculation result T ₁ (v) after the first iteration calculation, T ₁ (u ₁ ), T ₁ (u ₂ ), T ₁ (u ₃ ) and T ₁ (u ₄ ), then the set of calculation results is {T ₁ (w), T ₁ (u ₁ ), T ₁ ( u ₂ ), T ₁ (u ₃ ), T ₁ (u ₄ )}, and the number of vertices performing the update function herein is not limited.

In the embodiment of the present invention, the first host performs the X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtains that the number of random walk instances on each vertex becomes R ₁ /2, R The current path length of each of the ₁ /2 random walk instances becomes 2L ₁ +1; specifically, it can be explained in step 303.

303. The first host performs the X+1th concurrent iteration calculation with the vertex v as the source point according to the graph data and the calculation result set T _X (u _i ), and obtains a random walk instance with the vertex v as the source point. The number of the paths becomes R ₁ /2, and the current path length of each of the R ₁ /2 random walk instances becomes 2L ₁ +1;

In the embodiment of the present invention, the first host performs the X+1th concurrent iteration calculation with the vertex v as the source point according to the graph data and the calculation result set T _X (u _i ), and obtains the vertex v as the source point. The number of random walk instances becomes R ₁ /2, and the current path length of each of R ₁ /2 random walk instances becomes 2L ₁ +1.

Specifically, the method includes: 1) the first host performs the X+1th iteration calculation with the vertex v as the source point according to the graph data, and obtains the number of random walk instances with the vertex v as the source point as R. _One , R ₁ random walk instances each increase by 1 path length;

2) The first host performs the X+1th iteration calculation with the vertex v as the source point according to the calculation result set T _X (u _i ), and obtains the number change of the random walk instance with the vertex v as the source point. For R ₁ /2, the current path length of each R ₁ /2 random walk instances becomes 2L ₁ ;

3) The first host adds 1 path length according to R ₁ random walk instances, and the current path length of R ₁ /2 random walk instances becomes 2L ₁ , and determines R ₁ /2 random walk instances respectively. The current path length becomes 2L ₁ +1.

The first host performs the X+1th iteration calculation with the vertex v as the source point according to the calculation result set T _X (u _i ), and obtains the number change of the random walk instance with the vertex v as the source point. For R ₁ /2, the current path length of each of the R ₁ /2 random walk instances becomes 2L ₁ , and may also include:

(1) The vertex set u _i is {v}∪Q ₁ , Q ₁ ∈Q, the set Q is {u ₁ , u ₂ , . . . , u _n }, and the result set T _X (u _i ) is calculated. Is {T _X (v)}∪T _X (Q ₁ ), T _X (Q ₁ )∈T _X (Q),

The set T _X (Q) is {T _X (u ₁ ), T _X (u ₂ ), ..., T _X (u _n )};

If Q ₁ is {u ₁ , u ₂ , u ₃ , u ₄ }, the calculation result set T _X (u _i ) is {T _X (v), T _X (u ₁ ), T _X (u ₂ ), T _X (u ₃ ), T _X (u ₄ )};

The first host performs the X+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances on the vertex v as the source point becomes R ₁ /2, R ₁ / The current path length of each of the two random walk instances becomes 2L ₁ ;

Indicates that the calculation results for the left and right ends are combined by the specified operation. It should be understood that the above-described set of vertices Q ₁ is only an illustrative example and does not constitute a definition of the vertices included in Q ₁ .

(2) After the first host acquires the Xth iteration calculation, the current path length of each of the previous R ₁ /2 random walk instances starting from the vertex v is obtained, and after obtaining the Xth iteration calculation, Another source point u starts with the current path length of the R ₁ /2 random walk instances, and the other source point u is the pre-R ₁ /2 random walk instances starting from the vertex v as the source point. After the iterative calculation, the vertices where R ₁ /2 random walk instances are located;

Splicing the first host to vertex v is the source point of departure before R _1/2 Examples of random walk respective current path length and the starting point from another source w R _1/2 Examples of random walk respective current path To the extent, the number of random walk instances with the vertex v as the source point becomes R ₁ /2, and the current path length of each of the R ₁ /2 random walk instances becomes 2L ₁ .

Exemplarily, the host 1 performs the second iteration calculation using the hybrid iteration, acquires the graph data and the calculation result set T _X (u _i ), and performs the second iterative calculation with the vertex v as the source point. The disk iterations and network iterations are performed concurrently, but for a clearer expression, they are described separately.

The host 1 performs disk iteration according to the graph data, and obtains the number of random walk instances with the vertex v as the source point unchanged, or 40, but the current path length of each of the 40 random walk instances is increased by 1 path. The length, that is, the path length of each random walk instance is now 2 path lengths.

The host 1 performs network iteration according to the calculation result set T _X (u _i ). The iterative calculation according to the calculation result set here is actually iteratively calculated according to the elements in the calculation result set, if the calculation result set is T _X ( u _i )={T _X (v), T _X (u ₁ ), T _X (u ₂ ), T _X (u ₃ ), T _X (u ₄ )}, as shown in FIG. 4, for several different The schematic of the source, then, there are formulas as follows:

The host performs an iterative calculation based on the formula, and

Indicates that the calculation results for the left and right ends are combined by the specified operation. The number of random walk instances with v as the source point is halved to 20, and the path length of the 20 random walk instances is the path length of the random walk instance spliced by another source point. Therefore, the length of one path after the end of the first iteration is doubled to two path lengths.

In addition, the path length of each random walk instance of the disk iteration is also increased by 1 path length. Then, the result is that the number of random walk instances starting from v as the source point becomes 20, which is 20 The path length of each random walk instance becomes 3 path lengths.

For the path growth mode of network iteration in hybrid iteration, it can actually be represented like this:

When the host 1 scans the first iteration calculation in the second iteration calculation, the host 1 selects the path of the first 20 random walk instances starting from each vertex. Explain by the vertex v, for each path, find the path for the first iteration calculation using another strip, which is one of the last 20 random walk paths starting from the other vertex. Path, and then merge the two paths into one path, then the path length of the 20 random walk instances becomes 2 path lengths. For example, suppose there is a path {v, w} starting from v. At runtime, the machine where node v is located will send a request to the machine where node w is located, get a path {w, u} on w, and then 2 The paths are merged into {v, w, u}. Then, after running the network-based random walk calculation, 20 random walk paths with double length are obtained on each node.

Combining the disk iteration and network iteration in the above hybrid iteration, then the path length of each of the 20 random walk instances obtained now can be expressed as {v, w, u, x}, where {u, x} is the disk The path length added after iteration.

304, the first host to the set of vertices iteratively calculated calculation result of Y h _i times acquired via a network to complete a set of _Y T (h _i), comprising a set of vertices in the vertex h _i is the first host to the source vertex v The Y+1th iteration of the point is calculated to perform all the vertices of the update function, the vertex set h _i includes the vertex v, and Y is an integer greater than 0;

In the embodiment of the present invention, the first host obtains the Y-th iteration calculation result set T _Y (h _i ) calculated by the vertex set h _i through the network, and the vertex included in the vertex set h _i is performed by the first host. The Y+1th iteration with the vertex v as the source point is calculated to execute all the vertices of the update function. The vertex set h _i includes the vertex v, and Y is an integer greater than 0; wherein each vertex currently has R ₃ random numbers Walk the instance, the current path length of each random walk instance is L ₃ .

Exemplarily, following step 303, there are currently 20 random walk instances on each vertex, and the current path length of each random walk instance is 3 path lengths.

305. The first host performs the Y+1th iteration calculation with the vertex v as the source point according to the calculation result set T _Y (h _i ), and obtains the number change of the random walk instance with the vertex v as the source point. For R ₃ /2, the current path length of each of R ₃ /2 random walk instances becomes 2L ₃ ;

In the embodiment of the present invention, after the first host obtains the calculation result set T _Y (h _i ) through the network, the first host performs iterative calculation according to the calculation result set T _Y (h _i ), and obtains the vertex v as the source point. The number of random walk instances becomes R ₃ /2, and the current path length of each of R ₃ /2 random walk instances becomes 2L ₃ .

The step may specifically include:

(1) The set of vertices h _i is {v} ∪ W ₁ , W ₁ ∈ W, and the set W is {h ₁ , h ₂ , ..., h _n }, and the result set T _Y (h _i ) is calculated. Is {T _Y (v)}∪T _Y (W ₁ ), T _Y (W ₁ )∈T _Y (W),

The set T _Y (W) is {T _Y (h ₁ ), T _Y (h ₂ ), ..., T _Y (h _n )};

If W ₁ is {h ₁ , h ₂ , h ₃ , h ₄ }, the calculation result set T _Y (h _i ) is {T _Y (v), T _Y (h ₁ ), T _Y (h ₂ ), T _Y (h ₃ ), T _Y (h ₄ )};

The first host performs the Y+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances on the vertex v as the source point becomes R ₃ /2, R ₃ / The path length of each of the two random walk instances becomes 2L ₃ ;

Where T _Y (v) represents the calculation result after the Yth iteration calculated with the vertex v as the source point,

Indicates that the calculation results for the left and right ends are combined by the specified operation. It should be understood that the above-described set of vertices W ₁ is merely an illustrative example and does not constitute a limitation on the vertices included in W ₁ .

(2) After the first host acquires the current path length of the previous R ₃ /2 random walk instances starting from the vertex v as the source point after performing the Y-th iteration calculation, and obtaining the Y-th iteration calculation, Another source point w starts with the current path length of the R ₃ /2 random walk instances, and the other source point w is the former R ₃ /2 random walk instances starting from the vertex v as the source point. After the iterative calculation, R ₃ /2 random walks are located at the vertices;

Splicing the first host to the source vertex point v before departure R _3/2 random walk Examples respective current path length and the starting point from another source w R _3/2 random walk Examples respective current path To the extent, the number of random walk instances on the vertex v as the source point becomes R ₃ /2, and the current path length of each of the R ₃ /2 random walk instances becomes 2L ₃ .

Exemplarily, when the host 1 performs the third iteration calculation, the network iteration is used separately. Following the content of the above steps, the host 1 performs network iteration according to the calculation result set T _Y (h _i ), and if the calculation result set T _Y (h _i ) is {T _Y (v), T _Y (h ₁ ), T _Y (h ₂ ), ..., T _Y (h _n )}, host 1 according to the formula

The iterative calculation is performed to obtain that the number of random walk instances with v as the source point is halved to 10, and the length of each random walk instance is increased from the previous three path lengths to six path lengths.

Because the length of the previous random walk instance is expressed as {v, w, u, x}, then after the network iteration, the path length of each of the 10 random walk instances can be expressed as {v, w, u, x , y, z, a}.

It should be noted that

steps

305 and 306 are optional steps. In actual applications, whether or not execution is required may be determined according to actual needs. Moreover, although the computational power of hybrid iterations is relatively fast, in the process of graph calculation, in addition to the first iteration calculation using disk iteration calculation, other iterations of calculations can be based on actual needs. Use hybrid iterations, disk iterations, or network iterations without specific limitations.

306. If R ₁ /2 and 2L ₁ +1 satisfy the iterative completion condition, the first host completes calculation of the graph data.

In the embodiment of the present invention, if R ₁ /2 and 2L ₁ +1 satisfy the iterative completion condition, the first host completes calculation of the map data. When the number of random walk instances on each vertex after the completion of the iteration and the path length of the random walk instance satisfy the iterative completion condition, the calculation of the map data is completed.

Exemplarily, step 305, R ₃ /2 corresponds to 10, 2L ₃ corresponds to {v, w, u, x, y, z, a}, that is, 6 path lengths, and the requirements of step 301 Correspondingly, the number of random walk instances on each node reaches the first preset threshold R=10, and the path length of each random walk instance reaches the second preset threshold L=6, so it can be considered The host 1 completes the calculation of the graph data.

In the embodiment of the present invention, when the first host uses the disk iteration, each random walk instance is increased by one path length, and when the hybrid iteration is used, there is a current path length of each of the R ₁ /2 random walk instances. From the previous L ₁ to 2L ₁ +1, when using network iteration, the current path length of each R ₃ /2 random walk instances is changed from the previous L ₃ to 2L ₃ . Therefore, if you only use disk iterations, you need to access the disk multiple times with limited efficiency. Using network iterations can greatly reduce the number of disk scans. However, network iterations are limited by network resources and do not make full use of disk resources. So hybrid iterations can use both disk and network to grow random walk paths, and the fastest growth rate. The graph computing system of the embodiment of the present invention can concurrently perform two processes of disk iteration and network iteration, which is much faster than the performance of the existing distributed graph computing system on related algorithms.

In the above embodiment, the description is mainly made by using the vertex v as a source point. The overall calculation in the graph calculation can be briefly described by way of example, and the process of graph calculation is further visualized.

Assume that there are two existing hosts, each of which has a “map data” stored on the local disk. Of course, the graph data here is an image representation, which is described above and will not be described here. The graph data includes 4 vertices/nodes. Then, each host concurrently runs 4/2=2 different source path calculations, and there are 2 random walk instances on each vertex, and the path length of each random walk instance The initial values are all 0. The four vertices are: 1, 2, 3, and 4. The path calculation of the vertices 1 and 2 is concurrently executed on the host 1; the path calculation of the vertices 3 and 4 is concurrently executed on the host 2.

The random walk instance of vertex 1 is two 1, the random walk instance of vertex 2 is two 2, the random walk instance of vertex 3 is two 3, and the random walk instance of vertex 4 is two 4.

(1) In the first iteration calculation, only disk iteration can be performed. Then, both host 1 and host 2 respectively obtain graph data from the local disk, and perform iterative calculation of different source points. The random walk instance is represented by the corresponding vertex. After the first iteration is completed, the number of random walk instances and the path length on the two hosts can be expressed as follows:

Host 1:

1:1→2,1→4

2:2→4,2→3

Host 2:

3:3→2,3→1

4:4→1,4→2

It should be understood that the meaning of 1→2 here is that a random walk instance on vertex 1 jumps to the position of vertex 2 through an iterative calculation, and other similarities, so the number of random walk instances on each node No change, each random walk instance adds 1 path length.

(2) In the second iteration calculation, hybrid iteration, network iteration or disk iteration can be performed. The technical solution of the present invention mainly performs hybrid iteration. Then, the hybrid iteration is used for explanation, although the result of the network iteration is said. It is only one path length worse than the hybrid iteration. However, in the context of big data, if you increase the length of one path, you can handle a lot of data.

Because of the hybrid iterations, disk iterations and network iterations are performed concurrently, assuming that when iterating over the disk:

Host 1:

1:1→2→4,1→4

2:2→4→2,2→3

Host 2:

3:3→2→1,3→1

4:4→1→3,4→2

The path length of each random walk instance here is increased by one path length.

When doing network iterations:

Host 1:

1:1→2→4→2,

2:2→4→2→3,

Host 2:

3:3→2→1→4,

4:4→1→3→1.

Here, when the network iteration is performed, the path lengths of 4→2, 2→3, 1→4, 3→1 are spliced on the corresponding vertices, so there is a path of one random walk instance on each vertex. The length is doubled from the length of one path after the first iteration to the length of two paths, plus the disk iteration in the hybrid iteration, then the path length of one random walk instance on each vertex increases. It is 3 path lengths. That is, after the completion of the second iteration calculation, the number of random walk instances on each vertex becomes one, and the path length of each random walk instance is increased to three path lengths.

In a general description, on each host, when the X+1th iteration calculation is performed, the first R/2 paths from each node are sequentially scanned after the Xth iteration calculation. For each walk path, look for another existing walk path for it, which is an unused path from the last R/2 random walk paths starting from the other vertex, then 2 Paths of length L are merged into a path of length 2L.

It should be noted that the hybrid iteration is written here, mainly to understand the process more clearly. The length of the hybrid iteration is usually longer in the disk iteration and network iteration, and the longer duration is used as the duration of the hybrid iteration. It should be understood that the process of individual network iteration can refer to the process of network iteration in the hybrid iteration described above, and details are not described herein again.

The network iteration in the above hybrid iteration is because the number of iterations is relatively small. Therefore, it may not be clear to explain that the path length of the random walk instance is doubled. The following is a simple example to illustrate the network iteration.

Suppose that after the second iteration calculation is performed with the vertex a as the source point, the current path length is 3 path lengths, which are {a, b, c, d}, and the second iteration is performed with d as the vertex source point. After the calculation, the current path length is also 3 path lengths, which are {d, e, f, g},

Then, at this time, the third network iteration calculation is performed on the vertex a as the source point, and {a, b, c, d} and {d, e, f, g} can be merged, then the vertex a is The current path length of the source point becomes {a, b, c, d, e, f, g}, which is changed from the previous three path lengths to six path lengths. Therefore, it can be said that a network iterative calculation is performed here. The path length doubles.

The method for calculating the graph in the embodiment of the present invention is described above. The following describes the host in the embodiment of the present invention. As shown in FIG. 5, it is a schematic diagram of an embodiment of a host in the embodiment of the present invention. Based on the graph computing system, the graph computing system includes M hosts, each of which stores graph data on a local disk, the graph data includes N vertices, and each host simultaneously runs N/M paths of different sources for calculation, each There are currently R ₁ random walk instances on the vertex, and the current path length of each random walk instance is L ₁ , including:

The obtaining unit 501 is configured to obtain a set of calculation results after the Xth iteration calculation is performed on the vertex set, and the vertex set is a set of the vertex of the update function to be executed when the first host performs the X+1th iteration calculation;

The calculating unit 502 is configured to perform the X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtain that the number of random walk instances on each vertex becomes R ₁ /2, R ₁ /2 The current path length of each of the random walk instances becomes 2L ₁ +1; and is also used to calculate the graph data if the R ₁ /2 and 2L ₁ +1 satisfy the iterative completion condition.

Optionally, in some embodiments of the invention,

The obtaining unit 501 is specifically configured to perform

steps

302 and 304 shown in FIG. 3 above.

The calculating unit 502 is specifically configured to perform

steps

301, 303, 305, and 306 shown in FIG. 3 above.

FIG. 6 is a schematic diagram of another embodiment of a host in an embodiment of the present invention, including:

The host may vary considerably depending on configuration or performance, and may include a transceiver 601, one or more central processing units (CPU) 602 (eg, one or more processors), and a memory 603. One or more storage media 604 that store application 6041 or data 6042 (eg, one or one storage device in Shanghai). Among them, the memory 603 and the storage medium 604 may be short-term storage or persistent storage. The program stored on storage medium 604 may include one or more modules (not shown in Figure 6), each of which may include a series of instruction operations in a wireless network controller. Still further, central processor 602 can be configured to communicate with storage medium 604 to perform a series of instruction operations in storage medium 604 on the wireless network controller.

Correspondingly, in the embodiment of the present invention, the transceiver 601 is configured to obtain a set of calculation results after the Xth iteration calculation is performed on the vertex set, and the vertex set is to perform an update function when the first host performs the X+1th iteration calculation. a collection of vertices;

The processor 602 is configured to perform the X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtain that the number of random walk instances on each vertex becomes R ₁ /2, R ₁ /2 The current path length of each of the random walk instances becomes 2L ₁ +1; if R ₁ /2 and 2L ₁ +1 satisfy the iterative completion condition, the first host completes the calculation of the map data.

Optionally, in some embodiments of the invention,

The transceiver 601 is specifically configured to perform the

steps

302 and 304 shown in FIG. 3 above.

The processor 602 is specifically configured to perform

steps

301, 303, 305, and 306 shown in FIG. 3 above.

The embodiment of the invention further provides a computer program product for data processing, comprising a computer readable storage medium storing program code, the program code comprising instructions for executing the method flow of any one of the foregoing method embodiments. A person skilled in the art can understand that the foregoing storage medium includes: a USB flash drive, a mobile hard disk, a magnetic disk, an optical disk, a random access memory (RAM), a solid state disk (SSD), or other nonvolatiles. A non-transitory machine readable medium that can store program code, such as non-volatile memory.

A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the system, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

In the several embodiments provided by the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that The technical solutions described in the embodiments are modified, or the equivalents of the technical features are replaced by the equivalents of the technical solutions of the embodiments of the present invention.

Claims

A method for graph calculation, characterized in that the method is applied to a disk-based graph computing system, the graph computing system comprising M hosts, each host storing graph data on a local disk, the graph data Including N vertices, each host simultaneously runs path calculations of N/M different sources, and there are currently R 1 random walk instances on each vertex, and the current path length of each random walk instance is L 1 . Methods include:

And acquiring, by the first host, a set of calculation results after the Xth iteration calculation, where the set of vertices is a set of vertices of the update function to be executed when the first host performs the X+1th iteration calculation;

The first host performs the X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtains that the number of random walk instances on each vertex becomes R 1 /2, The current path length of each R 1 /2 random walk instances becomes 2L 1 +1;

If the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition, the first host completes the calculation of the graph data.
The method according to claim 1, wherein the first host acquires a set of vertices and performs a calculation result set after the Xth iteration calculation, including:

Calculation result of the first host to the first set of vertices u i X iteration calculation performed by obtaining a set of network T X (u i), the set of vertices comprises vertices in the u i is the first host to the vertex v The vertex of the update function is to be executed when calculating the X+1th iteration of the source point, the vertex v being one of N/M vertices on the first host, the vertex set u i including the vertex v, X is an integer greater than 0;

The first host performs the X+1th concurrent iteration calculation with the vertex v as a source point according to the graph data and the calculation result set, and obtains the number of random walk instances on each vertex to become R. 1 /2, the current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1, including:

The first host performs an X+1th concurrent iteration calculation with the vertex v as a source point according to the graph data and the calculation result T X (u i ), and obtains the vertex v as a source point. The number of random walk instances becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1.
The method according to claim 2, wherein the first host performs the X+1th time with the vertex v as a source point according to the graph data and the calculation result set T X (u i ) The number of random walk instances with the vertex v as the source point becomes R 1 /2, and the current path length of the R 1 /2 random walk instances becomes 2L 1 +1, including:

The first host performs an X+1th iteration calculation with the vertex v as a source point according to the graph data, and obtains the number of random walk instances with the vertex v as a source point as the R. 1 , the R 1 random walk instances each increase by 1 path length;

The first host performs an X+1th iteration calculation with the vertex v as a source point according to the calculation result set T X (u i ), and obtains the random walk instance with the vertex v as a source point. The number of the paths becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 ;

The first host adds one path length according to each of the R 1 random walk instances, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 , and determines the R 1 / The current path length of each of the two random walk instances becomes 2L 1 +1.
The method according to claim 3, wherein said set of vertices u i is {v} ∪Q 1, said Q 1 ∈Q, Q is the set {u 1, u 2, ...... , u n }, the set of calculation results T X (u i ) is {T X (v)} ∪ T X (Q 1 ), the T X (Q 1 ) ∈ T X (Q),
The set T X (Q) is {T X (u 1 ), T X (u 2 ), ..., T X (u n )};

The first host performs an X+1th iteration calculation with the vertex v as a source point according to the calculation result set T X (u i ), and obtains the random walk instance with the vertex v as a source point. The number of the paths becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 , including:

If Q 1 is {u 1 , u 2 , u 3 , u 4 }, the calculation result set T X (u i ) is {T X (v), T X (u 1 ), T X (u 2 ), T X (u 3 ), T X (u 4 )};

The first host performs the X+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances on the vertex v as the source point becomes R 1 /2 The current path length of each of the R 1 /2 random walk instances becomes 2L 1 ;

Where T X (v) represents the calculation result after the Xth iteration with the vertex v as the source point.
Indicates that the calculation results for the left and right ends are combined by the specified operation.
The method according to claim 3, wherein the first host performs an X+1th iteration calculation with the vertex v as a source point according to the calculation result set T X (u i ), and obtains the method. The number of random walk instances on the vertex v as the source point becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 , including:

After the first host acquires the current path length of the first R 1 /2 random walk instances starting from the vertex v as the source point after performing the Xth iteration calculation, and acquiring the Xth iteration calculation, The current path length of the respective R 1 /2 random walk instances starting from another source point u, the other source point u being the pre-R 1 /2 random tour starting from the vertex v as the source point After the instance is calculated by the Xth iteration, the R 1 /2 random walks are located at the vertices;

The first host splicing the current path length of each of the pre R 1 /2 random walk instances starting from the vertex v as the source point and the post R 1 /2 random tour starting from the other source point u Taking the current path degree of each instance, the number of random walk instances on the vertex v as the source point is changed to R 1 /2, and the current path length of each of the R 1 /2 random walk instances It becomes 2L 1 .
The method of any of claims 1-5, wherein there are currently R 2 random walk instances on each vertex, and the path length of each random walk instance is L 2 , the method further comprising:

The first host performs the Z+1th iteration calculation with the vertex v as the source point according to the graph data, and obtains the number of random walk instances on the vertex v as the source point as the R. 2 , the path length of each of the R 2 random walk instances is increased by 1 path length, and Z is an integer greater than or equal to 0.
The method according to any one of claims 1-6, wherein there are currently R 3 random walk instances on each vertex, and the current path length of each random walk instance is L 3 , the method further includes :

The first host network calculation result obtained by the iterative calculation of Y times to complete the set of vertices of the set h i T Y (h i), the set of vertices comprises vertices h i is the first host to The vertex v is a vertice of the update function when the Y+1th iteration of the source point is calculated, and the vertex set h i includes the vertex v, and Y is an integer greater than 0;

The first host performs the Y+1th iteration calculation with the vertex v as the source point according to the calculation result set T Y (h i ), and obtains the random walk instance with the vertex v as the source point. The number of the numbers becomes R 3 /2, and the current path length of each of the R 3 /2 random walk instances becomes 2L 3 .
The method according to claim 7, wherein said set of vertices h i is {v} ∪W 1, the W 1 ∈W, W is a set {h 1, h 2, ...... , h n }, the set of calculation results T Y (h i ) is {T Y (v)} ∪ T Y (W 1 ), the T Y (W 1 ) ∈ T Y (W),
The set T Y (W) is {T Y (h 1 ), T Y (h 2 ), ..., T Y (h n )};

The first host performs the Y+1th iteration calculation with the vertex v as the source point according to the calculation result set T Y (h i ), and obtains the random walk instance with the vertex v as the source point. The number of the paths becomes R 3 /2, and the current path length of each of the R 3 /2 random walk instances becomes 2L 3 , including:

If W 1 is {h 1 , h 2 , h 3 , h 4 }, the calculation result set T Y (h i ) is {T Y (v), T Y (h 1 ), T Y (h 2 ), T Y (h 3 ), T Y (h 4 )};

The first host performs the Y+1th iteration calculation with the vertex v as the source point according to the following formula, and obtains that the number of random walk instances with the vertex v as the source point becomes R 3 /2 , the path length of each of the R 3 /2 random walk instances becomes 2L 3 ;

Where T Y (v) represents the calculation result after the Yth iteration calculated with the vertex v as the source point,
Indicates that the calculation results for the left and right ends are combined by the specified operation.
The method according to claim 7, wherein the first host performs the Y+1th iteration calculation with the vertex v as a source point according to the calculation result set T Y (h i ), and obtains the method. The number of random walk instances on the vertex v as the source point becomes R 3 /2, and the current path length of each of the R 3 /2 random walk instances becomes 2L 3 , including:

After the first host acquires the current path length of the first R 3 /2 random walk instances starting from the vertex v as the source point after performing the Y-th iteration calculation, and acquiring the Y-th iteration calculation, The current path length of the respective R 3 /2 random walk instances starting from another source point w, the other source point w being the pre-R 3 /2 random tour starting from the vertex v as the source point After the instance is calculated by the Yth iteration, the R 3 /2 random walks are located at the vertices;

The first host splicing the current path length of each of the pre R 3 /2 random walk instances starting from the vertex v as the source point and the post R 3 /2 random tour starting from the other source point w Taking the current path degree of each instance, the number of random walk instances on the vertex v as the source point is changed to R 3 /2, and the current path length of each R 3 /2 random walk instances It becomes 2L 3 .
A host, wherein the host is applied to a disk-based graph computing system, the graph computing system includes M hosts, and each host saves graph data on a local disk, where the graph data includes N Vertex, each host simultaneously runs N/M paths of different sources. Each vertex currently has R 1 random walk instances. The current path length of each random walk instance is L 1 . The host includes:

An obtaining unit, configured to obtain a set of calculation results after the Xth iteration calculation is performed on the vertex set, where the vertex set is a set of vertices to perform an update function when the first host performs the X+1th iteration calculation;

a calculation unit, configured to perform an X+1th concurrent iteration calculation according to the graph data and the calculation result set, to obtain that the number of random walk instances on each vertex becomes R 1 /2, The current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1; and is also used to complete the said first host if the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition Calculation of graph data.
The host according to claim 10, characterized in that

The obtaining unit is configured to obtain the results of the set of vertices X iteration u i is calculated by the network after the completion of the set X T (u i), the set of vertices comprises vertices u i is in the first The host performs a X+1 iteration calculation with the vertex v as a source point to execute a vertex of the update function, the vertex v being one of N/M vertices on the first host, the vertex set u i includes the vertex v, and X is an integer greater than 0;

The calculating unit is specifically configured to perform, according to the graph data and the calculation result T X (u i ), an X+1th concurrent iteration calculation with the vertex v as a source point, and obtain the vertex v as The number of random walk instances on the source point becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1.
A host according to claim 11 wherein:

The calculating unit is specifically configured to perform an X+1th iteration calculation with the vertex v as a source point according to the graph data, and obtain the number of random walk instances with the vertex v as a source point as R 1 , the R 1 random walk instances each increase by 1 path length;

According to the calculation result set T X (u i ), an X+1th iteration calculation using the vertex v as a source point is performed, and the number of random walk instances on the vertex v as a source point is obtained. R 1 /2, the current path length of each of the R 1 /2 random walk instances becomes 2L 1 ;

Adding 1 path length according to each of the R 1 random walk instances, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 , and determining the R 1 /2 random walks The current path length of each instance becomes 2L 1 +1.
The host according to claim 12, wherein said set of vertices u i is {v} ∪Q 1, said Q 1 ∈Q, Q is the set {u 1, u 2, ...... , u n }, the set of calculation results T X (u i ) is {T X (v)} ∪ T X (Q 1 ), the T X (Q 1 ) ∈ T X (Q),
The set T X (Q) is {T X (u 1 ), T X (u 2 ), ..., T X (u n )};

The calculating unit is further configured to: if the Q 1 is {u 1 , u 2 , u 3 , u 4 }, the calculation result set T X (u i ) is {T X (v), T X ( u 1 ), T X (u 2 ), T X (u 3 ), T X (u 4 )}; performing an X+1th iteration calculation with the vertex v as a source point according to the following formula, The vertex v is the number of random walk instances on the source point becomes R 1 /2, and the current path length of each of the R 1 /2 random walk instances becomes 2L 1 ;
Where T X (v) represents the calculation result after the Xth iteration with the vertex v as the source point.
Indicates that the calculation results for the left and right ends are combined by the specified operation.
The host according to claim 12, wherein

The acquiring unit is further configured to obtain a current path length of each of the first R 1 /2 random walk instances starting from the vertex v as the source point after performing the Xth iteration calculation, and obtain the Xth time after the iterative calculation, the R starting from another source u 1/2 random walk path length of each of the current example, the point u is from another source to the front vertex v R is the source point of departure 1 / The vertices of the R 1 /2 random walk instances after the 2 random iterations are calculated by the Xth iteration;

The calculation unit is further configured specifically the length of the current path of each of said front and said splice R is the source point to the starting vertex v 1/2 Examples of random walk starting from another source U 1 R / The current path degree of each of the two random walk instances is obtained, and the number of random walk instances on the vertex v as the source point is changed to R 1 /2, and the R 1 /2 random walk instances are respectively The current path length becomes 2L 1 .
The host according to any one of claims 9-14, characterized in that there are currently R 2 random walk instances on each vertex, and the path length of each random walk instance is L 2 .

The calculating unit is further configured to perform, according to the graph data, an iterative calculation of the Z+1th time with the vertex v as a source point, and obtain the number of random walk instances with the vertex v as a source point. For the R 2 , the path length of each of the R 2 random walk instances is increased by 1 path length, and Z is an integer greater than or equal to 0.
The host according to any one of claims 9-15, characterized in that there are currently R 3 random walk instances on each vertex, and the current path length of each random walk instance is L 3 .

The obtaining unit is further configured to calculate the specific results to obtain a set of vertices of Y h i th iteration is calculated by the network after the completion of a set of Y T (h i), the set of vertices comprising vertex h i is in the first A host performs a Y+1th iteration calculation with the vertex v as a source point to execute a vertex of the update function, the vertex set h i including the vertex v, Y being an integer greater than 0;

The calculating unit is further configured to perform, according to the calculation result set T Y (h i ), a Y+1th iteration calculation with the vertex v as a source point, and obtain the vertex v as a source point. The number of random walk instances becomes R 3 /2, and the current path length of each of the R 3 /2 random walk instances becomes 2L 3 .
The host according to claim 16, wherein said set of vertices h i is {v} ∪W 1, the W 1 ∈W, W is a set {h 1, h 2, ...... , h n }, the set of calculation results T Y (h i ) is {T Y (v)} ∪ T Y (W 1 ), the T Y (W 1 ) ∈ T Y (W),
The set T Y (W) is {T Y (h 1 ), T Y (h 2 ), ..., T Y (h n )};

The calculating unit is further configured to: if the W 1 is {h 1 , h 2 , h 3 , h 4 }, the calculation result set T Y (h i ) is {T Y (v), T Y ( h 1 ), T Y (h 2 ), T Y (h 3 ), T Y (h 4 )}; performing the Y+1th iteration calculation with the vertex v as the source point according to the following formula, The vertex v is the number of random walk instances on the source point becomes R 3 /2, and the path length of each of the R 3 /2 random walk instances becomes 2L 3 ;
Where T Y (v) represents the calculation result after the Yth iteration calculated with the vertex v as the source point,
Indicates that the calculation results for the left and right ends are combined by the specified operation.
A host according to claim 16 wherein:

The acquiring unit is further configured to obtain a current path length of each of the pre R 3 /2 random walk instances starting from the vertex v as a source point after performing the Yth iteration calculation, and acquiring the Yth time after the iterative calculation, the R starting from another source w 3/2 random walk the length of the current path of each of examples, the other source vertex point v w is the starting point for the source before R 3 / After the second random walk instance is calculated by the Yth iteration, the R 3 /2 random walks are located at the vertices;

The calculating unit is further configured to splicing the current path length of each of the pre R 3 /2 random walk instances starting from the vertex v as the source point and the post R 3 starting from the other source point w / The current path degree of each of the two random walk instances is obtained, and the number of random walk instances on the vertex v as the source point is changed to R 3 /2, and the R 3 /2 random walk instances are respectively The current path length becomes 2L 3 .
A host, wherein the host is applied to a disk-based graph computing system, the graph computing system includes M hosts, and each host saves graph data on a local disk, where the graph data includes N Vertex, each host simultaneously runs N/M paths of different sources. Each vertex currently has R 1 random walk instances. The current path length of each random walk instance is L 1 . The host includes:

Transceiver

The transceiver and the processor are connected by a bus;

The transceiver is configured to obtain a set of calculation results after the Xth iteration calculation is performed on the vertex set, and the vertex set Forming a set of vertices of the update function when performing the X+1th iteration calculation on the first host;

The processor is configured to perform an X+1th concurrent iteration calculation according to the graph data and the calculation result set, and obtain that the number of random walk instances on each vertex becomes R 1 /2, The current path length of each of the R 1 /2 random walk instances becomes 2L 1 +1; and is also used to complete the first host if the R 1 /2 and 2L 1 +1 satisfy the iterative completion condition Calculation of the graph data.