CN110688055B

CN110688055B - Data access method and system in large graph calculation

Info

Publication number: CN110688055B
Application number: CN201810725214.3A
Authority: CN
Inventors: 张广艳; 郑纬民
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-07-04
Filing date: 2018-07-04
Publication date: 2020-09-04
Anticipated expiration: 2038-07-04
Also published as: CN110688055A

Abstract

The invention provides a data access method and a data access system in large graph calculation, which are characterized in that a target graph data file is preprocessed to obtain a compact graph data file corresponding to the target graph data file, active vertexes of the target graph data file in each iteration step are recorded through an index bitmap, useful side data and useless side data in each iteration step are determined according to the active vertexes, all useful side data blocks corresponding to each iteration step are determined according to the useful side data and the useless side data, and an I/O request is generated according to the initial position and the size of the useful side data blocks, so that each side data in the corresponding useful side data blocks can be directly accessed from the compact graph data file according to the initial position and the size when the I/O request is processed. The method and the system comprehensively consider the addressing overhead and the I/O overhead of the external storage equipment, reduce the I/O of the external storage equipment without side data to a certain extent, ensure the sequence of side data access and effectively improve the overall performance of large graph calculation.

Description

Data access method and system in large graph calculation

Technical Field

The invention relates to the technical field of graph calculation, in particular to a data access method and system in large graph calculation.

Background

With the rapid development of social networks, biological information networks, and information technologies, graph data targeted for such information is increasing. In dealing with the problem of computing large-scale data, large-scale computing systems based on external storage devices tend to expand the problem scale of large-scale computing by using relatively inexpensive external storage devices due to limited memory capacity of computers. Thus, for such external storage based large graph computing systems, external storage I/O tends to be a performance bottleneck. In addition, the graph algorithm is an algorithm with iteration property, and not all the edge data must be used in each iteration, so how to effectively reduce the external storage device I/O of the useless edge data, and reduce the additional overhead is a technical challenge.

Currently, there are two types of approaches to solving the above problems. The first is a selective scheduling strategy of static partition, and the main idea of this method is to perform static partition on all data according to a certain partition method in the graph data preprocessing stage, and then skip these partitions without any useful edge data in the calculation. Another method is graph data repartitioning, and the main idea of the method is to repartition original graph data after one iteration calculation is finished, in the process, graph data which cannot be used in the subsequent calculation is removed and cannot appear in new partitioning, so that the I/O of an external storage device without useless data and related calculation can be greatly reduced, but the method introduces extra I/O expense of the external storage device in each repartitioning process.

Disclosure of Invention

The invention provides a data access method and system in large graph computation, aiming at overcoming the problem that the overall performance of graph computation is lower due to overlarge I/O overhead of external storage equipment in the graph computation process in the prior art.

In one aspect, the present invention provides a data access method in large graph computing, including:

calculating the output information of each vertex in the target graph data file, orderly dividing all the vertices into a plurality of vertex sets according to the output information of all the vertices, writing the edge data corresponding to all the vertices in the vertex set into corresponding partition files for any vertex set, sequencing the edge data corresponding to different vertices in the partition files, and writing the sequenced partition files into the compact graph data file;

calling an index bitmap corresponding to an iterative algorithm corresponding to the target image data file in the current iteration step, and sequentially acquiring all useful side data blocks corresponding to the current iteration step according to the index bitmap, wherein each useful side data block comprises a plurality of side data;

for any useful edge data block, taking the position of the first edge data in the useful edge data block in the compact graph data file as the starting position of the useful edge data block, determining the target size of the useful edge data block according to the quantity of all edge data in the useful edge data block, generating an I/O request according to the starting position and the target size, and adding the I/O request into an I/O request queue;

and sequentially taking out the I/O requests from the I/O request queue, and accessing the edge data in the compact graph data file according to the initial position and the target size in the I/O requests.

Preferably, the writing the edge data corresponding to all vertices in the vertex set into the corresponding partition file further includes:

for any vertex set, acquiring all the vertices in the vertex set;

and for any vertex, taking the vertex as a source vertex, acquiring a target vertex corresponding to the source vertex, and taking the combination of the vertex and all the target vertices as edge data corresponding to the vertex.

Preferably, the sorting the edge data corresponding to different vertices in the partition file specifically includes:

initializing an offset value corresponding to each vertex according to the ID information and the out-degree information of all the vertices in the partition file;

and for any vertex, determining a target position corresponding to the vertex according to the offset value corresponding to the vertex, and storing the edge data corresponding to the vertex in the target position.

Preferably, the invoking of the iterative algorithm corresponding to the target graph data file further includes, before the index bitmap corresponding to the current iteration step:

and constructing an index bitmap corresponding to the current iteration step according to the iteration algorithm corresponding to the target image data file in the last iteration step.

Preferably, the constructing an index bitmap corresponding to the current iteration step according to the iteration algorithm corresponding to the target map data file in the last iteration step specifically includes:

for any vertex in the target image data file, judging whether the vertex is an active vertex in the current iteration step according to the iteration operation of the vertex in the last iteration step; if the vertex is an active vertex in the current iteration step, setting the bitmap bit corresponding to the vertex as a first numerical value, and if the vertex is an inactive vertex in the current iteration step, setting the bitmap bit corresponding to the vertex as a second numerical value;

and arranging the bitmap bits of all the vertexes in sequence according to the ID information of all the vertexes, setting a corresponding index bit for a preset number of bitmap bits, and obtaining an index bitmap corresponding to the current iteration step.

Preferably, the setting of a corresponding index bit for a preset number of bitmap bits specifically includes:

if at least one bitmap bit in the preset number of bitmap bits is a first numerical value, setting the corresponding index bit as the first numerical value; and if all the bitmap bits in the preset number of bitmap bits are the second numerical value, setting the corresponding index bits as the second numerical value.

Preferably, the sequentially obtaining all the useful edge data blocks corresponding to the current iteration step according to the index bitmap specifically includes:

scanning all index bits in the index bitmap in sequence, and if any index bit is a second numerical value, ignoring the index bit; if the index bit is the first numerical value, all bitmap bits corresponding to the index bit are scanned in sequence;

for any bitmap bit, if the bitmap bit is a second value, determining that a vertex corresponding to the bitmap bit is an inactive vertex, obtaining useless edge data corresponding to the bitmap bit according to the out-degree information of the inactive vertex, if the bitmap bit is a first value, determining that the vertex corresponding to the bitmap bit is an active vertex, and obtaining useful edge data corresponding to the bitmap bit according to the out-degree information of the active vertex;

judging whether the size of continuous useless edge data between useful edge data corresponding to any two bitmap bits exceeds a preset threshold value, if not, combining the useful edge data corresponding to the two bitmap bits and the continuous useless edge data into a useful edge data block, and if so, respectively taking the useful edge data corresponding to the two bitmap bits as independent useful edge data blocks.

In one aspect, the present invention provides a data access system in large graph computing, including:

the preprocessing module is used for calculating the output information of each vertex in the target graph data file, orderly dividing all the vertices into a plurality of vertex sets according to the output information of all the vertices, writing the edge data corresponding to all the vertices in the vertex set into corresponding partition files for any vertex set, sequencing the edge data corresponding to different vertices in the partition files, and writing the sequenced partition files into the compact graph data file;

the side data acquisition module is used for calling an index bitmap corresponding to the iteration algorithm corresponding to the target graph data file in the current iteration step, and sequentially acquiring all useful side data blocks corresponding to the current iteration step according to the index bitmap, wherein each useful side data block comprises a plurality of side data;

a request generating module, configured to, for any useful edge data block, use a position of a first edge data in the useful edge data block in the compact graph data file as a starting position of the useful edge data block, determine a target size of the useful edge data block according to the number of all edge data in the useful edge data block, generate an I/O request according to the starting position and the target size, and add the I/O request to an I/O request queue;

and the data access module is used for sequentially taking out the I/O requests from the I/O request queue and accessing the edge data in the compact graph data file according to the initial position and the target size in the I/O requests.

In one aspect, the present invention provides an electronic device comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor being capable of performing any of the methods described above when invoked by the processor.

In one aspect, the invention provides a non-transitory computer readable storage medium storing computer instructions that cause a computer to perform any of the methods described above.

The invention provides a data access method and a system in large graph calculation, which are characterized in that a target graph data file is preprocessed, the occupied amount of a storage space of each side data in the target graph data file is compressed to a certain degree, and the side data corresponding to each vertex are sequentially stored to obtain a compact graph data file corresponding to the target graph data file; and simultaneously recording active vertexes of an iterative algorithm corresponding to the target graph data file in each iteration step through an index bitmap, further determining useful edge data and useless edge data in each iteration step according to the active vertexes, determining all useful edge data blocks corresponding to each iteration step according to the useful edge data and the useless edge data in each iteration step on the basis of comprehensively considering addressing overhead of an external storage device and I/O overhead of the external storage device, and generating an I/O request according to the initial position and the size of each useful edge data block, so that each edge data in the corresponding useful edge data block can be directly accessed from the compact graph data file according to the initial position and the size when the I/O request is processed. When the method and the system are used for calculating the large graph, only the useful side data blocks corresponding to each iteration step need to be accessed, so that the I/O of the external storage device of the useless side data can be effectively reduced, the I/O overhead of the external storage device of the target graph data file in the whole calculation process is effectively reduced, and the overall performance of the large graph calculation is improved to a certain extent.

Drawings

FIG. 1 is a schematic overall flowchart of a data access method in large-scale computing according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a data format of edge data corresponding to each vertex according to an embodiment of the invention;

FIG. 3 is a diagram illustrating a merged simulation of a block of useful edge data according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an overall structure of a data access system in a large-scale computing according to an embodiment of the present invention;

fig. 5 is a schematic structural framework diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

It should be noted that the graph calculation is a graph algorithm, and the graph algorithm is a simple algorithm for obtaining an answer by using a special line arithmetic graph. Undirected graphs, directed graphs, and networks can employ many commonly used graph algorithms, including: various traversal algorithms, algorithms to find shortest paths, algorithms to find lowest cost paths in the network, etc. The graph algorithm can be applied to a variety of scenarios, for example: and optimizing pipelines, routing tables, express services, communication websites and the like.

Generally, the graph algorithm involves a huge amount of graph data, and particularly for large graphs with hundreds of millions of data, the memory of the computer at present cannot meet the capacity requirement of large graph calculation, so that the problem scale of large graph calculation is often required to be expanded by using a relatively cheap external storage device. However, for such external storage based large graph computations, external storage I/O tends to be a performance bottleneck. In view of the above, the present invention provides a method and a system for accessing data in big-graph computation, which are mainly directed to big-graph computation, so as to reduce the external storage I/O in big-graph computation, thereby improving the overall performance of big-graph computation. For a specific implementation process, see the following method examples.

Fig. 1 is a schematic overall flow chart of a data access method in large graph computing according to an embodiment of the present invention, and as shown in fig. 1, the present invention provides a data access method in large graph computing, including:

s1, calculating the out-degree information of each vertex in the target graph data file, orderly dividing all the vertices into a plurality of vertex sets according to the out-degree information of all the vertices, writing the edge data corresponding to all the vertices in the vertex set into corresponding partition files for any vertex set, sequencing the edge data corresponding to different vertices in the partition files, and writing the sequenced partition files into the compact graph data file;

specifically, when the large-scale graph calculation needs to be performed on the target graph data file, in order to effectively reduce the external storage device I/O in the large-scale graph calculation, in this embodiment, the target graph data file is preprocessed first to obtain a compact graph data file corresponding to the target graph data file, where the data amount of the compact graph data file compared with the target graph data file is compressed to a certain extent, and the specific implementation process is as follows:

firstly, scanning a target graph data file, and acquiring the output information of each vertex in the target graph data file. The out-degree information is the total out-edge amount taking each vertex as a source vertex. For example, if there are 10 outgoing edges using vertex a as the source vertex, the outgoing degree information of vertex a is 10. It should be noted that, because the edges in the undirected graph have no direction, for the vertices in the undirected graph, the direction of each edge may be determined according to a specific graph algorithm, and then the out-degree information of each vertex in the undirected graph is determined. In addition, in calculating the out-degree information of each vertex in the target graph data file, in order to increase the calculation speed, in the present embodiment, different portions of the target graph data file are concurrently scanned by a plurality of threads, so that the out-degree information of each vertex in the target graph data file is simultaneously calculated by a plurality of threads. For example, if the size of the target graph data file is 60GB, if there are three concurrent threads, the content of the front portion 20GB of the target graph data file can be scanned by the first thread, the content of the middle portion 20GB of the target graph data file can be scanned by the second thread, and the content of the rear portion 20GB of the target graph data file can be scanned by the third thread, so that the launch degree information of each vertex in the contents of each portion of the file can be concurrently calculated by the three threads.

After the out-degree information of each vertex in the target graph data file is obtained, all the vertices are sequentially divided into a plurality of vertex sets according to the out-degree information of all the vertices, the total amount of the out-degree information corresponding to all the vertices in each vertex set is ensured to be approximately equal, and meanwhile, the internal memory of the computer is required to be ensured to be capable of simultaneously containing the edges corresponding to all the vertices in one vertex set. Specifically, the total amount range of the out-degree information corresponding to all the vertices in one vertex set may be preset, and on this basis, the vertices are sequentially scanned according to the ID order of the vertices, and the vertices are sequentially divided into vertex sets. For example, there are A, B, C, D, E, F vertices in the target graph data file, and the out-degree information of the 6 vertices is: 2. 4, 3, 5, and 1, if the total amount of the out-degree information corresponding to all the vertexes in one vertex set is set to be 6, on this basis, a and B may be divided into the same vertex set, C and D may be divided into the same vertex set, and E and F may be divided into the same vertex set.

After obtaining a plurality of vertex sets, for any vertex set, all vertices in the vertex set are obtained, and edge data corresponding to all vertices are obtained. Taking vertex a as an example, the edge data corresponding to vertex a is the information of all outgoing edges corresponding to vertex a as the source vertex. And then writing the edge data corresponding to all the vertexes into the partition file corresponding to the vertex set. Thus, the edge data corresponding to all the vertices in each vertex set can be written into the partition file corresponding to each vertex set.

Through the processing steps, the target graph data file can be divided into a plurality of partition files, so that the memory of the computer can be ensured to contain the edge data corresponding to the same partition file. On the basis, for any partition file, all the edge data in the partition file are written into a computer memory, and all the edge data are sequenced in the computer memory. Specifically, the edge data corresponding to each vertex in the partition file may be sorted according to the ID order of each vertex in the partition file, so as to sort the edge data corresponding to different vertices in the partition file. After the edge data corresponding to different vertexes in the partition file are sorted, the sorted partition file can be written into a preset compact graph data file. Therefore, the edge data corresponding to different vertexes in each partition file can be sequenced in sequence, and the sequenced partition files are directly written into the compact graph data file. Finally, all the edge data in all the partition files are written into the compact graph data file, namely all the edge data in the target graph data file are written into the compact graph data file, namely the target graph data file is converted into the compact graph data file.

S2, calling an index bitmap corresponding to the iteration algorithm corresponding to the target image data file in the current iteration step, and sequentially acquiring all useful side data blocks corresponding to the current iteration step according to the index bitmap, wherein each useful side data block comprises a plurality of side data;

in particular, on the basis of the above technical solution, since the graph algorithm is an algorithm having an iterative nature, and not all edge data has to be used in each iteration. In view of this, in this embodiment, in order to reduce the computation of the useless edge data in the large graph computation, that is, to reduce the I/O of the external storage device of the useless edge data, the index bitmap is used to record the vertices that are active in each iteration step in the large graph computation, and if a vertex belongs to an active vertex in a certain iteration step, the edge data corresponding to the vertex is the useful edge data in the iteration step. That is, each iteration step corresponds to one index bitmap, and for any iteration step, useful edge data in the iteration step can be determined through the index bitmap corresponding to the iteration step.

On the basis of the technical scheme, in the iterative algorithm of the target image data file, for the current iteration step, the iterative operation of the last iteration step of the index bitmap corresponding to the current iteration step is already constructed, so that the index bitmap corresponding to the current iteration step can be directly called. After the index bitmap corresponding to the current iteration step is obtained, because the active vertex in the current iteration step is recorded in the index bitmap, useful edge data in the current iteration step can be determined according to the active vertex, meanwhile, other vertices except the active vertex can be determined as inactive vertices, and the edge data corresponding to the inactive vertices can be determined as useless edge data.

After the useful edge data and the useless edge data are determined, because the edge data corresponding to each vertex are sequentially stored in the compact graph data file, and the useful edge data and the useless edge data are often spaced, when the useful edge data are accessed, if the useless edge data are skipped, the computer needs to carry out external storage device addressing for many times; if the useless edge data is not skipped, the I/O of the external storage device of the useless edge data is invisibly increased. In view of the above, in this embodiment, in order to balance the influence of external storage device addressing and external storage device I/O on the performance of the computer, continuous useful edge data (or containing a small amount of useless edge data) is combined into a useful edge data block. Therefore, all the useful edge data blocks corresponding to the current iteration step can be obtained, and one useful edge data block contains a plurality of useful edge data and possibly contains a small amount of useless edge data. That is, a useful edge data block often contains multiple edge data.

S3, regarding any useful edge data block, taking the position of the first edge data in the useful edge data block in the compact graph data file as the starting position of the useful edge data block, determining the target size of the useful edge data block according to the number of all edge data in the useful edge data block, generating an I/O request according to the starting position and the target size, and adding the I/O request into an I/O request queue;

specifically, on the basis of the above technical solution, after all the useful edge data blocks corresponding to the current iteration step are obtained, for any one of the useful edge data blocks, since a plurality of edge data in the useful edge data block are continuously stored in the compact graph data file, the position of the first edge data in the useful edge data in the compact graph data file can be used as the starting position of the useful edge data block, and the target size of the useful edge data block is determined according to the number of all the edge data in the useful edge data block. On the basis, the I/O request corresponding to the useful edge data block can be generated according to the initial position and the target size, and the I/O request is added into an I/O request queue. That is, one useful edge data block corresponds to one I/O request, and the corresponding useful edge data block can be accessed from the compact graph data file according to the I/O request.

S4, the I/O requests are taken out from the I/O request queue in turn, and the edge data in the compact graph data file is accessed according to the initial position and the target size in the I/O requests.

Specifically, on the basis of the above technical solution, all the useful edge data blocks corresponding to the current iteration step may be sequentially added to the I/O request queue. On this basis, the I/O requests can be sequentially taken out from the I/O request queue, and each I/O request comprises the starting position and the target size corresponding to the request, so that the edge data in the compact graph data file can be accessed according to the starting position and the target size in the I/O request. Specifically, a position corresponding to the start position may be determined in the compact graph data file, and then the edge data of the target size may be accessed from the start position. Therefore, the iterative operation of the current iterative step can be completed. In addition, each iteration step in the iteration algorithm corresponding to the target graph data file can carry out iteration operation according to the method steps, and therefore calculation of the target graph data file can be completed.

The invention provides a data access method in large graph calculation, which comprises the steps of preprocessing a target graph data file, compressing the occupied amount of storage space of each side data in the target graph data file to a certain degree, and sequentially storing the side data corresponding to each vertex to obtain a compact graph data file corresponding to the target graph data file; and simultaneously recording active vertexes of an iterative algorithm corresponding to the target graph data file in each iteration step through an index bitmap, further determining useful edge data and useless edge data in each iteration step according to the active vertexes, determining all useful edge data blocks corresponding to each iteration step according to the useful edge data and the useless edge data in each iteration step on the basis of comprehensively considering addressing overhead of an external storage device and I/O overhead of the external storage device, and generating an I/O request according to the initial position and the size of each useful edge data block, so that each edge data in the corresponding useful edge data block can be directly accessed from the compact graph data file according to the initial position and the size when the I/O request is processed. When the method is used for calculating the large graph, only the useful side data blocks corresponding to each iteration step need to be accessed, the I/O of the external storage device of the useless side data can be effectively reduced, the I/O overhead of the external storage device of the target graph data file in the whole calculation process is effectively reduced, and the overall performance of the large graph calculation is improved to a certain extent.

Based on any of the above embodiments, a data access method in large graph computing is provided, where edge data corresponding to all vertices in the vertex set is written into corresponding partition files, and the method further includes: for any vertex set, acquiring all the vertices in the vertex set; for any vertex, the vertex is taken as a source vertex, a target vertex corresponding to the source vertex is obtained, and the combination of the vertex and all the target vertices is taken as edge data corresponding to the vertex.

Specifically, in this embodiment, after dividing all vertices in the target graph data file into a plurality of vertex sets, for any one vertex set, before writing the edge data corresponding to all vertices in the vertex set into the corresponding partition file, because edge data sharing one source vertex often exists in all the edge data corresponding to the vertex set, in view of this, the storage space occupation amount of all the edge data corresponding to the vertex set may be compressed to a certain extent, and the specific implementation process is as follows:

for any vertex set, firstly acquiring all the vertices in the vertex set, for any vertex, scanning a target graph data file, acquiring all outgoing edges of the vertex as source vertices from the target graph data file, acquiring another vertex corresponding to each outgoing edge, and taking the other vertex as a target vertex. Thus, all target vertexes corresponding to the vertex can be obtained. On this basis, the combination of the vertex and all target vertices can be used as the edge data corresponding to the vertex. That is, for a plurality of pieces of edge data sharing one source vertex, only one shared source vertex and each target vertex need to be stored, and the source vertex does not need to be stored for multiple times, so that the storage space occupation amount of each piece of edge data in the target graph data file is reduced to a certain extent. For ease of understanding, the following examples are now specifically described:

as shown in fig. 2, V₀、V₁……V_nFor each vertex, for vertex V₀In other words, at the memory vertex V₀When corresponding edge data is obtained, the vertex is taken as a source vertex, only one source vertex needs to be stored, and the source vertex is represented as src in the graph₀The edge data with the vertex as the source vertex can be directly represented by a plurality of corresponding target vertices, which are represented as dst in the figure₀、dst₁… … are provided. It can be seen from this that src₀And dst₀Form an edge data, src₀And dst₁Composing an edge datum, i.e. if the vertex V is used₀If there are m outgoing edges of the source vertex, there are m corresponding target vertices. In addition, if there is a corresponding weight for the edge data in the target graph data file, the weight can be stored in association with the corresponding target vertex, w in the graph₀And w₁I.e. representing the corresponding weight.

The invention provides a data access method in large graph calculation, which comprises the steps of dividing all vertexes in a target graph data file into a plurality of vertex sets, acquiring all vertexes in the vertex set for any vertex set before writing edge data corresponding to all vertexes in the vertex set into a corresponding partition file, acquiring a target vertex corresponding to a source vertex by taking the vertex as the source vertex for any vertex, and taking the combination of the vertex and all target vertexes as the edge data corresponding to the vertex. Therefore, when the edge data corresponding to each vertex is written into the partition file, only one source vertex and a plurality of corresponding target vertices are needed to be stored for the edge data sharing the same source vertex, and the storage space occupation of the edge data in the target graph data file is reduced to a certain extent.

Based on any of the embodiments above, a data access method in big graph computing is provided, where edge data corresponding to different vertices in a partition file are sorted, specifically: initializing an offset value corresponding to each vertex according to the ID information and the outbound information of all the vertices in the partition file; and for any vertex, determining a target position corresponding to the vertex according to the offset value corresponding to the vertex, and storing the edge data corresponding to the vertex in the target position.

Specifically, in this embodiment, a specific implementation process of sorting edge data corresponding to different vertices in each partition file is as follows:

for any partition file, firstly acquiring all vertexes in the partition file, acquiring ID information and output information of each vertex, sequencing each vertex according to the ID sequence of each vertex, acquiring the number of vertexes arranged in front of the vertex and the output information of each vertex for any vertex after the vertex sequencing is finished, and initializing the offset value corresponding to the vertex according to the product of the number of the vertexes and the output information. After the offset values corresponding to the vertices are obtained, for any vertex, the target position corresponding to the vertex can be determined according to the offset value corresponding to the vertex, the target position is the position where the edge data corresponding to the vertex needs to be stored, and finally the edge data corresponding to the vertex is stored in the target position in the partition file. And storing the edge data corresponding to each vertex according to the steps of the method, namely realizing the sequential storage of the edge data corresponding to each vertex in the partition file.

For example, if there are 3 vertices A, B and C in a partition file, and the vertices are sorted according to the ID information corresponding to the 3 vertices, then: B. a, C are provided. Meanwhile, the out-degree information corresponding to A, B and C is 2, 4 and 6 respectively. On this basis, the offset value corresponding to the vertex B may be initialized to 0; initializing the offset value corresponding to the vertex A to be 4; the offset value corresponding to vertex C is initialized to 6. Finally, the 4 edges data corresponding to the vertex B can be stored in the 0 th to 3 rd bits of the partition file; the 2 edges data corresponding to the vertex A can be stored in the 4 th to 5 th bits of the partition file; the 6 edges data corresponding to vertex C may be stored in bits 6-11 of the partition file. The edge data corresponding to each vertex in the partition file can be sequentially stored.

The invention provides a data access method in large graph calculation, which initializes the offset value corresponding to each vertex according to the ID information and the out-degree information of all the vertices in a partition file; and for any vertex, determining a target position corresponding to the vertex according to the offset value corresponding to the vertex, and storing the edge data corresponding to the vertex in the target position. According to the method, the edge data corresponding to each vertex is sequentially stored according to the ID information and the out-degree information of each vertex in each partition file, and then the sorted partition files are written into the same compact graph data file, so that each edge data can be accessed according to the position of each edge data in the compact graph data file in the subsequent iterative computation, and the data access time in the iterative computation can be saved.

Based on any of the above embodiments, a data access method in large graph computation is provided, where an iteration algorithm corresponding to a target graph data file is called to obtain an index bitmap corresponding to a current iteration step, and the method further includes: and constructing an index bitmap corresponding to the current iteration step according to the iteration algorithm corresponding to the target image data file in the last iteration step.

Specifically, in the iterative computation of the target graph data file, for the current iteration step, if the current iteration step is the first iteration step, an index bitmap corresponding to the first iteration step needs to be constructed before the large graph computation is performed; and if the current iteration step is not the first iteration step, the index bitmap corresponding to the current iteration step is constructed in the iteration operation process of the last iteration step. For example, for the traversal algorithm, vertices to be traversed for the first time can be determined before the traversal algorithm is performed, the vertices can be determined as active vertices corresponding to the first traversal step, and an index bitmap corresponding to the first traversal step can be constructed according to the determined active vertices; for other traversal steps, such as the second traversal step, in the process of the first traversal, the vertices required to be traversed by the second traversal can be determined, the vertices are the active vertices in the second traversal step, and the index bitmap corresponding to the second traversal step can be determined according to the active vertices.

The invention provides a data access method in large graph calculation, which calls an iterative algorithm corresponding to a target graph data file before an index bitmap corresponding to a current iteration step, and constructs the index bitmap corresponding to the current iteration step according to the iterative operation of the iterative algorithm corresponding to the target graph data file in the last iteration step. According to the method, the index bitmap corresponding to the current iteration step is constructed in the iteration operation process of the last iteration step, so that the corresponding index bitmap can be directly called in the current iteration step, the active vertex in the current iteration step is determined according to the corresponding index bitmap, and the scheduling overhead in the calculation process is saved to a certain extent.

Based on any of the embodiments above, a data access method in large graph computation is provided, where an index bitmap corresponding to a current iteration step is constructed according to an iteration algorithm corresponding to a target graph data file in a last iteration step, specifically: for any vertex in the target image data file, judging whether the vertex is an active vertex in the current iteration step according to the iteration operation of the vertex in the last iteration step; if the vertex is an active vertex in the current iteration step, setting the bitmap bit corresponding to the vertex as a first numerical value, and if the vertex is an inactive vertex in the current iteration step, setting the bitmap bit corresponding to the vertex as a second numerical value; and arranging the bitmap bits of all the vertexes in sequence according to the ID information of all the vertexes, setting a corresponding index bit for a preset number of bitmap bits, and obtaining an index bitmap corresponding to the current iteration step.

Specifically, in this embodiment, the index bitmap corresponding to the current iteration step is constructed according to the iteration operation of the iteration algorithm corresponding to the target map data file in the last iteration step, and the specific implementation process is as follows:

firstly, all the vertexes in the target image data file are obtained, and for any vertex, whether the vertex is an active vertex in the current iteration step can be judged according to the iteration operation of the vertex in the last iteration step. For example, in the traversal algorithm, the vertices to be traversed in the second traversal step may be determined in the first traversal step, and these vertices may be determined as active vertices in the second traversal step. For other iterative algorithms, whether a vertex is an active vertex in the current iteration step may be determined according to specific properties of the iterative algorithm, and is not specifically limited herein.

If the vertex is an active vertex in the current iteration step, setting a bitmap bit corresponding to the vertex in the index bitmap as a first numerical value; and if the vertex is an inactive vertex in the current iteration step, setting a bitmap bit corresponding to the vertex in the index bitmap as a second numerical value. On the basis, the bitmap bit corresponding to each vertex in the index bitmap in the target image data file can be determined, the ID information of each vertex is acquired at the same time, and the bitmap bits corresponding to all the vertices are sequentially arranged according to the ID information of each vertex. In addition, in order to reduce the traversal overhead of the index bitmap, in this embodiment, after all the bitmap bits arranged in sequence are obtained, a corresponding index bit is set for a preset number of bitmap bits, so as to obtain the index bitmap corresponding to the current iteration step. In the index bitmap, whether a bitmap bit corresponding to an index bit has a first numerical value and/or a second numerical value can be determined according to the index bit. In addition, in this embodiment, the first value may be set to 1, and the second value may be set to 0, and in other embodiments, the first value and the second value may be set according to actual requirements, and are not specifically limited herein. Meanwhile, the preset number can be set according to actual requirements, and is not specifically limited here.

The invention provides a data access method in large graph calculation, which is characterized in that for any vertex in a target graph data file, whether the vertex is an active vertex in the current iteration step is judged according to the iteration operation of the vertex in the last iteration step; if the vertex is an active vertex in the current iteration step, setting the bitmap bit corresponding to the vertex as a first numerical value, and if the vertex is an inactive vertex in the current iteration step, setting the bitmap bit corresponding to the vertex as a second numerical value; and arranging the bitmap bits of all the vertexes in sequence according to the ID information of all the vertexes, setting a corresponding index bit for a preset number of bitmap bits, and obtaining an index bitmap corresponding to the current iteration step. According to the method, the active vertexes in the current iteration step are obtained in the iteration operation process of the last iteration step, and then all the active vertexes are recorded in the index bitmap corresponding to the current iteration step, so that the active vertexes in the current iteration step can be determined according to the index bitmap, and the scheduling overhead in the calculation process is saved to a certain extent.

Based on any of the above embodiments, a data access method in large graph computation is provided, where a preset number of bitmap bits are provided with a corresponding index bit, and the specific steps are as follows: if at least one bitmap bit in the preset number of bitmap bits is a first numerical value, setting the corresponding index bit as the first numerical value; and if all the bitmap bits in the preset number of bitmap bits are the second numerical value, setting the corresponding index bits as the second numerical value.

Specifically, after obtaining the bitmap bits corresponding to each vertex in each target image data file and sorting the bitmap bits corresponding to each vertex, setting a corresponding index bit for a preset number of bitmap bits, and specifically setting the following process:

firstly, setting a preset number, and dividing all the sequenced bitmap bits according to the preset number to obtain a plurality of bitmap bit combinations, wherein each bitmap bit combination comprises the preset number of bitmap bits. On the basis, if at least one bitmap bit in a certain bitmap bit combination is a first numerical value, the index bit corresponding to the bitmap bit combination is set as the first numerical value, and if all bitmap bits in the certain bitmap bit combination are second numerical values, the index bit corresponding to the bitmap bit combination is set as the second numerical value. For example, assuming that the first value is 1 and the second value is 0, if at least one bitmap bit in a certain bitmap bit combination is 1, the index bit corresponding to the bitmap bit combination is set to 1, and if all bitmap bits in the certain bitmap bit combination are 0, the index bit corresponding to the bitmap bit combination is set to 0. The first value, the second value and the preset number may be set according to actual requirements, and are not specifically limited herein.

The invention provides a data access method in large graph calculation, if at least one bitmap bit in a preset number of bitmap bits is a first numerical value, setting a corresponding index bit as the first numerical value; and if all the bitmap bits in the preset number of bitmap bits are the second numerical value, setting the corresponding index bits as the second numerical value. The method is favorable for reducing the traversal overhead of the index bitmap by setting corresponding index bits for the preset number of bitmap bits.

Based on any of the embodiments above, a data access method in big graph computation is provided, where all useful edge data blocks corresponding to a current iteration step are sequentially obtained according to an index bitmap, and the specific steps are as follows: scanning all index bits in the index bitmap in sequence, and if any index bit is a second numerical value, ignoring the index bit; if the index bit is the first numerical value, all bitmap bits corresponding to the index bit are scanned in sequence; for any bitmap bit, if the bitmap bit is a second value, determining that a vertex corresponding to the bitmap bit is an inactive vertex, obtaining useless edge data corresponding to the bitmap bit according to the out-degree information of the inactive vertex, if the bitmap bit is a first value, determining that the vertex corresponding to the bitmap bit is an active vertex, and obtaining useful edge data corresponding to the bitmap bit according to the out-degree information of the active vertex; and judging whether the size of continuous useless edge data between the useful edge data corresponding to any two bitmap bits exceeds a preset threshold value, if not, combining the useful edge data corresponding to the two bitmap bits and the continuous useless edge data into a useful edge data block, and if so, respectively taking the useful edge data corresponding to the two bitmap bits as independent useful edge data blocks.

Specifically, after obtaining the index bitmap corresponding to the current iteration step, sequentially scanning all index bits in the index bitmap, and for any index bit, if the index bit is a second value (for example, 0), determining that all bitmap bits corresponding to the index bit are the second value, that is, all vertices corresponding to the index bit are inactive vertices, and at this time, directly ignoring the index bit; if the index bit is a first value (e.g., 1), it is determined that at least one bitmap bit exists in the bitmap bits corresponding to the index bit as the first value, i.e., at least one vertex of all vertices corresponding to the index bit is an active vertex, and all bitmap bits corresponding to the index bit are sequentially scanned.

In the process of sequentially scanning all bitmap bits corresponding to the index bit, if the bitmap bit is a second numerical value, determining that a vertex corresponding to the bitmap bit is an inactive vertex, acquiring the output information of the inactive vertex, and acquiring useless edge data corresponding to the bitmap bit according to the output information of the inactive vertex. And if the bitmap bit is a first numerical value, determining that the vertex corresponding to the bitmap bit is an active vertex, acquiring the out-degree information of the active vertex, and acquiring useful side data corresponding to the bitmap bit according to the out-degree information of the active vertex. Thus, useful edge data or useless edge data corresponding to each figure bit can be obtained.

For any two bitmap bits, if the side data corresponding to the two bitmap bits are useful side data and the useful side data corresponding to the two bitmap bits are separated by continuous useless side data, judging whether the size of the continuous useless side data between the useful side data corresponding to the two bitmap bits exceeds a preset threshold value, if not, merging the useful side data corresponding to the two bitmap bits and the continuous useless side data into a useful side data block, and if so, respectively taking the useful side data corresponding to the two bitmap bits as independent useful side data blocks. In this embodiment, the preset threshold l ═ b is set_seq×t_seekWherein b is_seqIs the sequential bandwidth of the external storage device, t_seekIs the addressing time of the external storage device. In other embodiments, the preset threshold may be set according to actual requirements, and is not specifically limited herein. For ease of understanding, the following examples are now specifically described:

as shown in fig. 3, the gray blocks in the graph represent useful edge data corresponding to a certain vertex, and the white blocks represent useless edge data corresponding to a certain vertex. The size of the useless edge data corresponding to the first white block is larger than a preset threshold value l, and the useless edge data represented by the first white block is directly skipped (Skip); for the first gray color block and the second gray color block, if the useless edge data represented by the white block between the two gray color blocks (i.e. the second white color block) does not exceed the preset threshold value l, the first gray color block and the second gray color block and the white block between the two gray color blocks are combined to form a useful edge data block (Chunk1), and the useful edge data block needs to be Read (Read). For the second gray color block and the third gray color block, the useless edge data represented by the white block between the two gray color blocks (i.e. the third white color block) exceeds a preset threshold value l, the useless edge data represented by the third white color block is directly skipped (Skip), and the third gray color block is taken as a useful edge data block (Chunk2) alone, and the useful edge data block needs to be Read (Read). For the third gray color block and the fourth gray color block, if the useless edge data represented by the white block between the two gray color blocks (i.e. the fourth white color block) exceeds the preset threshold value l, the useless edge data represented by the fourth white color block is directly skipped (Skip), and the fourth gray color block is taken as a useful edge data block (Chunk3) alone, and the useful edge data block needs to be Read (Read).

According to the data access method in large graph calculation, all the useful side data blocks corresponding to the current iteration step are obtained according to the index bitmap corresponding to the current iteration step, so that when the iteration operation corresponding to the current iteration step is carried out, only the corresponding useful side data blocks need to be accessed from the compact graph data file, the I/O of the external storage device of the useless side data can be effectively reduced, the I/O overhead of the external storage device of the target graph data file in the whole calculation process is effectively reduced, and the overall performance of the large graph calculation is improved to a certain extent.

Fig. 4 is a schematic diagram of an overall structure of a data access system in large graph computing according to an embodiment of the present invention, and as shown in fig. 4, based on any of the embodiments, a data access system in large graph computing is provided, including:

the preprocessing module 1 is used for calculating the output information of each vertex in the target graph data file, orderly dividing all the vertices into a plurality of vertex sets according to the output information of all the vertices, writing the edge data corresponding to all the vertices in the vertex set into corresponding partition files for any vertex set, sequencing the edge data corresponding to different vertices in the partition files, and writing the sequenced partition files into the compact graph data file;

the side data acquisition module 2 is used for calling an index bitmap corresponding to an iterative algorithm corresponding to the target graph data file in the current iteration step, and sequentially acquiring all useful side data blocks corresponding to the current iteration step according to the index bitmap, wherein each useful side data block comprises a plurality of side data;

the request generating module 3 is configured to, for any one useful edge data block, use a position of a first edge data in the useful edge data block in the compact graph data file as a start position of the useful edge data block, determine a target size of the useful edge data block according to the number of all edge data in the useful edge data block, generate an I/O request according to the start position and the target size, and add the I/O request to an I/O request queue;

and the data access module 4 is used for sequentially taking out the I/O requests from the I/O request queue and accessing the edge data in the compact graph data file according to the initial position and the target size in the I/O requests.

Specifically, the present invention provides a data access system in large graph computing, including a preprocessing module 1, an edge data obtaining module 2, a request generating module 3, and a data access module 4, where the method in any one of the above method embodiments is implemented through cooperation between the modules, and for a specific implementation process, reference is made to the above method embodiment, which is not described herein again.

According to the data access system in large graph calculation, the target graph data file is preprocessed, the occupied amount of the storage space of each side data in the target graph data file is compressed to a certain degree, and the side data corresponding to each vertex are sequentially stored, so that a compact graph data file corresponding to the target graph data file is obtained; and simultaneously recording active vertexes of an iterative algorithm corresponding to the target graph data file in each iteration step through an index bitmap, further determining useful edge data and useless edge data in each iteration step according to the active vertexes, determining all useful edge data blocks corresponding to each iteration step according to the useful edge data and the useless edge data in each iteration step on the basis of comprehensively considering addressing overhead of an external storage device and I/O overhead of the external storage device, and generating an I/O request according to the initial position and the size of each useful edge data block, so that each edge data in the corresponding useful edge data block can be directly accessed from the compact graph data file according to the initial position and the size when the I/O request is processed. When the system is used for calculating the large graph, only the useful side data blocks corresponding to each iteration step need to be accessed, the I/O of the external storage device of the useless side data can be effectively reduced, the I/O overhead of the external storage device of the target graph data file in the whole calculation process is further effectively reduced, and the overall performance of the large graph calculation is improved to a certain extent.

Fig. 5 shows a block diagram of an electronic device according to an embodiment of the present invention. Referring to fig. 5, the electronic device includes: a processor (processor)51, a memory (memory)52, and a bus 53; wherein, the processor 51 and the memory 52 complete the communication with each other through the bus 53; the processor 51 is configured to call program instructions in the memory 52 to perform the methods provided by the above-mentioned method embodiments, including: calculating the output information of each vertex in the target graph data file, orderly dividing all the vertices into a plurality of vertex sets according to the output information of all the vertices, writing the edge data corresponding to all the vertices in the vertex set into corresponding partition files for any vertex set, sequencing the edge data corresponding to different vertices in the partition files, and writing the sequenced partition files into the compact graph data file; calling an index bitmap corresponding to an iterative algorithm corresponding to a target image data file in a current iteration step, and sequentially acquiring all useful side data blocks corresponding to the current iteration step according to the index bitmap, wherein each useful side data block comprises a plurality of side data; for any useful edge data block, taking the position of the first edge data in the useful edge data block in the compact graph data file as the initial position of the useful edge data block, determining the target size of the useful edge data block according to the quantity of all edge data in the useful edge data block, generating an I/O request according to the initial position and the target size, and adding the I/O request into an I/O request queue; and sequentially taking out the I/O requests from the I/O request queue, and accessing the edge data in the compact graph data file according to the initial position and the target size in the I/O requests.

The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-mentioned method embodiments, for example, comprising: calculating the output information of each vertex in the target graph data file, orderly dividing all the vertices into a plurality of vertex sets according to the output information of all the vertices, writing the edge data corresponding to all the vertices in the vertex set into corresponding partition files for any vertex set, sequencing the edge data corresponding to different vertices in the partition files, and writing the sequenced partition files into the compact graph data file; calling an index bitmap corresponding to an iterative algorithm corresponding to a target image data file in a current iteration step, and sequentially acquiring all useful side data blocks corresponding to the current iteration step according to the index bitmap, wherein each useful side data block comprises a plurality of side data; for any useful edge data block, taking the position of the first edge data in the useful edge data block in the compact graph data file as the initial position of the useful edge data block, determining the target size of the useful edge data block according to the quantity of all edge data in the useful edge data block, generating an I/O request according to the initial position and the target size, and adding the I/O request into an I/O request queue; and sequentially taking out the I/O requests from the I/O request queue, and accessing the edge data in the compact graph data file according to the initial position and the target size in the I/O requests.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: calculating the output information of each vertex in the target graph data file, orderly dividing all the vertices into a plurality of vertex sets according to the output information of all the vertices, writing the edge data corresponding to all the vertices in the vertex set into corresponding partition files for any vertex set, sequencing the edge data corresponding to different vertices in the partition files, and writing the sequenced partition files into the compact graph data file; calling an index bitmap corresponding to an iterative algorithm corresponding to a target image data file in a current iteration step, and sequentially acquiring all useful side data blocks corresponding to the current iteration step according to the index bitmap, wherein each useful side data block comprises a plurality of side data; for any useful edge data block, taking the position of the first edge data in the useful edge data block in the compact graph data file as the initial position of the useful edge data block, determining the target size of the useful edge data block according to the quantity of all edge data in the useful edge data block, generating an I/O request according to the initial position and the target size, and adding the I/O request into an I/O request queue; and sequentially taking out the I/O requests from the I/O request queue, and accessing the edge data in the compact graph data file according to the initial position and the target size in the I/O requests.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above-described embodiments of the electronic device and the like are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may also be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, the method of the present application is only a preferred embodiment and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data access method in large graph computation is characterized by comprising the following steps:

2. The method of claim 1, wherein writing the edge data corresponding to all vertices in the vertex set into corresponding partition files further comprises:

for any vertex set, acquiring all the vertices in the vertex set;

3. The method according to claim 1, wherein the sorting of the edge data corresponding to different vertices in the partition file is specifically:

initializing the offset corresponding to each vertex according to the ID information and the out-degree information of all the vertices in the partition file;

and for any vertex, determining a target position corresponding to the vertex according to the offset corresponding to the vertex, and storing the edge data corresponding to the vertex in the target position.

4. The method according to claim 1, wherein the invoking of the iterative algorithm corresponding to the target graph data file further comprises, before the index bitmap corresponding to the current iteration step:

5. The method according to claim 4, wherein the constructing of the index bitmap corresponding to the current iteration step according to the iteration operation of the iteration algorithm corresponding to the target map data file in the last iteration step specifically includes:

6. The method according to claim 5, wherein the setting of a corresponding index bit for a predetermined number of bitmap bits includes:

7. The method according to claim 1, wherein the sequentially obtaining all the useful edge data blocks corresponding to the current iteration step according to the index bitmap specifically comprises:

8. A system for data access in large graph computing, comprising:

9. An electronic device, comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 7.

10. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 7.