CN109522428B

CN109522428B - External memory access method of graph computing system based on index positioning

Info

Publication number: CN109522428B
Application number: CN201811082365.8A
Authority: CN
Inventors: 王芳; 冯丹; 陈静; 蒋子威; 王子毅; 刘上; 杨蕾; 杨文鑫; 陈硕; 曹孟媛; 戴凯航; 施展
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2018-09-17
Filing date: 2018-09-17
Publication date: 2020-11-24
Anticipated expiration: 2038-09-17
Also published as: CN109522428A

Abstract

The invention discloses an external memory access method of a graph computing system based on index positioning, which comprises the following steps: dividing the complete graph data into a plurality of subgraphs; sequencing the edges of each subgraph according to the source vertex number and the target vertex number respectively; writing the sequenced subgraphs into an external storage file, and respectively establishing indexes for the source vertex number and the target vertex number; selecting an optimal loading mode from the loading modes of index positioning and the loading modes of accessing complete data; and loading each sub-graph in the external memory into the internal memory in an optimal loading mode. The invention redesigns the external memory data structure, improves the data loading mode, enables the system to analyze the effective data in the external memory before loading, and obviously reduces the I/O data volume and the random access times; the time overhead of the complete data access mode and the index positioning mode is analyzed, the optimal data loading mode of the system is dynamically judged, and the time overhead of data loading is reduced.

Description

External memory access method of graph computing system based on index positioning

Technical Field

The invention belongs to the field of graph calculation based on an external memory, and particularly relates to an external memory access method of a graph calculation system based on index positioning.

Background

The graph computation is performed by iteratively executing an update function. A commonly used approach for external memory based graph computing systems is to organize graph data into multiple sub-graph data files on a disk so that each sub-graph file can be placed into memory. Each sub-graph contains vertex information for computing updates, and a complete iteration process will process all sub-graph data. The key point is how to manage the computation states of all subgraphs to ensure the correctness of the processing result, wherein the graph data is loaded from an external memory to a memory, and an intermediate result is written back to the external memory, so that the subsequent computation can obtain an updated result. Thus, each iteration requires access to a large amount of data, which can generate a large amount of IO overhead and become a bottleneck for the external memory based approach.

The graphci system divides vertices in graph data into disjoint intervals during preprocessing and partitions edges into multiple data slices, the data slices corresponding one-to-one to the vertex intervals, the target vertex of an edge in each data slice corresponding to a vertex in the respective vertex interval, collects data from neighboring vertices using a vertex-centric processing model, executes an update function on each vertex, calculates and updates vertex values. Using a parallel sliding window technique to reduce random external memory I/O; keval Vora et al propose a general optimized access method ADS for the external memory map computing system, which only reads active data, and the main idea is to regenerate new sub-partitions from the next iteration of active data when each iteration is finished, and to read only the newly generated sub-partitions when the next iteration is performed. And simultaneously, setting DELAY _ BUFFER in a memory for storing the vertex which needs to be updated but is not executed, and uniformly reading the original subgraph to update the subgraph in the DELAY _ BUFFER every other iteration.

In summary, most of the current external memory data access methods in graph computing systems are loading complete subgraph partitions. Whether GraphChi using a parallel sliding window or ADS creating a dynamic sub-graph partition, since the external memory data is not divided more finely, data required by accurate positioning access calculation cannot be achieved. Meanwhile, if a simple positioning method is used for loading external memory data, although the data loading amount can be reduced and the resource utilization rate is improved, the original sequential access is divided into multiple random accesses, and extra time overhead is brought.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to solve the technical problems of large I/O data volume, random access times and time overhead of an external memory access mode in the prior art.

In order to achieve the above object, in a first aspect, an embodiment of the present invention provides an external memory access method for a graph computing system based on index positioning, where the method includes the following steps:

s0. dividing the complete graph data into multiple sub-graphs that the memory can hold;

s1, sequencing edges of each subgraph according to a source vertex number and a target vertex number;

s2, writing the sequenced subgraphs into an external storage file, and respectively establishing indexes for a source vertex number and a target vertex number;

s3, selecting an optimal loading mode from the loading modes of index positioning and the loading modes of accessing complete data;

and S4, loading each sub-graph in the external memory into the internal memory in an optimal loading mode.

Specifically, the writing of the sorted subgraphs into the external storage file specifically includes the following steps:

when each sub-graph after edge sequencing is written into an external memory file, edges with the same source vertex and target vertex are stored in continuous external memory data blocks; the data of the edge comprises an edge and an edge value, and the edge are respectively stored as two files in an external memory: an edge file and an edge value file; the edge file stores the topological structure of the graph, the storage format is the adjacency list format; the order of the edge values in the edge file corresponds to the order of the edges in the edge file.

Specifically, the edge file includes: an edge-out file according to the source vertex number and an edge-in file according to the target vertex number; the edge-out file is organized according to the source vertex number, and the source vertex number, the out-degree and the target vertex number of the edge-out are sequentially and continuously stored; the edge entering file is composed according to the target vertex number, and the target vertex number, the degree of entry and the source vertex number of the edge entering are sequentially and continuously stored.

Specifically, the index records the offset address of the edge corresponding to the vertex in the external file.

Specifically, the loading manner of the index positioning is as follows:

(1) finding out data blocks needing to be loaded into the memory in the edge file according to the index positioning;

(2) according to the offset address of the data block output edge which needs to be loaded in the output edge file, finding out the edge value data of the output edge in the edge value file, and loading the edge value data into the memory;

(3) constructing an edge-out topological structure of a vertex in a memory;

(4) when the application needs to enter the edge, locating an edge entering data block which needs to be loaded into the memory in the edge entering file according to the index;

(5) according to the edge entering offset address of the data block to be loaded in the edge entering file, finding the edge value offset address of the edge entering in the edge value index file, and loading the edge value data offset address to the memory;

(6) and constructing an edge-entering topological structure of the vertex in the memory.

Specifically, step S3 is as follows:

s30, determining active vertexes of all sub-graphs, judging whether the proportion of the active vertexes exceeds a threshold value, if so, entering a step S31, otherwise, accessing a loading mode of complete data to be an optimal loading mode, and ending the step S3;

s31, recording a data block number corresponding to the active vertex data;

s32, calculating the cost (index) of a loading mode for index positioning and the cost (all) of a loading mode for accessing complete data based on the number of a data block corresponding to the active vertex data;

and S33, judging whether cost (all) is greater than cost (index), if so, determining that the loading mode of index positioning is the optimal loading mode, and otherwise, determining that the loading mode of accessing the complete data is the optimal loading mode.

Specifically, the value range of the threshold is 20% -30%.

In particular, the overhead of accessing the load of the complete data

Overhead of index-oriented load-wise

Wherein E is the set of all edges in the graph; d is a storage space occupied by a single edge; b is the data block size of primary I/O; b is the size of the data block pointed to by each index entry; k is the number of index data blocks corresponding to all active vertexes in the current system; r is the random access overhead.

In a second aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the external memory access method according to the first aspect.

Generally, compared with the prior art, the above technical method contemplated by the present invention has the following beneficial effects:

1. the invention combines the operation characteristics of graph application to redesign the organization structure of the external memory data, so that the data of the same vertex is stored in the continuous space of the external memory for being read conveniently, and an index is established for the offset address of the data block corresponding to the vertex in the file, so as to quickly access the corresponding data block, improve the data loading mode of the system, and enable the system to calculate and analyze the effective data in the external memory before the data loading stage, thereby realizing the selection of the vertex data required by the loading calculation, and obviously reducing the I/O data volume and the random access times;

2. the invention analyzes the time overhead of the original access complete data mode and the index positioning mode, designs a decision judgment function for dynamically judging the optimal data loading mode of the system in each iteration, and effectively reduces the time overhead of data loading.

Drawings

FIG. 1 is a diagram illustrating a corresponding relationship between an edge file and an edge offset address file according to the present invention;

fig. 2 is a schematic diagram of an edge-out index file and an edge-out file provided in an embodiment of the present invention;

fig. 3 is a flowchart of step S3 according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical means and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides an external memory access method of a graph computing system based on index positioning, which comprises the following steps:

The graph computation system mainly comprises a preprocessing stage and a computation execution stage. The preprocessing stage comprises reading original graph data into an internal memory, processing the data into a format required by system calculation and execution, and writing the format into an external memory file; the calculation execution stage comprises loading external memory data into a memory, constructing a subgraph in the memory, executing calculation updating and writing a calculation result back to the external memory. In order to realize the selective loading of external memory data based on indexes, sub-graph data sorting and index establishment are added in the preprocessing stage; and decision judgment is added in the calculation execution stage, and the modes of side data loading and side value data loading are modified.

The preprocessing stage of the present invention includes steps S0-S2.

Step S0. splits the complete graph data into multiple subgraphs that the memory can accommodate.

According to the size of available memory in computing resources, the graph data is divided into a plurality of subgraphs from complete original data, and the volume of the subgraph data does not exceed the available memory capacity generally, so as to ensure that the memory can store all data of a single subgraph.

And S1, sequencing the edges of each subgraph according to the source vertex numbers and the target vertex numbers.

And regarding the vertexes as being numbered continuously from 0 to | V | -1, and sequencing the sub-image data according to the source vertex number and the target vertex number respectively. The vertex set V is a set of all vertices in the graph, a vertex being composed of a vertex number and a value of the vertex itself.

And S2, writing the sequenced subgraphs into an external storage file, establishing indexes for the source vertex number and the target vertex number respectively, and recording the offset addresses of the edges corresponding to the vertices in the external storage file by the indexes.

And when the sub-graphs after edge sequencing are written into an external memory file, storing the edges with the same source vertex and target vertex in continuous external memory data blocks. The data of the edge comprises an edge and an edge value, and the edge are respectively stored as two files in an external memory: an edge file and an edge value file. The edge file stores the topological structure of the graph, and the storage format is an adjacency list format. The side file includes: and the edge-out file is numbered according to the source vertex and the edge-in file is numbered according to the target vertex. The edge-out file is organized according to the source vertex number, and the source vertex number, the out-degree and the target vertex number of the edge-out are sequentially and continuously stored; the edge entering file is composed according to the target vertex number, and the target vertex number, the degree of entry and the source vertex number of the edge entering are sequentially and continuously stored. The order of the edge values in the edge file corresponds to the order of the edges in the edge file. Because the stored information sequences in the two files are the same, the data in the edge value file can be conveniently positioned through the information of the edge file.

Considering that it is inevitable to synchronize and write back twice as much data if two pieces of edge data are to be saved, optimization of the data write back process is considered. Because only the difference of the storage sequence exists between the two edge values, an improved edge value file structure is designed, and only one piece of edge value data and one piece of offset address of the edge value data in the file are stored. Fig. 1 is a schematic diagram of a corresponding relationship between an edge file and an edge offset address file according to the present invention. As shown in fig. 1, the edge-out data file stores original edge data, and the sequence of the edge values corresponds to the sequence of the edges in the data file for compatibility with the original data loading manner of the system; the data file of the incoming edge constructs a piece of 'edge value' data with the same format, and the content stored in the file is not a real edge value, but a file offset address pointing to the position of the real edge value data in the file.

The index records the offset address of the edge corresponding to the vertex in the external file. The invention adopts a sparse index method to establish indexes for the source vertex number and the target vertex number respectively, uses a redundant storage method to store each edge as an incoming edge and an outgoing edge respectively once, and uses the external storage cost doubled to replace the improvement of the time performance. The method comprises the following specific steps:

when the index is established, an index file is established by the vertex number of the edge file, and the index points to the offset address of the outgoing edge/incoming edge of the vertex in the file. Each row of the index file corresponds to a doublet: vertex number + offset address of its corresponding outgoing side information/incoming side information in the file. Dividing the edge file into a plurality of blocks according to given interval intervals, wherein each index item points to the first piece of data of the corresponding block and comprises a vertex number and a file offset address thereof. In this way a large amount of vertex data can be skipped. Meanwhile, when data is loaded, although some useless data can be inevitably read in by loading one data block at a time, the data block read each time may contain data of a plurality of required vertexes, so that the I/O frequency can be reduced, and the efficiency is improved.

Fig. 2 is a schematic diagram of an edge-out index file and an edge-out file according to an embodiment of the present invention. As shown in fig. 2, each line in the edge-out file represents edge-out information of a source vertex, and the edge-out information comprises: source vertex number + out degree + destination vertex number. The interval given is 3. When the index is established, the edge file is divided into a plurality of blocks at intervals of 3, the vertexes 1-3 are one block, the vertexes 4-6 are one block, and the established index points to a first vertex of the block, for example, a first vertex V1, which points to the first address 0 of the first block, and a vertex 4 which points to the first address 32 of the second block (assuming that 4 bits are required for storing each value).

The topology of the graph directly affects the organization of the external data, and thus the location of the external data. On the premise of not changing the topological structure of the graph, the invention establishes indexes for the external memory data and finds the positioning data, thereby reducing the complexity of realization and simultaneously reducing the overhead of additional external memory access.

The calculation execution phase of the present invention includes steps S3-S4.

And S3, selecting an optimal loading mode from the loading modes of index positioning and the loading modes of accessing complete data. Fig. 3 is a flowchart of step S3 according to an embodiment of the present invention. As shown in fig. 3, step S3 is specifically as follows:

s30, determining the active vertex of each sub-graph, judging whether the proportion of the active vertices exceeds a threshold value, if so, entering the step S31, otherwise, accessing the loading mode of the complete data as an optimal loading mode, and ending the step S3.

The vertex participating in the calculation update is called an active vertex; vertices that do not participate in the computation are called inactive nodes.

The round with the lower active vertex proportion can obtain performance improvement by using an index positioning loading mode, the round with the higher active vertex proportion has better performance by using an original mode for accessing complete data, and the two data loading modes are combined to obtain an overall optimal scheme, namely, the decision judgment module is used for selecting the data loading mode to obtain the maximum performance improvement effect. The value range of the threshold is 20-30%, preferably 30%.

And S31, recording the number of the data block corresponding to the active vertex data.

One data block usually contains data of a plurality of vertexes, and data blocks corresponding to all active vertex data are counted. And (4) considering the distribution situation of the active vertexes in the data blocks, namely, determining which data blocks have the active vertexes, and counting the data blocks with the active vertexes.

And S32, calculating the overhead cost (index) of a loading mode for index positioning and the overhead cost (all) of a loading mode for accessing complete data based on the number of the data block corresponding to the active vertex data.

There are two ways to load data from external memory into internal memory, and the contents of the data loaded in these two ways are different. When accessing complete data, loading complete sub-graph data, corresponding to data files with ordered source vertexes, and not loading data files with ordered target vertexes; when the index positioning is used for loading the effective data, the outgoing edge of the vertex is loaded from the orderly outgoing edge file of the source vertex, and the incoming edge of the vertex is loaded from the orderly incoming edge file of the target vertex.

For the data loading mode of accessing the complete subgraph, the data loading mode is usually used when the data volume required by the application is large, the skipped invalid data is less, the optimization benefit is lower, and higher access efficiency can be obtained by utilizing the high performance of the external memory sequential access. Note that the original graph data is preprocessed to obtain two pieces of external storage data, and the two pieces of data files both contain complete sub-graph information, so that the system only needs to access one of the data files when loading data. And when the subgraph is constructed, each edge in the data file is sequentially processed, and whether a source vertex and a target vertex of the edge are data required by the application or not is respectively judged. If the source vertex needs to participate in calculation, adding the current edge into the edge output sequence of the vertex; if the target vertex needs to participate in the computation, the current edge is added to the in-edge sequence of the vertex.

The data loading mode for accessing the complete subgraph is as follows:

(1) sequentially loading data files with ordered source vertexes;

(2) sequentially processing each edge in the data file, respectively judging whether a source vertex and a target vertex of the edge are data required by application, and adding a current edge into an edge outlet sequence of the vertex if the source vertex needs to participate in calculation; if the target vertex needs to participate in the computation, the current edge is added to the in-edge sequence of the vertex.

The index positioning data loading mode is usually used when the data volume required by the application is small, and a lot of invalid data can be skipped, so that the optimization effect is obvious, and the index positioning is utilized to access the data required by the application, thereby reducing the I/O data volume and obtaining higher access efficiency. Before accessing the external memory data, firstly obtaining the vertex state information in the system, recording the number of a data block corresponding to the active vertex data, wherein one data block usually comprises data of a plurality of vertexes, counting the data blocks corresponding to all the active vertex data, and positioning the corresponding file position according to the index to access the data block. Ideally, only the data corresponding to the active vertex needs to be loaded, but in actual reading, considering that it is inefficient that each vertex needs to access the external memory once, the data of multiple vertices can be read in at one time by using a method of accessing a data block corresponding to the vertex, although the data block inevitably contains some useless data, the I/O frequency can be greatly reduced, and the I/O efficiency is improved. Note that the loaded data is obtained from two separate data files, the outgoing edge of the vertex is loaded from the file with ordered source vertices, and the incoming edge of the vertex is loaded from the file with ordered target vertices. Therefore, when the subgraph is constructed, the outgoing edge is added to the outgoing edge sequence of the corresponding source vertex, and the incoming edge is added to the incoming edge sequence of the corresponding target vertex, so that compared with a mode of accessing complete data, the data amount needing to be processed is reduced.

The data loading mode of index positioning is as follows:

(3) constructing an edge-out topological structure of a vertex in a memory;

(5) according to the edge entering offset address of the data block to be loaded in the edge entering file, finding the edge value offset address of the edge entering in the edge value index file, and loading the offset address of the edge value data to the memory;

When the data is loaded, the position of the data in the file can be quickly positioned according to the number of the vertex, so that the data searching and accessing efficiency during the operation is improved.

Overhead of load-wise access to complete data

Overhead of index-oriented load-wise

D. The constants | E | B, b and r are obtained and assigned before the computation is run, wherein D | E | is the space occupied by the data file.

By comparing the two data loading modes, it can be seen that the data loading mode of index positioning is suitable for the case of less active vertex number, at this time, the amount of data required for calculation is less, if the complete data is accessed, there are many invalid data, and the amount of data to be processed when constructing the subgraph is also great, which wastes CPU resources. And the data loading mode for accessing the complete subgraph can fully utilize the sequential access performance of the external memory disk, and better performance can be obtained by loading all data at one time when the proportion of the number of active vertexes is large. The two data loading modes are comprehensively used, so that the system always obtains the optimal performance when the data is loaded, and the comprehensive efficiency of the system is greatly improved.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An external memory access method for a graph computing system based on index positioning, the method comprising the steps of:

s3, selecting an optimal loading mode from the loading modes of index positioning and the data loading modes of accessing the complete subgraph, wherein the step S3 is as follows:

s30, determining active vertexes of all sub-graphs, judging whether the proportion of the active vertexes exceeds a threshold value, if so, entering a step S31, otherwise, accessing the data loading mode of the complete sub-graph to be an optimal loading mode, and ending the step S3;

s31, recording a data block number corresponding to the active vertex data;

s32, calculating the overhead cost (index) of a loading mode for index positioning and the overhead cost (all) of a data loading mode for accessing a complete subgraph based on the number of a data block corresponding to the active vertex data;

s33, judging whether cost (all) is greater than cost (index), if so, determining the loading mode of index positioning as the optimal loading mode, otherwise, determining the data loading mode of accessing the complete subgraph as the optimal loading mode;

the loading mode of the index positioning is as follows: (1) finding out data blocks needing to be loaded into the memory in the edge file according to the index positioning; (2) according to the offset address of the data block output edge which needs to be loaded in the output edge file, finding out the edge value data of the output edge in the edge value file, and loading the edge value data into the memory; (3) constructing an edge-out topological structure of a vertex in a memory; (4) when the application needs to enter the edge, locating an edge entering data block which needs to be loaded into the memory in the edge entering file according to the index; (5) according to the edge entering offset address of the data block to be loaded in the edge entering file, finding the edge value offset address of the edge entering in the edge value index file, and loading the edge value data offset address to the memory; (6) constructing an edge-entering topological structure of a vertex in a memory;

the data loading mode for accessing the complete subgraph is as follows: (1) sequentially loading data files with ordered source vertexes; (2) sequentially processing each edge in the data file, respectively judging whether a source vertex and a target vertex of the edge are data required by application, and adding a current edge into an edge outlet sequence of the vertex if the source vertex needs to participate in calculation; if the target vertex needs to participate in calculation, adding the current edge into the edge entering sequence of the vertex;

2. The method according to claim 1, wherein the writing of the sorted subgraphs into the external memory file is as follows:

3. The external memory access method of claim 2, wherein the edge file comprises: an edge-out file according to the source vertex number and an edge-in file according to the target vertex number; the edge-out file is organized according to the source vertex number, and the source vertex number, the out-degree and the target vertex number of the edge-out are sequentially and continuously stored; the edge entering file is composed according to the target vertex number, and the target vertex number, the degree of entry and the source vertex number of the edge entering are sequentially and continuously stored.

4. The method as claimed in claim 2, wherein, when creating the index, the edge-value index file is created by using the vertex number of the edge file, and each row of the edge-value index file corresponds to one tuple: the vertex number of the edge file and the offset address of the corresponding outgoing edge or incoming edge in the external file.

5. The method according to claim 1, wherein the threshold value ranges from 20% to 30%.

6. The method of claim 1, wherein the overhead of accessing the data load of a complete subgraph

Overhead of index-oriented load-wise

7. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements a method of external memory access according to any one of claims 1 to 6.