CN113726342B - Segmented difference compression and inert decompression method for large-scale graph iterative computation - Google Patents

Segmented difference compression and inert decompression method for large-scale graph iterative computation Download PDF

Info

Publication number
CN113726342B
CN113726342B CN202111046999.XA CN202111046999A CN113726342B CN 113726342 B CN113726342 B CN 113726342B CN 202111046999 A CN202111046999 A CN 202111046999A CN 113726342 B CN113726342 B CN 113726342B
Authority
CN
China
Prior art keywords
vertex
decompression
value
message
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111046999.XA
Other languages
Chinese (zh)
Other versions
CN113726342A (en
Inventor
王志刚
尹怀胜
殷波
王宁
聂捷
魏志强
宋德海
�田�浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Original Assignee
Ocean University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China filed Critical Ocean University of China
Priority to CN202111046999.XA priority Critical patent/CN113726342B/en
Publication of CN113726342A publication Critical patent/CN113726342A/en
Application granted granted Critical
Publication of CN113726342B publication Critical patent/CN113726342B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a segmentation difference compression and inertia decompression method for large-scale graph iterative computation, and relates to the technical field of large-scale graph data compression in high-frequency iterative computation. The method comprises a segmented difference compression process based on cluster distribution characteristics and an on-demand decompression process based on an inert decompression mechanism. Based on the characteristic that the outgoing edges of the graph have clustering distribution, the outgoing edge sequences of the adjacent tables are segmented according to the clusters, and based on segmentation difference compression, an inert decompression technology for decompression on demand is designed, so that the decompression problem can be flexibly understood. For message transmission of a specific vertex, only the dictionary value corresponding to the segment and the corresponding difference value are needed to be found, so that the decompression process can be completed, and the complete decompression of the edge table is not needed; when facing the dynamic change graph data, the method can directly update the changed vertex data without reordering and compressing the whole graph data.

Description

Segmented difference compression and inert decompression method for large-scale graph iterative computation
Technical Field
The invention relates to the technical field of large-scale graph data compression in high-frequency iterative computation, in particular to a segmentation difference compression and inertia decompression method for large-scale graph iterative computation.
Background
The graph is used as the most common data structure in computer science, and is particularly suitable for expressing different entities (vertexes) and association relations (edges) of the real world. Whereas complex network topologies derived therefrom result in queries related to graphs that typically require iterative computation, i.e., round-robin or recursive processing of vertices along edges, until convergence conditions are met. Graph iterative analysis and calculation has been widely applied to various fields of national life, such as military positioning and city planning (shortest path calculation, diameter estimation, BFS traversal), social network analysis (connected domain discovery), e-commerce transaction (maximum matching, absorption recommendation), and the like.
In the big data age, with the rapid development of computer software and hardware technology, the scale of graph data (especially the association relation-edge) is increasingly large, and huge data volume is far beyond the memory capacity of a single machine, so that many algorithms based on graphs cannot operate efficiently, and the method brings unprecedented challenges to the analysis and processing of the big graph. In such a background, how to perform effective data management and analysis on graph data has important research and practical values.
Three currently mainstream solutions for large graph data storage management are disk storage, distributed storage and graph data compression, respectively. The disk storage technology is that data are stored in non-memory media such as a machine disk or a solid state disk, and the like, so that the problem that all data cannot be loaded into a memory at one time in the large-scale graph data processing process is solved, but the access speed of disk storage is far lower than that of the memory, the graph analysis algorithm generally needs to access vertexes along edges, the space-time locality is extremely poor, the I/O efficiency of the graph algorithm in a disk storage mode is low, and the overall processing performance is difficult to satisfy; on the other hand, the distributed storage scheme ensures storage expandability by combining (memory or internal and external memory) storage capabilities of a plurality of physical machines, but a large graph is required to be divided into a plurality of sub-graphs (namely graph division), and the free connection characteristics among vertexes in the graph lead to strong coupling, so that the graph division is called a classical NP difficult problem and has no effective solution, and a great number of network communication operations are generated among the vertexes of the graph of the cross physical machines in the iterative calculation process of the graph, so that query delay is increased; unlike the first two methods, the graph data compression algorithm reduces the space complexity by transforming the organization structure of the graph, so as to reduce the requirement on the storage space, and can effectively avoid or reduce the defects of the two schemes (such as reducing the proportion of the data stored in the disk to optimize the random IO overhead and reducing the number of physical machines required by the distributed storage to optimize the network communication overhead) while reducing the data scale. Since the scale of the graph data is mainly dependent on the edge data, the compression of the graph data is also mainly oriented to the edge compression. However, the existing graph data compression technology does not fully consider the diversity requirement of the graph iterative algorithm and the dynamic change characteristic of the real graph, and is difficult to use the former mainstream iterative computing system framework, so that the applicability of the compressed graph in actual system processing is poor.
Local clustering characteristics of the graph
Recent studies have found that real graphs are typically crawled through Breadth-First Search (BFS) techniques for use by computing systems such as Pregel, known as BFS-generated graphs. This allows the distribution of vertices and edges of the crawled real graph to have some local clustering.
As shown in FIG. 1, BFS search is performed by taking vertex No. 1 as a starting vertex, and the storage data sequence of the vertices in the adjacency list is the BFS topological ordering result of the vertices in the original graph. In particular, given a vertex, it exhibits a clustering feature because its neighbor vertex of interest (i.e., the destination vertex to which the edge is directed) is a representation of its personal preference. Taking fig. 1 as an example, the user No. 1 focuses on a certain entertainment star, such as the user No. 2, and the star belongs to the category of the genuineness and the youth activity, and the user No. 1 focuses on other stars of the category, such as the user No. 3 and the user No. 4. In the BFS topology storage mode, storage positions of vertexes corresponding to users No. 2, no. 3 and No. 4 are adjacent, and locality of vertex distribution is reflected. On the other hand, in the real world, the outgoing edges of vertices typically have similarity, i.e., point to the same vertex. Taking the example of a social entertainment network, this similarity is manifested in that the focus of different users is largely overlapping, say they are commonly focusing on certain stars, have common friends, etc. Taking fig. 1 as an example, the user No. 5 and the user No. 1 pay attention to the user No. 2, the user No. 3 and the user No. 4 together, and the number of the destination vertex in the outgoing edge of the vertex No. 5 has obvious local clustering characteristics (namely, 2,3 and 4 are one cluster, and 6 and 7 are the other cluster) due to the continuity of the storage positions of the user No. 2, the user No. 3 and the user No. 4.
FIG. 2 shows a specific embodiment of the local clustering characteristic of the outgoing edge data in the adjacent table storage mode. The outgoing edge of vertex 1 presents two clusters, namely, a neighbor vertex around 1000000 and a neighbor vertex around 10000. Vertex 3 has two clusters of neighbors around vertex 1000 and neighbors around vertex 2000. Whereas vertex number 2 has only one cluster, i.e. neighbors around vertex number 10000.
Compression coding technology based on continuous difference value calculation
The compression coding technology based on continuous difference calculation realizes the compression of the vertex id of the target graph in the outgoing edge by combining two technologies of difference compression and variable length id coding, thereby compressing the outgoing edge data scale. The main compression idea is to store only their differences for the number ids of adjacent vertices in the outgoing edge, and then encode the vertex id values and differences by using variable length integer encoding to encode the given id using the minimum number of bytes.
The neighboring vertices that take into account many vertices may have similar vertex ids. Storing only incremental changes (i.e., differences) between them for these similar vertex ids saves significant space compared to storing itself. And the compression method based on the difference value is to compress the adjacency list by using the difference value of the vertex ids. The difference method is different from the traditional adjacent table storage mode in storing the destination vertex ids of all outgoing edges, wherein the difference method only stores the vertex id of the first adjacent point, and only stores adjacent vertex differences for the rest adjacent points. For example, consider the adjacency list of vertex v1, adj (v 1) = { v2, v3, v4}. Assuming that the ids of v2, v3, v4 are 2,5, 300, respectively, the adjacency list of vertex v1 will be stored as 2,3, 295.
Variable length id coding. Current systems always store ids as integers of 4 bytes or 8 bytes in length. However, if the id value is small, this will result in a waste of memory space. The technique adopts variable length integer [13] Each vertex id (including the difference) is encoded, using the minimum number of bytes to encode a given integer. The most significant bit of each packed byte is used to indicate a different id, and the remaining 7 bits are used to store a value. According to the example above, as shown in fig. 3, the adjacency list of v1 is stored as "00000010 10000011 00000010 00100111". The first byte represents an id of 2, v 2. The most significant bit of the second byte is different from the first byte, and the most significant bit is removed to indicate an increment value of 3 between 2 and 5. The most significant bits of the third and fourth bytes are identical, meaning that they are used to encode the same id,the most significant bit is removed 00000100100111 to represent an increment value 295 of between 5 and 300.
In general, given a source vertex, a compression coding technique based on successive difference calculation will re-encode the destination vertex to which its outgoing edge points, basically as follows:
step1 first stores the vertex id value of the first neighbor point, and stores only the difference value of the neighbor points for the vertex ids of the remaining neighbor points.
Step2 encodes the difference between the first vertex id and the subsequent vertex in the edge table using a variable length integer. The binary coded valid bit is added with a flag bit from right to left every 7 bits for indicating different vertex ids or differences.
There has been a great deal of research effort on graph data compression, which can be largely divided into graph data for specific types, such as web page graphs, social network graphs, and adjacency matrices and adjacency tables for conventional graph storage structures. The related technology can realize the graph data compression under specific requirements, but the consideration of the structural characteristics of the graph is still insufficient, and the universality and the compression efficiency of the graph are still to be improved. For the latest graph compression algorithm mentioned in 2.2, the following two disadvantages remain:
first, the continuous difference technique requires pre-ordering the outgoing edges, which is time consuming. The core of the method is to compress the edge table of the graph data, which reduces the size of the numerical value to be stored by storing the increment value between the vertex ids, so as to compress the memory space occupied by the data to the maximum degree in the variable length coding stage. However, the neighboring vertices corresponding to the edge table of a vertex in the real graph are not orderly arranged according to a rule of increasing or decreasing, which causes the difference to have both positive and negative values. As can be seen from the variable length integer code of 2.2, the most significant bit of each byte is used as an identification bit to indicate different vertex ids, which cannot correctly distinguish the difference between the positive and negative. To ensure accuracy, the edge table needs to be pre-ordered to eliminate positive and negative scores. However, ordering the edge table of a large graph is very time consuming and cannot be done once for ordering compression, multiple times for use, due to dynamic updating of graph data, to amortize the time overhead of pre-ordering.
And secondly, the decompression operation of continuous difference compression is inflexible, and is difficult to adapt to the diversity requirements of different graph algorithms and the data operation flow of a main stream iterative computation framework. The final purpose of data compression is to facilitate processing and analysis, i.e. to involve data decompression problems. After the continuous difference value is compressed, if the destination vertex id pointed by a certain outgoing edge is used in the calculation process, as the increment value between adjacent vertex ids is stored in the compression, in order to obtain the real vertex id, the first vertex value of the edge table where the destination vertex is located needs to be found in the decompression process and the subsequent vertex in the edge table is decompressed continuously until the required destination vertex id is obtained. The worst-case complexity is proportional to the length of the edge table, i.e., all destination vertices in the edge table need to be decompressed. Not all algorithms, however, message all destination vertices to which outgoing edges are directed by broadcasting. Taking the minimum edge selection algorithm in the minimum generation forest as an example, only the vertex with a specific purpose is selected according to the weight of the edge to send the message. Only the part, or even some out-edge vertex, need be decompressed at this time. The continuous difference algorithm is limited by the compression mechanism, and all outgoing edges must be decompressed, which causes memory space waste and affects algorithm performance. Furthermore, the current iterative approach, especially with Pregel [14] Typically, the distributed processing system uses an API interface centered on the vertex, i.e., a value (remsgs) programming interface is provided for vertex updating in each iteration step. That is, the user implements the update logic of the vertex value (value) according to the message (remsgs) sent to the vertex in the previous iteration step, and sends the updated vertex value to its neighbor vertices along the edges (edges) in the form of a message, so that the neighbor vertices complete the vertex update in the next iteration step. But such systems suffer from ineffective decompression when performing iterative processing using compressed data. Since the parameter list of computer () contains edge tables edges for vertices, this means that each time the method is called, the compressed edge table is decompressed. However, not all algorithms have their vertices receiving the message, invokingThe value of the computer function must be updated after the computer function, i.e. the computer function becomes an active vertex. Since the vertex value of the inactive vertex is unchanged, no message is generally required to be sent to the neighbor vertex, and the edge table of the inactive vertex is not required to be accessed, so that the decompression of the edge table is invalid. For example, a single-source shortest path algorithm, which updates its value only when the received message value is smaller than the value of the current vertex, becomes an active vertex and sends a message to inform its outbound neighbors to continue updating the shortest distance; otherwise, the vertex does not need to access the outgoing edge, but the decompressed outgoing edge data edges are useless when the computer function is called, so that invalid decompression is caused, and the computing resources are wasted.
Disclosure of Invention
Aiming at the defects, the invention provides a segmented difference compression and inert decompression method for large-scale graph iterative computation, which is used for flexibly processing understanding pressure and is used for segmenting an edge sequence of an adjacent table according to clusters based on the fact that the edge of a graph has cluster distribution characteristics and compressing based on segmented difference values.
The invention adopts the following technical scheme:
the method for segmenting difference compression and inert decompression for large-scale graph iterative computation comprises a segmenting difference compression process based on clustering distribution characteristics and an on-demand decompression process based on an inert decompression mechanism;
the segmented difference compression process based on the cluster distribution characteristics comprises the following steps:
step1: given a difference threshold/cluster range denoted α, a dictionary Dic is maintained for each vertex's edge table i And differential sequence Dif i
Step2: vertex v i The 1 st adjacent vertex v in the edge table of (1) i,0 Writing dictionary Dic i Note temp=v i,0 And in a differential sequence Dif i A first element 0 is written in;
step3: computing v sequentially for subsequent adjacent vertices in the edge table i,j Temp, if |v i,j -temp| < α, then the calculated difference is written to the difference sequence Dif i In (3) continuing to calculateThe next adjacent vertex; otherwise, the adjacent vertex v i,j Writing dictionary Dic i Note temp=v i,j And in a differential sequence Dif i Writing a 0, and repeating the Step3;
step4: step2 and Step3 are executed on each vertex until the whole adjacency list is traversed;
the iterative process of the on-demand decompression process based on the inert decompression mechanism comprises an initial stage and a convergence stage, wherein the number of active top points in the initial stage is rapidly increased, peaks are reached in a plurality of iterative steps and then the convergence stage is entered, and the graph algorithm is slowly converged until the end in the subsequent iterative steps;
in the convergence stage, in each iteration step of the iterative computation process, after receiving the message, executing a computer () algorithm, and after computing an update value, judging whether the iteration is finished, if so, ending the whole flow; otherwise, judging whether the vertex is activated as an active vertex, if so, decompressing the edge table or the destination vertex of the vertex according to a getedge () algorithm, and then sending a new message to perform the next iteration step.
Preferably, one iteration step comprises the following process:
the vertex receiving the message calls a computer () method to update the value of the vertex, and then sends the message to the neighbor vertex according to the updated value;
the rewritten computer () method makes the parameter list not need the edge list of the vertex, so the method can be directly called without decompressing the edge list of the vertex in advance;
after the call, the computer () method updates the value of the vertex according to the value update logic provided by the user, not all the value of the vertex are updated, if the value of the vertex is updated, the vertex is activated to be an active vertex and needs to send a message to the neighbor vertex, at the moment, the edge table decompression method getedge () is required to be called to decompress the edge table or the target vertex of the vertex, and after the decompression is completed, the message is sent to the neighbor vertex to prepare for the next iteration;
the specific flow is as follows:
1. the current iteration step starts: the vertex v receives the message;
2. directly calling a computer () method without decompressing the edge table;
the computer () method updates the vertex value according to the algorithm logic provided by the user;
4. judging whether the value is updated, if so, activating the value as an active vertex, otherwise, not activating the value;
5. calling an edge table decompression method getedge () for the active vertex, and decompressing the edge table or the target vertex;
6. and sending a message to the decompressed edge table or the destination vertex.
The invention has the following beneficial effects:
the method segments the edge sequence of the adjacent table according to clusters based on the characteristic that the edge of the graph has cluster distribution, takes a dictionary value for each segment and stores sequence difference values according to the dictionary value, thereby reducing the byte length required by data storage. As the positive and negative values of the increment are not required to be distinguished, the sorting operation of the side table is not required, and the preprocessing time is greatly saved.
Based on the segmentation difference compression, an inert decompression technology for decompression on demand is designed, and the decompression problem can be flexibly solved. For message transmission of a specific vertex, only the dictionary value corresponding to the segment and the corresponding difference value are needed to be found, so that the decompression process can be completed, and the complete decompression of the edge table is not needed; when facing the dynamic change graph data, the method can directly update the changed vertex data without reordering and compressing the whole graph data.
For the problem of invalid decompression in an iterative system, the decompression time is allowed to be delayed in the process of calculating the iteration of the graph, and when the vertex is truly updated and the edge table needs to be accessed to send a message to the destination vertex, the decompression is performed as required.
Drawings
FIG. 1 is a diagram of an example of the clustering characteristics of the graph;
FIG. 2 is a schematic diagram of clustering characteristics of an edge number in an adjacency list storage mode;
FIG. 3 is a schematic diagram of edge compression encoding based on successive difference calculation;
FIG. 4 is a graph of the change in the scale of vertex activation during iteration for different algorithms;
fig. 5 is a flow chart of a method for compressing and decompressing a segment difference value for iterative computation of a large-scale graph.
Detailed Description
The following description of the embodiments of the invention will be given with reference to the accompanying drawings and examples:
the segmented difference compression and inert decompression method for the large-scale graph iterative computation is characterized by comprising a segmented difference compression process based on cluster distribution characteristics and an on-demand decompression process based on an inert decompression mechanism;
as shown in fig. 2, the output edges of the vertices in the real graph have a clustered distribution. The key point of the technology is that the characteristic of clustering distribution is utilized to adopt the concept of segmented compression, and the storage of the data is changed into the storage of difference values, so that the compression effect is achieved. The segment compression technique is illustrated by the following example:
given that graph g= (V, E), using adjacency table to represent G, the process of compressing the graph is the process of compressing the edge tables of all vertices. The table sequence corresponding to a vertex in the graph G (i.e., the destination vertex ID sequence) is assumed to be represented by four bytes Int data, and the specific contents are as follows:
1000000,1000100,1000099,999999,10100,10000,10050,10150,10200,20000,20100
it was found by observation of a partial sequence of the above adjacency list that the sequence could be simply divided into three clusters (1000000, 1000100, 1000099, 999999;10100, 10000, 10050, 10150, 10200;20000, 20100). For each cluster of destination vertices, we use the first value as the reference, denoted by dic, store with int type, and store the difference between the vertex and dic with char type for the rest of the vertices (if the difference exceeds the char type expression range, the vertex should be classified into another cluster). For the first cluster, the compressed storage content is < 1000000, < 100,99, -1. This compresses the data that would otherwise require 16 bytes to 7 bytes. According to the above example, for the edge table corresponding to each vertex, the rule of segment difference sequence compression can be summarized as follows:
step1: given a difference threshold/cluster range denoted α=127 (self-set, this example set as the absolute value of the representation range of the char-type data), a dictionary Dic is maintained for each vertex's edge table i And differential sequence Dif i
Step2: vertex v i The 1 st adjacent vertex v in the edge table of (1) i,0 Writing dictionary Dic i Note temp=v i,0 And in a differential sequence Dif i The first element 0 is written.
Step3: computing v sequentially for subsequent adjacent vertices in the edge table i,j Temp, if |v i,j -temp| < α, then the calculated difference is written to the difference sequence Dif i Continuing to calculate the next adjacent vertex; otherwise, the adjacent vertex v i,j Writing dictionary Dic i Note temp=v i,j And in a differential sequence Dif i A0 is written in, and Step3 is repeated.
Step4: step2, step3 are performed for each vertex until the complete adjacency list is traversed.
By the above rule, the adjacency list can be represented by two structures of dictionary set D ic and difference sequence set Dif, i.e. < Dic, < Dif, < table 1.B gives dictionary set Dic and difference sequence set Dif corresponding to part of the sequences in adjacency list in table 1. A. It can be seen that the sequence of the side table, which originally requires 144 bytes of memory space, only requires 64 bytes after compression
TABLE 1.A
TABLE 1.B
On-demand decompression technology based on inert decompression mechanism
The graph data compression can significantly reduce the storage space requirement, but because the actual iterative analysis process needs to use the real vertex ID, the decompression process of the compressed data is involved, and the efficiency of the decompression process directly affects the efficiency of the compressed graph iterative processing. For the minimum generation forest class diagram iterative algorithm, because only a message is required to be sent to a specific vertex in the edge table, under the segmentation difference compression technology, the storage position of the required destination vertex can be filtered out according to the screening condition (for example, the minimum generation forest algorithm needs to screen according to the weight of the edge), then the segment where the minimum generation forest class diagram iterative algorithm is located is positioned, and the required destination vertex ID is obtained through solving according to the dictionary value and the difference value of the segment, without continuously decompressing all destination vertices.
For the pre class diagram iterative computing system widely used at present, when the shortest path class algorithm is processed, since the invocation of the computer function is not equivalent to the new message sending, namely, the vertex receives the message of the previous iteration step and tries to update the vertex value of the vertex, but finally the vertex value (not activated) is not required to be updated, the side table decompression before the invocation of the computer function becomes invalid decompression; on the contrary, when the PageRank algorithm is processed, the computer function is called and then a message is necessarily broadcast along the outgoing edge, namely, all vertexes are always in an activated state in all iteration steps, and the decompression at the moment is the decompression requirement of different algorithms when the effective operation is performed.
The graph algorithm can be divided into two classes according to the pattern of change of the scale of activated vertices in the iterative process as shown in fig. 4: the updating of the straight line type, namely the vertexes always depends on the values of all neighbor vertexes, so that all vertexes are in an active state in any iteration step, the scale of the activated vertexes does not change along with the increase of the iteration step, and the algorithm comprises PageRank, community discovery based on label broadcasting and the like; the parabolic type, the time point of vertex participation calculation is closely related to the traversal process of the graph, and the activated vertex scale presents monotonically gradual characteristics, such as breadth-first search, single-source shortest path and connected domain calculation and the like.
For the parabolic graph algorithm, the iteration process is divided into an initial stage and a convergence stage, the number of active vertexes in the initial stage increases rapidly, peaks can be reached almost in several iteration steps and then enters the convergence stage, and the graph algorithm is slowly converged until the end in the subsequent iteration steps. For such algorithms, such as the single-source shortest path algorithm, the value of the vertex has been substantially stabilized during its convergence phase, and only a portion of the value of the vertex has been updated. At this time, if the vertex is to send a message to all its outgoing edges, if it is designed according to the traditional graph iteration system framework, its outgoing neighbors will receive the message and call the computer algorithm, which means that the outgoing edges of all its outgoing neighbors need to be decompressed in advance for use by the computer. However, the newly received message in the convergence stage hardly causes the update operation to be generated on the destination vertex, so that only a very small number of the outbound neighbors of the vertex may be activated as the active vertex, and the decompression operation on the outbound edge table of other neighbor vertices is not necessary at all, which wastes the performance of the graph processing greatly. The core idea of the inert decompression technology is that in the iterative process of graph calculation, when a vertex invokes a computer () method, the decompression operation is delayed, and after the value of the vertex is updated and activated as an active vertex, the edge table is decompressed to send out a new message so as to prepare for the next iteration.
The iterative process of the on-demand decompression process based on the inert decompression mechanism comprises an initial stage and a convergence stage, wherein the number of active top points in the initial stage is rapidly increased, peaks are reached in a plurality of iterative steps and then enter the convergence stage, and the graph algorithm is slowly converged until the end in the subsequent iterative steps.
In each iteration step of the iterative calculation process, directly executing a rewritten computer () algorithm after the vertex receives the message, judging whether iteration is finished after calculating an update value, and finishing the whole flow if the iteration is finished; otherwise, judging whether the vertex is activated as an active vertex, if so, decompressing the edge table or the destination vertex of the vertex according to a getedge () algorithm, and then sending a new message to perform the next iteration step.
The inert decompression mechanism is used for adding a getEdges () method for decompressing the edge table by rewriting the computer () method, so as to avoid the decompression of the edge table before the calling; correspondingly, after the value is calculated and updated and then the vertex is judged to be actually activated into an active vertex, a getedge () method is called to decompress the edge table of the vertex and acquire the edge table information, and then a message is sent to the edge table information. Obviously, under the design, the corresponding data is decompressed only when the side table information is really needed, so that the phenomenon of invalid decompression is effectively avoided. The design is also compatible for straight line type algorithms, the only difference is that the decompression time is changed from 'before calling computer () function' to 'when calling getedge () function', and the programmer applying the algorithm does not need to change the algorithm calculation logic.
An iteration step comprises the following steps:
the vertex receiving the message calls a computer () method to update the value of the vertex, and then sends the message to the neighbor vertex according to the updated value;
the rewritten computer () method makes the parameter list not need the edge list of the vertex, so the method can be directly called without decompressing the edge list of the vertex in advance;
after the call, the computer () method updates the value of the vertex according to the value update logic provided by the user, not all the value of the vertex are updated, if the value of the vertex is updated, the vertex is activated to be an active vertex and needs to send a message to the neighbor vertex, at the moment, the edge table decompression method getedge () is required to be called to decompress the edge table or the target vertex of the vertex, and after the decompression is completed, the message is sent to the neighbor vertex to prepare for the next iteration;
the specific flow is as follows:
1. the current iteration step starts: the vertex v receives the message;
2. directly calling a computer () method without decompressing the edge table;
the computer () method updates the vertex value according to the algorithm logic provided by the user;
4. judging whether the value is updated, if so, activating the value as an active vertex, otherwise, not activating the value;
5. calling an edge table decompression method getedge () for the active vertex, and decompressing the edge table or the target vertex;
6. and sending a message to the decompressed edge table or the destination vertex.
Based on the techniques presented in fig. 4.1-4.2, fig. 5 shows an overall implementation framework of the segmentation difference compression and inert decompression techniques for iterative computation of large-scale graphs. Before the iterative computation starts, the graph data is loaded first, and compressed by a segmentation difference compression technology. In each iteration step of the iterative calculation process, after receiving the message, executing a computer () algorithm, and after calculating an update value, judging whether the iteration is finished, if so, ending the whole flow; otherwise, judging whether the vertex is activated as an active vertex, decompressing an edge table or a target vertex of the vertex according to the corresponding algorithm type if the vertex is activated, and then sending a new message to carry out the next iteration step.
It should be understood that the above description is not intended to limit the invention to the particular embodiments disclosed, but to limit the invention to the particular embodiments disclosed, and that the invention is not limited to the particular embodiments disclosed, but is intended to cover modifications, adaptations, additions and alternatives falling within the spirit and scope of the invention.

Claims (1)

1. The segmented difference compression and inert decompression method for the large-scale graph iterative computation is characterized by comprising a segmented difference compression process based on cluster distribution characteristics and an on-demand decompression process based on an inert decompression mechanism;
the segmented difference compression process based on the cluster distribution characteristics comprises the following steps:
step1: given a difference threshold/cluster range denoted α, a dictionary Dic is maintained for each vertex's edge table i And differential sequence Dif i
Step2: vertex v i The 1 st adjacent vertex v in the edge table of (1) i,0 Writing dictionary Dic i Note temp=v i,0 And in a differential sequence Dif i A first element 0 is written in;
step3: computing v sequentially for subsequent adjacent vertices in the edge table i,j -temp, if |v i,j -temp| < α, then the calculated difference is written to the difference sequence Dif i Continuing to calculate the next adjacent vertex; otherwise, the adjacent vertex v i,j Writing dictionary Dic i Note temp=v i,j And in a differential sequence Dif i Writing a 0, and repeating the Step3;
step4: step2 and Step3 are executed on each vertex until the whole adjacency list is traversed;
the iterative process of the on-demand decompression process based on the inert decompression mechanism comprises an initial stage and a convergence stage, wherein the number of active top points in the initial stage is rapidly increased, peaks are reached in a plurality of iterative steps and then the convergence stage is entered, and the graph algorithm is slowly converged until the end in the subsequent iterative steps;
in the convergence stage, in each iteration step of the iterative computation process, after the vertex receives the message, executing the rewritten computer () algorithm, and judging whether the iteration is finished after calculating the updated value, if so, ending the whole flow; otherwise, judging whether the vertex is activated as an active vertex, decompressing an edge table or a target vertex row of the vertex according to a getedge () algorithm if the vertex is activated, then sending a new message, and carrying out the next iteration step;
an iteration step comprises the following steps:
the vertex receiving the message calls a computer () method to update the value of the vertex, and then sends the message to the neighbor vertex according to the updated value;
the rewritten computer () method makes the parameter list not need the edge list of the vertex, so the method can be directly called without decompressing the edge list of the vertex in advance;
after the call, the computer () method updates the value of the vertex according to the value update logic provided by the user, not all the value of the vertex are updated, if the value of the vertex is updated, the vertex is activated to be an active vertex and needs to send a message to the neighbor vertex, at the moment, the edge table decompression method getedge () is required to be called to decompress the edge table or the target vertex of the vertex, and after the decompression is completed, the message is sent to the neighbor vertex to prepare for the next iteration;
the specific flow is as follows:
1. the current iteration step starts: the vertex v receives the message;
2. directly calling a computer () method without decompressing the edge table;
the computer () method updates the vertex value according to the algorithm logic provided by the user;
4. judging whether the value is updated, if so, activating the value as an active vertex, otherwise, not activating the value;
5. calling an edge table decompression method getedge () for the active vertex, and decompressing the edge table or the target vertex;
6. and sending a message to the decompressed edge table or the destination vertex.
CN202111046999.XA 2021-09-08 2021-09-08 Segmented difference compression and inert decompression method for large-scale graph iterative computation Active CN113726342B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111046999.XA CN113726342B (en) 2021-09-08 2021-09-08 Segmented difference compression and inert decompression method for large-scale graph iterative computation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111046999.XA CN113726342B (en) 2021-09-08 2021-09-08 Segmented difference compression and inert decompression method for large-scale graph iterative computation

Publications (2)

Publication Number Publication Date
CN113726342A CN113726342A (en) 2021-11-30
CN113726342B true CN113726342B (en) 2023-11-07

Family

ID=78682368

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111046999.XA Active CN113726342B (en) 2021-09-08 2021-09-08 Segmented difference compression and inert decompression method for large-scale graph iterative computation

Country Status (1)

Country Link
CN (1) CN113726342B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914556A (en) * 2014-04-15 2014-07-09 西北工业大学 Large-scale graph data processing method
CN110737804A (en) * 2019-09-20 2020-01-31 华中科技大学 graph processing memory access optimization method and system based on activity level layout
CN111309976A (en) * 2020-02-24 2020-06-19 北京工业大学 GraphX data caching method for convergence graph application

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111539534B (en) * 2020-05-27 2023-03-21 深圳大学 General distributed graph processing method and system based on reinforcement learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914556A (en) * 2014-04-15 2014-07-09 西北工业大学 Large-scale graph data processing method
CN110737804A (en) * 2019-09-20 2020-01-31 华中科技大学 graph processing memory access optimization method and system based on activity level layout
CN111309976A (en) * 2020-02-24 2020-06-19 北京工业大学 GraphX data caching method for convergence graph application

Also Published As

Publication number Publication date
CN113726342A (en) 2021-11-30

Similar Documents

Publication Publication Date Title
CN109254733B (en) Method, device and system for storing data
US9235651B2 (en) Data retrieval apparatus, data storage method and data retrieval method
CN110134714B (en) Distributed computing framework cache index method suitable for big data iterative computation
CN109033303B (en) Large-scale knowledge graph fusion method based on reduction anchor points
CN103997346B (en) Data matching method and device based on assembly line
CN104283567A (en) Method for compressing or decompressing name data, and equipment thereof
CN110719106B (en) Social network graph compression method and system based on node classification and sorting
JP2022020070A (en) Information processing, information recommendation method and apparatus, electronic device and storage media
CN112667860A (en) Sub-graph matching method, device, equipment and storage medium
CN111898698A (en) Object processing method and device, storage medium and electronic equipment
CN111083933A (en) Data storage and acquisition method and device
CN109302449B (en) Data writing method, data reading device and server
CN108334532B (en) Spark-based Eclat parallelization method, system and device
US20200104425A1 (en) Techniques for lossless and lossy large-scale graph summarization
CN113726342B (en) Segmented difference compression and inert decompression method for large-scale graph iterative computation
CN117472959A (en) Gskip list-based block chain efficient query system and dynamic construction method
CN115905168B (en) Self-adaptive compression method and device based on database, equipment and storage medium
CN103761298A (en) Distributed-architecture-based entity matching method
CN112950451B (en) GPU-based maximum k-tress discovery algorithm
CN114647764A (en) Graph structure query method and device and storage medium
Rasel et al. Summarized bit batch-based triangle listing in massive graphs
CN112100446A (en) Search method, readable storage medium and electronic device
WO2017186049A1 (en) Method and device for information processing
US20240152334A1 (en) Method and system for implementing binary arrays
CN114095036B (en) Code length generating device for dynamic Huffman coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant