CN108959653B

CN108959653B - Based on dense grid recombination and K2Graph data representation method of tree

Info

Publication number: CN108959653B
Application number: CN201810884864.2A
Authority: CN
Inventors: 李凤英; 张琪; 常亮; 古天龙
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2018-08-06
Filing date: 2018-08-06
Publication date: 2021-06-01
Anticipated expiration: 2038-08-06
Also published as: CN108959653A

Abstract

The invention discloses a method based on dense grid recombination and K²The graph data representation method of the tree firstly abstracts the graph data into a representation form of an adjacent matrix, and then carries out grid division and recombination on the basis of the adjacent matrix, thereby greatly increasing the density of 1 value in the matrix and improving the storage efficiency. Compared with the original adjacent matrix, the size of the matrix corresponding to each cluster is greatly reduced, and K is reduced²The time required for traversing from top to bottom to the leaf layer in the tree is beneficial to improving the query efficiency.

Description

Based on dense grid recombination and K2Graph data representation method of tree

Technical Field

The invention relates to the technical field of graph data processing, in particular to a method based on dense grid recombination and K²Graph data representation method of tree.

Background

The world wide web is a data structure that can efficiently describe binary relationships, which can be represented by a directed graph. Typically, in the world Wide Web, each web page corresponds to a graph node and each link corresponds to a graph edge. Such directed graphs are called page graphs (Web graphs). By operating on the webpage graph, the connectivity between two pages can be determined, all the webpages linked to a certain page are screened out, and which pages point to a specified webpage are found out. With the rapid growth of the internet and the rapid growth of the world wide web, the way and speed in which netpage maps are produced and accumulate large amounts of data, and how to analyze and use the data, has become a key opportunity and a serious challenge in many fields.

According to a 37 th statistical report of the development conditions of the Chinese Internet, which is issued by a Chinese Internet information center (CNNIC), the number of Chinese web pages reaches 2123 hundred million in 2015, and the number of links on a web page map exceeds 1018; if a adjacency matrix is used to store the page map, approximately 40TB of storage is required. The explosive growth in the size of netpage maps has made traditional storage structures unable to meet storage and computing requirements. When the traditional storage structures such as the adjacency matrix or the adjacency list are used for representation and storage, the webpage graph cannot be completely put into a computer main memory for calculation, and the efficiency of analyzing and calculating the webpage graph is limited to a great extent.

K²A tree is a compact data structure for representing the binary relationship of large empty regions of an adjacency matrix using a network graph. At K²In tree, the main idea is to use K²The fork tree represents the adjacency matrix of the network graph. More precisely, in K²In the stage of tree construction, the adjacency matrix is divided into K²An equal matrix. For each matrix, the method will generate child nodes for the root node. If the elements of the matrix are all 0, then K²-the corresponding child node of the tree is 0; if the matrix contains at least one 1, the child node is 1. For those matrices that contain 1, the method recursively divides them into K²Equal parts up to a matrix size of 1 × 1. In this partitioning process, all child nodes are ordered from top to bottom and from left to right. K²The tree is stored by two bit vectors T and L, wherein the bit vector T stores all nodes except the bottommost node in the tree, the storage sequence is from top to bottom, from left to right, and the bit vector L stores the leaf node at the bottommost node in the tree. By making a pair of K²The tree performs top-down traversal, which not only can effectively realize forward and backward navigation, but also supports detection of existence of a single link and a range query operation.

Although K²A tree is an excellent data structure that can represent graph data compactly, but it still has some drawbacks: tight coupling between nodes may be broken, which increases the time consumption of the operation on the graph; large areas of the matrix filled with 0's may be divided into different matrices and thus not fully compressed; k²The height of the tree will seriously affect the time consumption of the operations on the graph. As can be seen, K²The tree lacks necessary considerations for structural characteristics within graph data and query efficiency when representing large-scale graph data, and still leaves great room for improvement in compactness and query efficiency.

Disclosure of Invention

The invention is directed to K²The tree lacks necessary considerations for structural characteristics within graph data and query efficiency when representing large-scale graph data,the method still has the problem of great improvement space in compactness and query efficiency, and provides a method based on dense grid recombination and K²Graph data representation method of tree.

In order to solve the problems, the invention is realized by the following technical scheme:

based on dense grid recombination and K²The graph data representation method of the tree specifically comprises the following steps:

step 1, abstractly representing graph data into an adjacent matrix, and dividing the adjacent matrix by using two different grid parameters, namely one large grid parameter and one small grid parameter, by using a grid clustering algorithm;

step 2, extracting all the small grids contained in the large grid and storing the small grids as a new recombination matrix;

step 3, adopting K for all recombination matrixes in the step 2²-a tree representation;

step 4, a plurality of Ks respectively established in step 3 by the nodes needing to be inquired²-querying in a tree, and merging query results to obtain a final result;

k is a positive integer.

The specific process of the step 1 is as follows:

step 1.1, abstract representing the graph data into an adjacency matrix;

step 1.2, setting K²-tree and small grid parameters K and large grid parameter Size;

step 1.3, dividing the adjacent matrix into a plurality of matrixes with the size of K multiplied by K according to the SET small grid parameter K, extracting at least one matrix with 1 in the matrix obtained after division, and storing the matrix in a small grid matrix SET SET 1;

step 1.4, dividing the adjacent matrix into a plurality of matrixes with the Size of Size multiplied by Size according to the SET large grid parameter Size, extracting at least one matrix with the Size of 1 from the divided matrixes, and storing the extracted matrix in a large grid matrix SET SET 2.

The specific process of the step 2 is as follows:

2.1, selecting any matrix in a large grid matrix SET SET2, extracting the matrixes in a small grid matrix SET SET1 contained in the matrix in sequence, and storing the matrix in a temporary matrix SET SET 3;

2.2, sequentially arranging the matrixes in the temporary matrix SET SET3 into a matrix queue with multiple rows according to the sequence from left to right and from top to bottom, wherein each row comprises x matrixes, and countless areas are filled with 0 to be recombined into a recombined matrix with the number of rows and columns of Kxx; wherein

a is the number of matrixes in the temporary matrix SET 3;

2.3, storing the recombination matrix into a recombination matrix SET SET4, and emptying a temporary matrix SET SET 3;

and 2.4, repeating the steps 2.1-2.3 until all the matrixes in the large grid matrix SET SET2 are recombined and stored in a recombined matrix SET SET 4.

The specific process of the step 3 is as follows:

step 3.1, selecting any recombination matrix in the recombination matrix SET SET4, filling 0 on the right side and the lower side of the recombination matrix when the node number of the recombination matrix is not the power of K, expanding the scale of the recombination matrix to the power of K, and dividing the recombination matrix into K²The primary partition matrixes are equal in size and number of rows and columns;

step 3.2, for K corresponding to each recombination matrix²-tree, taking the matrix before partitioning, i.e. the regrouping matrix, as root node, K²The divided matrixes are used as child nodes of the root node; if the elements of one matrix after division are all 0, K²-the corresponding child node of the tree is 0; if one of the divided matrixes contains at least one 1, K²-the corresponding child node of the tree is 1; in this division process, all K²The children nodes of the tree are all ordered from top to bottom and from left to right;

step 3.3, continuously dividing the matrix obtained by the last division into K²The matrixes are equal in size and equal in row number and column number;

step 3.4, taking the matrix divided in the step 3.3 as a child node of the matrix before division; if the elements of one matrix after division are all 0, K²-the corresponding child node of the tree is 0; if one of the divided matrixes contains at least one 1, K²-the corresponding child node of the tree is 1; in this division process, all K²The children nodes of the tree are all ordered from top to bottom and from left to right;

step 3.5, repeating the steps 3.3-3.4, and performing recursive partitioning on the matrix obtained by the last partitioning until the size of the partitioned matrix is 1 multiplied by 1;

step 3.6, when all the regrouping matrixes in the SET of regrouping matrixes SET4 are represented as K²After tree, for each K²-tree storing K in top-to-bottom and left-to-right order using bit vector T, respectively²-storing K using the bit vector L, except for the values of all the nodes at the bottom level in the tree²-values of the lowest leaf nodes of the tree, resulting in sets of vectors comprising bit vectors T and L.

The specific process of the step 4 is as follows:

step 4.1, a node q is given, a recombination matrix comprising the node in the recombination matrix SET SET4 is screened out and stored in the query matrix SET SET 5;

step 4.2, selecting any query matrix in the query matrix SET SET5, finding out all matrices which contain the node q and are contained in the small grid matrix SET SET1 in the query matrix, and calculating new row numbers q' of the node q in the query matrix, K column numbers to be queried and column offsets t according to the position information of the matrices in the recombined matrix;

step 4.3, for one of the new line numbers q' calculated in step 4.2, according to K²The traversal rule of the tree performs a top-down traversal, and queries the values at the K column number positions to be queried corresponding to the new row number q':

if the value of the column number is 1 and the column number is even, assigning the column offset t corresponding to the new row number to a real neighbor node column number p, and storing the neighbor node column number p into a neighbor SET SET 6;

if the value of the column number is 1 and the column number is an odd number, adding 1 to the column offset t corresponding to the new row number, assigning the column offset t to a real neighbor node column number p, and storing the neighbor node column number p into a neighbor SET SET 6;

if the value at the column number is 0, no processing is performed;

4.4, repeating the step 4.3 until all neighbor nodes of the node q in the query matrix are queried;

and 4.5, repeating the steps 4.2-4.4 until all the neighbor nodes of the node q are inquired, wherein all the neighbor nodes stored in the neighbor SET SET6 are all the neighbor nodes of the node q.

Compared with the prior art, the method abstracts the graph data into an adjacent matrix representation form, and then performs grid division and recombination on the basis of the adjacent matrix, thereby greatly increasing the density of 1 value in the matrix and improving the storage efficiency. Compared with the original adjacent matrix, the size of the matrix corresponding to each cluster is greatly reduced, and K is reduced²The time required for traversing from top to bottom to the leaf layer in the tree is beneficial to improving the query efficiency.

Drawings

Fig. 1 is an adjacency matrix representation obtained by dividing an adjacency matrix of a graph including 16 nodes.

FIG. 2 is K of the adjacency matrix²-tree.

FIG. 3 shows a regrouping matrix obtained by dividing and regrouping the adjacent matrixes and the corresponding K²-a tree representation; (a-1) and (a-2) are recombination matrices No. 0 and K thereof, respectively²-a tree; (b-1) and (b-2) are recombination matrices No. 1 and K thereof, respectively²-a tree; (c-1) and (c-2) are recombination matrices No. 2 and K thereof, respectively²-a tree; (d-1) and (d-2) are recombination matrices No. 3 and K thereof, respectively²-tree。

FIG. 4 is a reassembly matrix and corresponding K that needs to be accessed when querying a neighbor node of a given node 2²-a tree representation; (a-1) and (a-2) are recombination matrices No. 0 and K thereof, respectively²-tree, wherein the two gray elements in (a-1) are nodes to be accessed in the adjacency matrix during query, and (a-2)When the dotted line and the corresponding node are query K²Nodes and steps to be accessed in the tree; (b-1) and (b-2) are recombination matrices No. 1 and K thereof, respectively²-tree, wherein the two gray elements in (b-1) are nodes to be accessed in the adjacency matrix during query, and the dotted line and corresponding node in (b-2) are K during query²Nodes and steps in the tree that need to be accessed.

Detailed Description

In order to further mine and utilize the inherent structural characteristics of the graph data and further improve K²The invention provides a method for representing query efficiency of large-scale graph data based on dense grid recombination and K²Graph data representation method of tree, which formally represents and compresses graph data efficiently and compactly based on adjacency matrix representation of graph, and provides direct neighbor query operation for nodes in graph data. The method specifically comprises the following steps:

step 1, using a grid clustering algorithm, dividing an adjacent matrix by adopting two different parameters, namely one large parameter and one small parameter, and forming a grid with the same scale by dividing each time;

step 1.1, the graph data is first abstractly represented as an adjacency matrix,

step 1.2, setting a small grid parameter as K, dividing an adjacent matrix into a plurality of K × K matrices, then marking the matrix containing at least one 1, wherein the marking mode is that the position of the matrices in an original matrix is used as a code, that is, a certain matrix is the mth matrix according to the sequence from top to bottom, and is the nth matrix according to the sequence from left to right, then the code of the matrix is (m, n), and then the codes are stored in a SET named SET 1.

Step 1.3, setting a large grid parameter as Size, dividing an adjacent matrix into a plurality of Size × Size matrixes, marking the matrixes containing at least one 1 in a mode that the positions of the matrixes in an original matrix are used as codes, namely, a certain matrix is an Mth matrix according to the sequence from top to bottom and is an Nth matrix according to the sequence from left to right, and then the codes of the matrix are (M, N), and storing the codes into a SET named as SET 2.

and 2.1, selecting any matrix R in the SET2, sequentially extracting the matrixes in the SET1 contained in the matrix R, and storing the matrix R in a SET named SET 3.

2.2, taking the number of the matrixes in the SET3 as a

Arranging the matrixes in the SET3 into a matrix queue with multiple rows in turn from left to right and from top to bottom, wherein each row comprises x matrixes, the countless areas are filled with 0, and the matrixes are recombined into a matrix R of 2x multiplied by 2x_S。

Step 2.3, adding R_SStore to the SET named SET4 and empty SET 3.

And 2.4, repeating the steps 2.1, 2.2 and 2.3 until all the matrixes in the SET2 are recombined into a new matrix.

step 3.1, set K²-a parameter K of tree;

step 3.2, selecting any one matrix R in SET4_sWhen R is_sWhen the number of nodes (c) is not a power of K, 0 is filled in the right and lower sides of the matrix, and R is set to_sIs expanded to the power of K. Then R is put_sDividing into equal-scale and equal-number-of-rows and-number-of-columns K²A matrix.

Step 3.3, for each matrix corresponding K²Tree, matrix before partitioning as root node, K after partitioning²The matrices are used as child nodes of the root node, and child nodes are generated for the root node. If the elements of the matrix are all 0, then K²-the corresponding child node of the tree is 0; if the matrix contains at least one 1, the child node is 1.

Step 3.4, for the matrices containing 1, continue to recursively divide them into K²Matrices of equal size and number of rows and columns, up to the matrixThe size is 1 × 1. In this partitioning process, all child nodes are ordered from top to bottom and from left to right.

Step 3.5, repeat steps 3.2, 3.3 and 3.4 until all matrices in SET4 are denoted as K²-tree。

Step 3.6, K constructed for the above step²-tree, using bit vector T to store K in top-to-bottom, left-to-right order²-storing K using the bit vector L, except for the values of all the nodes at the bottom level in the tree²-the value of the lowest leaf node of the tree.

Step 4, a plurality of Ks respectively established in step 3 by the nodes needing to be inquired²And querying in the tree, and merging query results to obtain a final result.

And 4.1, a node q is given, and a matrix containing the node in the SET4 is screened out and stored in the SET 5.

Step 4.2, selecting any one matrix R in the SET4_qFinding R_qThe matrixes contained in the SET1 and including the node q, and the position information of the matrixes is used for calculating the position of the node q in the R_qThe new row number q' of (a), the column number p1, p2 that needs to be queried, and the column offset t.

Step 4.3, for one of the new line numbers q' calculated in step 4.2, according to K²The traversal rule of the tree performs a top-down traversal, and queries the values at the positions of the column numbers p1 and p2 which need to be queried and correspond to the new row number. If the value at the column number is 1 and the column number is even, assigning a column offset t to a real neighbor node column number p and storing p in the SET 5; if the value at this column number is 1 and this column number is odd, then the column offset t is added to 1 and assigned to the true neighbor node column number p, and p is stored in SET 5.

Step 4.4, repeating step 4.3 until the node q in the matrix R is inquired_qAll neighbor nodes within.

And 4.5, repeating the steps 4.2, 4.3 and 4.4 until all neighbor nodes of the node q are inquired. Stored in SET5 are all the neighbor nodes of node q.

The invention is described in further detail below with reference to fig. 1-4 and a specific example:

1. partitioning and reorganizing adjacency matrices

The adjacency matrices are divided into 64 2 × 2 matrices using the small-grid parameter 2, and then the matrices containing at least one 1 are marked by taking the positions of the matrices in the original matrix as codes, which are then stored in a SET named SET1, SET1 { (1,1), (1,2), (1,8), (2,1), (2,8), (3,8), (4,8), (5,8), (6,7), (7,2), (8,7) }.

Setting a large grid parameter to 8, dividing the adjacency matrix into 4 8 × 8 matrices, then marking the matrices containing at least one 1 by taking the positions of the matrices in the original matrix as codes, and then storing the codes in a SET named SET2, SET2 { (1,1), (1,2), (2,1), (2,2) }.

Extracting small matrixes contained in the four large matrixes, and storing the small matrixes into the SET3, wherein the SET3 corresponding to the four matrixes are SET3 respectively₀＝{(1,1),(1,2),(2,1)},SET3₁＝{(1,8),(2,8),(3,8),(4,8)}, SET3₂＝{(7,2)},SET3₃{ (5,8), (6,7), (8,7) }. Then the matrixes in the SET3 are recombined respectively to obtain new matrixes. SET3₀The number of the matrix in (1) is 3, then each row in the corresponding recombination matrix should have

The size of the matrix, i.e. the scale, should be 4 x 4. Then the SET3 is sequentially carried out from left to right and from top to bottom₀The matrices in (1) are arranged in a multi-row matrix array, and the infinite number area is filled with 0. SET3₁、 SET3₂And SET3₃Reassembly is performed in the same manner, and 4 reassembly matrices are stored in SET 4. Fig. 1 is an adjacency matrix representation obtained by dividing an adjacency matrix of a graph including 16 nodes.

2. K of recombination matrix²-tree represents

Expressing the recombination matrix obtained in the step as K²The tree, as shown in FIG. 2, results in the corresponding bit vectors T and L. With the recombination matrix number 0 asFor example, first, the K value is set to 2 and divided into 4 equal-sized matrices. If an element in the matrix contains 1, then the corresponding K²-the value at the node of tree is 1; if all the elements in the matrix are 0, then the corresponding K²The value at the node of tree is 0. Obtain corresponding K²After tree, K²All nodes except the last layer of the tree are stored in the bit vector T from top to bottom and from left to right, and leaf nodes of the last layer are stored in the bit vector L. In this way, four recombination matrices are denoted as K²-tree and obtain the corresponding set of bit vectors. FIG. 3 shows a regrouping matrix obtained by dividing and regrouping the adjacent matrixes and the corresponding K²-tree.

3. Querying neighbor nodes of a given node

Node 2 is designated as the node that needs to be queried. The reassembly matrix containing node 2 in SET4 is first screened. In this example, the 0 and 1 recombination matrices contain this node. For the reassembly matrix No. 0, the new row number q' 2, the column number p1 to be queried, the column number p 21, and the column offset t 0 are calculated for node 2. Then access the corresponding K²-the value of the corresponding position in the tree. q's value at the p1 neighbor is 1 and the p2 neighbor is 0, so the column offset is the true neighbor node number 0.

For the reassembly matrix # 1, the new row number q' of node 2 in this matrix is calculated to be 0, the column number p1 to be queried is calculated to be 2, p2 is calculated to be 3, and the column offset t is calculated to be 14. Then access the corresponding K²-the value of the corresponding position in the tree. q's value at p1 neighbor is 0 and p2 neighbor is 1, so the offset adds 1 to get the true neighbor node number 15. Finally, all neighbor nodes {0,15} of the node 2 are obtained. FIG. 4 is a reassembly matrix and corresponding K that needs to be accessed when querying a neighbor node of a given node 2²-an element in a tree. FIG. 4 is a reassembly matrix and corresponding K that needs to be accessed when querying a neighbor node of a given node 2²-tree.

It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. Other embodiments, which can be made by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.

Claims

1. Based on dense grid recombination and K²-graph data representation method of tree, applied in a graph of web pages, wherein each web page corresponds to a graph node and each link corresponds to a graph edge, characterized in that it comprises the steps of:

step 2, extracting all the small grids contained in the large grid and storing the small grids as a new recombination matrix; the method comprises the following specific steps:

a is the number of matrixes in the temporary matrix SET 3;

step 2.4, repeating the steps 2.1-2.3 until all matrixes in the large grid matrix SET SET2 are recombined and stored in a recombined matrix SET SET 4; step 3, adopting K for all recombination matrixes in the step 2²-a tree representation;

step 4, a plurality of Ks respectively established in step 3 by the nodes needing to be inquired²Checking in treeInquiring and combining the inquiry results to obtain a final result;

k is a positive integer.

2. The method of claim 1 based on dense lattice reorganization and K²-graph data representation method of tree, characterized in that, the concrete procedure of step 1 is as follows:

step 1.1, abstract representing the graph data into an adjacency matrix;

3. The method of claim 1 based on dense lattice reorganization and K²-graph data representation method of tree, characterized in that, the concrete procedure of step 3 is as follows:

step 3.2, for K corresponding to each recombination matrix²-tree, taking the matrix before partitioning, i.e. the regrouping matrix, as root node, K²The divided matrixes are used as child nodes of the root node; if the elements of one matrix after division are all 0, K²-the corresponding child node of the tree is 0; if a divided matrix contains at least one1, then K²-the corresponding child node of the tree is 1; in this division process, all K²The children nodes of the tree are all ordered from top to bottom and from left to right;

4. The method of claim 1 based on dense lattice reorganization and K²-graph data representation method of tree, characterized in that, the concrete procedure of step 4 is as follows:

if the value at the column number is 0, no processing is performed;