CN113051106A

CN113051106A - Graph redundancy strategy of novel storage system

Info

Publication number: CN113051106A
Application number: CN202110307688.8A
Authority: CN
Inventors: 陈仁海; 李太俊; 冯志勇; 刘琤
Original assignee: ELECTRONIC INFORMATION VOCATIONAL TECHNOLOGY COLLEGE; Tianjin University
Current assignee: ELECTRONIC INFORMATION VOCATIONAL TECHNOLOGY COLLEGE; Tianjin University
Priority date: 2021-03-23
Filing date: 2021-03-23
Publication date: 2021-06-29

Abstract

The invention discloses a RDF graph redundancy cutting storage method of a storage system, which copies one part of associated vertexes of graph cutting vertexes during graph cutting and dividing, and places the copied part into two flash pages for redundancy storage, so that data of the associated vertexes can be read through the flash pages where the graph cutting vertexes are located. According to the RDF graph redundancy cutting storage method, the vertex with higher relevance degree with other vertices is copied during cutting and placed into two pages for redundancy storage, so that when the vertex is read through other vertices, the vertex does not need to be searched in other pages, namely, the data of the two pages does not need to be read, and the reading efficiency is improved.

Description

Graph redundancy strategy of novel storage system

Technical Field

The invention relates to the technical field of graph storage, in particular to a graph redundancy strategy of a novel storage system, and particularly relates to a RDF graph redundancy cutting storage method of the storage system.

Background

The graph storage technology is used for researching a series of problems of a layout form, a dividing method, a copying method and the like of graph data in an SSD environment, and is a premise and a base stone of graph data management. The storage mode of the graph directly determines the access efficiency of graph data and the efficiency of graph query. The conventional graph storage system is a non-relational database, and data nodes and relationships between the nodes are stored in the database together. When a graph is divided, the graph storage system with the vertex as the center is unorganized, only the vertex is allocated to different partitions in a coarse granularity mode, the problem of efficiency in deep analysis of the relation between different vertices and reading of vertex data does not exist, and the access characteristic of the SSD cannot be effectively utilized.

At present, the graph partitioning method fully utilizes the storage resources of the whole page in the SSD storage as much as possible to avoid the waste of the storage resources, but when a complete graph is partitioned, two associated vertices are inevitably cut and stored in different flash pages, the minimum unit when the SSD reads is a page, so if adjacent vertices stored in two different pages need to be read, the whole page where the two vertices are located needs to be read in order to obtain the contents of the two vertices, which may increase the reading time by a factor. By cutting the large graph multiple times, the more such points will be cut, and if the program reads the contents of these points multiple times, the reading efficiency will be lower and lower.

Disclosure of Invention

The invention aims to provide a method for storing RDF graph redundancy cutting of a storage system, aiming at the technical defects in the prior art.

The technical scheme adopted for realizing the purpose of the invention is as follows:

when the graph is cut and divided, copying one part of the determined associated vertex of the graph cut vertex, and putting the copied part into two flash pages for redundant storage, so that the data of the associated vertex can be read through the flash page where the graph cut vertex is located.

The selection steps of the association points are as follows:

and setting a redundancy threshold value r ', when the sum r of the out-degree and the in-degree of the vertex to be cut is equal to r ', taking the vertex to be cut as an associated point for redundancy storage, and when r is greater than r ', only storing the vertex to be cut once and not taking the vertex to be cut as the associated point.

According to the RDF graph redundancy cutting storage method, the vertex with higher relevance degree with other vertices is copied during cutting and placed into two pages for redundancy storage, so that when the vertex is read through other vertices, the vertex does not need to be searched in other pages, namely, the data of the two pages does not need to be read, and the reading efficiency is improved.

Drawings

FIG. 1 is a flow chart of an RDF graph redundancy cutting storage method of the storage system of the present invention;

FIG. 2 is an RDF example diagram;

FIGS. 3a-3d are schematic diagrams of a conventional RDF graph cutting method;

FIGS. 4a-4d are schematic diagrams of a redundant RDF graph cutting method.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Because the graph is cut into different small graphs according to requirements in the prior art, the efficiency problem cannot be caused when the small graphs are read again after being stored in the SSD during graph cutting, and actually, if the relevance of the cut vertex and other vertexes is high, a program can often read the vertexes adjacent to the cut vertex, so that data in two pages where the two vertexes are located need to be read, the reading time is multiplied, and the reading efficiency is reduced.

It should be noted that the RDF data set itself is not a complete graph but a sparse graph, most of the contents of the data set are a small graph, for example, DBpedia extracts entries in wikipedia, and a part of edge graphs are extracted first, and then the large graph is divided. The conventional graph partitioning strategy is described next using fig. 2, in which attribute values of vertices are omitted in fig. 2, and only the relationships with other vertices are retained, as shown in fig. 2, which includes a-K vertices.

3a-3d, the conventional RDF graph cutting method cuts the vertices directly according to the requirement, and when cutting to FIG. 3c, it is found that vertices H, I and G can be found very easily through vertex F (in order to put the vertices with the relevance together), but if J and K are also put together, the storage space of the page may be exceeded, so that only the remaining vertices J and K can be stored separately in page P4 shown in FIG. 3d, vertices F, H, I and G are stored in P3, after cutting, all the obtained thumbnails are stored in each page of SSD (for visual representation, only a few vertices are drawn), and each thumbnail in the real environment contains extra vertices, so that one page can be filled. When cutting as in fig. 3a and 3b, the vertices C and E are in different partitions, the vertex C is in page P2, the vertex E is in page P1, if the vertex E is searched by the vertex C, different pages need to be read, the minimum unit of SSD reading is a page, so the whole page where the vertex is located needs to be read in order to obtain the content of a certain vertex, which can increase the reading time by times.

Therefore, the invention provides a novel graph cutting storage method, which copies the vertex with higher relevance with other vertices into two pages for redundant storage during cutting, so that when the vertex is read by other vertices, the vertex does not need to be searched in other pages, namely, the data of the two pages is read, thereby improving the reading efficiency. Of course, before the redundant storage is copied, the associated vertex is determined according to the adjacent relation or the associated relation of other vertices and the vertex to be redundantly stored.

Because the graph is different from other block type storage data (the other block type storage data have no too much relevance with each other and can be cached), the relevance between each vertex of the graph is stronger, and the cut small graph cannot be cached, so that each time the graph divides the cut vertex, the vertex is copied and divided into a new partition, and thus, the vertex can be read in the same flash page as much as possible.

But not all vertices to be cut, which results in a significant memory and storage overhead. It is therefore considered which points need redundancy and which do not need such an operation.

Therefore, in the invention, a new variable is introduced, the sum of the out-degree and the in-degree of the vertex to be cut is assumed to be r, and a redundancy threshold value r 'is set, and when r < ═ r', the vertex is redundantly stored. When r > r', the vertex is stored only once.

Through the setting of the variables, the vertexes with higher relevance degrees with other vertexes can be made into redundancy, the reading efficiency is improved, and meanwhile, too much waste of storage space cannot be caused.

The value of r ' needs to be determined in an experiment, and r ' when the data reading efficiency and the space occupation situation are balanced is selected as an optimal value by testing the data reading efficiency and the storage space occupation situation after different values are selected for r '.

For a simpler graph, if it can be obtained through several steps of calculation, taking fig. 3a-3D as an example, if we want to make vertices E, D, H and I redundant, if r 'is selected to be 1, we find that no vertex meets the condition, when r' is 2 or r 'is 3, we find that vertices E, H and I meet the condition, when r' is 4, we find that all of E, D, H and I meet the condition, and we are ideal values, we cut fig. 1 according to this idea, and the cut graph is shown in fig. 4a-4D, page P2 includes the associated vertex E of page P1, page P3 includes the associated vertex D of page P2, and page P4 includes the associated vertex H of page P3.

Through the above analysis, it can be seen that, in the RDF graph redundancy cutting storage method of the present invention, a vertex with a higher degree of association with other vertices is copied during cutting and placed into two pages for redundancy storage, so that when the vertex is read through other vertices, the vertex does not need to be searched in other pages, that is, the data of the two pages does not need to be read, thereby improving the reading efficiency.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A RDF graph redundancy cutting storage method of a storage system is characterized in that when graph cutting and dividing are carried out, a copy of a determined graph cutting vertex correlation vertex is copied and placed into two flash pages for redundancy storage, and therefore data of the correlation vertex can be read through the flash page where the graph cutting vertex exists.

2. The RDF graph redundancy cutting storage method of the storage system according to claim 1, wherein the selecting step of the associated points is as follows: