CN113051106A - Graph redundancy strategy of novel storage system - Google Patents
Graph redundancy strategy of novel storage system Download PDFInfo
- Publication number
- CN113051106A CN113051106A CN202110307688.8A CN202110307688A CN113051106A CN 113051106 A CN113051106 A CN 113051106A CN 202110307688 A CN202110307688 A CN 202110307688A CN 113051106 A CN113051106 A CN 113051106A
- Authority
- CN
- China
- Prior art keywords
- vertex
- graph
- cutting
- redundancy
- storage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/51—Indexing; Data structures therefor; Storage structures
Abstract
The invention discloses a RDF graph redundancy cutting storage method of a storage system, which copies one part of associated vertexes of graph cutting vertexes during graph cutting and dividing, and places the copied part into two flash pages for redundancy storage, so that data of the associated vertexes can be read through the flash pages where the graph cutting vertexes are located. According to the RDF graph redundancy cutting storage method, the vertex with higher relevance degree with other vertices is copied during cutting and placed into two pages for redundancy storage, so that when the vertex is read through other vertices, the vertex does not need to be searched in other pages, namely, the data of the two pages does not need to be read, and the reading efficiency is improved.
Description
Technical Field
The invention relates to the technical field of graph storage, in particular to a graph redundancy strategy of a novel storage system, and particularly relates to a RDF graph redundancy cutting storage method of the storage system.
Background
The graph storage technology is used for researching a series of problems of a layout form, a dividing method, a copying method and the like of graph data in an SSD environment, and is a premise and a base stone of graph data management. The storage mode of the graph directly determines the access efficiency of graph data and the efficiency of graph query. The conventional graph storage system is a non-relational database, and data nodes and relationships between the nodes are stored in the database together. When a graph is divided, the graph storage system with the vertex as the center is unorganized, only the vertex is allocated to different partitions in a coarse granularity mode, the problem of efficiency in deep analysis of the relation between different vertices and reading of vertex data does not exist, and the access characteristic of the SSD cannot be effectively utilized.
At present, the graph partitioning method fully utilizes the storage resources of the whole page in the SSD storage as much as possible to avoid the waste of the storage resources, but when a complete graph is partitioned, two associated vertices are inevitably cut and stored in different flash pages, the minimum unit when the SSD reads is a page, so if adjacent vertices stored in two different pages need to be read, the whole page where the two vertices are located needs to be read in order to obtain the contents of the two vertices, which may increase the reading time by a factor. By cutting the large graph multiple times, the more such points will be cut, and if the program reads the contents of these points multiple times, the reading efficiency will be lower and lower.
Disclosure of Invention
The invention aims to provide a method for storing RDF graph redundancy cutting of a storage system, aiming at the technical defects in the prior art.
The technical scheme adopted for realizing the purpose of the invention is as follows:
when the graph is cut and divided, copying one part of the determined associated vertex of the graph cut vertex, and putting the copied part into two flash pages for redundant storage, so that the data of the associated vertex can be read through the flash page where the graph cut vertex is located.
The selection steps of the association points are as follows:
and setting a redundancy threshold value r ', when the sum r of the out-degree and the in-degree of the vertex to be cut is equal to r ', taking the vertex to be cut as an associated point for redundancy storage, and when r is greater than r ', only storing the vertex to be cut once and not taking the vertex to be cut as the associated point.
According to the RDF graph redundancy cutting storage method, the vertex with higher relevance degree with other vertices is copied during cutting and placed into two pages for redundancy storage, so that when the vertex is read through other vertices, the vertex does not need to be searched in other pages, namely, the data of the two pages does not need to be read, and the reading efficiency is improved.
Drawings
FIG. 1 is a flow chart of an RDF graph redundancy cutting storage method of the storage system of the present invention;
FIG. 2 is an RDF example diagram;
FIGS. 3a-3d are schematic diagrams of a conventional RDF graph cutting method;
FIGS. 4a-4d are schematic diagrams of a redundant RDF graph cutting method.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Because the graph is cut into different small graphs according to requirements in the prior art, the efficiency problem cannot be caused when the small graphs are read again after being stored in the SSD during graph cutting, and actually, if the relevance of the cut vertex and other vertexes is high, a program can often read the vertexes adjacent to the cut vertex, so that data in two pages where the two vertexes are located need to be read, the reading time is multiplied, and the reading efficiency is reduced.
It should be noted that the RDF data set itself is not a complete graph but a sparse graph, most of the contents of the data set are a small graph, for example, DBpedia extracts entries in wikipedia, and a part of edge graphs are extracted first, and then the large graph is divided. The conventional graph partitioning strategy is described next using fig. 2, in which attribute values of vertices are omitted in fig. 2, and only the relationships with other vertices are retained, as shown in fig. 2, which includes a-K vertices.
3a-3d, the conventional RDF graph cutting method cuts the vertices directly according to the requirement, and when cutting to FIG. 3c, it is found that vertices H, I and G can be found very easily through vertex F (in order to put the vertices with the relevance together), but if J and K are also put together, the storage space of the page may be exceeded, so that only the remaining vertices J and K can be stored separately in page P4 shown in FIG. 3d, vertices F, H, I and G are stored in P3, after cutting, all the obtained thumbnails are stored in each page of SSD (for visual representation, only a few vertices are drawn), and each thumbnail in the real environment contains extra vertices, so that one page can be filled. When cutting as in fig. 3a and 3b, the vertices C and E are in different partitions, the vertex C is in page P2, the vertex E is in page P1, if the vertex E is searched by the vertex C, different pages need to be read, the minimum unit of SSD reading is a page, so the whole page where the vertex is located needs to be read in order to obtain the content of a certain vertex, which can increase the reading time by times.
Therefore, the invention provides a novel graph cutting storage method, which copies the vertex with higher relevance with other vertices into two pages for redundant storage during cutting, so that when the vertex is read by other vertices, the vertex does not need to be searched in other pages, namely, the data of the two pages is read, thereby improving the reading efficiency. Of course, before the redundant storage is copied, the associated vertex is determined according to the adjacent relation or the associated relation of other vertices and the vertex to be redundantly stored.
Because the graph is different from other block type storage data (the other block type storage data have no too much relevance with each other and can be cached), the relevance between each vertex of the graph is stronger, and the cut small graph cannot be cached, so that each time the graph divides the cut vertex, the vertex is copied and divided into a new partition, and thus, the vertex can be read in the same flash page as much as possible.
But not all vertices to be cut, which results in a significant memory and storage overhead. It is therefore considered which points need redundancy and which do not need such an operation.
Therefore, in the invention, a new variable is introduced, the sum of the out-degree and the in-degree of the vertex to be cut is assumed to be r, and a redundancy threshold value r 'is set, and when r < ═ r', the vertex is redundantly stored. When r > r', the vertex is stored only once.
Through the setting of the variables, the vertexes with higher relevance degrees with other vertexes can be made into redundancy, the reading efficiency is improved, and meanwhile, too much waste of storage space cannot be caused.
The value of r ' needs to be determined in an experiment, and r ' when the data reading efficiency and the space occupation situation are balanced is selected as an optimal value by testing the data reading efficiency and the storage space occupation situation after different values are selected for r '.
For a simpler graph, if it can be obtained through several steps of calculation, taking fig. 3a-3D as an example, if we want to make vertices E, D, H and I redundant, if r 'is selected to be 1, we find that no vertex meets the condition, when r' is 2 or r 'is 3, we find that vertices E, H and I meet the condition, when r' is 4, we find that all of E, D, H and I meet the condition, and we are ideal values, we cut fig. 1 according to this idea, and the cut graph is shown in fig. 4a-4D, page P2 includes the associated vertex E of page P1, page P3 includes the associated vertex D of page P2, and page P4 includes the associated vertex H of page P3.
Through the above analysis, it can be seen that, in the RDF graph redundancy cutting storage method of the present invention, a vertex with a higher degree of association with other vertices is copied during cutting and placed into two pages for redundancy storage, so that when the vertex is read through other vertices, the vertex does not need to be searched in other pages, that is, the data of the two pages does not need to be read, thereby improving the reading efficiency.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (2)
1. A RDF graph redundancy cutting storage method of a storage system is characterized in that when graph cutting and dividing are carried out, a copy of a determined graph cutting vertex correlation vertex is copied and placed into two flash pages for redundancy storage, and therefore data of the correlation vertex can be read through the flash page where the graph cutting vertex exists.
2. The RDF graph redundancy cutting storage method of the storage system according to claim 1, wherein the selecting step of the associated points is as follows:
and setting a redundancy threshold value r ', when the sum r of the out-degree and the in-degree of the vertex to be cut is equal to r ', taking the vertex to be cut as an associated point for redundancy storage, and when r is greater than r ', only storing the vertex to be cut once and not taking the vertex to be cut as the associated point.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110307688.8A CN113051106A (en) | 2021-03-23 | 2021-03-23 | Graph redundancy strategy of novel storage system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110307688.8A CN113051106A (en) | 2021-03-23 | 2021-03-23 | Graph redundancy strategy of novel storage system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113051106A true CN113051106A (en) | 2021-06-29 |
Family
ID=76514345
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110307688.8A Pending CN113051106A (en) | 2021-03-23 | 2021-03-23 | Graph redundancy strategy of novel storage system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113051106A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104809249A (en) * | 2015-05-18 | 2015-07-29 | 北京嘀嘀无限科技发展有限公司 | Processing method and system of data structure |
CN106202175A (en) * | 2016-06-24 | 2016-12-07 | 四川大学 | Distributed dynamic figure management system towards big figure segmentation |
CN109710774A (en) * | 2018-12-21 | 2019-05-03 | 福州大学 | It is divided and distributed storage algorithm in conjunction with the diagram data of equilibrium strategy |
US20190325075A1 (en) * | 2018-04-18 | 2019-10-24 | Oracle International Corporation | Efficient, in-memory, relational representation for heterogeneous graphs |
-
2021
- 2021-03-23 CN CN202110307688.8A patent/CN113051106A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104809249A (en) * | 2015-05-18 | 2015-07-29 | 北京嘀嘀无限科技发展有限公司 | Processing method and system of data structure |
CN106202175A (en) * | 2016-06-24 | 2016-12-07 | 四川大学 | Distributed dynamic figure management system towards big figure segmentation |
US20190325075A1 (en) * | 2018-04-18 | 2019-10-24 | Oracle International Corporation | Efficient, in-memory, relational representation for heterogeneous graphs |
CN109710774A (en) * | 2018-12-21 | 2019-05-03 | 福州大学 | It is divided and distributed storage algorithm in conjunction with the diagram data of equilibrium strategy |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110276002B (en) | Search application data processing method and device, computer equipment and storage medium | |
Yang et al. | Towards effective partition management for large graphs | |
CN1811757B (en) | System and method for locating pages on the world wide web and for locating documents from a network of computers | |
US20080228783A1 (en) | Data Partitioning Systems | |
CN103077197A (en) | Data storing method and device | |
US10241963B2 (en) | Hash-based synchronization of geospatial vector features | |
CN110399096B (en) | Method, device and equipment for deleting metadata cache of distributed file system again | |
CN108733306A (en) | A kind of Piece file mergence method and device | |
US8180838B2 (en) | Efficiently managing modular data storage systems | |
CN105469001B (en) | Disk data protection method and device | |
Ramamohanarao et al. | Recursive linear hashing | |
CN112579595A (en) | Data processing method and device, electronic equipment and readable storage medium | |
KR100806115B1 (en) | Design method of query classification component in multi-level dbms | |
KR100907477B1 (en) | Apparatus and method for managing index of data stored in flash memory | |
US9471437B1 (en) | Common backup format and log based virtual full construction | |
CN103530067B (en) | A kind of method and apparatus of data manipulation | |
CN113051106A (en) | Graph redundancy strategy of novel storage system | |
CN109815303B (en) | Mobile data storage system based on position | |
Hjaltason et al. | Improved bulk-loading algorithms for quadtrees | |
CN115114294A (en) | Self-adaption method and device of database storage mode and computer equipment | |
CN115114239A (en) | Distributed system data processing method, device, equipment and medium | |
CN114741388A (en) | Novel construction method for integrated circuit layout data index | |
CN114168588A (en) | Vector database storage and retrieval method | |
US9678979B1 (en) | Common backup format and log based virtual full construction | |
CN102831240A (en) | Storage method and storage structure of extensible metadata documents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210629 |