CN113051106A - Graph redundancy strategy of novel storage system - Google Patents

Graph redundancy strategy of novel storage system Download PDF

Info

Publication number
CN113051106A
CN113051106A CN202110307688.8A CN202110307688A CN113051106A CN 113051106 A CN113051106 A CN 113051106A CN 202110307688 A CN202110307688 A CN 202110307688A CN 113051106 A CN113051106 A CN 113051106A
Authority
CN
China
Prior art keywords
vertex
graph
cutting
redundancy
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110307688.8A
Other languages
Chinese (zh)
Inventor
陈仁海
李太俊
冯志勇
刘琤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ELECTRONIC INFORMATION VOCATIONAL TECHNOLOGY COLLEGE
Tianjin University
Original Assignee
ELECTRONIC INFORMATION VOCATIONAL TECHNOLOGY COLLEGE
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ELECTRONIC INFORMATION VOCATIONAL TECHNOLOGY COLLEGE, Tianjin University filed Critical ELECTRONIC INFORMATION VOCATIONAL TECHNOLOGY COLLEGE
Priority to CN202110307688.8A priority Critical patent/CN113051106A/en
Publication of CN113051106A publication Critical patent/CN113051106A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures

Abstract

The invention discloses a RDF graph redundancy cutting storage method of a storage system, which copies one part of associated vertexes of graph cutting vertexes during graph cutting and dividing, and places the copied part into two flash pages for redundancy storage, so that data of the associated vertexes can be read through the flash pages where the graph cutting vertexes are located. According to the RDF graph redundancy cutting storage method, the vertex with higher relevance degree with other vertices is copied during cutting and placed into two pages for redundancy storage, so that when the vertex is read through other vertices, the vertex does not need to be searched in other pages, namely, the data of the two pages does not need to be read, and the reading efficiency is improved.

Description

Graph redundancy strategy of novel storage system
Technical Field
The invention relates to the technical field of graph storage, in particular to a graph redundancy strategy of a novel storage system, and particularly relates to a RDF graph redundancy cutting storage method of the storage system.
Background
The graph storage technology is used for researching a series of problems of a layout form, a dividing method, a copying method and the like of graph data in an SSD environment, and is a premise and a base stone of graph data management. The storage mode of the graph directly determines the access efficiency of graph data and the efficiency of graph query. The conventional graph storage system is a non-relational database, and data nodes and relationships between the nodes are stored in the database together. When a graph is divided, the graph storage system with the vertex as the center is unorganized, only the vertex is allocated to different partitions in a coarse granularity mode, the problem of efficiency in deep analysis of the relation between different vertices and reading of vertex data does not exist, and the access characteristic of the SSD cannot be effectively utilized.
At present, the graph partitioning method fully utilizes the storage resources of the whole page in the SSD storage as much as possible to avoid the waste of the storage resources, but when a complete graph is partitioned, two associated vertices are inevitably cut and stored in different flash pages, the minimum unit when the SSD reads is a page, so if adjacent vertices stored in two different pages need to be read, the whole page where the two vertices are located needs to be read in order to obtain the contents of the two vertices, which may increase the reading time by a factor. By cutting the large graph multiple times, the more such points will be cut, and if the program reads the contents of these points multiple times, the reading efficiency will be lower and lower.
Disclosure of Invention
The invention aims to provide a method for storing RDF graph redundancy cutting of a storage system, aiming at the technical defects in the prior art.
The technical scheme adopted for realizing the purpose of the invention is as follows:
when the graph is cut and divided, copying one part of the determined associated vertex of the graph cut vertex, and putting the copied part into two flash pages for redundant storage, so that the data of the associated vertex can be read through the flash page where the graph cut vertex is located.
The selection steps of the association points are as follows:
and setting a redundancy threshold value r ', when the sum r of the out-degree and the in-degree of the vertex to be cut is equal to r ', taking the vertex to be cut as an associated point for redundancy storage, and when r is greater than r ', only storing the vertex to be cut once and not taking the vertex to be cut as the associated point.
According to the RDF graph redundancy cutting storage method, the vertex with higher relevance degree with other vertices is copied during cutting and placed into two pages for redundancy storage, so that when the vertex is read through other vertices, the vertex does not need to be searched in other pages, namely, the data of the two pages does not need to be read, and the reading efficiency is improved.
Drawings
FIG. 1 is a flow chart of an RDF graph redundancy cutting storage method of the storage system of the present invention;
FIG. 2 is an RDF example diagram;
FIGS. 3a-3d are schematic diagrams of a conventional RDF graph cutting method;
FIGS. 4a-4d are schematic diagrams of a redundant RDF graph cutting method.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Because the graph is cut into different small graphs according to requirements in the prior art, the efficiency problem cannot be caused when the small graphs are read again after being stored in the SSD during graph cutting, and actually, if the relevance of the cut vertex and other vertexes is high, a program can often read the vertexes adjacent to the cut vertex, so that data in two pages where the two vertexes are located need to be read, the reading time is multiplied, and the reading efficiency is reduced.
It should be noted that the RDF data set itself is not a complete graph but a sparse graph, most of the contents of the data set are a small graph, for example, DBpedia extracts entries in wikipedia, and a part of edge graphs are extracted first, and then the large graph is divided. The conventional graph partitioning strategy is described next using fig. 2, in which attribute values of vertices are omitted in fig. 2, and only the relationships with other vertices are retained, as shown in fig. 2, which includes a-K vertices.
3a-3d, the conventional RDF graph cutting method cuts the vertices directly according to the requirement, and when cutting to FIG. 3c, it is found that vertices H, I and G can be found very easily through vertex F (in order to put the vertices with the relevance together), but if J and K are also put together, the storage space of the page may be exceeded, so that only the remaining vertices J and K can be stored separately in page P4 shown in FIG. 3d, vertices F, H, I and G are stored in P3, after cutting, all the obtained thumbnails are stored in each page of SSD (for visual representation, only a few vertices are drawn), and each thumbnail in the real environment contains extra vertices, so that one page can be filled. When cutting as in fig. 3a and 3b, the vertices C and E are in different partitions, the vertex C is in page P2, the vertex E is in page P1, if the vertex E is searched by the vertex C, different pages need to be read, the minimum unit of SSD reading is a page, so the whole page where the vertex is located needs to be read in order to obtain the content of a certain vertex, which can increase the reading time by times.
Therefore, the invention provides a novel graph cutting storage method, which copies the vertex with higher relevance with other vertices into two pages for redundant storage during cutting, so that when the vertex is read by other vertices, the vertex does not need to be searched in other pages, namely, the data of the two pages is read, thereby improving the reading efficiency. Of course, before the redundant storage is copied, the associated vertex is determined according to the adjacent relation or the associated relation of other vertices and the vertex to be redundantly stored.
Because the graph is different from other block type storage data (the other block type storage data have no too much relevance with each other and can be cached), the relevance between each vertex of the graph is stronger, and the cut small graph cannot be cached, so that each time the graph divides the cut vertex, the vertex is copied and divided into a new partition, and thus, the vertex can be read in the same flash page as much as possible.
But not all vertices to be cut, which results in a significant memory and storage overhead. It is therefore considered which points need redundancy and which do not need such an operation.
Therefore, in the invention, a new variable is introduced, the sum of the out-degree and the in-degree of the vertex to be cut is assumed to be r, and a redundancy threshold value r 'is set, and when r < ═ r', the vertex is redundantly stored. When r > r', the vertex is stored only once.
Through the setting of the variables, the vertexes with higher relevance degrees with other vertexes can be made into redundancy, the reading efficiency is improved, and meanwhile, too much waste of storage space cannot be caused.
The value of r ' needs to be determined in an experiment, and r ' when the data reading efficiency and the space occupation situation are balanced is selected as an optimal value by testing the data reading efficiency and the storage space occupation situation after different values are selected for r '.
For a simpler graph, if it can be obtained through several steps of calculation, taking fig. 3a-3D as an example, if we want to make vertices E, D, H and I redundant, if r 'is selected to be 1, we find that no vertex meets the condition, when r' is 2 or r 'is 3, we find that vertices E, H and I meet the condition, when r' is 4, we find that all of E, D, H and I meet the condition, and we are ideal values, we cut fig. 1 according to this idea, and the cut graph is shown in fig. 4a-4D, page P2 includes the associated vertex E of page P1, page P3 includes the associated vertex D of page P2, and page P4 includes the associated vertex H of page P3.
Through the above analysis, it can be seen that, in the RDF graph redundancy cutting storage method of the present invention, a vertex with a higher degree of association with other vertices is copied during cutting and placed into two pages for redundancy storage, so that when the vertex is read through other vertices, the vertex does not need to be searched in other pages, that is, the data of the two pages does not need to be read, thereby improving the reading efficiency.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (2)

1. A RDF graph redundancy cutting storage method of a storage system is characterized in that when graph cutting and dividing are carried out, a copy of a determined graph cutting vertex correlation vertex is copied and placed into two flash pages for redundancy storage, and therefore data of the correlation vertex can be read through the flash page where the graph cutting vertex exists.
2. The RDF graph redundancy cutting storage method of the storage system according to claim 1, wherein the selecting step of the associated points is as follows:
and setting a redundancy threshold value r ', when the sum r of the out-degree and the in-degree of the vertex to be cut is equal to r ', taking the vertex to be cut as an associated point for redundancy storage, and when r is greater than r ', only storing the vertex to be cut once and not taking the vertex to be cut as the associated point.
CN202110307688.8A 2021-03-23 2021-03-23 Graph redundancy strategy of novel storage system Pending CN113051106A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110307688.8A CN113051106A (en) 2021-03-23 2021-03-23 Graph redundancy strategy of novel storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110307688.8A CN113051106A (en) 2021-03-23 2021-03-23 Graph redundancy strategy of novel storage system

Publications (1)

Publication Number Publication Date
CN113051106A true CN113051106A (en) 2021-06-29

Family

ID=76514345

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110307688.8A Pending CN113051106A (en) 2021-03-23 2021-03-23 Graph redundancy strategy of novel storage system

Country Status (1)

Country Link
CN (1) CN113051106A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809249A (en) * 2015-05-18 2015-07-29 北京嘀嘀无限科技发展有限公司 Processing method and system of data structure
CN106202175A (en) * 2016-06-24 2016-12-07 四川大学 Distributed dynamic figure management system towards big figure segmentation
CN109710774A (en) * 2018-12-21 2019-05-03 福州大学 It is divided and distributed storage algorithm in conjunction with the diagram data of equilibrium strategy
US20190325075A1 (en) * 2018-04-18 2019-10-24 Oracle International Corporation Efficient, in-memory, relational representation for heterogeneous graphs

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809249A (en) * 2015-05-18 2015-07-29 北京嘀嘀无限科技发展有限公司 Processing method and system of data structure
CN106202175A (en) * 2016-06-24 2016-12-07 四川大学 Distributed dynamic figure management system towards big figure segmentation
US20190325075A1 (en) * 2018-04-18 2019-10-24 Oracle International Corporation Efficient, in-memory, relational representation for heterogeneous graphs
CN109710774A (en) * 2018-12-21 2019-05-03 福州大学 It is divided and distributed storage algorithm in conjunction with the diagram data of equilibrium strategy

Similar Documents

Publication Publication Date Title
CN110276002B (en) Search application data processing method and device, computer equipment and storage medium
Yang et al. Towards effective partition management for large graphs
CN1811757B (en) System and method for locating pages on the world wide web and for locating documents from a network of computers
US20080228783A1 (en) Data Partitioning Systems
CN103077197A (en) Data storing method and device
US10241963B2 (en) Hash-based synchronization of geospatial vector features
CN110399096B (en) Method, device and equipment for deleting metadata cache of distributed file system again
CN108733306A (en) A kind of Piece file mergence method and device
US8180838B2 (en) Efficiently managing modular data storage systems
CN105469001B (en) Disk data protection method and device
Ramamohanarao et al. Recursive linear hashing
CN112579595A (en) Data processing method and device, electronic equipment and readable storage medium
KR100806115B1 (en) Design method of query classification component in multi-level dbms
KR100907477B1 (en) Apparatus and method for managing index of data stored in flash memory
US9471437B1 (en) Common backup format and log based virtual full construction
CN103530067B (en) A kind of method and apparatus of data manipulation
CN113051106A (en) Graph redundancy strategy of novel storage system
CN109815303B (en) Mobile data storage system based on position
Hjaltason et al. Improved bulk-loading algorithms for quadtrees
CN115114294A (en) Self-adaption method and device of database storage mode and computer equipment
CN115114239A (en) Distributed system data processing method, device, equipment and medium
CN114741388A (en) Novel construction method for integrated circuit layout data index
CN114168588A (en) Vector database storage and retrieval method
US9678979B1 (en) Common backup format and log based virtual full construction
CN102831240A (en) Storage method and storage structure of extensible metadata documents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210629