CN110825738A - Data storage and query method and device based on distributed RDF - Google Patents

Data storage and query method and device based on distributed RDF Download PDF

Info

Publication number
CN110825738A
CN110825738A CN201911006105.7A CN201911006105A CN110825738A CN 110825738 A CN110825738 A CN 110825738A CN 201911006105 A CN201911006105 A CN 201911006105A CN 110825738 A CN110825738 A CN 110825738A
Authority
CN
China
Prior art keywords
query
data
star
rdf
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911006105.7A
Other languages
Chinese (zh)
Other versions
CN110825738B (en
Inventor
宋佳明
张小旺
冯志勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201911006105.7A priority Critical patent/CN110825738B/en
Publication of CN110825738A publication Critical patent/CN110825738A/en
Application granted granted Critical
Publication of CN110825738B publication Critical patent/CN110825738B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data storage and query method and a device based on distributed RDF, wherein the method comprises the following steps: generating a candidate data mode set by using the original RDF data, and counting the frequency of predicates and the frequency of data modes; overlaying a mining star pattern based on a set of candidate data patterns; establishing an index of the star mode; according to the star mode, dynamic storage is built from RDF data to form a star mode table; obtaining an optimal query plan by index and statistical data of a star pattern table based on breadth-first analysis and reverse consciousness optimization SPARQL query statement; the query plan is converted to a physical execution process, which is executed on the star schema table store. The device comprises: memory, processor and computer program stored on the memory and executable on the processor, which when executed implements the method steps. According to the invention, the storage data is represented as the star pattern table, so that a large amount of storage space is saved compared with an attribute table, and the number of times of querying join is saved compared with a vertical partition table.

Description

Data storage and query method and device based on distributed RDF
Technical Field
The invention relates to the field of storage and query of RDF (remote data format) data, in particular to a data storage and query method and device based on distributed RDF.
Background
The Resource Description Framework (RDF) is a language specification recommended by World Wide Web Consortium (W3C) for describing resources and their relationships. An RDF dataset can also be described as a directed label graph, with a triple representing an edge, a subject (object) and an object (object) representing two vertices, and a predicate (predicate) representing the label of the edge. For querying on RDF data, W3C proposes SPARQL Protocol and RDF query Language (SPARQL Protocol and RDFQuery Language, SPARQL for short) as its standard query Language. With the large scale growth of the amount of RDF data, and the complexity of SPARQL statements due to the growing diversity of user demands, it is becoming difficult to efficiently store data.
Currently, RDF data processing methods are mainly classified into three categories: the processing method based on the relation (storing RDF data in a relational database and querying in SQL language), the processing method based on the partition (each node of the cluster stores different parts of RDF data and only queries the own part of each node), and the processing method based on the graph characteristics (executing query in a graph searching mode). Through comprehensive qualitative analysis and experimental comparison, the processing method based on the relation and the processing method based on the partition are superior to the processing method based on the graph characteristics in the aspects of query efficiency and flexibility. Moreover, the relational processing method can utilize the results of theoretical research and practical experience of the past 40 years, and it is necessary to base these results on the RDF processing engine.
Based on the similarity of the SQL algebra and the basic SPARQL algebra, a large number of relational storage modes are proposed to store RDF. The simplest method is that the data are stored in a table of three columns of RDF data (subject, predicate, object) without any prior knowledge, and the method is not beneficial to the acceleration of the query. The second method is an attribute class table, which creates a table for each class of entities in the RDF data, and the column names of other columns, except for the subject column, are all predicates connected to the subject. The third method is to divide the predicate into tables, also called vertical partitions, where two columns of each predicate table are the subject and the object. However, these methods cannot simultaneously satisfy the characteristics of RDF data relationship sparsity, association complexity, structure imbalance, etc., and cannot effectively cope with large-scale data and complex queries.
On this basis, the need for an adaptive RDF processing method is compelling to be unworkable. In the aspect of storage, the self-adaptive method mainly comprises data perception and workload perception, and the characteristics of data and the characteristics of the existing query history are respectively utilized. The data perception method generally divides the triples according to the same subject or the same object, and has an obvious effect on star query, but for complex query, the problem of overlarge communication cost between nodes may exist. Besides, there is a method of clustering graph data into partitions, and the traditional data clustering method cannot describe the topology of the RDF graph, but the more complicated graph-based clustering time cost is very high. The label propagation-based community detection clustering method is good in construction time and data balance, and has the defect that a query execution plan suitable for the storage is not given. The workload perception method adopts a triple mode extracted from a query set to train, or directly trains data partitioning, or firstly adopts a simple partitioning method to partition, and then copies frequently accessed data among nodes. In the aspect of query, the number of nodes and the query plan are adjusted according to different query shapes in the conventional adaptive query plan, so that unnecessary communication cost and workload of the nodes are avoided, but scheduling cost needs to be considered at the same time.
The prior art has the following disadvantages:
the distributed RDF processing method is used for responding to most queries in millisecond level, summarizing and giving query results on the premise of effectively storing and partitioning large-scale RDF processing data. In fact, the RDF big data presents characteristics of relationship sparsity, association complexity, structure imbalance and the like, so that the distributed RDF processing has high requirements on scalability, fault tolerance, message transfer, load balancing, data throughput and the like, and further, key technologies face great challenges in the aspects of overall architecture, data transmission, application interface, message transfer and the like. The existing distributed RDF processing method does not completely consider balanced storage, reduce the number of times of querying join and reduce intermediate results at the beginning of design, so that the requirement of rapid increase of the RDF data volume cannot be effectively met. Therefore, an adaptive RDF query processing method is needed, which can efficiently store a large amount of RDF data and can quickly perform query processing.
Disclosure of Invention
The invention provides a data storage and query method and device based on distributed RDF (remote data format), which are characterized in that storage data are expressed as a star pattern table, so that a large amount of storage space is saved compared with an attribute table, and the number of times of querying join is saved compared with a vertical partition table; the invention generates the query plan based on breadth-first and reverse consciousness, obviously improves the query efficiency, and is described in detail in the following:
a data storage and query method based on distributed RDF (remote data repository), comprising the following steps:
generating a candidate data mode set by using the original RDF data, and counting the frequency of predicates and the frequency of data modes;
overlaying a mining star pattern based on a set of candidate data patterns; establishing an index of the star mode;
according to the star mode, dynamic storage is built from RDF data to form a star mode table;
obtaining an optimal query plan by index and statistical data of a star pattern table based on breadth-first analysis and reverse consciousness optimization SPARQL query sentences;
the query plan is converted to a physical execution process, which is executed on the star schema table store.
Wherein the set of data patterns has the following attributes: having repeated predicates between different data patterns; corresponding RDF subgraphs of different data modes in the RDF data set D are not overlapped with each other; the data pattern length range is [1, the maximum out degree of the subject in D ]; a data schema is all predicates that connect the same subject.
Further, the star pattern is defined as: all predicates that connect the same subject.
Preferably, the breadth-based preferential analysis specifically includes:
decomposing the query to obtain a star pattern query sequence, giving a basic graph pattern, initializing a vertex queue of a query graph, and sequentially dequeuing the vertex queue to obtain a root of a sub-query;
and after a star pattern is extracted by means of the index pToSP and a star pattern query is constructed, placing the adjacent points which are not visited in a query graph into a vertex queue, storing variables contained in the query for acquiring the next sub-query join variables, and repeating the process until the vertex queue is empty.
Preferably, the inverse consciousness optimization is specifically:
firstly, constructing a subject super-node graph, wherein an edge exists between a local node and a child node of a father node to indicate that a connection relation exists, an SP list is maintained in each super-node, and the cardinalities of an initial super-node and the local node are calculated;
updating the cardinality of each local node from bottom to top, sequencing the local nodes inside the super nodes recursively, and representing the cardinality of the super node with the smallest local node cardinality;
and sequentially selecting the supernodes on the same layer from top to bottom, merging the selected supernodes with the father nodes of the supernodes, and attaching the local node sequence to the back of the local node sequence of the father nodes.
A distributed RDF-based data storage, query apparatus, the apparatus comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing the method steps when executing the program.
The technical scheme provided by the invention has the beneficial effects that:
1. the invention establishes a storage mechanism and a query strategy from the comprehensive consideration of scalability, system fault tolerance, message transmission, load balancing and data throughput, and comprises the following steps: the method comprises the whole process of RDF data storage and query, such as data mode division, storage, query planning, query execution and the like.
2. The invention provides storage based on heuristic set coverage, a data mode is extracted from an RDF data set, the length and the frequency of the data mode are counted, a predicate is divided by using a set coverage idea, and a divided star mode is obtained. In order to avoid the influence of the heuristic starting point on the result, a desired data pattern length e (k) is first obtained using a normal distribution model, and predicates are divided starting from a data pattern of length e (k). In the process, a strategy of separating the attribute from the relationship is added, so that the data mode in which part of the relationship is located is prevented from containing a large amount of data irrelevant to query.
3. The invention designs an SPARQL query plan generation algorithm based on breadth-first and a query optimization method based on reverse consciousness. The SPARQL query plan generation algorithm based on breadth first starts from the subject definition angle of star pattern query, combines the intermediate result size and the join operation cost, defines a cost model of star pattern query, and sequences the star pattern sub-query. The query optimization method based on the reverse consciousness considers the influence of the deep nodes of the query tree on the shallow nodes, and avoids the over-pruning condition of the query plan.
4. The present invention examines the efficiency of the storage method proposed herein and the performance of the query algorithm by performing a number of comparative experiments using the standard test dataset WatDiv dataset and the real dataset YAGO. The final experimental result shows that the self-adaptive RDF processing method provided by the invention can well support large-scale RDF data storage in terms of storage compression and storage time, and reaches or surpasses the level of the existing distributed RDF processing engine based on Hadoop/Spark in terms of simple query and complex query effects.
Drawings
FIG. 1 is a statistical plot of the number of data pattern lengths for the actual data sets YAGO and DBpedia;
FIG. 2 is an overall framework of the adaptive RDF process of the present invention;
FIG. 3 is a storage construction flow diagram of the present invention;
FIG. 4 is a flowchart illustration of a query plan;
FIG. 5 is a schematic diagram of the query response time of the present invention on a synthetic dataset WatDiv;
FIG. 6 is a graph illustrating the query response time of the present invention on a real data set YAGO.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Example 1
The embodiment of the invention provides a data storage and query method based on distributed RDF, which firstly defines the terms provided by the invention:
referring to fig. 1, the present invention analyzes the data pattern of RDF real data. The data pattern is defined as: let D be an RDF dataset consisting of triples in the form of (subject, predicate, object), i.e. (s, p, o). One data pattern DP ═ pi|(s,pi,oi) E.g., D), i.e., all predicates that connect the same subject. The set of data patterns is a set of all data patterns formed by D. The data schema set has the following attributes: (1) having repeated predicates between different data patterns; (2) corresponding RDF subgraphs of different data modes in D are not overlapped with each other; (3) the data pattern length range is [1, maximum out degree of subject in D ]]。
The star model mentioned in the invention is defined as: let D be an RDF dataset, one star pattern SP ═ yi|(x,yi,zi) E.g., D), i.e., all predicates that connect the same subject. The set of star patterns is a set SPS of all star patterns formed by D ═ { SP1,...,SPn},SP1∪SP2∪...∪SPnSet of predicates P of D, and SPi∩SPj=φ。
Referring to fig. 1-4, the method includes the steps of:
the method comprises the following steps: generating a candidate data mode set by using the original RDF data, and counting the frequency of predicates and the frequency of data modes;
step two: overlaying a mining star pattern based on a set of candidate data patterns;
referring to fig. 1 to 3, firstly, a data pattern is extracted from an original RDF data set, and an expected data pattern length e (k) is calculated by using a calculation model based on normal distribution and a yield model constructed by predicate frequency and data pattern frequency; and then taking the data mode with the length of E (k) as a starting point of heuristic algorithm selection, and dividing the predicates by using an aggregation coverage idea according to the data mode selected in each round of circulation to obtain a star mode aggregation.
Step three: establishing an index of the star model obtained in the step two;
referring to fig. 3, an index of a star schema is established, a storage table is constructed by using a star schema set obtained by a data schema dividing module, and data is compressed and stored on the HDFS in a partial file format (known to those skilled in the art, and not described herein in this embodiment of the present invention).
Step four: according to the star mode, dynamic storage, namely a star mode table, is constructed from RDF data, and is persisted in a Hadoop distributed file system together with statistical data;
for a SPARQL query statement to be queried, in order to obtain an efficient query response result, it is required to optimize the SPARQL query statement according to statistical data of the star schema table and a query algorithm to obtain an optimal query plan.
Step five: after receiving a query request of a user, analyzing and optimizing SPARQL query statements to obtain an optimal query plan based on breadth-first and reverse consciousness through statistic data of the step three-star mode index and the step four-star mode table;
the method designs the conversion from the SPARQL query statement to the star pattern subquery according to the query plan. And inputting an optimized query plan to obtain a final query result. The method aims to reduce the size of all intermediate results generated in the query process, and as shown in FIG. 4, the method divides the super point, namely the sub-query, according to the subject and then generates a query plan based on breadth-first and reverse consciousness.
Step six: and converting the query plan in the step five into a physical execution process, and executing the process on the star pattern table storage obtained in the step four.
Example 2
The method carries out statistics on the occurrence frequency of the data patterns in the YAGO and DBpedia data sets according to the lengths of the data patterns and draws a distribution diagram, and as shown in figure 1, the data patterns can be seen to be in a biased distribution overall. The length ranges from 1 to the maximum out of degree of the point in the RDF plot.
The area of the curve of fig. 1 along the X-axis depicts the distribution of data, so the distribution function f (X) of the curve can be defined as the ratio of the data amount of the data pattern size X or less to the total data amount, and is expressed as follows:
Figure BDA0002242826360000061
s.t.G(k1)≤0.5
wherein the content of the first and second substances,
Figure BDA0002242826360000062
means that the left half of the length k1 with the maximum frequency of occurrence of the data pattern length is integrated in a discrete calculation mode, wherein xiAnd yiIs the abscissa and abscissa of each coordinate point of the distribution map.
The mean value μ of the normal distribution curve in the right half can be determined by the formula μ ═ argG (0.5), so that the standard deviation σ can also be calculated from μ and the data in the right half of the normal distribution curve.
After the parameters mu and sigma are obtained through calculation, an expected k value is obtained according to a distribution function:
E(k)=argF(ε) (2)
where epsilon is defined as the ratio of the size of data that the cache can hold to the size of the data set. When the memory is sufficient, let's be defined as 0.8413, where k is μ + σ.
All of the above results are obtained under the assumption that the area ratio of the left half of the peak value k1 is 0.5 or less, and if the area of the left half of the peak value k1 is greater than 0.5, μ of the right half curve of k1 is smaller than the peak value k1, and then the approximate value e (k) is obtained using the formula e (k) ═ argG (∈).
Referring to fig. 2, the overall framework of the adaptive RDF process of the present invention includes: data mode division, storage construction, query planning and query execution. The storage construction part preprocesses and stores RDF data, after a client inputs a query, the query planning part decomposes and optimizes the query, and finally selects a required table from the HDFS to load the table into a memory, and returns a final result to the client after query execution.
Referring to fig. 3, the RDF data storage flow of the storage method of the present invention. And extracting all data modes from the RDF data set, then grouping the predicates to obtain non-overlapping star modes, and dividing the original data set into a plurality of tables by using a star mode set for storage.
Firstly, dividing predicates in data into non-overlapping subsets, which is a problem of set coverage and has the complexity of NP-hard, a heuristic dividing strategy is proposed, and a divided data mode is obtained through the strategy. The benefit of the data pattern is defined as follows:
Figure BDA0002242826360000071
wherein dpiData patterns for which yield is sought, f (p)j) For predicates p in data patternsjThe frequency of occurrence in the data set.
In addition, a knowledge-graph can filter entities by giving all attribute values, and then the entities are connected with each other through a relationship predicate. Entities and their attributes form "clusters of stars", and entities and some relationship and entities connected through the relationship form "chains". When a predicate is divided, if the difference between attributes and relations in the predicate is not considered, a certain relation and certain attributes are bound together, and in query, it is very common to locate an entity through multiple attributes, so that attribute and relation are not distinguished and multi-attribute star query is not influenced; however, when only a few relations (columns) in the star schema are queried, the data of the whole star schema correspondence table needs to be traversed once.
The heuristic predicate division method based on set coverage is as follows:
(1) starting from a data mode with the length of k (the initial state is E (k)), selecting a data mode with the maximum profit from a certain state, calculating the profit of the data mode, the profit of an attribute mode (attribute mode size >1) in the data mode, and the mode profit of combining the attribute mode with any relation predicate, and selecting the mode with the maximum profit to join the star mode set.
Through experimentation, data patterns formed in combination with only one relational predicate for an attribute pattern in a given data pattern are rare in data sets, but on the other hand, it is too costly to compare all combinations of attribute patterns and relational predicates together. Therefore, a compromise method is adopted, in the selected data pattern dp1, the relational predicate is deleted from the data pattern step by step according to the frequency of the relational predicate (from small to large) to obtain an updated data pattern dp2, the profit magnitudes of dp1 and dp2 are compared, and the comparison is performed sequentially until the data pattern with the largest profit contained in dp1 is found as the final result.
And if the income is the same, selecting the joining star pattern set with the maximum data pattern length. If the number of uncovered predicates is equal to 1, directly establishing a mode for the predicates, and if not, executing the step (2);
(2) and if all predicates are covered, returning the obtained star pattern set. Otherwise, removing the covered predicates from the data mode with the residual length of k, recalculating the yield, and executing the step (3);
(3) setting the expected length to be k-1, repeating the steps (1) - (3) until k is equal to 0, and executing the step (4);
(4) if the predicates are not covered completely, in the residual data mode, according to the strategy of separating the attributes and the relational predicates in the step (1) and the profit size, the schema is continuously selected, the profits are updated, and the step (4) is repeatedly executed until all the predicates are covered.
Referring to FIG. 4, an example of a query plan of the present invention. In the query decomposition stage, redundancy removal and rewriting work of a query language angle are required to be completed, and the SPARQL query is equivalently converted into a matching and searching form suitable for storage, so that a semantically correct data pattern query execution sequence subjected to preliminary optimization is output. From the perspective of redundancy elimination, two points need to be guaranteed in the query decomposition stage: (1) for a connected query graph, adjacent sub-queries need to have common variables, otherwise, a Cartesian product is created at join. (2) Edges of the same subject are queried only once. Based on these two considerations, the query graph needs to be decomposed in the order of node-sequential traversal.
As is well known, the most common traversal methods for graphs are depth-first traversal and breadth-first traversal. In the method, if depth-first traversal decomposition query is used, the above two targets can be achieved, but compared with breadth-first, under the condition that an optimal root node is selected, an execution (join) sequence obtained by depth-first cannot be pruned effectively for intermediate results.
Because of the separation of relationships from attributes, branches with child nodes must be relationships. At each join, the cardinality estimation that is crucial will be the cardinality of the joined star pattern SP and child nodes. When join is performed on the same subject, the number of join results is limited by the branch with the smallest number of records. For more accurate cardinality estimation, therefore, the following parameter estimation model is constructed, namely:
wherein, | c | represents the number of constants in the triples connected by s, and the larger the number of constants is, the more s is limited, and the fewer results are obtained;
Figure BDA0002242826360000082
represents the size of the SP table, which represents the minimum upper limit for the number of results of the star schema sub-query SPQ to be added, and similarly, the smaller the value, the greater the limit on s. Thus, the smaller the cardinality of s calculated from the formula, the fewer results the sub-query composed of triples connected to s will yield.
And decomposing the query in a breadth-first mode to obtain a star pattern query sequence. Given the basic graph schema BGP, the vertex queues of the query graph are initialized first, and then the vertex queues are dequeued in sequence to get the root of the sub-query. For the root of each sub-query, extracting the star model by means of index pToSPAfter the star pattern query is constructed, the adjacent points which are not visited in the query graph are placed into a vertex queue, and variables contained in the query are savedStoreAnd obtaining the next sub-query join variable. The above process is repeated until the vertex queue is empty.
In order to reduce the occurrence of cartesian products in the query process, that is, to convert the query graph into one tree rather than multiple trees as much as possible, points which do not enter edges should be taken as roots as much as possible, that is, points which are only used as subjects in the query graph are found, and the points are sorted according to the profits. If there is no point in the graph that is just the subject, then the graph must have a ring, so a point with the smallest base number can be selected from all subjects as the starting point, i.e., the root of the tree. The cardinality of each point is defined as the selectivity of its corresponding data pattern sub-query, as shown in equation 5.
Figure BDA0002242826360000091
The specific sub-query acquisition process is as follows. And giving a root node root of the sub-query SPQ, obtaining triplets TPs taking the root as a subject from the query graph, and sequentially obtaining the corresponding star modes according to the TPs. If the triplets already exist in the acquired star patterns SPs, the triplets are removed from the TPs until the TPs are empty. Then a sequence composed of one or more SPQs taking root as root can be constructed according to the obtained SPs and TPs corresponding to the SPs. Meanwhile, during query decomposition, only the cardinality of the next sub-query and the current intermediate result is considered to be not accurate enough, and the selection of the sub-query should be determined by combining all the possibilities after the sub-query, so that a reverse consciousness algorithm combining the n +1 th layer to the last layer of statistical information is provided when the branches on the n-th layer are sequenced according to a hierarchical decomposition method.
Referring to fig. 4(b), a subject supernode graph is first constructed, and an edge exists between a local node of a parent node and a child node to indicate that a connection relationship exists between them. Each supernode maintains a list of SPs internally.
Referring to fig. 4(c) and (d), the cardinality of each local node is calculated from bottom to top, the local nodes inside the super node are recursively ordered, and the cardinality of the super node where the local node is located is represented by the smallest local node cardinality.
Referring to fig. 4(e) to (g), after the computation is completed, supernodes on the same layer are sequentially selected from top to bottom, and after the selection is completed, the selected supernodes are merged with parent nodes thereof, and a local node sequence is appended to the back of the local node sequence of the parent nodes. Because this process is equivalent to the process of dwarfing the tree, each selected superpoint will be merged with the parent.
The query optimization comprises the following specific steps:
(1) converting the query graph into a hypergraph formed by the hypergraphs, decomposing the hypergraphs in the process to obtain local SP points, and calculating a local point base number and a hypergraph initial base number;
(2) converting the hypergraph into a tree, initializing a root according to the initial cardinality of the hypergraph, and linking each local point of the hypergraph with the next layer of the hypergraph;
(3) sorting local points from bottom to top, and modifying the cardinality of the local points and the over points;
(4) and constructing SPQ edge combination for the over point edge from top to bottom.
Example 3
The experimental environment of FIGS. 5 and 6 is described below, with experiments running on a cluster of 6 nodes, where 4 machines are 128G, and the CPU is Intel (R) Xeon (R) E5-46030 @2.00 GHz; another 2 are 64G, CPU is Intel (R) Xeon (R) CPU E5-4607 v2@2.60GHz and Intel (R) Xeon (R) CPU E5-46070 @2.20 GHz. All experiments are operated in a horn mode, 20 executors are configured, the number of cores of each execute is 2, the number of driver memories is 15g, the number of memory of each execute is 8g, and the parallelism is 120. The method is realized by Scala and Java languages.
In the method, the S2RDF uses the extended vertical division, the Sempala uses the unified attribute table, and the FlexStore is adaptive storage between the vertical division and the attribute table, so that better effect than the two engines is expected. S2RDF and PigSPARQL both convert SPARQL to relational query language, while FlexStore proposes richer query decomposition and query optimization. StarMR/optStarMR matches both data and queries by dividing them into stars, eliminating NULL, but requiring all neighbors of the star root to be matched each time. The experiment used the above engines for comparison.
Referring to fig. 5, the present invention and other methods performed a comparison of query response times on synthetic datasets Watdiv of different scales. For the query of the Watdiv data, the query classification comparison was performed by classifying it into four query types (snowflake type: F, line type: L, star type: S and complexity: C). The version which does not use normal distribution, does not separate attributes from relationships, and does not perform query optimization is called uFlexStore, and the version which realizes the functions is called FlexStore. It can be seen from the experimental results that for different scale synthetic data sets, the query response time experimental results of the present invention are significantly lower than other methods under different types of queries.
Referring to fig. 6, experimental results of query processing of real RDF data by the present invention and other methods are compared and presented as FlexStore versus other methods for query response time acceleration ratio results. As can be seen, the query response time of the present invention is significantly lower than the query processing time of other methods.
It can be seen that FlexStore accelerates more significantly in the case of q2 and q4 than PigSPARQL, both q2 and q4 comprising a ring type, PigSPARQL has no storage section, and only converting SPARQL to Pig Latin for execution. It follows that FlexStore processes better for ring queries on query plans. In these engines, the S2RDF response time is the shortest because the results for these four queries are all small (<10), so for S2RDF, even with a large number of joins, the intermediate results become very small after 1-2 joins, so for this very selective query the S2RDF dominates over the other engines, but similar to PigSPARQL, FlexStore on the ring query dominates. The increase in FlexStore E (k) compared to uFlexStore is more pronounced on such star queries as q1 and q 4. In summary, FlexStore showed significant acceleration, reaching an acceleration ratio of 48.10 to PigSPARQL at q 4.
Example 4
A distributed RDF-based data storage, query apparatus, the apparatus comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing the method steps when executing the program.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (6)

1. A data storage and query method based on distributed RDF is characterized by comprising the following steps:
generating a candidate data mode set by using the original RDF data, and counting the frequency of predicates and the frequency of data modes;
overlaying a mining star pattern based on a set of candidate data patterns; establishing an index of the star mode;
according to the star mode, dynamic storage is built from RDF data to form a star mode table;
obtaining an optimal query plan by index and statistical data of a star pattern table based on breadth-first analysis and reverse consciousness optimization SPARQL query statement;
the query plan is converted to a physical execution process, which is executed on the star schema table store.
2. The distributed RDF-based data storage and query method of claim 1, wherein the set of data patterns has the following attributes:
having repeated predicates between different data patterns;
corresponding RDF subgraphs of different data modes in the RDF data set D are not overlapped with each other;
the data pattern length range is [1, the maximum out degree of the subject in D ]; a data schema is all predicates that connect the same subject.
3. The distributed RDF-based data storage and query method according to claim 1, wherein the star schema is defined as: all predicates that connect the same subject.
4. The distributed RDF-based data storage and query method according to claim 1, wherein the breadth-first resolution specifically comprises:
decomposing the query to obtain a star pattern query sequence, giving a basic graph pattern, initializing a vertex queue of a query graph, and sequentially dequeuing the vertex queue to obtain a root of a sub-query;
and extracting a star pattern by means of the index pToSP, constructing a star pattern query, putting the adjacent points which are not visited in a query graph into a vertex queue, storing variables contained in the query for acquiring the next sub-query join variables, and repeating the process until the vertex queue is empty.
5. The distributed RDF-based data storage and query method according to claim 1, wherein the reverse-aware optimization specifically comprises:
firstly, constructing a subject language super node graph, wherein an edge exists between a local node and a child node of a father node to indicate that a connection relation exists, an SP list is maintained in each super node, and the cardinalities of an initial super node and the local node are calculated;
updating the cardinality of each local node from bottom to top, sequencing the local nodes inside the super nodes recursively, and representing the cardinality of the super node with the smallest local node cardinality;
and sequentially selecting the supernodes on the same layer from top to bottom, merging the selected supernodes with the father nodes of the supernodes, and attaching the local node sequence to the back of the local node sequence of the father nodes.
6. A distributed RDF-based data storage, query apparatus, the apparatus comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the method steps of claim 1 are implemented when the processor executes the program.
CN201911006105.7A 2019-10-22 2019-10-22 Data storage and query method and device based on distributed RDF Active CN110825738B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911006105.7A CN110825738B (en) 2019-10-22 2019-10-22 Data storage and query method and device based on distributed RDF

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911006105.7A CN110825738B (en) 2019-10-22 2019-10-22 Data storage and query method and device based on distributed RDF

Publications (2)

Publication Number Publication Date
CN110825738A true CN110825738A (en) 2020-02-21
CN110825738B CN110825738B (en) 2023-04-25

Family

ID=69550209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911006105.7A Active CN110825738B (en) 2019-10-22 2019-10-22 Data storage and query method and device based on distributed RDF

Country Status (1)

Country Link
CN (1) CN110825738B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487015A (en) * 2020-11-30 2021-03-12 河海大学 Distributed RDF system based on incremental repartitioning and query optimization method thereof
CN113946600A (en) * 2021-10-21 2022-01-18 北京人大金仓信息技术股份有限公司 Data query method, data query device, computer equipment and medium
EP4198763A1 (en) * 2021-12-17 2023-06-21 Dassault Systèmes Optimizing sparql queries in a distributed graph database

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722569A (en) * 2012-05-31 2012-10-10 浙江理工大学 Knowledge discovery device based on path migration of RDF (Resource Description Framework) picture and method
US20170098009A1 (en) * 2015-10-02 2017-04-06 Oracle International Corporation Method for faceted visualization of a sparql query result set
CN110059073A (en) * 2019-03-18 2019-07-26 浙江工业大学 Web data automatic visual method based on Subgraph Isomorphism
CN110321999A (en) * 2018-03-30 2019-10-11 北京深鉴智能科技有限公司 Neural computing figure optimization method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722569A (en) * 2012-05-31 2012-10-10 浙江理工大学 Knowledge discovery device based on path migration of RDF (Resource Description Framework) picture and method
US20170098009A1 (en) * 2015-10-02 2017-04-06 Oracle International Corporation Method for faceted visualization of a sparql query result set
CN110321999A (en) * 2018-03-30 2019-10-11 北京深鉴智能科技有限公司 Neural computing figure optimization method
CN110059073A (en) * 2019-03-18 2019-07-26 浙江工业大学 Web data automatic visual method based on Subgraph Isomorphism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIAOWANG ZHANG, MINGYUE ZHANG,等: "A Scalable Sparse Matrix-Based Join for SPARQL Query Processing" *
冷泳林: "RDF数据分割与索引方法研究" *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487015A (en) * 2020-11-30 2021-03-12 河海大学 Distributed RDF system based on incremental repartitioning and query optimization method thereof
CN112487015B (en) * 2020-11-30 2022-10-14 河海大学 Distributed RDF system based on incremental repartitioning and query optimization method thereof
CN113946600A (en) * 2021-10-21 2022-01-18 北京人大金仓信息技术股份有限公司 Data query method, data query device, computer equipment and medium
EP4198763A1 (en) * 2021-12-17 2023-06-21 Dassault Systèmes Optimizing sparql queries in a distributed graph database

Also Published As

Publication number Publication date
CN110825738B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
Zhang et al. EAGRE: Towards scalable I/O efficient SPARQL query evaluation on the cloud
Wylot et al. RDF data storage and query processing schemes: A survey
Xu et al. A fast parallel clustering algorithm for large spatial databases
Kaoudi et al. RDF in the clouds: a survey
Huang et al. Scalable SPARQL querying of large RDF graphs
Zeng et al. A distributed graph engine for web scale RDF data
Le et al. Scalable multi-query optimization for SPARQL
US8126870B2 (en) System and methodology for parallel query optimization using semantic-based partitioning
EP2729883B1 (en) Query execution systems and methods
CN110825738B (en) Data storage and query method and device based on distributed RDF
WO2018201916A1 (en) Data query method, device, and database system
CN109325029A (en) RDF data storage and querying method based on sparse matrix
Agathangelos et al. RDF query answering using apache spark: Review and assessment
Bala et al. P-ETL: Parallel-ETL based on the MapReduce paradigm
Cheng et al. Scale-out processing of large RDF datasets
Chawla et al. Storage, partitioning, indexing and retrieval in Big RDF frameworks: A survey
Saleem Storage, indexing, query processing, and benchmarking in centralized and distributed RDF engines: a survey
Kumari et al. Scalable parallel algorithms for shared nearest neighbor clustering
Manegold et al. A multi-query optimizer for Monet
Yuan et al. Big RDF Data Storage, Computation, and Analysis: A Strawman's Arguments
US20220012242A1 (en) Hierarchical datacube query plan generation
Meimaris et al. Hierarchical Property Set Merging for SPARQL Query Optimization.
Zou et al. AMR-aware in situ indexing and scalable querying
Sakr et al. Distributed RDF Query Processing
Wang et al. RDF multi-query optimization algorithm for query rewriting using common subgraphs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant