CN105630881B

CN105630881B - A kind of date storage method and querying method of RDF

Info

Publication number: CN105630881B
Application number: CN201510955821.5A
Authority: CN
Inventors: 袁柳; 张鸿洋; 翟梅
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2015-12-18
Filing date: 2015-12-18
Publication date: 2019-04-09
Anticipated expiration: 2035-12-18
Also published as: CN105630881A

Abstract

The present invention relates to the date storage method of RDF a kind of and querying methods, it is the storage organization and storage mapping by designing the RDF data of entity-oriented, it converts the URI of RDF data and literal to after 64 bit binary datas and is stored according to the storage organization of design, SPARQL query statement is parsed and converted in querying method, by multiple each inquiry triples in SPARQL sentence according to the connection relationship between the analysis result and each inquiry to entire data set, estimation is single to ask cost, ultimately generate minimum cost querying flow, the present invention can greatly promote the speed compared between data and reduce memory space, compared to traditional direct SQL is converted by SPARQL inquire, significantly promote search efficiency, it can be used for web data management, We The fields such as b semantic retrieval.

Description

A kind of date storage method and querying method of RDF

Technical field

The invention belongs to web data administrative skill fields, and in particular to it is a kind of reduce RDF data memory space, improve The date storage method and querying method of the RDF of the search efficiency of SPARQL.

Background technique

RDF (resource description framework) is proposed by WWW to WWW (World Wide Web the frame that information is described on), it provides information Description standard for the various applications on Web.RDF subject S (Subject), predicate P (Predicate), object O (Object) triple form the resource on Web is described.Wherein, main Language generally use uniform resource identifier URI (Uniform Resource Identifiers) indicate Web on information entity (or Person's concept), predicate describes association attributes possessed by entity, and object is corresponding attribute value.Such form of presentation makes RDF It can be used to indicate appointing on WebWhat identified information, and it is exchanged among applications without losing Lose semantic information.Therefore, RDF becomes the standard of semantic data description, is widely used in description, ontology and the semanteme of metadata In net.With being increasing for semantic web data, the system that construction efficiently stored and inquired these semantic web datas becomes language Adopted net application obtains a universal very important aspect, and RDF is basic as the description of semantic web data, therefore studies The efficient storage of RDF data and inquiry become the hot spot of research of semantic web.The storage mode and optimal way of RDF data at present There are mainly three types of.

The first, the storage mode based on relational database

Since RDF data can regard<Subject as, Predicate, Object>triple set, therefore it is most natural Mode be directly to store these data using triple table.Therefore many RDF datas based on relational database store system System directly uses relational database, designs triple table or similar mode to store RDF data.The step of this method, wraps Contain: (1) RDF data being parsed into triple；(2) MD5 (Message Digest is passed through to the URI in triple Algorithm 5) Hash is encoded, and intercepts preceding 64 identifiers as resource of MD5 Hash；(3) in relationship type number Data are stored according to the table arranged in library using one 3, and establish relative index.But this method is looked into progress SPARQL When inquiry, needs to convert structured query language SQL for SPARQL query language and inquire, need the conversion operation of multilayer. Since RDF data and relation data are very different, when RDF data is stored in relation database table, need to carry out table Between map operation.Therefore the efficiency of space utilisation and inquiry is reduced.

Second, the storage mode based on local binary file

RDF document be can with certain format store into file, in semantic net, a large amount of RDF document just with The form of RDF/XML exists.RDF data and relation data make a big difference in structure, describe grammer and compare relation data Complicated more in library, but describing resource using RDF is to have biggish flexibility.It can be with based on fixed disk file storage RDF document Reach better storage efficiency, while can guarantee quickly to respond inquiry, has some storage organizations based on the hard disk at present System design, B-tree, B+ tree and the Hash table technology that these systems are often generallyd use by means of database.But based on text The storage mode development cost of part is relatively high, and since RDF is basic semantic web data description basis, if there are also Need on basic storage organization support to data carry out inquiry reasoning that just also need to do a large amount of work.

The third, storage mode memory-based

With the continuous development of hardware technology, memory is also increasingly cheaper, and memory size is also increasing, and building is based on interior The RDF data storage system deposited also becomes the hot spot of Recent study.Memory is capable of providing quickish access speed first, can To be operated in real time to data, the I/O expense of disk is saved, if it is good to design a storage organization in memory RDF storage system can further improve the efficiency of inquiry and analysis.But which is not suitable for large-scale RDF data Storage, and current option b RAHMS, BitMat etc. does not support the direct inquiry of SPARQL.It can be seen that RDF memory-based is deposited Storage structure, which is still within, constantly to be studied and improves the stage.

Summary of the invention

It is an object of the invention to overcome the shortcomings of above-mentioned prior art, propose a kind of for RDF education resource offer one Compare speed between kind data fastly and reduces the RDF data storage method of memory space.

The present invention also provides a kind of RDF data issuers for matching with above-mentioned storage method and capableing of quick search Method, to improve the recall precision of RDF education resource.

To achieve the goals above, the technical solution adopted by the present invention is that:

The storage method of RDF data of the present invention comprises the steps of:

(1) storage organization of the RDF data of entity-oriented is designed

(1.1) by the way of entity-oriented, data are stored in the k column of relevant database n row, wherein k is RDF number The average value of the predicate quantity of all subjects in, n is the sum for the line number line that all subjects need, when the predicate of single subject When quantity sum≤k, then needed for line number line=1；As sum > k, then multirow storage is carried out, then required line number line=(sum/ k)+1；

(1.2) after determining k value, according to mapping predicates algorithm, predicate is switched into column subscript, obtains the table of n row k column Structure；

Wherein the predicate of step (1.2) is converted into the lower target of column method particularly includes:

(1.2.1) calculates column subscript, the formula of mapping predicates algorithm with mapping predicates algorithm are as follows:

H in formula₁, h₂…h_jJ hash function is corresponded to, i is column subscript；

(1.2.2) then opens up new a line when j hash function calculates the subscript for completing still not find the free time, The data are stored to h₁In the subscript of calculating.

(2) it is designed for the storage mapping of RDF data

The URI of RDF data and literal are separately converted to by 64 bit binary datas using hash algorithm, URI takes hash 64 high, literal low 64 for measuring hash algorithm of algorithm, simultaneously into hash concordance list by the binary data storage of conversion Ascending order arrangement is carried out to the row in hash concordance list, is quickly mapped and is converted by binary chop algorithm when to search；

(3) RDF data stores

After RDF data is mapped and converted according to the method for step (2), the table of step (1) is arrived in storage for the first time In structure, to storage, into table structure, data are analyzed, and are created analytical table S, are recorded each Subject and Object includes Triple number and highest 20 URI of the frequency of occurrences and the corresponding frequency of highest 20 literals of frequency, according still further to step Suddenly the table structure of (1), using Object as storage entity, to data of the storage into table structure by step (2) mapping with Second of storage, i.e. the data storage of completion RDF are carried out after conversion again.

A kind of and above-mentioned matched RDF data querying method of RDF data storage method, is to comprise the steps of:

(a.1) extraction and conversion of variable

The basic chart-pattern of triple in SPARQL query statement is decomposed, and determines the variable in query statement Number be count, in query statement URI and literal respectively refer to the mapping mode in the step (2) in storage method will It is converted into 64 bit binary datas, carries out -1 assignment for arriving-count to the variable for being included；

(a.2) conversion of basic query chart-pattern

According to the triple parent map Mode Decomposition in step (a.1) as a result, converting ternary for each basic chart-pattern Group polling node structure, wherein triple query node structure are as follows:

Triple query node structure

{

The Id of node；

The Id of subject；

The Id of predicate；

The Id of object；

The mark of storage mode；

}

The first time storage or second of storage of step (3) in the mark selection RDF data storage method of storage mode；

To URI and literal, the Id of subject, predicate, object are respectively 64 bit binary datas；To change Amount, the Id of subject, predicate, object correspond to institute's assigned value；

(a.3) expression of attended operation is inquired

It is mutually compared according to the triple decomposed in chart-pattern basic in step (a.1), to there are identical variables Triple, established a connection using the node Id in step (a.2) structure as unique identifier, and convert connection relationship to Attended operation side structure, wherein attended operation side structure are as follows:

Attended operation side structure

{

The Id of the node of triple is originated,

The Id of the node of triple is terminated,

The Id of co-variate

}；

(a.4) Query Cost of each inquiry is calculated

According to triple query node structure obtained in step (a.2), to attended operation side obtained in step (a.3) Structure carries out costing analysis according to cost algorithms respectively, and the cost value for obtaining attended operation side structure is c, the formula of cost algorithms Are as follows:

TMC(t,m,S)→c

Wherein: t is the triple for needing to inquire；Storage or for the first time in the step of m is RDF data storage method (3) Secondary storage；S is analytical table；

(a.5) generation of inquiry plan

The cost value c of all attended operation sides structure obtained in step (a.4) is subjected to ascending sequence, obtain by The sequence node of cost value sequence, choosing the smallest node of c value in sequence is start node, is successively chosen next in sequence Node is attached inquiry if the variable in node is not inquired, until the variable in all nodes is all completed to inquire, i.e., in fact The inquiry of existing sentence.

It further include that step (a.6) establishes caching mechanism after above-mentioned steps (a.5), specifically: the inquiry to user's input The set of sentence triple query node structure according to obtained in step (a.2) carries out hash operation, obtains hash function End value directly takes out buffered results and feeds back to user if there are the values in cache list；Otherwise, then repeatedly step (a.3) To (a.5), acquired results are stored in hard disk, corresponding address identifies and the end value of hash function is stored in cache list.

The date storage method and querying method of RDF of the invention are the optimization to the memory structure of data, and are directed to The structure does query optimization to SPARQL, realizes the method that the education resource based on RDF is quickly retrieved and inquired.With The prior art is compared, the invention has the following advantages that

(1) storage that the URI and literal of script are replaced using 64 bit binary datas, can greatly promote data Between the speed that compares and reduce memory space, while to URI and literal, taking high 64 and low 64 of hash algorithm respectively Position, to distinguish URI and literal as identical character string.And the storage of hash index record is ranked up, to search When required record quickly navigated to by binary chop algorithm.

(2) storage organization of RDF data is stored simultaneously by the way of entity-oriented (entry-oriented) It is entity with subject (Subject) and is entity two ways with object (Object), the former realizes efficiently from subject (Subject) inquiry predicate (Predicate) is gone, a large amount of attended operation of the conventional store mode in inquiry is avoided；The latter It realizes efficiently from predicate (Predicate) to the inquiry of Subject (subject).

(3) SPARQL query statement is parsed and is converted, by multiple each inquiry triples in SPARQL sentence According to the connection relationship between the analysis result and each inquiry to entire data set, estimates single inquiry cost, ultimately generate minimum Cost querying flow, compared to it is traditional it is direct convert SQL for SPARQL and inquire, significantly promote search efficiency.

(4) caching mechanism is added during inquiry, the data set high to enquiry frequency caches, and delays in memory List is deposited, the row in each cache list includes the end value and address mark of hash function, promotes the efficiency of inquiry.

(5) present invention proposes that Data Storage Models and query optimization plan can extend to web data management, Web language The fields such as justice retrieval, or even the storage and retrieval of others RDF resource data.

Detailed description of the invention

The analysis and conversion schematic diagram that Fig. 1 is the SPARQL of step (a.2) in embodiment.

Fig. 2 is the explanation that query tree is generated to SPARQL of step (a.3) in embodiment.

Fig. 3 is the cache model schematic diagram of step (a.6) in embodiment.

Specific embodiment

The present invention is described further with reference to the accompanying drawings and examples.

The date storage method of RDF is realized by following steps in the present embodiment:

(1) it is designed for the storage mapping of RDF data

For the storage organization of RDF data, by the way of entity-oriented (entry-oriented), data are stored to pass It is in the k column of type database n row, wherein k is the average value of the predicate quantity of all subjects in RDF data, and n needs for all subjects The sum of the line number line wanted.

(1.1) the columns k and required line number n of table structure are determined

As predicate (Predicate) quantity sum≤k of single subject (Subject), then needed for line number line=1；When When sum > k, then multirow tuple is needed to be stored, required line number line=(sum/k)+1；

Such as following data:

(Charles Flint,born,1850)

(Charles Flint,died,1934)

(Charles Flint,founder,IBM)

(Larry Page,born,1973)

(Larry Page,founder,Google)

(Larry Page,board,Google)

(Larry Page,home,Palo Alto)

(Android,developer,Google)

(Android,version,4.1)

(Android,kernel,Linux)

(Android,preceded,4.0)

(Android,graphics,OpenGL)

Storage form is as shown in table 1:

Table 1 is using Object as the storage table of entity

(1.2) the subscript i of predicate (Predicate) storage is determined

After determining k value, according to mapping predicates algorithm, predicate is switched into column subscript, when multiple predicates of same target pass through It crosses mapping algorithm and obtains identical subscript, be then known as conflicting, need to define the column that multiple hash algorithms utilize space as far as possible With avoid conflicting, when multiple hash algorithms calculate complete still exist conflict when, then be the Subject more increase tuple one advance Row storage, mapping predicates algorithmic function are as follows:

H in formula₁, h₂…h_jJ hash function is corresponded to, i is column subscript,

When j hash function, which calculates, to be completed still not finding idle subscript, then new a line is opened up, by the data It is stored to h₁In the subscript of calculating.

In conjunction with table 1, check that Subject is the triple of Android, it is assumed that the triple is inserted into database one by one In, setting j is 2, then there is h₁,h₂, the subscript process for calculating pred is as shown in table 2:

Table 2 is to calculate target process under predicate

Developer passes through h₁Subscript 1 is calculated, at this time 1 element-free of subscript, directly places.

Version is similarly placed into subscript 2.

Kernel passes through h₁It calculates, obtains subscript 1,1 is idle at this time, and meaning clashes, then uses h₂Continue to calculate It is designated as 3 under, places.

Preceded passes through h₁It is calculated down and is designated as k placement.

Graphics passes through h₁,h₂Obtained subscript 3 and 2 is conflicted, then creates a line, put it into pred₃。

(2) it is designed for the storage mapping of RDF data

The triple data of usual RDF are divided into two classes: URI and literal.

URI and literal are separately converted to by 64 bit binary datas using hash algorithm, hash algorithm is taken for URI It is 64 high, for literal low 64 for measuring hash algorithm, to distinguish the URI and literal of identical characters string, by conversion Binary data storage carries out ascending order arrangement into hash concordance list and to the row in hash concordance list, passes through two when to search Lookup algorithm is divided quickly to be mapped and converted；

(3) RDF data stores

By RDF data according to the method mapping of step (2) with after conversion, the table structure of step (1) is arrived in storage for the first time In, and to storage, into table structure, data are analyzed, and are created analytical table S, are recorded each Subject and Object include three Tuple number and highest 20 URI of the frequency of occurrences and the corresponding frequency of highest 20 literals of frequency, according still further to step (1) table structure by the mapping of step (2) and turns data of the storage into table structure using Object as storage entity Second of storage is carried out after alternatively again, completes the data storage of RDF.

With the data in table 1, storage form is shown in table 3:

Table 3 is the storage form that data in table 1 are entity by Object

A kind of efficient method for quickly querying of the RDF data suitable for above method storage, is realized by following steps:

For including 6 basic chart-patterns of triple (Basic Graph Pattern, BGP) with SPARQL sentence, connect down Need SPARQL query statement to convert, the purpose of conversion be to be able to it is convenient the storage result of bottom is operated, convert Query Cost estimation is carried out to each triple later, lowest costs is ultimately formed and executes process, specifically by following steps reality It is existing:

(a.1) extraction and conversion of variable

The basic chart-pattern of the triple of SPARQL query statement (Basic Graph Pattern, BGP) is decomposed, And determine that the variable number in query statement is count, in query statement URI and literal deposit referring to above-mentioned RDF data The mapping of the step of method for storing (2) and method for transformation are translated into 64 bit binary datas, for included in query statement Variable carry out -1 arrive-count assignment；

Such as following data:

SELECT? x? y WHERE

X home " Palo Alto " //q1

Y founder " IBM " //q2

Z founder " Google " //q3

X memberOf? z. //q4

Z revenue? y. //q5

X developer? y. //q6

}

Above-mentioned query statement is parsed, obtain three variables? x,? y,? z, and all variables are subjected to id coding It is -1, -2, -3, for other URI or literal, is then directly inquired in the concordance list of step (2).

(a.2) conversion of basic query chart-pattern

Referring to Fig. 1, according to the basic chart-pattern of triple (Basic Graph Pattern, BGP) in step (a.1) point Solution structure converts triple query node structure for each basic chart-pattern, wherein triple query node structure are as follows:

Triple query node structure

{

The Id of node；

The Id of subject；

The Id of predicate；

The Id of object；

The mark of storage mode；

}

To URI and literal, the Id of subject, predicate, object are respectively 64 bit binary datas；To change Amount, the Id of subject, predicate, object are institute's assigned value；

First time storage (the access- of step (3) in above-mentioned RDF data storage method may be selected in the mark of storage mode By-Subject it) realizes with second of storage (access-by-Object), first time storage efficiently from subject (Subject) Inquiry predicate (Predicate) is gone, a large amount of attended operation of the conventional store mode in inquiry is avoided；When subject is unknown, Second of storage mode inquiry may be selected.

Before carrying out single ternary group polling, first have to determine number, the number of constant of each triple variable with And the incidence relation between triple variable and constant, the sequence of inquiry can be determined according to these relationships.

(a.3) expression of attended operation is inquired

It is mutually compared according to the triple of triple parent map Mode Decomposition all in step (a.1), to presence The triple of identical variable is established a connection using the node Id in step (a.2) structure as unique identifier, and connection is closed System is converted into attended operation side structure, wherein attended operation side structure are as follows:

Attended operation side structure

{

The Id of the node of triple is originated,

The Id of the node of triple is terminated,

The Id of co-variate

}

Ultimately form the attended operation structure in Fig. 2.

Query statement is converted and handled by above-mentioned, realizes coding and the collection of variable, basic chart-pattern Triple indicates and the attended operation of inquiry indicates.

(a.4) Query Cost of each inquiry is calculated

According to triple query node structure obtained in step (a.2), to the obtained attended operation in step (a.3) Side structure carries out costing analysis according to conventional cost algorithms, and the cost value for obtaining attended operation side structure is c, the public affairs of cost algorithms Formula are as follows:

TMC(t,m,S)→c

Wherein: t is the triple for needing to inquire；M is storage or for the first time in the storage method step (3) of RDF data Secondary storage, S are analytical table；

Such as:

(? x founder Google)

Access-by-Object is used for the triple, then the implementing result of TMC function are as follows: each in analytical table S The triple number for including in Object.

(a.5) generation of inquiry plan

The cost value c of all attended operation sides structure obtained in step (a.4) is subjected to ascending sequence, obtain by The sequence node of cost value sequence, choosing the smallest node of c in sequence is start node, successively chooses next section in sequence Point is attached inquiry if the variable in node is not inquired, until the variable in all nodes is all completed to inquire, that is, realizes The inquiry of sentence.

With reference to Fig. 2, inquiry plan chooses first triple query node in inquiry plan first, and, as starting point, selection is looked into The 4th query node in proposed figures for the plan is ask, according to the information of the inquiry plan provided, to variable? x is attached operation, obtains To two variables<? x? z>intermediate result set；The intermediate result set is carried out with the 5th inquiry ternary group node according to change again Amount? z is attached operation, obtain the middle tables of three variables<? z? x? y>, and so on, it executes and completes all inquiry languages Sentence, will obtain<? z? x? y>middle table.SELECT operation finally is carried out to the result of inquiry, take out variable? x? y is corresponding Value.

(a.6) caching mechanism is established

During data query, establish caching mechanism caching query as a result, referring to Fig. 3, to promote inquiry Efficiency, concrete operations are:

The set of query statement triple query node structure according to obtained in step (a.2) of user's input is carried out Hash operation, obtains the end value of hash function, if there are the values in cache list, directly takes out buffered results and feed back to use Family；Otherwise, then repeat the above steps (a.3) to (a.5), acquired results be stored in hard disk, and by corresponding address mark and In the end value deposit cache list of hash function.When the capacity of caching is more than expected setting, according to the frequency of inquiry, delete Remove minimum frequency.

Claims

1. a kind of RDF data storage method, it is characterised in that comprise the steps of:

(1) storage organization of the RDF data of entity-oriented is designed

(1.1) by the way of entity-oriented, data are stored in the k column of relevant database n row, wherein k is in RDF data The average value of the predicate quantity of all subjects, n is the sum for the line number line that all subjects need, when the predicate quantity of single subject When sum≤k, then needed for line number line=1；As sum > k, then carry out multirow storage, then needed for line number line=(sum/k)+ 1；

(1.2) after determining k value, according to mapping predicates algorithm, predicate is switched into column subscript, obtains the table knot of n row k column Structure, the predicate are converted into the lower calibration method of column are as follows:

(1.2.2) then opens up new a line, by this when j hash function calculates the subscript for completing still not find the free time Data are stored to h₁In the subscript of calculating；

(2) it is designed for the storage mapping of RDF data

The URI of RDF data and literal are separately converted to by 64 bit binary datas using hash algorithm, URI takes hash algorithm It is 64 high, it is literal to measure low 64 of hash algorithm, the binary data storage of conversion is into hash concordance list and right Row in hash concordance list carries out ascending order arrangement, is quickly mapped and is converted by binary chop algorithm when to search；

(3) RDF data stores

After RDF data is mapped and converted according to the method for step (2), the table structure of step (1) is arrived in storage for the first time In, to storage, into table structure, data are analyzed, and are created analytical table S, are recorded the ternary that each Subject and Object include Group number and highest 20 URI of the frequency of occurrences and the corresponding frequency of highest 20 literals of frequency, according still further to step (1) Table structure, using Object as storage entity, to data of the storage into table structure by the mapping and conversion of step (2) Carry out second of storage, i.e. the data storage of completion RDF again afterwards.

2. a kind of and matched RDF data querying method of RDF data storage method described in claim 1, it is characterised in that by Following steps composition:

(a.1) extraction and conversion of variable

The basic chart-pattern of triple in SPARQL query statement is decomposed, and determines that the variable number in query statement is Count, in query statement URI and literal respectively refer to the mapping mode in the step (2) in storage method for its turn 64 bit binary datas are turned to, -1 assignment for arriving-count is carried out to the variable for being included；

(a.2) conversion of basic query chart-pattern

It is looked into according to the triple parent map Mode Decomposition in step (a.1) as a result, converting triple for each basic chart-pattern Node structure is ask, wherein triple query node structure are as follows:

Triple query node structure

To URI and literal, the Id of subject, predicate, object are respectively 64 bit binary datas；To variable, The Id of subject, predicate, object correspond to institute's assigned value；

(a.3) expression of attended operation is inquired

It is mutually compared according to the triple decomposed in chart-pattern basic in step (a.1), to there are the three of identical variable Tuple is established a connection as unique identifier using the node Id in step (a.2) structure, and converts connection for connection relationship Side structure is operated, wherein attended operation side structure are as follows:

Attended operation side structure

(a.4) Query Cost of each inquiry is calculated

According to triple query node structure obtained in step (a.2), to attended operation side structure obtained in step (a.3) Costing analysis is carried out respectively according to cost algorithms, and the cost value for obtaining attended operation side structure is c, the formula of cost algorithms are as follows:

TMC(t,m,S)→c

Wherein: t is the triple for needing to inquire；Storage or second for the first time in the step of m is RDF data storage method (3) Storage；S is analytical table；

(a.5) generation of inquiry plan

The cost value c of all attended operation sides structure obtained in step (a.4) is subjected to ascending sequence, is obtained by cost It is worth the sequence node of sequence, choosing the smallest node of c value in sequence is start node, successively chooses next section in sequence Point is attached inquiry if the variable in node is not inquired, until the variable in all nodes is all completed to inquire, that is, realizes The inquiry of sentence.

3. RDF data querying method according to claim 2, it is characterised in that further include step after the step (a.5) (a.6) caching mechanism is established, specifically:

Hash is carried out to the set of query statement triple query node structure according to obtained in step (a.2) of user's input Operation, obtains the end value of hash function, if there are the values in cache list, directly takes out buffered results and feed back to user； Otherwise, then repeatedly step (a.3) arrives (a.5), acquired results is stored in hard disk, the result of corresponding address mark and hash function In value deposit cache list.