CN104462610A - Distributed type RDF storage and query optimization method combined with body - Google Patents

Distributed type RDF storage and query optimization method combined with body Download PDF

Info

Publication number
CN104462610A
CN104462610A CN201510003243.5A CN201510003243A CN104462610A CN 104462610 A CN104462610 A CN 104462610A CN 201510003243 A CN201510003243 A CN 201510003243A CN 104462610 A CN104462610 A CN 104462610A
Authority
CN
China
Prior art keywords
class
file
tlv triple
attribute
ternary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510003243.5A
Other languages
Chinese (zh)
Other versions
CN104462610B (en
Inventor
汪璟玢
方知立
郑翠春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201510003243.5A priority Critical patent/CN104462610B/en
Publication of CN104462610A publication Critical patent/CN104462610A/en
Application granted granted Critical
Publication of CN104462610B publication Critical patent/CN104462610B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/835Query processing
    • G06F16/8365Query optimisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a distributed type RDF storage and query optimization method combined with a body. The distributed type RDF storage and query optimization method comprises the following steps that S1, an IOMSQ algorithm is adopted to segment and store an RDF data file; S2, the segmented data file is inquired and preprocessed; S3, distributed query is carried out on the segmented data file; S4, data updating is carried out, and comprises the steps of newly increasing data, deleting data and modifying data. The IOMSQ algorithm is adopted to solve the problems that an index file is too large, and Job starting is too frequent, and inquiring efficiency is guaranteed.

Description

Distributed RDF in conjunction with body stores and enquiring and optimizing method
Technical field
The invention belongs to magnanimity RDF technical field of data administration, be specifically related to a kind of distributed RDF in conjunction with body and store and enquiring and optimizing method.
Background technology
Have at present to be much operated in and magnanimity RDF distributed storage and inquiry problem are discussed, RDF distributed storage mainly comprises HDFS and HBase two schemes.Researcher proposes distributed RDF query strategy separately for the feature of its storage mode, such as: tlv triple Hash is distributed to each node in cluster by (1) employing SHARD, the tlv triple that logical relation is close is originally made to be distributed to different physical nodes, in a large amount of data of transmission over networks, query performance must be have impact on to a certain extent when performing complicated SPARQL inquiry; (2) adopt Hadoop in conjunction with the mode of RDF-3X [5], RDF data scatter is stored in the RDF-3X database in cluster, the tlv triple be closely allied with each other is stored on same node, thus Network Transmission Delays when reducing inquiry; (3) memory module of RDF on HBase, when adopting the attended operation in MapReduce treatment S PARQL inquiry, writes the data source of HDFS as MapReduce operation using data query; (4) querying method using mixing to divide reduces the number of times of MapReduce, thus reaches and reduce I/O communication frequently; (5) on the basis of existing view, optimize SPARQL inquiry; (6) according to PredicateLead, pre-service is carried out to data, adopt JobPartitioner algorithm to generate multiple MapReduce task and carry out inquiry connection handling and obtain Query Result; (7) use the multiple method of filtering that connects to substitute traditional SQL Connection inquiring, in workflow, then merge different connection plans; (8) greedy MapReduce task generating algorithms, multiple MapReduce task iterative processing SPARQL BGP attended operation.
Above-mentioned several RDF distributed query algorithm does not all consider the ontology file of RDF, existing major part distributed RDF Query Optimal must adopt a MapReduce Job to inquire about for each query statement simultaneously, then iteration is carried out to the result of inquiry, although (7) by multiple connect to filter reduce the algorithm of query statement, it is for being all multivariable query statement and inapplicable.Consider that the startup of Job is consuming time, therefore need under the prerequisite ensureing search efficiency, proposing IOMSQ algorithm, to solve index file excessive, and Job starts too much situation.
Summary of the invention
In view of this, the object of this invention is to provide a kind of distributed RDF in conjunction with body to store and enquiring and optimizing method.Adopt IOMSQ algorithm to solve index file excessive, Job starts too much situation, and ensure that search efficiency.
Implementation method of the present invention adopts following scheme to realize: a kind of distributed RDF in conjunction with body stores and enquiring and optimizing method, comprises the following steps:
Step S1: adopt IOMSQ algorithm to carry out segmentation to RDF data file and store;
Step S2: inquiry pre-service is carried out to the data file after segmentation;
Step S3: distributed query is carried out to the data file after segmentation;
Step S4: carry out Data Update, described Data Update comprises newly-increased data, deletes data, Update Table.
Further, the segmentation memory phase in described step S1 comprises the following steps:
Step S11: obtain all classes in described RDF ontology file, and create the file being called filename with class;
Step S12: obtain all subjects in described RDF data file, then the tlv triple at Type attribute place and the class belonging to this subject that obtain subject;
Step S13: all tlv triple of this subject are deposited in the file of the class belonging to its subject, and represent the end of a subject with a special marking;
Step S14: obtain each size being called the file of filename with class, stores filename and corresponding each file size;
Step S15: the field of definition class obtaining all attributes in described RDF data file and attribute, and set up the two-dimentional relation model of attribute and class.
Further, the pre-query processing stage in described step S2 adopts SPARQL query statement, specifically comprises the following steps:
Step S21: query statement is divided into N number of query statement block according to the difference of subject;
Step S22: determine the file set being called filename with class that each query statement block will be inquired about;
Step S23: the search order determining statement block.
Further, the distributed query stage in described step S3 comprises a Map stage and a Reduce stage;
The described Map stage inquires about the multiple statements in a statement block, all tlv triple of identical subject is placed on same burst and inquires about, and judges whether the data inquired meet tlv triple module, if meet, enters the Reduce stage;
The described Reduce stage using the output in all Map stages as input, then to be outputted to in the key in the Reduce stage file that is filename.
Further, the newly-increased data described in described step S4 specifically comprise the following steps:
Step S411: for all tlv triple that will increase newly, is divided into multiple ternary chunk according to the difference of subject;
Whether step S412: for the ternary chunk that each is different, first see in ternary chunk to have and comprise the tlv triple that attribute is type, if had, needs to add in the middle of class file set Class1 to Classt corresponding to this tlv triple object, carry out step S414; The tlv triple that attribute is type if do not comprise, carries out step S413;
Step S413: the two-dimentional relation model attribute P1 to Pn comprised in ternary chunk being searched respectively to attribute and class, determines the class file set Class1 to Classt that attribute is corresponding;
Step S414: search class file, if it is consistent with the subject of this ternary chunk that the tlv triple in class file exists subject, add in the burst of this subject, if there is no subject is consistent with this ternary chunk, then add this ternary chunk at class file afterbody and separate with special marking.
Further, the deletion data described in described step S4 specifically comprise the following steps:
Step S421: for the tlv triple of all deletions, is divided into multiple ternary chunk according to the difference of subject;
Step S422: for the ternary chunk that each is different, first see in ternary chunk whether to have and comprise the tlv triple that attribute is type, if had, need the tlv triple revised in the middle of the class file set Class1 to Classt that this tlv triple object is corresponding, then carry out step S424; The tlv triple that attribute is type if do not comprise, carries out step S423;
Step S423: the two-dimentional relation model attribute P1 to Pn comprised in ternary chunk being searched respectively to attribute and class, determines the class file set Class1 to Classt that attribute is corresponding;
Step S424: search class file, deletes ternary chunk identical with this ternary chunk in class file.
Further, the Update Table described in described step S4 specifically comprises the following steps:
Step S431: for the tlv triple before all modifications and amended tlv triple, is divided into multiple ternary chunk according to the difference of subject;
Step S432: in the tlv triple before amendment, each different ternary chunk, first see in ternary chunk whether to have and comprise the tlv triple that attribute is type, if had, need the tlv triple revised in the middle of the class file set Class1 to Classt that this tlv triple object is corresponding, then carry out step S434; The tlv triple that attribute is type if do not comprise, carries out step S433;
Step S433: the two-dimentional relation model attribute P1 to Pn comprised in ternary chunk being searched respectively to attribute and class, determines the class file set Class1 to Classt that attribute is corresponding;
Step S434: search class file, deletes ternary chunk identical with this ternary chunk in class file;
Step S435: amended tlv triple is done to the operation adding data.
Compared with prior art, the present invention has following 3 outstanding advantages: 1. decrease the number of tasks starting MapReduce; 2. decrease data redundancy; 3. improve search efficiency.
Accompanying drawing explanation
Fig. 1 is flow chart of steps of the present invention.
Fig. 2 is IOMSQ algorithm frame figure of the present invention.
Fig. 3 is the process flow diagram that IOMSQ algorithm data of the present invention stores the segmentation stage.
Fig. 4 is Data Segmentation result of the present invention.
Fig. 5 is the two-dimentional relation model of attribute of the present invention and class.
Fig. 6 is the concrete query processing process of IOMSQ algorithm of the present invention.
Fig. 7 is LUBM Q7 query statement of the present invention.
Fig. 8 is LUBM Q8 query statement of the present invention.
Embodiment
Below in conjunction with drawings and Examples, the present invention will be further described.
The present embodiment provides a kind of distributed RDF in conjunction with body to store and enquiring and optimizing method, as shown in Figure 1, comprises the following steps:
Step S1: adopt IOMSQ algorithm to carry out segmentation to RDF data file and store;
Step S2: inquiry pre-service is carried out to the data file after segmentation;
Step S3: distributed query is carried out to the data file after segmentation;
Step S4: carry out Data Update, described Data Update comprises newly-increased data, deletes data, Update Table.
In the present embodiment, the frame diagram of IOMSQ algorithm as shown in Figure 2, adopts IOMSQ algorithm to carry out Data Segmentation by class and storage in conjunction with ontology file to large data files; In conjunction with this storage scheme, search algorithm starts the number of Job by reducing inquiry, reduce each Job query context to improve search efficiency; Equally, this algorithm supported data regular update preferably.The several function of definable: Write (FileName, Data): Data is written to FileName be filename HDFS file in the middle of; Write (FileName, Data, Data2): Data and Data2 is written to FileName be filename HDFS file in the middle of.
Preferably, in order to reach the number this purpose reducing inquiry Job, this algorithm carries out Data Analysis in conjunction with data ontology file, carry out storing according to the segmentation of subject classification to RDF data, and the two-dimentional relation model of attribute and classification is obtained by ontology file, data store the process flow diagram in segmentation stage as shown in Figure 3.
Especially, the segmentation memory phase in described step S1 comprises the following steps:
Step S11: obtain all classes in described RDF ontology file, and create the file being called filename with class;
Step S12: obtain all subjects in described RDF data file, then the tlv triple at Type attribute place and the class belonging to this subject that obtain subject;
Step S13: all tlv triple of this subject are deposited in the file of the class belonging to its subject, and represent the end of a subject with a special marking;
Step S14: obtain each size being called the file of filename with class, stores filename and corresponding each file size;
Step S15: the field of definition class obtaining all attributes in described RDF data file and attribute, and set up the two-dimentional relation model of attribute and class.
In order to the step that Data Segmentation stores is described better, the present embodiment is described by the segmentation storing step of a data set.
First IOMSQ data segmentation algorithm is adopted, input: the ontology file of data set and raw data file; Export: by the data file after class segmentation, specific algorithm is as follows:
Begin
1.Read ontology by jena // read ontology file by jena
All classes in 2.List ClassList=model.listClasses () // acquisition ontology definition
3.For each class(i) in ClassList do
4. FileSystem. create (class (i) .toString) // create the HDFS file being called filename with class
All subjects in 5.List sub_list=model.listSubjects () // acquisition data file
6.For each sub(i) in sub_list do
7. the tlv triple at statement (i)=sub (i) .listProperties (RDF.type) // obtain type attribute place of this subject
8. class (i)=statement (i) .getObject (); // obtain class class (i) belonging to this subject
9. all tlv triple under List statements_list=model.listStatements (sub (i), null, (RDFNode) null) // acquisition subject
10. Write (class (i), statements_list) // being sub (i) subject, all tlv triple are deposited in the middle of the hdfs file of class (i)
11. Write (class (i), " END ") // end of a subject is represented with a special mark (END)
End。
Be example by above LUBM data set, all data files are run through and just can obtain 44 class files, contain the tlv triple that all subjects belong to such in each class file, and the content of each identical subject is at a spacing block, separates with special marking (END) for each piece.Can obtain class be the partial data form of GraduateStudent file as shown in Figure 4.
After digital independent is complete, obtained the size of each data file by the FileSystem.getLength (Path) of distributed file system, the size of filename and corresponding each file is left, for generating connection plan.
Also need to adopt IOMSQ data segmentation algorithm to set up the two-dimentional relation model of attribute and class, input: ontology file; Export: the two-dimentional relation model of attribute and class, specific algorithm is as follows:
Begin
1.Read ontology by jena // read ontology file by jena
All properties in 2.List P_list=ontModel.listAllOntProperties () // acquisition ontology definition
3.For each Pi in P_list do
4. the field of definition of List Si_list=Pi.listdomain () // acquisition Pi
5. Write (modelFile, Pi, the Si_list) // field of definition of Pi and correspondence is deposited in the middle of two-dimentional relation model
End。
Can construct the two-dimentional relation model file of attribute and class thus, the first dimension is P1, P2, P3 ..., Pn, second dimension is the field of definition [U1] that each Pi is corresponding, and as shown in Figure 5, this file is mainly used in query context and the generation connection plan of determining any query statement.
In the present embodiment, preferably, the pre-query processing stage in described step S2 adopts SPARQL query statement, specifically comprises the following steps:
Step S21: query statement is divided into N number of query statement block according to the difference of subject;
Step S22: determine the file set being called filename with class that each query statement block will be inquired about;
Step S23: the search order determining statement block.
The SPARQL inquiry that the pre-query processing stage adopts improves search efficiency from two aspects: one is the data area reducing inquiry, and two is reduce Job number needed for distributed query.The result of segmentation is needed to combine for reaching this object, query statement according to input generates several query statement blocks, each query statement block comprises many query statements, inquire about at a Job, the Job task result that this inquiry obtains being brought into next query statement block carries out Connection inquiring, the number of distributed query task can be reduced like this, improve the search efficiency of each distributed query.The concrete query processing process of IOMSQ algorithm as shown in Figure 6.
Carry out Data Segmentation owing to have employed in the segmentation stage by the class belonging to subject, and the content that each subject is correlated with being placed on same burst, therefore the query statement block that is placed on identical for subject can being inquired about when generating connection plan simultaneously.Because whole inquiry solution have employed the mode of Job task iteration, just need to make the intermediate result collection of the statement block first inquired about little as far as possible to better improve search efficiency.So need to use two contents of preserving above: one is the two-dimentional relation model of attribute and class, and another is the size of each class and its [U2] file.The two-dimentional relation model of attribute and class is used to determine its input file scope of inquiring about for each query statement block, thus reaches the effect noting be used in whole data set and all inquire about.The size of each class and its [U3] file is then used to the priority execution sequence determining query statement block, reaches the object that the result set of the query statement block first inquired about is as far as possible little.
First according to the difference of subject, query statement is divided into multiple queries statement block.Such as the query statement of the Q7 of RDF standard data set LUBM as shown in Figure 7, and the query statement of Q8 as shown in Figure 8.According to IOMSQ algorithm, Q7 is divided into { 1,3}, { 2} and { 4} tri-query statement blocks.According to IOMSQ algorithm, Q8 is divided into { 1,3,5 } and { 2,4 } two query statement blocks.
Secondly, the class file set that each query statement block will be inquired about is determined.If comprise the statement of type attribute in query statement block, obtain all subclass class_list_type of the object of this statement, the attribute preserved by the segmentation stage and the two-dimentional relation model of class, obtain the class class_list_property except all properties of type attribute in statement block.If class_list_type is not empty, then exports the common factor of class_list_type and class_list_property, if class_list_type is sky, export class_list_property.
IOMSQ determines the input file set based algorithm of query statement block.Input: all query statement blocks; Export: the class file set that each class need be inquired about.Specific algorithm is as follows:
Begin:
Set class_list_final = null;
Set class_list_type=null;
Set class_list_property=null;
/ * for each query statement in each query statement block, if comprise the son that RDF.type just obtains object
Class, otherwise just obtain the field of definition * of predicate/
1. For each queryBlock(i) in queryBlockList do
2. For statement(j) in queryBlock(i)
3. If(statement.contain(RDF.type))
class_list_type = statement.getObject().listSubClasses();
4. else
class_list_property∪=statement.getProperty().listDomain();
5. If(class_list_type!=null && class_list_property!=null)
class_list_final = class_list_property∩class_list_type;
6. else if(class_list_property!=null)
class_list_final = class_list_property;
7. else if(class_list_type!=null)
class_list_final = class_list_type;
End
Wherein in LUBM Q7 query statement, { 1,3} statement block comprises type, then obtain all subclass set of ub:Student, respectively: { UndergraduateStudent, GraduateStudent, Student}, the field of definition set obtained by ub:takesCourse attribute is that { UndergraduateStudent, GraduateStudent, Student} ask the common factor of class set belonging to these statements to be { UndergraduateStudent, GraduateStudent, Student}.Then the RDF data file that comprises of these 3 class files is using the input file as this query statement block.{ 2} statement block only comprises type, then obtain all subclass { UndergraduateStudent, GraduateStudent, Student, GraduateCourse, Course} of ub:Course.{ 4} statement block does not comprise type, then obtain the field of definition of ub:teacherOf attribute: { Professor, PostDoc, Lecturer, VisitingProfessor, AssociateProfessor, Chair, FullProfessor, AssistantProfessor, Dean, Faculty}.
LUBM Q8 inquire about { 1,3,5} statement block comprises type, then obtain all subclasses of ub:Student, respectively: { UndergraduateStudent, GraduateStudent, Student}, field of definition by ub:memberof attribute obtains: { Chair, Director, Employee, Student, TeachingAssistant, GraduateStudent, ResearchAssistant, PostDoc, ClericalStaff, VisitingProfessor, AdministrativeStaff, UndergraduateStudent, AssociateProfessor, Lecturer, Professor, FullProfessor, AssistantProfessor, Dean, SystemsStaff, Faculty, Person}, the field of definition obtained by ub:emailAddress: { Chair, Director, Employee, Student, TeachingAssistant, GraduateStudent, ResearchAssistant, PostDoc, ClericalStaff, VisitingProfessor, AdministrativeStaff, UndergraduateStudent, AssociateProfessor, Lecturer, Professor, FullProfessor, AssistantProfessor, Dean, SystemsStaff, Faculty, Person}, the field of definition of two attribute acquisitions gets union, and the subclass then obtained with type is got common factor and obtained { UndergraduateStudent, GraduateStudent, Student}.
Finally, the search order of statement block is determined.[U4] 。Under the prerequisite that another external satisfaction single argument statement is preferentially inquired about, by the size of each class and its [U5] file, obtain the class file set sizes of each query statement block, then sort, the result of sequence is exactly the search order of query statement block.
The search order that therefore can obtain Q7 is: { 4}->{2}->{1,3}.The search order of Q8 is: { 2,4}->{1,3,5}.
In the present embodiment, further, the distributed query stage in described step S3 comprises a Map stage and a Reduce stage;
The described Map stage inquires about the multiple statements in a statement block, all tlv triple of identical subject is placed on same burst and inquires about, and judges whether the data inquired meet tlv triple module, if meet, enters the Reduce stage;
The described Reduce stage using the output in all Map stages as input, then to be outputted to in the key in the Reduce stage file that is filename.
Because IOMSQ algorithm is exactly that the output file of previous MapReduce Job is transmitted as the parameter of next Job, size due to file has probably exceeded the size restriction that MapReduce General Parameters transmits, the distributed caching realization mechanism in MapReduce is mainly utilized to realize herein, i.e. DistributedCache. getLocalCacheFiles (Configuration).Because Parameter transfer can expend the regular hour, thus will as far as possible using less file as distributed caching.
In order to inquire about the multiple statements in a statement block in a Map simultaneously, all tlv triple of identical subject are needed to be placed on same burst, therefore InputFormat and the RecordReader method of MapReduce has been rewritten herein, allow Split method read special marking " END " at every turn and just carry out burst, add the strategy that segmentation stores, achieve herein make all subjects identical tlv triple at same burst, namely the input of each Map is exactly the relevant all tlv triple of certain subject.Being the file of filename to realize that Reduce finally writes the content exported with key in addition, having rewritten OutputFormat and LineRecordWriter method herein.
Below be Map process and the Reduce process of IOMSQ algorithm respectively.
In the Map process of IOMSQ algorithm, Map:(k1, v1)-> [(k2, v2)], algorithm is as follows:
List<Map> inputfileMap =
Function configure(JobConf conf){
If(DistributedCache!=null){
The content of // acquisition distributed document, such as data (y_z): o1, o2; K2, k3; B2, b3
inputfileMap = ReadDistributedCache;
}
}
Function map(k1,v1,context){
String data=v1; // such as data: s1, p1, o1; S1, p2, o2; S1, p2,03
For each(triple pattern in a query){
// inquiry example=
// 0:?x p1 ?y. 1:?x p2 ?z. 2:?x p2 o3
// }
If (data do not meet any one tlv triple pattern)
return no result;
}
}
If (data meet all tlv triple patterns)
// such as data: x_y_z:s1, o1, o2
Value coupling in the value of // correlated variables and distributed caching
If(relavent variable satisfied values in inputfileMap){
All variablees in // query statement are as Key, and the value of its correspondence is as Value
key = all variable in query ;//example:x_y_z
value = values //example:s1,o1,o2
output(key, value)
}
}
}
In the Reduce process of IOMSQ algorithm, Reduce:(k2, values2)-> [(k2, values2)],
Algorithm is as follows:
For each(value in values2){
Output(k2,value);
}
In order to better the present invention is described, next the distributed query case in conjunction with LUBM Q7 is analyzed further the carrying out practically process of implementation method of the present invention.
Connection plan according to Q7: 4}->{2}->{1,3}, and the input file of its correspondence, obtain following content:
MapReduce Job1:
Input file: Professor, PostDoc, Lecturer, VisitingProfessor, AssociateProfessor, Chair, FullProfessor, AssistantProfessor, Dean, Faculty
Statement the input of a 4} Map is as follows:
<FullProfessor2>, undergraduateDegreeFrom, http://www.University862.edu
<FullProfessor2>, emailAddress,FullProfessor2@Department0.University0.edu
<FullProfessor2>, worksFor,http://www.Department0.University0.edu
<FullProfessor2>, teacherOf,<Course3>
<FullProfessor2>telephone,xxx-xxx-xxxx
<FullProfessor2>, doctoralDegreeFrom,http://www.University653.edu
<FullProfessor2>, teacherOf,<Course2>
<FullProfessor2>, researchInterest,Research28
<FullProfessor2>, mastersDegreeFrom,http://www.University125.edu
<FullProfessor2>,teacherOf, <GraduateCourse3>
<FullProfessor2>, name,FullProfessor2
For last Map, statement the output of 4}Map is as follows:
key:Y
Value:<Course3>
<Course2>
<GraduateCourse3>
Because the output of all Map is as the input of Reduce, then output to in the middle of the value Y of the Key file that is name.
MapReduce Job2:
Input file: UndergraduateStudent, GraduateStudent, Student, GraduateCourse, Course
The output of Job1 is as distributed caching.
Statement the input of a Map of 2} is as follows:
<Course3>, type,<Course>
<Course3>,name,Course3
Meeting statement, { under the prerequisite of 2}, the input of Map will compare with the data in distributed caching, and of meeting wherein just exports.
For a upper Map, statement the output of 2}Map is as follows:
key:Y
Value:<Course3>
The output of all Map as the input of Reduce, then outputs to in the middle of the value Y of the Key file that is name.
MapReduce Job3:
Input file: UndergraduateStudent, GraduateStudent, Student.
The output of Job2 is as distributed caching
Statement the input of a Map of 1,3} is as follows:
<UndergraduateStudent170>, takesCourse,<Course50>
<UndergraduateStudent170>, takesCourse,<Course3>
<UndergraduateStudent170>, takesCourse,< Course8>
<UndergraduateStudent170>, telephone,xxx-xxx-xxxx
<UndergraduateStudent170>, emailAddress,UndergraduateStudent170@Department0.University0.edu
<UndergraduateStudent170>, memberOf,http://www.Department0.University0.edu
<UndergraduateStudent170>, name,UndergraduateStudent170
Meeting statement, { under the prerequisite of 1,3}, the input of Map will compare with the data in distributed caching, and of meeting wherein just exports.
For a upper Map, statement the output of 1,3}Map is as follows:
key:X_Y
Value:<UndergraduateStudent170>,<Course3>
The output of all Map as the input of Reduce, then outputs to in the middle of the value X_Y of the Key file that is name.
The end product of inquiry is the result in X_Y file.
In the present embodiment, especially, IOMSQ algorithm have employed and carries out Data Segmentation by the class belonging to subject, and it is more convenient in the utilization of Data Update.
Further, the newly-increased data described in described step S4 specifically comprise the following steps:
Step S411: for all tlv triple that will increase newly, is divided into multiple ternary chunk according to the difference of subject;
Whether step S412: for the ternary chunk that each is different, first see in ternary chunk to have and comprise the tlv triple that attribute is type, if had, needs to add in the middle of class file set Class1 to Classt corresponding to this tlv triple object, carry out step S414; The tlv triple that attribute is type if do not comprise, carries out step S413;
Step S413: to the attribute P1 comprised in ternary chunk ..., Pn searches the two-dimentional relation model of attribute and class respectively, determines the class file set Class1 that attribute is corresponding ..., Classt;
Step S414: search class file, if it is consistent with the subject of this ternary chunk that the tlv triple in class file exists subject, add in the burst of this subject, if there is no subject is consistent with this ternary chunk, then add this ternary chunk at class file afterbody and separate with special marking.
Further, the deletion data described in described step S4 specifically comprise the following steps:
Step S421: for the tlv triple of all deletions, is divided into multiple ternary chunk according to the difference of subject;
Step S422: for the ternary chunk that each is different, first see in ternary chunk whether to have and comprise the tlv triple that attribute is type, if had, need the tlv triple revised in the middle of the class file set Class1 to Classt that this tlv triple object is corresponding, then carry out step S424; The tlv triple that attribute is type if do not comprise, carries out step S423;
Step S423: to the attribute P1 comprised in ternary chunk ..., Pn searches the two-dimentional relation model of attribute and class respectively, determines the class file set Class1 that attribute is corresponding ..., Classt;
Step S424: search class file, deletes ternary chunk identical with this ternary chunk in class file.
Further, the Update Table described in described step S4 specifically comprises the following steps:
Step S431: for the tlv triple before all modifications and amended tlv triple, is divided into multiple ternary chunk according to the difference of subject;
Step S432: in the tlv triple before amendment, each different ternary chunk, first see in ternary chunk whether to have and comprise the tlv triple that attribute is type, if had, need the tlv triple revised in the middle of the class file set Class1 to Classt that this tlv triple object is corresponding, then carry out step S434; The tlv triple that attribute is type if do not comprise, carries out step S433;
Step S433: to the attribute P1 comprised in ternary chunk ..., Pn searches the two-dimentional relation model of attribute and class respectively, determines the class file set Class1 that attribute is corresponding ..., Classt;
Step S434: search class file, deletes ternary chunk identical with this ternary chunk in class file;
Step S435: amended tlv triple is done to the operation adding data.
The foregoing is only preferred embodiment of the present invention, all equalizations done according to the present patent application the scope of the claims change and modify, and all should belong to covering scope of the present invention.

Claims (7)

1. the distributed RDF in conjunction with body stores and enquiring and optimizing method, it is characterized in that comprising the following steps:
Step S1: adopt IOMSQ algorithm to carry out segmentation to RDF data file and store;
Step S2: inquiry pre-service is carried out to the data file after segmentation;
Step S3: distributed query is carried out to the data file after segmentation;
Step S4: carry out Data Update, described Data Update comprises newly-increased data, deletes data, Update Table.
2. the distributed RDF in conjunction with body according to claim 1 stores and enquiring and optimizing method, it is characterized in that: the segmentation memory phase in described step S1 comprises the following steps:
Step S11: obtain all classes in described RDF ontology file, and create the file being called filename with class;
Step S12: obtain all subjects in described RDF data file, then the tlv triple at Type attribute place and the class belonging to this subject that obtain subject;
Step S13: all tlv triple of this subject are deposited in the file of the class belonging to its subject, and represent the end of a subject with a special marking;
Step S14: obtain each size being called the file of filename with class, stores filename and corresponding each file size;
Step S15: the field of definition class obtaining all attributes in described RDF data file and attribute, and set up the two-dimentional relation model of attribute and class.
3. the distributed RDF in conjunction with body according to claim 1 stores and enquiring and optimizing method, it is characterized in that: the pre-query processing stage in described step S2 adopts SPARQL query statement, specifically comprises the following steps:
Step S21: query statement is divided into N number of query statement block according to the difference of subject;
Step S22: determine the file set being called filename with class that each query statement block will be inquired about;
Step S23: the search order determining statement block.
4. the distributed RDF in conjunction with body according to claim 1 stores and enquiring and optimizing method, it is characterized in that: the distributed query stage in described step S3 comprises a Map stage and a Reduce stage;
The described Map stage inquires about the multiple statements in a statement block, all tlv triple of identical subject is placed on same burst and inquires about, and judges whether the data inquired meet tlv triple module, if meet, enters the Reduce stage;
The described Reduce stage as input, then to be outputted to the output in all Map stages with in the key in the Reduce stage file that is filename.
5. the distributed RDF in conjunction with body according to claim 1 stores and enquiring and optimizing method, it is characterized in that: the newly-increased data described in described step S4 specifically comprise the following steps:
Step S411: for all tlv triple that will increase newly, is divided into multiple ternary chunk according to the difference of subject;
Whether step S412: for the ternary chunk that each is different, first see in ternary chunk to have and comprise the tlv triple that attribute is type, if had, needs to add in the middle of class file set Class1 to Classt corresponding to this tlv triple object, carry out step S414; The tlv triple that attribute is type if do not comprise, carries out step S413;
Step S413: the two-dimentional relation model attribute P1 to Pn comprised in ternary chunk being searched respectively to attribute and class, determines the class file set Class1 to Classt that attribute is corresponding;
Step S414: search class file, if it is consistent with the subject of this ternary chunk that the tlv triple in class file exists subject, add in the burst of this subject, if there is no subject is consistent with this ternary chunk, then add this ternary chunk at class file afterbody and separate with special marking.
6. the distributed RDF in conjunction with body according to claim 1 stores and enquiring and optimizing method, it is characterized in that: the deletion data described in described step S4 specifically comprise the following steps:
Step S421: for the tlv triple of all deletions, is divided into multiple ternary chunk according to the difference of subject;
Step S422: for the ternary chunk that each is different, first see in ternary chunk whether to have and comprise the tlv triple that attribute is type, if had, need the tlv triple revised in the middle of the class file set Class1 to Classt that this tlv triple object is corresponding, then carry out step S424; The tlv triple that attribute is type if do not comprise, carries out step S423;
Step S423: the two-dimentional relation model attribute P1 to Pn comprised in ternary chunk being searched respectively to attribute and class, determines the class file set Class1 to Classt that attribute is corresponding;
Step S424: search class file, deletes ternary chunk identical with this ternary chunk in class file.
7. the distributed RDF in conjunction with body according to claim 1 stores and enquiring and optimizing method, it is characterized in that: the Update Table described in described step S4 specifically comprises the following steps:
Step S431: for the tlv triple before all modifications and amended tlv triple, is divided into multiple ternary chunk according to the difference of subject;
Step S432: in the tlv triple before amendment, each different ternary chunk, first see in ternary chunk whether to have and comprise the tlv triple that attribute is type, if had, need the tlv triple revised in the middle of the class file set Class1 to Classt that this tlv triple object is corresponding, then carry out step S434; The tlv triple that attribute is type if do not comprise, carries out step S433;
Step S433: the two-dimentional relation model attribute P1 to Pn comprised in ternary chunk being searched respectively to attribute and class, determines the class file set Class1 to Classt that attribute is corresponding;
Step S434: search class file, deletes ternary chunk identical with this ternary chunk in class file;
Step S435: amended tlv triple is done to the operation adding data.
CN201510003243.5A 2015-01-06 2015-01-06 Distributed RDF storages and enquiring and optimizing method with reference to body Active CN104462610B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510003243.5A CN104462610B (en) 2015-01-06 2015-01-06 Distributed RDF storages and enquiring and optimizing method with reference to body

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510003243.5A CN104462610B (en) 2015-01-06 2015-01-06 Distributed RDF storages and enquiring and optimizing method with reference to body

Publications (2)

Publication Number Publication Date
CN104462610A true CN104462610A (en) 2015-03-25
CN104462610B CN104462610B (en) 2018-02-06

Family

ID=52908645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510003243.5A Active CN104462610B (en) 2015-01-06 2015-01-06 Distributed RDF storages and enquiring and optimizing method with reference to body

Country Status (1)

Country Link
CN (1) CN104462610B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930419A (en) * 2016-04-19 2016-09-07 福州大学 RDF data distributed parallel semantic coding method
CN106021457A (en) * 2016-05-17 2016-10-12 福州大学 Keyword-based RDF distributed semantic search method
CN108520035A (en) * 2018-03-29 2018-09-11 天津大学 SPARQL parent map pattern query processing methods based on star decomposition

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510213A (en) * 2009-03-23 2009-08-19 杭州电子科技大学 Large scale publishing and subscribing pipelined matching method based on noumenon
CN102567314A (en) * 2010-12-07 2012-07-11 中国电信股份有限公司 Device and method for inquiring knowledge
US20140297676A1 (en) * 2013-04-01 2014-10-02 International Business Machines Corporation Rdf graphs made of rdf query language queries

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510213A (en) * 2009-03-23 2009-08-19 杭州电子科技大学 Large scale publishing and subscribing pipelined matching method based on noumenon
CN102567314A (en) * 2010-12-07 2012-07-11 中国电信股份有限公司 Device and method for inquiring knowledge
US20140297676A1 (en) * 2013-04-01 2014-10-02 International Business Machines Corporation Rdf graphs made of rdf query language queries

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
汪璟玢等: "基于索引的分布式RDF查询优化算法", 《计算机科学》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930419A (en) * 2016-04-19 2016-09-07 福州大学 RDF data distributed parallel semantic coding method
CN105930419B (en) * 2016-04-19 2019-08-09 福州大学 RDF data distributed parallel semantic coding method
CN106021457A (en) * 2016-05-17 2016-10-12 福州大学 Keyword-based RDF distributed semantic search method
CN106021457B (en) * 2016-05-17 2019-10-15 福州大学 RDF distributed semantic searching method based on keyword
CN108520035A (en) * 2018-03-29 2018-09-11 天津大学 SPARQL parent map pattern query processing methods based on star decomposition

Also Published As

Publication number Publication date
CN104462610B (en) 2018-02-06

Similar Documents

Publication Publication Date Title
CN109299102B (en) HBase secondary index system and method based on Elastcissearch
CN107092656B (en) A kind of tree data processing method and system
US10521427B2 (en) Managing data queries
CN103577123B (en) A kind of small documents optimization based on HDFS stores method
AU2011323637B2 (en) Object model to key-value data model mapping
CN107291807B (en) SPARQL query optimization method based on graph traversal
US9047330B2 (en) Index compression in databases
CN106227800A (en) The storage method of the big data of a kind of highlights correlations and management system
US9141666B2 (en) Incremental maintenance of range-partitioned statistics for query optimization
CN105574093A (en) Method for establishing index in HDFS based spark-sql big data processing system
US20070078909A1 (en) Database System
CN107016071B (en) A kind of method and system using simple path characteristic optimization tree data
CN104866497A (en) Metadata updating method and device based on column storage of distributed file system as well as host
CN103116625A (en) Volume radio direction finde (RDF) data distribution type query processing method based on Hadoop
CN103246749A (en) Matrix data base system for distributed computing and query method thereof
WO2023024247A1 (en) Range query method, apparatus and device for tag data, and storage medium
US10592153B1 (en) Redistributing a data set amongst partitions according to a secondary hashing scheme
CN105808746A (en) Relational big data seamless access method and system based on Hadoop system
WO2014117295A1 (en) Performing an index operation in a mapreduce environment
CN107977446A (en) A kind of memory grid data load method based on data partition
CN109597829B (en) Middleware method for realizing searchable encryption relational database cache
CN101710336A (en) Method for accelerating data processing by using relational middleware
CN113157723B (en) SQL access method for Hyperridge Fabric
CN111880795B (en) Front-end interface generation method and device
CN107077496A (en) For indexing system, the method and apparatus that compiling is used with optimum indexing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant