CN105608228A

CN105608228A - High-efficiency distributed RDF data storage method

Info

Publication number: CN105608228A
Application number: CN201610064516.1A
Authority: CN
Inventors: 吴志坚; 黎建辉; 周园春; 侯艳飞; 韩岳岐
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2016-01-29
Filing date: 2016-01-29
Publication date: 2016-05-25
Anticipated expiration: 2036-01-29
Also published as: CN105608228B

Abstract

The invention discloses a high-efficiency distributed RDF data storage method. The method comprises the following steps that: 1) a user selects a named graph or sets a new named graph for each triple to be uploaded, and sets an effective predicate and a triple thereof for the triple according to service requirements; 2) a data control system analyzes each triple in the RDF uploaded by the user, extracts a predicate of the triple and the effective predicate of the named graph of the triple, and then splits the triple into two triples, namely the triple of a complete predicate of the same subject and the triple of the effective predicate of the same subject, with the same unique identifier according to the effective predicate, wherein the effective predicate is a part of predicate of the complete predicate; 3) the data control system respectively stores the triple data of the complete predicate of the same subject and the triple data of the effective predicate of the same subject to different database clusters. According to the high-efficiency distributed RDF data storage method, the high availability of data is improved.

Description

A kind of efficient distributed RDF date storage method

Technical field

The present invention relates to RDF technical field of data storage, particularly efficient distributed RDF date storage method, belongs toComputer software fields.

Background technology

Along with the high speed development of Internet technology, make the range of application of internet more and more extensive, and form one hugeKnowledge network storehouse, but simultaneously also bring a lot of challenges, for multi-form knowledge network storehouse is coupled together, allows calculating functionEnough understand contacting between data and data, proposed the concept of semantic net. The target of semantic net is to allow information resources on networkCan be by machine perception, thus realize the automation processing of network information resource, to adapt to the rapid growth of network information resource.

Semantic net defines a kind of resource description framework RDF and describes the information resources on network. RDF is an Internet resources objectThe data model of relation therebetween, provides a general data model to support the description to Internet resources, and RDF uses ternaryGroup (subject, predicate and object) is described various resources on network and the relation between them. From the angle analysis of figure, this mouldType is to be made up of the limit between node and node, and node represents subject and object, and limit represents predicate, so can represent with nodeResource, limit represents the attribute of resource.

To RDF data, storage generally adopts unit RDF data base management system at present, such as: GraphDB, stardog andAllegrograph etc. This RDF storage mode can be managed a large amount of triple data, but along with internet information resourceRapid growth, the storage capacity of unit is limited, can not meet the demand of current magnanimity triple data storage. For magnanimityThe scholar that stores of triple data has proposed kinds of schemes, but is all in conceptual phase. Such as using Hadoop or HbaseDistributed type assemblies storage triple data, due to Hadoop or all natural storage administration abilities with mass data of Hbase,And adopt mapreduce simulation to realize data query; But because making the triple data of the same subject, deposits this storage modeStorage has dispersiveness, and the triple of the same subject may be stored in many machines; Add the complexity of RDF data correlation relation,Between each triple, likely have incidence relation, mapreduce simulation realizes data query scheme while carrying out data query,Need to carry out a large amount of data correlation screenings, current storage scheme can not realize the inquiry to data high-speed, query performance comparisonLow, in the situation that particularly data volume is very large, a simple inquiry may just need to be carried out tens seconds, can not meet actualService inquiry demand.

Summary of the invention

For the problem running in RDF data above-mentioned storage, the present invention proposes a kind of efficient distributed RDFDate storage method, solves the problem that in existing RDF data storage method, memory data output is limited, triple data are disperseed.

For addressing the above problem, the present invention proposes a kind of efficient distributed RDF date storage method, the method is mainly wrappedDraw together following performing step:

1) the RDF data that data parser is uploaded user are resolved, and every triple Data Analysis is become to three of consolidation formUnit's group objects; Data after resolving are processed, are resolved and extract the predicate in triple, extract effective predicate of name figure,Effectively predicate defines by user's business demand, and user determines the predicate triple of using at present according to concrete business demand,Form the triple of effective predicate. According to effective predicate of this name figure, the triple Data Division of the same subject is become to twoPoint, i.e. the triple data of the triple data of the complete predicate of the same subject and effective predicate of the same subject; The same subjectThe triple data of complete predicate are the complete triple data of the same subject, the triple number of effective predicate of the same subjectAccording to the triple data of partial predicate that are the same subject, therefore the triple data of effective predicate of the same subject are same mastersThe triple data subset of the complete predicate of language. According to effective predicate of this name figure, the triple Data Division of the same subjectBecome two parts, i.e. the triple data of the triple data of the complete predicate of the same subject and effective predicate of the same subject; And rawBecome unique ID, the triple of unique this subject of sign, the triple of each subject all can generate this unique ID, for unique signThe triple of this subject, the triple data of the triple data of the complete predicate of the same subject and effective predicate of the same subject altogetherWith this unique ID.

2) data are divided into two parts and carry out storage administration, i.e. the triple data of the complete predicate of the same subject and having of the same subjectThe separately storage of the triple data of effect predicate. The increase income complete predicate of distributed NoSQL data-base cluster storage the same subject of useTriple data, in order to ensure the integrality of data, when predicate changes in demand, effective predicate triple data are carried out in the futureExpansion or reduction. Use the triple data of effective predicate of RDF data-base cluster storage the same subject, the having of the same subjectThe triple data of effect predicate are the triple data subsets of the complete predicate of the same subject, in the situation that storage capacity is constant,Memory space and the managerial ability of raising system triple data, reduced the data volume of triple and then improved efficiency data query;RDF data-base cluster is made up of back end, routing node and configuration node.

3) the effective predicate triple data dynamic extending in RDF data-base cluster. RDF data-base cluster is only stored same masterThe triple data of effective predicate of language, effectively predicate capable of dynamic changes, and when effectively predicate changes, first user submits meaning toWord is new task more, and the predicate of system upgrades the more new task of predicate that Mission Monitor module monitors user submits to, when user submits predicate toMore, after new task, this monitoring module starts more new task of predicate on backstage, detect which predicate and change, RDF data base setThe triple of group's storage also needs to change accordingly, when data management module is responsible for effective predicate variation, according to distributedThe triple data of predicate of changing in the triple data of the complete predicate of NoSQL data-base cluster storage the same subject are ledEnter in RDF data-base cluster, ensure the integrality of storage triple data.

Further, described triple and name figure (graphname), in RDF data, basic structure is the collection of multiple tripleClose, each triple is made up of a main body, a predicate and an object, and predicate represents the association pass between subject and objectSystem, each triple is also appreciated that as being made up of a subject, a predicate and an object. A series of such tripleBe called as a RDF figure, the title of definition RDF figure names figure (graphname), and name figure is exactly the space that data are preserved,Being equal to the concept of database in relevant database, is to define according to business demand in the time of user's uploading data, can selectSome name figure, also can add new name figure.

Further, described complete predicate and effectively predicate, the present invention is divided into two parts the predicate of the triple of the same subject,Be complete predicate and effective predicate; Complete predicate: all predicates that a certain name figure comprises, effectively predicate: user is according to businessDemand is self-defined, the predicate that in a certain name figure, user's current needs meeting uses; According to predicate information by three of the same subjectTuple is divided into two parts, i.e. the triple of complete predicate and the effectively triple of predicate.

Further, the triple data of the complete predicate of described the same subject and the triple data of the effective predicate of the same subject are separatelyStorage administration. Because the triple predicate of the same subject generally has multiplely, and most of predicate is redundant digit in actual demandAccording to, can in existing business demand, not use, but in the time that following demand changes, may use this part numberAccording to, in order to ensure the integrality of data, thus this partial data can not be lost, so adopt this pattern to draw dataDivide management, the triple data of the triple data of complete predicate and effective predicate are separately stored, and use unique ID to closeJoin this two parts data, the triple data that use the distributed NoSQL data-base cluster of increasing income to store complete predicate, are used RDFData-base cluster is stored the triple of effective predicate.

Further, described RDF data-base cluster is made up of back end, routing node and configuration node; Back end is mainCarry out data storage, formed by multiple standalone version RDF databases of increasing income; Route (routor) node is controlled back endSystem, comprises that Data Update, back end selection, data fragmentation and data are synchronous etc.; Configuration node (config) is to back endConfiguration information manages, and comprises the IP of each back end and port, title, name figure, predicate information, storage triple numberAccording to information such as amount, maximum load factor and principal and subordinate storehouse signs.

Further, described data fragmentation and back end are selected, when storage triple data, in order to solve data dispersivenessProblem, stores the triple data of the same subject into same back end, and the data of same name figure are deposited in back end maximumIn reserves, store same back end into, make data distributed query reduce the amount of calculation and different internodal number of data queryReportedly defeated, promote inquiry velocity. In the time carrying out data fragmentation, the triple data of the same subject are as an atomic data, rootDistribution situation according to each back end current data memory space, storage capacity, maximum load factor, figure is selected corresponding dataNode, stores this triple data.

Compared with prior art, good effect of the present invention is:

The present invention is directed to the storage of extensive RDF data, propose a kind of new distributed RDF data storage scheme, dataBe divided into two parts and carry out storage administration, the triple data of complete predicate and effectively the triple data of predicate are separated memory module.The storage capacity that improves RDF data, makes it can manage the RDF data of magnanimity; Promote data high availability, RDF dataStorehouse cluster has data fragmentation and Backup Data, and in the time that certain back end lost efficacy, the system that can ensure normally runs without interruption;Data fragmentation strategy be triple data using the same subject as an atomic data, carry out data according to name figure and subject and divideSheet and back end are selected, and reduce the dispersiveness of triple data at each back end, the complexity while reducing data query and notWith volume of transmitted data between node, improve the search efficiency of data simultaneously.

Brief description of the drawings

Accompanying drawing is the system architecture diagram of a kind of efficient distributed RDF date storage method of the present invention.

Detailed description of the invention

For more clear and express intuitively method of the present invention, below in conjunction with accompanying drawing, the present invention is explained in further detail.The efficient distributed RDF date storage method of the present invention comprises the following steps:

1) data access, is responsible for externally providing unified data access interface, and the access of data is undertaken by the interface providing. MainIn comprising, have the interfaces such as data upload, Data Update, data query, predicate expansion and predicate information inquiry.

2) Data Control, provide the control processing capacity of data is mainly comprised in have the management of data management, predicate and dataStorage administration.

Data management provides the management function to RDF data, comprises RDF data to upload, upgrade and inquire about control; RDFUploading data control, comprises RDF data parser, RDF data segmentation module and generates unique ID. When data upload, headFirst, RDF data parser carries out RDF Data Analysis, supports the parsing to multiple format RDF data, comprises xml, jsonRDF data with forms such as nt, according to user's uploading data form, become Data Analysis the RDF data object of consolidation form;Then, RDF data segmentation module is cut apart the RDF data object of resolving the consolidation form generating, and user defines RDFThe name figure title of data, for determining which name figure uploading data is saved in, and obtains it according to the name figure of these RDF dataEffectively predicate list, becomes two parts according to effective predicate list Data Segmentation, i.e. the triple pair of the complete predicate of the same subjectResemble, the triple object of effective predicate of the same subject; Finally, generate unique ID, for the triple of unique this subject of sign,And these two parts data of the triple of the triple of the complete predicate of an associated subject and effective predicate of the same subject, use ID certainlyIncrease strategy and generate, obtain the ID of increasing certainly of this name figure by self-defining unique ID maker, and generate one and comprise this ID'sTriple is encapsulated into respectively the triple object of complete predicate of the same subject and the triple object of effective predicate of the same subjectIn.

Predicate management provides the management function of the predicate to RDF data, comprises the expansion of predicate, reduction and the predicate information of predicateThe function such as inquiry. The expansion of predicate, refers to effective predicate to expand, due to a RDF data-base cluster storage area predicateTriple, in the time that predicate that user need to use certain name figure is not in effective predicate, need to expand effective predicate,The triple of these predicates in extending database. Predicate spread step: user submits the predicate of the name figure that will expand to, meaningWord administration module obtain user submit to name figure and expansion predicate, contrast the effective predicate in this name figure, examine draw wantExpansion predicate, in order to ensure that existing effective predicate does not comprise the expansion predicate that user submits to, plays user input data verificationEffect; Expand scheduling by predicate and submit predicate expansion task to, backstage asynchronous execution predicate expansion task, carries out data importing,Read corresponding triple data from NoSQL database, extract the triple of expansion predicate, import to RDF data-base clusterIn.

Data storage management provides data management module and the operation of predicate administration module to database, all operations to databaseAll undertaken by this module, unified data access interface is provided, storage separates with data to realize data processing, comprises dataData query such as is carried out, upgrades and uploads at the function in storehouse, and data importing, the predicate information of predicate expansion are inquired about, upgradedAnd upload function.

3) data persistence, the physical store of responsible data, is saved in disk data, and data are divided into two parts and carry out persistence,Use NoSQL data-base cluster and RDF data-base cluster to carry out data storage. NoSQL data-base cluster uses the distribution of increasing incomeFormula NoSQL data-base cluster, utilizes the data managing capacity feature of its magnanimity, the triple data of storing complete predicate, forEnsure the integrality of data, in the time that effective predicate changes, read its corresponding predicate triple data importing to RDF dataIn the cluster of storehouse. RDF data-base cluster is made up of multiple back end, routing node and configuration node; Back end mainly carries outThe storage of triple data, is made up of many standalone versions RDF database of increasing income; Routing node is controlled back end, comprisesData Update, back end selection, data fragmentation and data are synchronous etc. Routing node management RDF data-base cluster is clusterCentroid, control each RDF database data node. Configuration node manages data node configuration information, and bag is eachThe IP of back end and port, title, name figure, predicate information, storage triple data volume, maximum load factor and principal and subordinateThe information such as storehouse sign. Load factor refers to memory data output and data maximum capacity ratio, and maximum load factor refers to permissionMaximum load factor value, current load factor refers to current according to memory space and data maximum capacity ratio. Carry out triple numberWhen uploading, routing node, according to the name figure of this triple and the configuration information of configuration node, draws this name diagram data placeBack end, if this name diagram data is not stored in any back end, represents that this name figure is new figure, from allIn back end, choose the back end of a current load factor minimum, the triple data that storage is uploaded; If there is this nameFigure is stored in some back end, from these back end, chooses the back end of certain current load factor minimum, ifMinimum current load factor value in back end is more than or equal to maximum load factor value, needs this name diagram data to carry outA back end of loading factor minimum is chosen in burst storage from other back end, the triple data that storage is uploaded,Otherwise directly choose the back end of current filling factor minimum, the triple data that storage is uploaded. Data store back end intoAfterwards, upgrade corresponding configuration information, comprise the configuration letter such as storage triple data volume of more rebaptism figure information and back endBreath.

The implementation case study of data upload:

1. prepare triple data, and define the name figure (graphname) of these triple data, which specified data will upload toIndividual name figure, uploads interface by calling data, uploads triple data and its name and schemes data management module.

2. data management module calling data resolver, resolves this triple data, data encapsulation is become to the triple of consolidation formData object.

3. data management module calling data is cut apart module, and inquires about effective predicate list of its name figure by predicate administration module,According to effective predicate list, be divided into two parts uploading triple data object, i.e. complete predicate triple data object and havingEffect predicate triple data object.

4. data management module uses unique ID to grow up to be a useful person from hyperplasia, generates the unique ID that uploads triple data, and ID value pointBe not encapsulated in complete predicate triple data object and effective predicate triple data object.

5. calling data storage control module, stores complete predicate triple data and effective predicate triple data into respectivelyNoSQL data-base cluster and RDF data-base cluster. Complete predicate triple data directly store NoSQL data-base cluster intoIn. The storage of the effective predicate triple of the routing node control data of RDF data-base cluster.

The routing node of 6.RDF data-base cluster, obtains this name figure place back end by calling configuration node, if shouldName diagram data is not stored in any back end, represents that this name figure is new figure, chooses one from all back endThe back end of individual current load factor minimum, the triple data that storage is uploaded, proceed data storage by step 10.

If 7. there is this name figure to be stored in some back end, from these back end, choose certain current load factorLittle back end.

8., if the minimum current load factor value in selected data node is more than or equal to maximum load factor value, need thisName diagram data carries out burst storage, chooses the back end of a current filling factor minimum from other back end, storageThe triple data of uploading, proceed data storage by step 10.

9., if the minimum current load factor value in selected data node is less than maximum load factor value, directly choose currentLoad the back end of factor minimum, the triple data that storage is uploaded, proceed data storage by step 10.

10. after data store back end into, upgrade corresponding configuration information: the storage ternary of name figure information, back endGroup data volume and the current filling factor.

Claims

1. an efficient distributed RDF date storage method, the steps include:

1) user is that each triple to be uploaded is chosen a name figure or set a new name figure; And be should according to business demandTriple is set effective predicate and a triple thereof;

2) every triple in the RDF data that data control system is uploaded user is resolved, and extracts the meaning of this tripleEffective predicate of the name figure of word and this triple; Then according to this effective predicate, this triple is split into and has phaseWith two triple of uniquely identified: three of the triple of the complete predicate of the same subject and effective predicate of the same subjectTuple; Wherein, all predicates that the name figure that complete predicate is triple comprises, effectively predicate is in complete predicate onePartial predicate;

3) data control system is by effective predicate of the triple data of the complete predicate of the same subject obtaining and the same subjectTriple data store respectively different data-base clusters into.

2. the method for claim 1, is characterized in that, uses the distributed NoSQL data-base cluster of increasing income to store same masterThe triple data of the complete predicate of language, the triple number of effective predicate of use RDF data-base cluster storage the same subjectAccording to.

3. method as claimed in claim 2, is characterized in that, when data control system receives that predicate is more when new task, according to this morePredicate lastest imformation in new task, detects the predicate changing, and then upgrades the correspondence of RDF data-base cluster storagePredicate in triple.

4. method as claimed in claim 2 or claim 3, is characterized in that, described RDF data-base cluster comprises back end, routeNode and configuration node; Wherein, back end is for data storage; Routing node, for back end is controlled, wrapsDraw together Data Update, back end selection, data fragmentation and data synchronous; Configuration node is for entering data node configuration informationLine pipe reason, comprise the IP of each back end and port, title, name figure, predicate information, storage triple data volume,Maximum load factor and principal and subordinate storehouse beacon information.

5. method as claimed in claim 4, is characterized in that, data control system stores the triple data of the same subject into sameOne back end.

6. method as claimed in claim 5, is characterized in that, data control system by the data of same name figure at back endIn large buffer memory, store same back end into.

7. method as claimed in claim 4, is characterized in that, routing node is according to the name figure of triple and the configuration of configuration nodeInformation, draws the back end at the data place of this name figure; Wherein, if the data of this name figure are not stored in anyBack end is chosen the back end of a current load factor minimum from all back end, the ternary that storage is uploadedGroup data; If find some back end of the data of this name of storage figure, choose current dress from these back endFill out the back end of factor minimum, if the minimum current load factor value in this back end is more than or equal to maximum fillingFactor values, carries out burst storage to the data of this name figure, chooses the joint of a filling factor minimum from other back endPoint, the triple data that storage is uploaded; Otherwise the triple number that the back end storage of choosing current filling factor minimum is uploadedAccording to.

8. method as claimed in claim 7, is characterized in that, back end is stored after a triple, upgrades corresponding configuration information,Comprise triple data volume and the current filling factor of name figure information, storage.

9. the method for claim 1, is characterized in that, data control system is expanded the effective predicate extracting: forThe predicate of the name figure that will expand that user submits to, data control system obtains name figure and the expansion meaning thereof that user submits toWord, contrasts the effective predicate in this name figure, examines to draw and will expand predicate.