CN106227800A

CN106227800A - The storage method of the big data of a kind of highlights correlations and management system

Info

Publication number: CN106227800A
Application number: CN201610579013.8A
Authority: CN
Inventors: 李�昊; 张敏; 付艳艳; 惠榛; 陈震宇; 张宗福
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2016-07-21
Filing date: 2016-07-21
Publication date: 2016-12-14
Anticipated expiration: 2036-07-21
Also published as: CN106227800B

Abstract

The invention discloses storage method and the management system of the big data of a kind of highlights correlations.Native system includes that memory module, uniform data manage module；Wherein, memory module includes the Hashmap model storing data entity content, in order to store the relational model of data entity attribute, in order to store the diagram data model associating contact between data entity；Each data entity arranges an entity type and unique No. ID, by setting up this incidence relation between No. ID attribute to same data entity of data entity and content；Described uniform data management module, for the increase in a storage module of the incidence relation of data entity, attribute, data content, delete, update, inquire about.The present invention is capable of the storage to large data sets and management, can support efficient correlation inquiry analysis simultaneously.

Description

The storage method of the big data of a kind of highlights correlations and management system

Technical field

The invention belongs to big field of data storage, relate to storage method and the management of the big data of a kind of highlights correlations System.

Background technology

Increasingly pay attention to the value of data in big data age, enterprise or organization, and progressively start big data Gather, store and analysis and utilization.In these large data sets, the association between data generally exists.Especially in social activity In the application scenarios closely-related with individual consumer such as the big data of network, the big data of medical treatment, present especially between data object The feature of highlights correlations.And complicated contact between data often possesses huge point present in these highlights correlations data sets Analysis is worth.Such as, associating etc. between friends, medicine and the patient between social user.Meanwhile, these highlights correlations Large data sets also possess on a large scale, high speed and multifarious feature, therefore to preferably analyze and utilize them, just need The problems such as the efficient storage of this type of data set and management are launched research.

In order to tackle the storage demand of big data, it will usually use structured relations database store structure targetedly Change data, use NoSQL database purchase semi-structured or unstructured data.In these storage methods, relational data Storehouse and most of NoSQL data base (such as, key value database, document database, column database) are for contact between data Storage and management the most very poor efficiency.They storages are all uncorrelated record, value, document, row, carry out between data at needs When the inquiry of relatedness and analysis, need to use the extra mechanism such as index, external key, table connection to realize.

On the other side, chart database is then absorbed between data storage and inquiry, its multilamellar correlation inquiry of contact With Query efficiency far higher than relational database and other NoSQL data bases.Multilamellar correlation inquiry refers between data Get in touch with and carry out multilamellar inquiry.Such as inquiry " friends of friends of the friends of friends of someone " is exactly in friends Carry out multilamellar inquiry.Query then refers to that the direction that the direction of inquiry is set up with index is contrary.Such as, there is index " patient-> medicine ", then inquiring about certain patient and having bought which medicine is the most efficiently, but Query has bought certain medicine Which the patient of product has it is necessary to efficiency is much lower.Even if in order to tackle Query above, establishing " medicine-> patient " Index, then when in the face of inquiry " having which patient to buy medicine A, also to buy medicine B ", it is still desirable to repeatedly inquire about, Inefficient.Chart database can solve the problem that the inquiry problem of incidence relation between these focused datas.But, chart database cannot Meet the big and diversified storage characteristics of scale of large data sets.

At present, occurred in that some mixing application NoSQL data base and file system are to the method storing large data sets. They, according to the features of different pieces of information in large data sets, have been left in suitable data base or file by these methods respectively In system.For example with structured relations database store structure data, use NoSQL database purchase semi-structured or non- Structural data.Yet with lacking for the consideration of complicated association between data in highlights correlations large data sets, also not with The data model of coupling, storage method and querying method, so these are stored in the number in disparate databases or file system The most separate according to collection, there is bulk redundancy, simultaneously the most more poor efficiency when carrying out complicated incidence relation inquiry.

In a word, at present big field of data storage still lacks a kind of for large data sets efficient that there is highlights correlations relation Storage and management method, it is impossible to meet the storage demand of big data and the efficient analysis demand to wherein complicated contact simultaneously.

Summary of the invention

For above-mentioned technical problem, it is an object of the invention to provide a kind of storage method for the big data of highlights correlations and Management system, it is achieved storage and the management to this kind of large data sets, can support efficient correlation inquiry analysis simultaneously.

The ultimate principle of this technology is: propose a kind of with data based on diagram data model, relational model, Hashmap model Between entity, incidence relation is the mixed data model of core, and the association i.e. using diagram data model to describe between data entity is closed System, uses relational model to describe the structured attributes of data entity, uses Hashmap model to describe the content of entity；And respectively Chart database, relational database, key value database or distributed file system is used to realize above-mentioned data model；And use just When the priority of the relational query of policy optimization data entity, attribute query and content search improve the inquiry inspection of data Rope efficiency.

Specifically, in order to realize above-mentioned technical purpose, the present invention by the following technical solutions:

A kind of storage method for the big data of highlights correlations and management system include:

1) mixed data model for highlights correlations large data sets is set up；

Further, described mixed data model includes diagram data model, relational model, Hashmap model.

Further, described diagram data model refers to show data entity with node table, represents data with the limit connecting node The data model of the contact between entity.

Further, described relational model refers to the normalized model used in relational database, i.e. by the shape of bivariate table Formula presentation-entity and the data model of inter-entity contact.The technical program describes entity attributes only with bivariate table.I.e. Relational model is usually and is used for depositing contact between entity and entity, but the present invention is only used in mixed data model Entity attributes is described.

Further, described Hashmap model refers to the data model using key to store and retrieve value below. The way of realization of Hashmap model can be key value database or distributed file system, the most permissible.

Further, the construction method of mixed data model is as shown in Figure 1: each data entity has an entity class Type and one unique No. ID；The attribute of data entity then uses the form of the bivariate table in relational model to describe, i.e. every kind Entity type is by corresponding a shape such as the bivariate table of [data entity ID | attribute A | attribute B......]；In Hashmap model Key is No. ID of data entity, and is worth the original contents for data entity；Association between data entity is by diagram data model Represent, i.e. a node in diagram data model means that a data entity, and this node is by identified upper corresponding data entity Type and No. ID, the limit between node then represents the incidence relation between data entity；And same entity attributes and Association between content is then represented by the mapping of entity ID.

2) data big with the highlights correlations that above-mentioned mixed data model mates storage method and management system；

Further, the storage method of the big data of described highlights correlations and management system are as in figure 2 it is shown, include: storage mould Block, uniform data management module and secondary index mechanism.

Further, described memory module refers to store multiple data bases or the file system of data entity, i.e. in order to deposit The chart database of contact between storage data entity, in order to store key value database or the distributed document of data entity original contents System, in order to store the relational database of data entity attribute.Wherein,

Described chart database achieves diagram data model, the joint that the most each data entity is stored as in chart database Point, this node has the mark of entity ID and the mark of entity type.

Described key value database or distributed file system achieve Hashmap model.Can be, but not limited to use entity ID As key.(such as, when several entities are stored together, storage time or reality can be used when using non-physical ID as key The combination of body ID is as key), then need to build a lazy halyard and attract raising search efficiency, i.e. index entity by entity ID The key that content is corresponding.This secondary index can be implemented as a table in relational database.

Described relational database achieves relational data model and secondary index mechanism, is each entity type and builds one Opening tables of data [entity ID | attribute A | attribute B......], entity ID is its major key.

Further, described uniform data management module, mainly achieve the incidence relation of data entity, attribute, data Content increase in disparate databases or file system, delete, update, query function, and use appropriate policy optimization number The priority of the relational query of body, attribute query and content search improves the query and search efficiency of data factually.

Further, described secondary index mechanism, mainly in order to improve the efficiency of common query, it is through frequently as inquiry The attribute of condition or combinations of attributes index building, to realize the quick-searching to entity ID.Additionally, also include using non-physical ID As the secondary index table built during the key of key value database or distributed file system.

Further, described data increase flow process is as follows:

Step A1: the original contents of newly-increased data entity is stored in key value database or distributed file system, its key sets For entity ID or other uniquely identify.Other according to non-physical ID uniquely identify, then should build in advance in relational database The vertical table [entity ID | other uniquely identify] being used on secondary index, and when newly-increased data entity, insert " newly-increased to this table Data entity ID, new other uniquely identify " corresponding relation record.

Step A2: extract the property value of newly-increased data entity, and these property values are inserted the type institute of this data entity In corresponding relation table, newly-increased record shape such as " entity ID | attribute A | attribute B...... ".

Step A3: extract newly-increased data entity and the incidence relation of other data entities, insert new in chart database Node represents newly-increased data entity, arranges two nodal communitys for it simultaneously and records entity ID and entity type, last root respectively Limit is set up according to the incidence relation of newly-increased data entity Yu other data entities.When another data associated by newly-increased data entity are real Body exists in chart database, then would indicate that the node limit of two entities connects.If associated by newly-increased data entity Another data entity also not in chart database, then first should set up in chart database and represent another data reality The node of body, connects them with limit the most again.

Further, described data deletion flow process is as follows:

Step B1: the data entity that will delete is deleted from key value database or distributed file system, if closing coefficient According to storehouse exists the table [entity ID | other uniquely identify] for secondary index, then simultaneously need to delete this data entity corresponding Record.

Step B2: the data entity that will delete is deleted from the relation table that relational database is corresponding.

Step B3: the node representated by data entity that will delete is deleted from chart database, is made by this node simultaneously Limit for one of end points is the most all deleted.

Further, described data query flow process is as follows:

Step C1: retrain according to the incidence relation between the data entity in querying condition, carry out in chart database Join lookup, it is thus achieved that qualified result set R1.If querying condition is without the constraint to the incidence relation between data entity, then Not performing step C1, R1 is directly empty.

Step C2: according to the attribute constraint of the data entity in querying condition, carries out matched and searched in relational database, Obtain result set R2.If without the attribute constraint to data entity in querying condition, the most not performing step C2, R2 is directly empty.

Step C3: result set R3 is set according to the collection of the entity ID composition in R1 and R2 is incompatible.If R1 and R2 is empty, then R3 is empty.If wherein any one is empty to R1 or R2, then R3 is constituted equal to the entity ID in a result set of wherein non-NULL Set.If R1 Yu R2 all non-NULLs, then arranging R3 is that in R1 and R2, entity ID intersection of sets integrates (may occur simultaneously as sky).

Step C4: if the result of data query requires the textual content having data, then according to the entity ID in R3, at key assignments Data base or distributed file system find the physical contents of correspondence.If R3 is empty, i.e. represent in R3 there is no data entity, Then step C4 does not performs, and is directly entered step C5.

Step C5: if R 3 is not empty, then return the content of data entity in Query Result R1, R2, R3 and R3.If R3 is Sky, then return R1, R2, R3.

Further, described data more new technological process is as follows:

Step D1: based on described data search flow process, obtain the ID set of data entity to be updated according to querying condition R3。

Step D2: if desired update the content of certain data entity, then according to data entity ID in R3 in key value database or Distributed file system finds the physical contents of correspondence, is replaced with new content.

Step D3: if desired update the attribute of certain data entity, then according to data entity ID in R3 at relational database pair The relation table answered updates the record that this entity is corresponding.

Step D4: if desired update certain data entity and some incidence relations of other data entities in R3, then by basis This entity ID navigates to represent its node, using this node as the starting point of inquiry, then starts according to update from this starting point The pattern of incidence relation in chart database, carry out matched and searched, updated when finding the incidence relation needing amendment.

Further, the strategy of described Optimizing Queries order is as follows: when inquiring about relational database without complexity When table connects, preferentially carry out the order of the inquiry of relational database, i.e. exchange query steps C2 and C1, the result set performed by C2 The start node that entity ID in R2 inquires about as C1 chart database, starts the road of chart database the most again from these start nodes Footpath matching inquiry；And when the inquiry of relational database being contained complicated table and connecting, the most preferentially carry out the inquiry of chart database, i.e. Keep step C1 and the order of C2, obtain R1 by the correlation inquiry of chart database, the most again in the range of the entity ID in R1 Carry out the inquiry of the relational database of complexity, thus improve search efficiency.

Beneficial effects of the present invention is as follows:

(1) incidence relation between the data entity in large data sets have employed graph model and is modeled, and employs figure Data base stores, therefore, it is possible to support complicated correlation inquiry efficiently, under i.e. same hardware environment, enterprising at graph model When the complicated associated data of row a length of the 3 of the path that node and limit are constituted (or more than) is searched, performance is averagely in second level；And other When using relational model, Hashmap model tormulation incidence relation, its complicated associated data inquiry is mostly at tens seconds, the most several Hundred seconds ranks.

(2) attribute and the original contents of the data entity in large data sets is associated together storage rightly, therefore can Enough simultaneously satisfied to big data attribute and the inquiry of original contents.

(3) have employed uniform data management module, it is possible to according to the feature of querying condition, search order is optimized, Thus improve search efficiency.

Accompanying drawing explanation

Fig. 1 is the mixed data model that the present invention proposes；

Fig. 2 is the Technical Architecture of the data management system that the present invention proposes；

Fig. 3 is the modeling example of the data set of the social network sites that the present invention shows.

Detailed description of the invention

The embodiment of the key technology in summary of the invention and method will be carried out example explanation below, but not with this Explain the scope limiting invention.

1) data set

With the data instance of certain social network sites, data mainly include user profile data, the big class of micro-blog information data two.With Family information data has included user account number, sex, age, hobby, hour of log-on, the list of other users of user's concern, pass Note the list of other users of this user.Micro-blog information data then include the ID of microblogging, the user account number of issue, forward microblogging ID, the content of microblogging, issuing time, issue place, the equipment of issuing microblog, the user account number of@.This data set is deposited Relation between substantial amounts of data: the forwarding between the issue relation of concern relation, user and microblogging between user, microblogging The relation of@in relation, microblogging.

2) data modeling

As it is shown on figure 3, be first relation table UserInfo [user account number by the structured attributes information modeling in user profile | sex | age | hobby | hour of log-on], major key is user account number.Again the structured attributes information in micro-blog information is also set up For relation table Weibo [microblogging ID | issuing time | issue place | the equipment of issuing microblog], major key is microblogging ID.Then by micro- Rich raw text content is modeled as Hashmap model, using the ID of microblogging as key.Then using user account number, microblogging ID as The node of graph model, and by the forwarding relation between the issue relation of concern relation, user and microblogging between user, microblogging, Limit between the@relation node of microblogging and user is described.Finally, relation table UserInfo and the user node of graph model Mapping association is carried out by user account number；Relation table Weibo, the microblogging node of graph model, Hashmap model are entered by microblogging ID Row mapping association.

Additionally, in order to improve search efficiency, suitable redundancy can be allowed under normal circumstances to exist.Such as, by relation table Weibo [microblogging ID | issuing time | issues place | the equipment of issuing microblog] and, [microblogging ID | issuing time | is sent out to be revised as Weibo Cloth place | the equipment of issuing microblog | the user account number of issue].Tie up in graph model although the issue of user and microblogging is closed Describe, after amendment, add the redundancy of data, but this redundancy is carrying out inquiring about " user A last month issue all microbloggings " Time, there is higher efficiency.Because this relational query without complicated table attended operation is that relational database is good at, institute To be no need for going to queried in chart database, total database query operations has only to once again.This modeling time superfluous to data Depending on remaining consideration needs according to actual business demand.

3) storage method

First, use traditional relational to implement the structure of relation table UserInfo and Weibo in data model, Complete user and the storage of the attribute of two kinds of data entities of microblogging；Then, key value database is used to implement Hashmap mould Type, completes the storage to microblogging textual content；Then, chart database is used to implement the structure of graph model, between completing user The issue relation of concern relation, user and microblogging, forwarding relation between microblogging, the storage of the relation of@in microblogging；Finally, The ID key as key value database of microblog data entity is directly have employed, so avoiding the need for setting up auxiliary due to this example Index.If needing to accelerate some conventional attribute conditions inquiries, other auxiliary suitably can be set up in relational database Index.And above-mentioned date storage method will be implemented by uniform data management module, the user profile data that will collect, micro- Rich information data two class data are organized according to data above model, leave in disparate databases respectively and are managed.

Claims

1. a storage method for the big data of highlights correlations, the steps include:

1) it is that each data entity arranges an entity type and unique No. ID；

2) bivariate table in relational model is used to store the attribute of data entity；Use Hashmap model storage data entity Content；

3) by setting up incidence relation between No. ID attribute to same data entity of data entity and content；Use diagram data mould Incidence relation between type storage data entity.

2. the method for claim 1, it is characterised in that use the association between diagram data model storage data entity to close The method of system is: a node in diagram data model represents a data entity, and mark corresponding data is real on this node The entity type of body and No. ID, utilize the limit between node to represent the incidence relation between data entity.

3. the method for claim 1, it is characterised in that in described Hashmap model, with No. ID of data entity be Key, the content of data entity is key assignments.

4. the method for claim 1, it is characterised in that use the bivariate table in relational model to store the genus of data entity The method of property is: using a bivariate table to store the attribute of the data entity of every kind of entity type, its form is: [data Entity ID | attribute A | attribute B......].

5. the method for claim 1, it is characterised in that in described Hashmap model, by the content of some data entities Merge storage；Wherein with storage time of some data entities of being stored together as key, the some numbers being stored together The content of body is key assignments factually, and builds a data entity ID and index the index of key corresponding to physical contents；Or be stored in The storage time of some data entities together and the combination of ID thereof are as key, the content of the some data entities being stored together For key assignments, and build a data entity ID and index the index of key corresponding to physical contents.

6. the management system of the big data of highlights correlations, it is characterised in that include that memory module, uniform data manage module； Wherein, memory module includes the Hashmap model storing data entity content, in order to store the relation of data entity attribute Model, in order to store the diagram data model associating contact between data entity；Each data entity arranges an entity type and only No. ID of one, by setting up this incidence relation between No. ID attribute to same data entity of data entity and content；

Described uniform data management module, for the incidence relation of data entity, attribute, data content in a storage module Increase, delete, update, inquire about.

7. system as claimed in claim 6, it is characterised in that utilize chart database to realize described diagram data model, the most each The node that data entity is stored as in chart database, identifies entity type and the ID of corresponding data entity on this node Number, utilize the limit between node to represent the incidence relation between data entity；Utilize key value database or distributed file system Realizing described Hashmap model, with No. ID of data entity as key, the content of data entity is key assignments；Utilize relational database Realizing described relational data model, wherein, the attribute of the data entity of every kind of entity type uses a bivariate table to store, its Form is: [data entity ID | attribute A | attribute B......].

System the most as claimed in claims 6 or 7, it is characterised in that need storage to memory module when there being newly-increased data entity Time, the content of this newly-increased data entity is stored in Hashmap model by described uniform data management module, and its key is set to entity ID； Then extract the property value of this newly-increased data entity, and these property values are inserted the class of this newly-increased data entity in relational model In bivariate table corresponding to type；Then the incidence relation of this newly-increased data entity and other data entities is extracted, at diagram data mould Type inserts new node and represents this newly-increased data entity, and two nodal communitys are set record entity ID and entity class respectively Type；Then limit is set up according to the incidence relation of this newly-increased data entity Yu other data entities.

System the most as claimed in claims 6 or 7, it is characterised in that when receiving a data entity inquiry request, described unification Data management module retrains according to the incidence relation between the data entity in querying condition, mates in diagram data model Search, it is thus achieved that qualified result set R1；According to the attribute constraint of the data entity in querying condition, enter in relational model Row matched and searched, it is thus achieved that result set R2；Then the collection constituted according to the entity ID in result set R1, R2 incompatible arranges result set R3: if result set R1 and R2 is sky, result set R3 are empty, if wherein any one is sky to result set R1 or R2, and result set The set that R3 is constituted equal to the entity ID in a result set of wherein non-NULL, if result set R1 Yu R2 all non-NULLs, then arranges knot Fruit integrates R3 as entity ID intersection of sets collection in R1 Yu R2；If data inquiry request requires the textual content having data, then according to knot Fruit collection R3 in entity ID, find in Hashmap model correspondence physical contents, be then back to query results R1, R2, The content of data entity in R3 and R3.