CN106227800B

CN106227800B - Storage method and management system for highly-associated big data

Info

Publication number: CN106227800B
Application number: CN201610579013.8A
Authority: CN
Inventors: 李�昊; 张敏; 付艳艳; 惠榛; 陈震宇; 张宗福
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2016-07-21
Filing date: 2016-07-21
Publication date: 2020-02-21
Anticipated expiration: 2036-07-21
Also published as: CN106227800A

Abstract

The invention discloses a storage method and a management system of highly-associated big data. The system comprises a storage module and a unified data management module; the storage module comprises a Hashmap model used for storing the contents of the data entities, a relation model used for storing the attributes of the data entities and a graph data model used for storing the association relation between the data entities; each data entity sets an entity type and a unique ID number, and the association relationship is established between the attribute and the content of the same data entity through the ID number of the data entity; the unified data management module is used for adding, deleting, updating and inquiring the incidence relation, the attribute and the data content of the data entity in the storage module. The invention can realize the storage and management of the large data set and simultaneously can support high-efficiency associated query analysis.

Description

Storage method and management system for highly-associated big data

Technical Field

The invention belongs to the field of big data storage, and particularly relates to a storage method and a management system of highly-associated big data.

Background

In the big data era, enterprises or organizations increasingly attach importance to the value of data, and gradually start to collect, store, analyze and utilize big data. In these large datasets, associations between data are ubiquitous. Particularly, in application scenes such as social network big data and medical big data which are closely related to individual users, the data objects are highly correlated. The complex relationships between the data in these highly correlated data sets tend to be of great analytical value. For example, a friendship between social users, an association between a drug and a patient, and so forth. Meanwhile, these highly correlated large data sets are also characterized by large scale, high speed and diversity, so in order to analyze and utilize them better, research on efficient storage and management of such data sets is needed.

In order to meet the storage requirement of big data, a structured relational database is generally used for storing structured data in a targeted manner, and a NoSQL database is used for storing semi-structured or unstructured data. Among these storage methods, both relational databases and most NoSQL databases (e.g., key-value databases, document databases, column databases) are very inefficient for the storage and management of associations between data. All the data are stored in unrelated records, values, documents and columns, and when the query and analysis of the relevance among the data are required, additional mechanisms such as indexes, foreign keys and table connection are required to be adopted for implementation.

In contrast, graph databases are dedicated to storing and querying the links between data, and the efficiency of multi-level associative query and reverse query is far higher than that of relational databases and other NoSQL databases. Multi-tier associative queries refer to making multiple tiers of queries over connections between data. For example, querying "friends of a person" is to perform multiple layers of queries on the relationships of friends. And the reverse query means that the query direction is opposite to the index building direction. For example, there is an index "patient- > drugs", it is very fast to query which drugs a certain patient has bought, but it is much less efficient to query in reverse which patients have bought a certain drug. Even if an index of "medicine- > patient" is established in order to deal with the above reverse query, in the face of the query "which patients buy medicine a and medicine B", it is still necessary to perform multiple queries, which is inefficient. Graph databases can solve the query problem of the incidence relation between the data of interest. However, graph databases do not satisfy the large-scale and diverse storage characteristics of large data sets.

Currently, some methods have emerged to store large data sets using a hybrid NoSQL database and file system. The methods respectively store different data in a large data set in a proper database or a file system according to respective characteristics of the different data. For example, a structured relational database is used to store structured data, and a NoSQL database is used to store semi-structured or unstructured data. However, due to the lack of consideration for complex association among data in a large highly-associated data set and the absence of a data model, a storage method and a query method matched with the complex association, the data sets stored in different databases or file systems are often independent of each other, so that a large amount of redundancy exists, and the complex association query is inefficient.

In summary, an efficient storage and management method for a large data set with a high degree of association is still lacking in the field of large data storage at present, and the storage requirement of large data and the efficient analysis requirement of complex association in the large data cannot be met at the same time.

Disclosure of Invention

In view of the above technical problems, an object of the present invention is to provide a storage method and a management system for highly-associated big data, which can realize storage and management of such big data sets and support efficient association query analysis.

The basic principle of the technology is as follows: a mixed data model taking the incidence relation between data entities as a core is provided based on a graph data model, a relation model and a Hashmap model, namely, the incidence relation between the data entities is described by adopting the graph data model, the structural attribute of the data entities is described by adopting the relation model, and the content of the entities is described by adopting the Hashmap model; respectively adopting a graph database, a relational database, a key value database or a distributed file system to realize the data model; and optimizing the priority order of the relation query, the attribute query and the content query of the data entity by adopting a proper strategy to improve the query retrieval efficiency of the data.

Specifically, in order to achieve the technical purpose, the invention adopts the following technical scheme:

a storage method and a management system for highly-associated big data comprise the following steps:

1) establishing a mixed data model aiming at the highly-associated big data set;

further, the mixed data model comprises a graph data model, a relation model and a Hashmap model.

Further, the graph data model refers to a data model in which data entities are represented by nodes, and the edges connecting the nodes represent the connections between the data entities.

Further, the relational model refers to a normalized model adopted in a relational database, that is, a data model representing entities and relations between the entities in a two-dimensional table form. In the technical scheme, only the two-dimensional table is adopted to describe the attribute of the entity. That is, the relational model is typically used to store entities and relationships between entities, but the present invention only uses it to describe attributes of entities in the mixed data model.

Further, the Hashmap model refers to a data model that employs keys to store and retrieve subsequent values. The Hashmap model may be implemented in the form of a key-value store or a distributed file system, both.

Further, the construction method of the mixed data model is shown in fig. 1: each data entity has an entity type and a unique ID number; the attributes of the data entities are described in the form of a two-dimensional table in a relational model, i.e., each entity type corresponds to a two-dimensional table in the form of [ data entity ID | attribute a | attribute b. ·. ]; the key in the Hashmap model is the ID number of the data entity, and the value is the original content of the data entity; the association between the data entities is represented by a graph data model, namely, a node in the graph data model represents a data entity, the node is identified with the type and the ID number of the corresponding data entity, and the edge between the nodes represents the association between the data entities; and the association between the attributes and the content of the same entity is represented by a mapping of entity ID numbers.

2) A storage method and a management system of highly-associated big data matched with the mixed data model;

further, the storage method and management system of the highly-associated big data are shown in fig. 2 and include: the device comprises a storage module, a unified data management module and an auxiliary index mechanism.

Further, the storage module refers to a plurality of databases or file systems for storing data entities, that is, a graph database for storing the relation between data entities, a key value database or distributed file system for storing the original content of data entities, and a relational database for storing the attributes of data entities. Wherein the content of the first and second substances,

the graph database implements a graph data model, i.e., each data entity is stored as a node in the graph database having an identification of an entity ID and an identification of an entity type.

The key-value database or distributed file system implements a Hashmap model. An entity ID may be used as a key, but is not limited to. When a non-entity ID is used as a key (for example, when several entities are stored together, a combination of storage time or entity IDs may be used as a key), an auxiliary index needs to be constructed to improve the query efficiency, that is, a key corresponding to the entity content is indexed by an entity ID. The secondary index may be implemented as a table in a relational database.

The relational database realizes a relational data model and an auxiliary indexing mechanism, namely a data table [ entity ID | attribute A | attribute B.. 9 ] is constructed for each entity type, and the entity ID is a main key of the data table.

Furthermore, the unified data management module mainly realizes the functions of adding, deleting, updating and querying the association relationship, the attribute and the data content of the data entity in different databases or file systems, and optimizes the priority order of the relationship query, the attribute query and the content query of the data entity by adopting a proper strategy to improve the query and retrieval efficiency of the data.

Further, the auxiliary index mechanism is mainly used for improving the efficiency of common query, that is, an index is constructed for attributes or attribute combinations which are often used as query conditions, so as to realize quick retrieval of the entity ID. In addition, the system also comprises an auxiliary index table constructed when the non-entity ID is used as a key of a key value database or a distributed file system.

Further, the data adding process is as follows:

step A1: and storing the original content of the newly added data entity into a key value database or a distributed file system, wherein the key of the newly added data entity is set as an entity ID or other unique identification. If other unique identification which is not the entity ID is adopted, a table [ entity ID | other unique identification ] for auxiliary index is established in the relational database in advance, and when a data entity is newly added, a corresponding relation record of 'the ID of the newly added data entity and the new other unique identification' is inserted into the table.

Step A2: and extracting attribute values of the newly added data entities, and inserting the attribute values into a relation table corresponding to the types of the data entities, wherein the newly added record is in the form of' entity ID | attribute A | attribute B.

Step A3: extracting the incidence relation between the newly added data entity and other data entities, inserting a new node into the graph database to represent the newly added data entity, setting two node attributes for the newly added data entity, respectively recording the entity ID and the entity type, and finally establishing a side according to the incidence relation between the newly added data entity and other data entities. And when another data entity associated with the newly added data entity exists in the graph database, connecting the nodes representing the two entities by using edges. If another data entity associated with the newly added data entity does not already exist in the graph database, a node representing another data entity should be established in the graph database, and then edges are used to connect them.

Further, the data deleting process is as follows:

step B1: deleting the data entity to be deleted from the key-value database or the distributed file system, and if a table [ entity ID | other unique identification ] for auxiliary index exists in the relational database, deleting the record corresponding to the data entity.

Step B2: and deleting the data entity to be deleted from the corresponding relation table of the relation database.

Step B3: the node represented by the data entity to be deleted is deleted from the graph database, while the edges having the node as one of the end points are also all deleted.

Further, the data query process is as follows:

step C1: and according to the incidence relation constraint between the data entities in the query condition, performing matching search in the graph database to obtain a result set R1 meeting the condition. If the query does not contain constraints on the association relationship between the data entities, step C1 is not performed and R1 is directly empty.

Step C2: and according to the attribute constraint of the data entity in the query condition, performing matching search in the relational database to obtain a result set R2. If the query does not contain the attribute constraint for the data entity, step C2 is not performed and R2 is directly empty.

Step C3: the result set R3 is set according to the set of entity IDs in R1 and R2. If both R1 and R2 are empty, then R3 is empty. If either R1 or R2 is empty, then R3 equals the set of entity IDs in one of the result sets that is not empty. If neither R1 nor R2 is empty, then R3 is set to the intersection of the entity ID sets in R1 and R2 (possibly with the intersection empty).

Step C4: if the result of the data query requires the original text content of the data, the corresponding entity content is found in the key value database or the distributed file system according to the entity ID in R3. If R3 is empty, i.e. it indicates that there is no data entity in R3, then step C4 is not performed and step C5 is entered directly.

Step C5: if R3 is not empty, the contents of the data entities in the query results R1, R2, R3 and R3 are returned. If R3 is empty, return to R1, R2 and R3.

Further, the data updating process is as follows:

step D1: and acquiring an ID set R3 of the data entity to be updated according to the query condition based on the data search process.

Step D2: if the content of a certain data entity needs to be updated, the corresponding entity content is found in the key value database or the distributed file system according to the data entity ID in R3, and is replaced by new content.

Step D3: if the attribute of a certain data entity needs to be updated, the record corresponding to the entity is updated in the relational table corresponding to the relational database according to the ID of the data entity in R3.

Step D4: if some incidence relations between a certain data entity and other data entities in the R3 need to be updated, a node representing the entity is located according to the entity ID, the node is used as a starting point of the query, then matching search is performed in the graph database according to the pattern of the incidence relations to be updated from the starting point, and the incidence relations to be modified are updated when found.

Further, the strategy for optimizing the query sequence is as follows: when the query of the relational database does not contain complex table connection, preferentially performing the query of the relational database, namely exchanging the sequence of the query steps C2 and C1, taking the entity ID in a result set R2 executed by C2 as the starting node of the query of the database of C1, and then starting the path matching query of the database from the starting nodes; when the query of the relational database contains complex table connection, the query of the graph database is preferentially carried out, namely the sequence of the steps C1 and C2 is maintained, R1 is obtained through the association query of the graph database, and then the query of the complex relational database is carried out in the entity ID range in R1, so that the query efficiency is improved.

The invention has the following beneficial effects:

the incidence relation between data entities in a big data set is modeled by adopting a graph model, and a graph database is used for storage, so that efficient complex incidence query can be supported, namely, the performance is averagely in the second level when complex incidence data (the length of a path formed by nodes and edges is 3 or more) is searched on the graph model under the same software and hardware environment; while other relational models and Hashmap models are adopted to express the association relationship, the complex association data query is mostly in the level of tens of seconds, even hundreds of seconds.

And (II) the attributes of the data entities in the big data set and the original content are properly associated and stored together, so that the query of the big data attributes and the original content can be simultaneously satisfied.

And thirdly, a unified data management module is adopted, so that the query sequence can be optimized according to the characteristics of the query conditions, and the query efficiency is improved.

Drawings

FIG. 1 is a hybrid data model proposed by the present invention;

FIG. 2 is a technical architecture of a data management system proposed by the present invention;

FIG. 3 is an example of modeling a data set for a social networking site presented by the present invention.

Detailed Description

The following is an illustrative explanation of embodiments of the key techniques and methods in this summary, but the scope of the invention is not limited by this explanation.

1) Data set

Taking data of a certain social network site as an example, the data mainly comprises user information data and microblog information data. The user information data includes a user account, gender, age, hobbies, registration time, a list of other users interested by the user. The microblog information data comprises the ID of the microblog, the user account for releasing, the ID for forwarding the microblog, the content of the microblog, the releasing time, the releasing place, the device for releasing the microblog and the user account of @. There is a large number of relationships between data in this dataset: concern relationship among users, release relationship between users and microblogs, forwarding relationship among microblogs, and @ relationship in microblogs.

2) Data modeling

As shown in fig. 3, the structured attribute information in the user information is modeled as a relationship table UserInfo [ user account | gender | age | hobby | registration time ], and the primary key is the user account. And then, establishing the structured attribute information in the microblog information as a relation table Weibo [ microblog ID | issuing time | issuing place | microblog issuing equipment ], wherein the primary key is the microblog ID. And modeling the original text content of the microblog into a Hashmap model, and taking the ID of the microblog as a key. And then, the user account and the microblog ID are used as nodes of the graph model, and the attention relationship among the users, the releasing relationship between the users and the microblog, the forwarding relationship between the microblog and the @ relationship between the microblog and the users are described by using edges among the nodes. Finally, mapping association is carried out between the relational table UserInfo and the user nodes of the graph model through user accounts; mapping association is carried out on the relation table Weibo, the microblog nodes of the graph model and the Hashmap model through microblog IDs.

Furthermore, to improve query efficiency, appropriate redundancy may be allowed to exist in general. For example, a relationship table Weibo [ device for issuing a microblog ID | issue time | issue place | issue a microblog ] is modified to Weibo [ user account for issuing device for issuing a microblog ID | issue time | issue place | issue a microblog ]. Although the publishing relationship between the user and the microblogs is described in the graph model, and the redundancy of data is increased after modification, the redundancy has higher efficiency when querying 'all microblogs published by the user A in the last month'. Because the relational query without complex table connection operation is good for the relational database, the query in a database is not needed, and the total database query operation is only needed once. The data redundancy consideration in such modeling needs to be based on actual business requirements.

3) Storage method

Firstly, constructing a relational table UserInfo and Weibo in a data model by adopting a traditional relational database, and finishing the storage of attributes of two data entities, namely a user and a microblog; then, a Hashmap model is implemented by adopting a key value database, and the storage of the microblog original text content is completed; then, constructing a graph model by adopting a graph database, and finishing storing the attention relationship among users, the release relationship between the users and the microblog, the forwarding relationship among the microblog and the relationship @ in the microblog; finally, since the ID of the microblog data entity is directly adopted as the key of the key value database, the establishment of the auxiliary index is not needed. If it is necessary to speed up some common attribute condition queries, other auxiliary indexes can be built in the relational database as appropriate. The data storage method is implemented by a unified data management module, namely two types of data, namely the collected user information data and the collected microblog information data, are organized according to the data model and are respectively stored in different databases for management.

Claims

1. A storage method of highly-associated big data comprises the following steps:

1) setting an entity type and a unique ID number for each data entity;

2) only storing the attribute of the data entity by adopting a two-dimensional table in the relational model; storing the content of the data entity by adopting a Hashmap model; in the Hashmap model, merging and storing the contents of a plurality of data entities, taking the storage time of the stored data entities as a key, the contents of the stored data entities as a key value, and constructing an index from a data entity ID index to the key corresponding to the entity contents; or the storage time of a plurality of data entities stored together and the combination of the IDs of the data entities serve as keys, the contents of the plurality of data entities stored together serve as key values, and an index of the data entity ID index to the key corresponding to the entity content is constructed;

3) establishing an association relation between the attribute and the content of the same data entity through the ID number of the data entity; storing the incidence relation between the data entities by adopting a graph data model; the method for storing the incidence relation between the data entities by adopting the graph data model comprises the following steps: one node in the graph data model represents one data entity, the entity type and the ID number of the corresponding data entity are identified on the node, and the association relationship between the data entities is represented by the edges between the nodes.

2. The method of claim 1, wherein storing attributes of data entities using a two-dimensional table in a relational model is by: storing the attribute of the data entity of each entity type by adopting a two-dimensional table, wherein the format of the table is as follows: [ data entity ID | Attribute A | Attribute B. ].

3. A management system of highly-associated big data is characterized by comprising a storage module and a unified data management module; wherein the content of the first and second substances,

the storage module comprises a Hashmap model used for storing the contents of the data entities, a relation model used for storing the attributes of the data entities and a graph data model used for storing the association relation between the data entities; each data entity sets an entity type and a unique ID number, and the association relationship is established between the attribute and the content of the same data entity through the ID number of the data entity; wherein, only the attribute of the data entity is stored by adopting a two-dimensional table in the relational model; the graph data model is realized by utilizing a graph database, namely, each data entity is stored as a node in the graph database, the entity type and the ID number of the corresponding data entity are identified on the node, and the association relationship between the data entities is represented by utilizing edges between the nodes;

the unified data management module is used for adding, deleting, updating and inquiring the incidence relation, the attribute and the data content of the data entity in the storage module;

in the Hashmap model, merging and storing the contents of a plurality of data entities, taking the storage time of the stored data entities as a key, the contents of the stored data entities as a key value, and constructing an index of a data entity ID index to the key corresponding to the entity contents; or the combination of the storage time and the ID of the plurality of data entities stored together is used as a key, the content of the plurality of data entities stored together is used as a key value, and an index of the data entity ID index to the key corresponding to the entity content is constructed.

4. The system of claim 3, wherein the relational data model is implemented using a relational database, wherein the attributes of the data entities for each entity type are stored in a two-dimensional table having a format of: [ data entity ID | Attribute A | Attribute B. ].

5. The system according to claim 3 or 4, wherein when a newly added data entity needs to be stored in the storage module, the unified data management module stores the content of the newly added data entity in the Hashmap model, and the key of the newly added data entity is set as entity ID; then extracting attribute values of the newly added data entity, and inserting the attribute values into a two-dimensional table corresponding to the type of the newly added data entity in a relation model; then extracting the incidence relation between the newly added data entity and other data entities, inserting a new node in the graph data model to represent the newly added data entity, and setting two node attributes to record the entity ID and the entity type respectively; and then establishing an edge according to the incidence relation between the newly added data entity and other data entities.

6. The system according to claim 3 or 4, wherein when a data entity query request is received, the unified data management module performs matching search in the graph data model according to the incidence relation constraint between the data entities in the query condition to obtain a result set R1 meeting the condition; according to the attribute constraint of the data entity in the query condition, matching search is carried out in the relation model, and a result set R2 is obtained; the result set R3 is then set from the set of entity IDs in the result sets R1, R2: if both result sets R1 and R2 are empty then result set R3 is empty, if either result set R1 or R2 is empty then result set R3 is equal to the set of entity IDs in the one of the result sets that is not empty, if both result sets R1 and R2 are not empty then set result set R3 to be the intersection of the sets of entity IDs in R1 and R2; if the data query request requires the original text content of the data, the corresponding entity content is found in the Hashmap model according to the entity ID in the result set R3, and then the content of the data entity in the query result sets R1, R2, R3 and R3 is returned.