CN109582831B

CN109582831B - Graph database management system supporting unstructured data storage and query

Info

Publication number: CN109582831B
Application number: CN201811202708.XA
Authority: CN
Inventors: 沈志宏; 周园春; 赵子豪
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2018-10-16
Filing date: 2018-10-16
Publication date: 2022-02-01
Anticipated expiration: 2038-10-16
Also published as: CN109582831A

Abstract

The invention discloses a graph database management system supporting unstructured data storage and query, which comprises: a labeled Neo4j attribute map model based on a Neo4j map database for storing structured data and attribute information of non-content of BLOBs, and also storing unique IDs to which BLOBs are assigned; a BLOB data storage model supporting BLOB type, comprising an abstract storage model and a concrete storage model, used for storing the content of the BLOB and maintaining the mapping relation between the ID of the BLOB and the attribute information of the BLOB; and the query module is used for querying the stored structured data and the BLOB type data. The system can store structured data and unstructured data as BLOB type, and realize the query of the two data.

Description

Graph database management system supporting unstructured data storage and query

Technical Field

The invention relates to the technical field of big data, databases and distributed systems, and provides a database management system supporting unstructured data storage and query.

Background

Typically, a conventional relational database is used to store structured data that can be represented in two dimensions, and techniques related to storage and querying of structured data are well established. But as the data age has developed, the form of data has become more complex. In practical application, a lot of semi-structured data with self-description structures and unstructured data without fixed structures appear, and the data has a very good expansibility and can freely express a lot of useful information. However, because of the freedom in format, how to store and manage such data becomes a problem which is difficult to solve, the traditional relational Database system mainly faces to the application field of object processing and data analysis, and cannot well realize storage and management of massive semi-structured and structured data, and the proposed NoSQL, especially Graph Database (Graph Database) technology such as Neo4j, provides a new idea for efficiently solving the management and processing problems of unstructured data.

The graph database is originated from Euler and graph theory, and can also be called as a graph-oriented/based database, the data model of the graph database is embodied by nodes and relations, and information can also be stored as the attributes of the nodes to support the quick query of the relations between entities.

As data sources increase, the variety of data becomes more abundant, and more than 85% of newly generated data is unstructured, but current big data applications are less capable of unstructured processing (http:// www.cio.com.cn/eyan/2295. html). Unstructured data is typically stored as Binary Large Objects (BLOBs) when they are stored. Nonetheless, storing BLOB objects in a conventional relational database still has many inconveniences such as inefficiency, inconvenience in retrieval, and the like.

In addition, in many application scenarios, the requirement for complex relationship query is large, and the online system is sensitive to time. The great advantage of graph databases is to solve complex relational problems quickly. However, graph databases, such as Neo4j, which is very popular at present, use an attribute graph model, which has a drawback that BLOB object storage is not natively supported, and thus it is particularly important to combine graph databases and BLOB storage to achieve uniform management and query of BLOB data and other types of data.

Disclosure of Invention

The invention aims to provide a graph database management system supporting unstructured data storage and query, which can store structured data and unstructured data as BLOB types and realize query of the two data.

In order to achieve the purpose, the invention adopts the following technical scheme:

a labeled Neo4j attribute graph model based on a Neo4j graph database for storing structured data and non-content attribute information of a BLOB, the structured data including a text type, a boolean type, a numerical type, a temporal type, the non-content attribute information of the BLOB including a length of the BLOB type data, a Mime type, and a 128-bit digest, and also storing a unique ID to which the BLOB is assigned;

a BLOB data storage model supporting BLOB types, which comprises an abstract storage model and a concrete storage model; the abstract storage model is realized by a local file system or a ceph distributed file system, is used for storing the content of the BLOB, and realizes mapping through the ID of the BLOB and the attribute information of the BLOB; the specific storage model is realized by a local file system, stores the content of the BLOB in a file storage mode, and realizes mapping by the ID of the BLOB and the attribute information of the BLOB;

and the query module is used for querying the stored structured data and the BLOB type data.

Further, the system also comprises a data type identification module for judging the type of the received data, and if the data is structured data, the data is stored in the Neo4j attribute graph model; if the data is BLOB type data, storing the ID corresponding to the BLOB and the attribute information of the non-content of the BLOB in a Neo4j attribute graph model, and storing the content of the BLOB in a BLOB data storage model.

Further, the device also comprises a BLOB attribute information extraction module for extracting the attribute information of the BLOB from the data text.

Further, the query module runs the Cypher language.

A method for constructing a graph database management system supporting unstructured data storage and query comprises the following steps:

taking the labeled attribute map model based on the Neo4j map database as a Neo4j attribute map model;

modifying the original code of Neo4j based on the labeled attribute graph model of the Neo4j graph database, and acquiring values and set values by a getRecord () method and a setRecord () method;

adding BLOB type support in the PropertType of the original code of Neo4j, and realizing operations such as reading values and creating a BLOB data storage model;

a query module is created that supports queries using Cypher.

Further, the step of adding support of BLOB type in propertype of the original code of Neo4j includes:

adding support for BLOB types in the getpropertytypeOrNull () method so that BLOB types can be returned when the method is called;

adding registration of BLOB type in a register ScalarsAndCollection () method, and injecting Java class into Neo4j.

A storage method of a graph database management system supporting unstructured data storage and query comprises the following steps:

determining whether the received data is structured data or unstructured data;

if the data is structured data, the data is stored in a Neo4j attribute map model;

if the data is unstructured data, extracting attribute information of BLOBs, and allocating a unique ID to each BLOB;

storing the ID corresponding to the BLOB and the content of the BLOB in a BLOB data storage model, and storing the attribute information of the non-content of the BLOB in a Neo4j attribute graph model;

the content of the BLOB is read by the ID.

Further, the step of storing the content of the BLOB comprises:

creating a new file under the appointed directory, writing the content of the BLOB into the new file by using a mode of outputting a file stream, and storing the content in a bid format;

another new file is created under the specified directory, and the md5 value of the BLOB digest is written to the other new file and saved in the md5 format.

Further, the step of reading the content of the BLOB comprises: and taking the bid of the BLOB as a parameter, searching a corresponding file in a specified directory, reading the content of the file by a fromFile () method and returning.

A method for creating attribute information for a BLOB of a graph database management system supporting unstructured data storage and querying, the steps comprising:

reading byte array content of the BLOB from the file as the content of the BLOB;

reading the length of the byte array of the BLOB from the file as the length of the BLOB;

reading the content digest of the BLOB from the file as the digest of the BLOB by using digestutils.getmd5digest;

reading the code of the first8bytes of the BLOB content from the file as the 32-bit flag value of the BLOB;

the unique ID of the BLOB is generated by the IdGenerator method.

The system adds related functions of BLOB storage to an open-source Neo4j graph database on the basis of an attribute graph model, realizes the combination of the graph database and a binary large object, supports the storage of large data of BLOB types, and can fully play the advantages of the graph database in the aspect of processing relationship problems, thereby supporting various types of data, and supporting the BLOB types in addition to text types, Boolean types, numerical types and time types supported by the attribute graph model of Neo4j. The present system stores attribute information of BLOBs other than content in Neo4j, and stores content information of a large volume in a plug-in back-end storage system.

The invention defines and realizes the read-write operation of BLOB, namely how to create the BLOB attribute value from the file, how to read and establish the BLOB object from the given file; the method enriches the self-owned attributes and related operations of the BLOB objects, realizes a method for acquiring attribute values including a digest (digest), a length (length), an 8byte mark and the like according to the content of the BLOB, and also realizes a method for allocating a unique ID to each BLOB object; the method provides support for the search of the related content of the BLOB, supports the matching of the attribute values of the BLOB by providing the operation function of the attribute values of the BLOB based on Cypher search language, and can screen the result by taking other attribute values and the association relationship as the limiting conditions.

The invention has the beneficial effects that: the graph database technology and BLOB storage are organically integrated together, the method can be used for mixed storage and query of structured data and unstructured data, compared with the traditional big data fusion management tool, the method has the advantages that the capability of processing the relation problem is enhanced, the relation retrieval performance is improved, and the blank of the big data fusion management tool on the same block is made up to a certain extent.

Drawings

FIG. 1 is a diagram of a graph database management system according to an embodiment.

Detailed Description

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

The present embodiment provides a graph database management system supporting unstructured data storage and query, as shown in fig. 1, specifically as follows:

1) the system creates a BLOB data storage model supporting BLOB based on a labeled attribute graph model in Neo4j, wherein the related content of the BLOB object is added; the Neo4j is used as a front-end storage system to store the structured data and part of attribute information of the BLOB. The attribute value of the BLOB type comprises a data length, a digest, a Mime type and a unique ID assigned to the BLOB by a system; a back-end storage system is also created to facilitate the storage of BLOB type data.

2) The system judges the received data, stores the general structured data into a native map storage of Neo4j, stores attribute values except BLOB contents into a front-end storage system of Neo4j if the data is of the BLOB type, and then stores the BLOB contents into a back-end storage system according to the ID; the back-end storage system maintains a mapping between the ID and the data content of the BLOB type attribute values.

The construction process of the system is as follows:

the system is mainly expanded based on an open source database Neo4j (https:// Neo4j. com /), so that all characteristics of labeled attribute maps in a data model Neo4j used by the system include nodes and edges in a data set, entities can add attribute values, and the attribute values can be of a text type, a boolean type, a numerical type and a time type, and in the system, the entities can also be of a BLOB type. The system is built by modifying part of the native code of Neo4j, adding support for BLOBs, with the following modifications:

(1) on the basis of the original function of Neo4j, a trail named WithRecord is designed, wherein the trail comprises a getRecord () method and a setRecord () method, and the getRecord () method and the setRecord () method are respectively used for acquiring a value and a set value.

(2) The support of BLOB type is added in the PropertyType, and the method for reading the value and returning the locked number is realized. Support for BLOB types is added to the getpropertyteornull () method so that a BLOB type can be returned when this method is called. The registration of BLOB is added in the registerScalarsAndCollection () method, i.e. class in Java is injected into Neo4j.

The storage and reading method of the present system is illustrated as follows:

the content of a BLOB object tends to occupy a large storage space, which would make the graph database cumbersome if the entire BLOB object were directly stored in the graph database, causing serious memory, storage space consumption and performance problems. In a specific use scenario, most of the time, the query is performed according to some attribute information of BLOB, so the invention proposes a storage scheme which is fused with and relatively independent from the storage of Neo4 j:

(1) in the present system, some relevant attributes are extracted for the BLOB object, including: the unique ID assigned by the system for each BLOB object, the length of the BLOB object, the 128-bit digest of the BLOB object, the Mime type. In particular, these attribute values are stored in the attribute map model of Neo4j.

(2) The specific content of the BLOB is stored in a storage system other than Neo4j, in the invention, an abstract storage model named as a class of BLOB storage is used, and the BLOB storage can be realized by a local file system or a ceph distributed file system, and is used for storing the content of the BLOB and maintaining the mapping relation between the ID of the BLOB and the attribute information of the BLOB.

The BlobStorage system provides the following interface method:

save: saving the specified BLOB content;

configure: create or modify a path of the BLOB store;

iii, load: acquiring corresponding BLOB data according to the ID;

(3) the invention provides a back-end storage system named FileBlobStorage as a specific storage model based on a local file system, wherein the FileBlobStorage inherits from the BlobStorage, stores the content of BLOB in a file storage mode, and realizes mapping through ID and BLOB. Three methods of save, load and configure are realized.

When the save method is executed, a new file is created under the appointed directory, the content of the BLOB is written into the new file by using a method of outputting a file stream, and the new file name is BLOB. And then calling a write method, creating a new file under the appointed directory, and writing the md5 value of the BLOB abstract into the new file with the name of BLOB.

When the load method is executed, the object bid of the BLOB is taken as a parameter to be transmitted, a corresponding file is searched in an appointed directory, and the data content is read and returned through a fromFile () method.

The attribute value creation and reading method of the BLOB of the present system is as follows:

the invention provides a method for creating BLOB attribute values by taking a file as a data source, and after the method is executed, each attribute value of the BLOB content is obtained:

blob fromfile (file): the BLOB attribute values are generated from the content of the file.

When a user calls the method, a background reads contents from a file designated by the user and creates a Blob object; calling a calculateDiget () method to obtain the digest value of the Blob object and calling a calculateLength () method to obtain the length of the Blob object; then, calling a calcutfirst 8Bytes () method to calculate the first8Bytes contents of the Blob object; and finally returning the attribute information together.

Regarding reading the value of the BLOB, the present invention adds a method of reading the BLOB type to the readValue (). The method for reading the Blob object value is readBlobValue (), and two parameters, namely values and conf, need to be transmitted by using the method, wherein the values store the attribute information of the Blob object, and the conf is system configuration information and comprises the storage position of the Blob in the file system. The system automatically fetches the Blob object's attribute information from the values and reads the Blob object's content values from the file system.

The attributes and operation of the BLOB of the present invention are illustrated below:

the present invention enriches the attributes of a BLOB, which are inherent, including content (content), length (length), digest (digest), and a 32-bit flag (first8Bytes), and provides methods of operation with respect to these attributes. See table below for details.

TABLE 1 Attribute Table for BLOB Attribute values

TABLE 2 operation Table for BLOB attribute values

Operation of	Means of
		getBlobLength	Obtaining the length of a BLOB object
calculateDigest	Obtaining a digest of a BLOB content
		calculateLenngth	Obtaining the length of a BLOB
calculateFirst8Bytes	Obtaining the content of the first8bytes of a BLOB
		computeHash	Computing hash values for BLOB digests
equals	Judging between two BLOB objects, etc

The invention uses an IdGenerator method to generate the ID of the BLOB, the ID is used as the unique identification of the BLOB object in the system, and the corresponding BLOB content can be found with the help of the Blobstorage.

The BLOB Cypher query method of the system is as follows:

in order to bring the great advantages of the convenient and efficient Cypher language in Neo4j into play, the system designs that a user program can call the related operation function of BLOB in a Cypher query statement, thereby achieving the effect of related query.

In the function of creating a node, a user creates a node according to a conventional method in Neo4j, stores a Blob object as an attribute under the node, and specifies a source file of the Blob object using the Blob.

When a user queries related contents by using Cypher language, the Blob objects under the nodes are treated with other types of common attributes equivalently, and related fields are directly specified for query.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A method for constructing a graph database management system supporting unstructured data storage and query comprises the following steps:

adding BLOB type support in the PropertType of the original code of Neo4j, realizing reading and returning the number of locks to be added, and creating a BLOB data storage model;

a query module is created that supports queries using Cypher.

2. The method of claim 1, wherein the step of adding BLOB type support to propertype of original code of Neo4j comprises: