CN111221785A

CN111221785A - Semantic data lake construction method of multi-source heterogeneous data

Info

Publication number: CN111221785A
Application number: CN201811427793.XA
Authority: CN
Inventors: 陈刚
Original assignee: Sinocbd Inc
Current assignee: Sinocbd Inc
Priority date: 2018-11-27
Filing date: 2018-11-27
Publication date: 2020-06-02

Abstract

The invention discloses a semantic data lake construction method of multi-source heterogeneous data, which comprises the following steps: s1, constructing an ontology, confirming the attribute and the parameter of the ontology, and storing the attribute and the parameter into a graph database of the data lake server; s2, extracting the semantics of the content of the imported data file, establishing RDF description, and storing the established RDF description into a document type database of the data lake server; and S3, according to the RDF description and referring to the related ontology, associating the semantic hierarchy between the file corresponding to the RDF description and the ontology, and writing the association into a graph database. The invention can display the node relation of the semantic data, process the body data and process the data file associated with the body, realize the operation of manual intervention, facilitate the retrieval of the semantic data lake, and further conveniently obtain the retrieval result file in detail.

Description

Semantic data lake construction method of multi-source heterogeneous data

Technical Field

The invention relates to the field of collection, management and application of multi-source heterogeneous data, in particular to a semantic data lake construction method of multi-source heterogeneous data.

Background

Database technology is the foundation and core of modern computer information systems and computer application systems, and is an important component of information systems. When developing a database application system, the database data is generally required to be exported for backup of the system or data sharing and exchange with other systems.

The concept of data lakes or hubs was originally proposed by big data vendors, and seemingly data was carried on top of inexpensive storage hardware based on the scalable HDFS (Hadoop distributed file system). But the larger the amount of data, the more different kinds of storage are needed. Eventually, all enterprise data may be considered big data, but not all enterprise data is suitable for being deposited on top of an inexpensive HDFS cluster. One part of the value of the data lake is to gather different kinds of data together, and the other part of the value is to perform data analysis without a predefined model. Today's big data architectures are scalable and can provide users with more and more real-time analytics. The data lake architecture is oriented to information storage of multiple data sources, including the Internet of things. Big data analysis or archiving may be handled or delivered to the requesting user by accessing the data lake.

In order to solve the problems of complexity and convenience of semantic retrieval, realize automatic establishment and facilitate retrieval, a semantic data lake construction method of multi-source heterogeneous data needs to be provided, so that the node relation of semantic data can be displayed, and body data and data files related to the body can be processed.

Disclosure of Invention

The invention aims to provide a semantic data lake construction method of multi-source heterogeneous data, which comprises the steps of constructing a graph database body, confirming body attributes and parameters, storing the graph database body into a graph database, establishing RDF description of files related to the body, storing the RDF description into a document database, establishing RDF association of the body and the related files according to the RDF description and referring to the existing graph database body and the attributes, writing the RDF association into the document database, displaying node relation of semantic data, processing the body data and processing the data files related to the body, realizing manual intervention operation, facilitating semantic data lake retrieval, and further facilitating detailed acquisition of retrieval result files.

In order to achieve the aim, the invention provides a semantic data lake construction method of multi-source heterogeneous data, which comprises the following steps:

s1, constructing an ontology, confirming the attribute and the parameter of the ontology, and storing the attribute and the parameter into a graph database of the data lake server;

s2, extracting the semantics of the content of the imported data file, establishing RDF description, and storing the established RDF description into a document type database of the data lake server;

and S3, according to the RDF description and referring to the related ontology, associating the semantic hierarchy between the file corresponding to the RDF description and the ontology, and writing the association into a graph database.

Preferably, the RDF description contains nodes and edges, wherein the nodes represent entities/resources/attributes, and the edges represent relationships between entities and relationships between entities and attributes.

Preferably, the data lake server is a data storage and management service platform comprising four databases, namely a relational database, a document database, a distributed file system and a graph database, the platform adopts a distributed operation and storage architecture, integrates various computers, servers and computer clusters/server clusters with data storage and operation functions, and provides various functional components including data management and algorithm development.

Preferably, the data storage and management service platform organizes and manages the data files and the storage and exchange thereof through log files and metadata files; the log record data contained in the log file exists in a key value pair mode and contains fields corresponding to the following contents:

the operator name of the current operation;

the type of current operation;

the content of the current operation, namely the execution object of the operation action; when the operation type is modification, creation or addition, the position of the corresponding data source is stored; when the operation type is query, storing a corresponding query statement;

the date and time of the current operation;

the current operation state is used for judging whether the current operation is successful;

the data type of the current operation;

wherein, the file metadata contained in the metadata file exists in the form of key value pairs, and the file metadata contains fields corresponding to the following contents:

the name of the data being processed;

a description of current data;

the user to which the current data belongs;

the group to which the current data belongs;

a stored destination that matches a database type;

a resource description framework for data generation;

a metadata creation time;

the metadata update time.

Preferably, the graph database is Neo4j or Cayley or GrapgDB; and/or the document type database is MongoDB or CouchDB.

Preferably, the step S1 further includes: and selecting an ontology keyword according to a main body of the graph database to be established, and further adding attribute parameter description of the ontology for constructing the graph database.

Preferably, the graph database of the data lake server is established based on software for graph database construction and management.

Preferably, the step S2 may be executed at the same time when the data file of the external data source is imported, or may be executed after the data file of the external data source is imported.

Preferably, after the semantic data lake is constructed, one or more of the following processes are further implemented: querying an ontology through a graph database in the semantic data lake to obtain attributes related to the ontology; the semantic data lake provides a graphical retrieval interface and a layer-by-layer query interface, supports data relation map display of query results, and supports related operations of maps; obtaining a source file corresponding to a query result, a matching list of files or data and contents of the queryable files; the user can further confirm the nodes in the network map and drill down the detailed query results.

Compared with the prior art, the invention has the beneficial effects that: (1) the invention can solve the problems of complexity and convenience of semantic retrieval, realizes automatic establishment and is convenient for retrieval; (2) the invention can solve the problem of accuracy of constructing the ontology and realize perfect ontology construction; (3) the method can solve the problem of multi-element isomerism of the data to be stored so as to realize that various data can be stored in the data lake; (4) the method can solve the problem of hardware platform support constructed by the semantic lake, and realize the hardware platform support constructed by the semantic lake; (5) the invention has convenient use, traceable retrieval process, convenient management and convenient further detailed acquisition of retrieval result files; (6) the method can establish the convenience of the semantic data lake and the retrieval convenience, and realize the operation of manual intervention and the retrieval convenience of the semantic data lake; (7) the invention can solve the problems of safety and stability of data storage so as to ensure the safety and stability of data storage.

Drawings

FIG. 1 is a schematic diagram of the architecture of a data lake of the present invention.

Detailed Description

In order that the invention may be more readily understood, reference will now be made to the following description taken in conjunction with the accompanying drawings.

As shown in fig. 1, the data lake server refers to a data storage and management service platform composed of four types of databases, i.e., a relational database (e.g., MariaDB, MySQL, etc.), a document database (e.g., MongoDB, CouchDB, etc.), a distributed file system (e.g., HDFS, PVFS, PanFS, etc.), and a graph database (e.g., Neo4j, Cayley, grappg db, etc.). The platform adopts a distributed operation and storage architecture, integrates various computers with data storage and operation functions, a single computer, a server and a computer cluster or server cluster, and provides various functional components including data management and algorithm development.

Wherein, the distributed operation and storage architecture is as follows: the PaaS cloud computing platform is used for providing distribution of computing resources, the service containers are distributed to all nodes in the cluster, and distributed computing resources are provided.

In this embodiment, the data exchange management of the entire data lake is based on the log record data and the file metadata stored in the MongoDB.

(a) The log record data exists in the form of key-value pairs, and the field names and contents of the log record data are as follows:

field "user": saving the name of the operator in the current operation;

field "operation _ type": saving the type of the current operation, such as creating, modifying, adding and the like;

field "operation _ record": and is used for saving the content of the current operation, namely the execution object of the operation action. When the operation type is modification, creation and addition, the position of the corresponding data source is saved; when the operation type is query, storing a corresponding query statement;

field "operation _ time": the date and time of the current operation is saved, such as: "2018-06-28T03:18: 58.91";

field "operation _ status": saving the current operation state, wherein the current operation state is an auxiliary field and is used for judging whether the current operation is successful or not;

field "operation _ source": saving the data type of the current operation, such as: "hdfs" represents file type data.

(b) The file metadata exists in the form of key-value pairs, and the field names and contents of the file metadata are mainly as follows:

data name: the name of the data being processed;

the following steps are described: a description of current data;

the method comprises the following steps: the user to which the current data belongs;

the group to which the vaccine belongs: the group to which the current data belongs;

storing a back end: the destination of storage, which refers to a certain database type;

auxiliary labeling: RDF generated by data; RDF is an english abbreviation of "resource description framework", and is essentially a Data Model (Data Model) which provides a unified standard for describing entities and resources, and simply, is a method and means for representing things, and is formally represented as "subject-predicate-object" triple;

metadata creation time: a creation time of the metadata;

metadata update time: update time of the metadata;

the "storage back end" field contains different fields according to different data types (file type, document type, table type, and graph type).

For data of file type, there are the following fields: file physical path, file physical name, HDFS occupation space size, real file owner, real file group, front-end display file path, front-end display file name, file extension, MINE type of file (multipurpose internet mail extension type), file real size, stop list (for RDF processing).

For document type data (such as JSON type data), there are the following fields: physical database location, physical collection name, display database name, display collection name, document structure (JSON data structure), stop word table (for RDF processing).

For data of the form type (e.g., data of MySQL), there are the following fields: physical database name, physical table name, display database name, display table name, list table, stop list table (for RDF processing).

For data of the graphics type (such as that of Neo4 j), there are the following fields: neo4j ID (also called ID of ontology), front display name, stop list (for RDF processing).

Through the log file and the metadata file, the data management service platform can efficiently and safely organize and manage the data file and accelerate the data storage speed.

The semantic data lake construction method of the multi-source heterogeneous data comprises the following steps:

s1, constructing an ontology, confirming the attribute and the parameter of the ontology, and storing the attribute and the parameter into a graph database of the data lake server.

Wherein the graph database of the data lake server is established based on software for graph database construction and management. The above "construct ontology, confirm the attributes and parameters of the ontology" means: according to the main body of the graph database to be established, ontology keywords are selected, attribute parameter descriptions of the ontologies are further added, and the foundation of the graph database is established.

S2, extracting the semantics of the content of the imported data file (for example, the semantics of the data record content of each document or each row of the document, which can be specific according to different types of data) at the same time of importing the data file of the external data source or after importing, establishing RDF description, and saving the established RDF description (for example, semantic information, keywords and the like) into a document type database (for example, MongoDB) of the data lake server.

The data file is a broad data concept, comprises various types of electronic storage files, and is a data file for a data lake server.

In addition, RDF is an english abbreviation for "resource description framework", and is essentially a Data Model (Data Model) that provides a unified standard for describing entities and resources. Briefly, a method and means of representing things formally as "subject-predicate-object" triples. In a graph database, RDF consists of nodes representing entities/resources, attributes, and edges representing relationships between entities and attributes.

And S3, according to the RDF description (such as semantic information, keywords and the like), and referring to the related ontology, realizing semantic hierarchy association between the file corresponding to the RDF description and the ontology, and writing the association into a graph database.

As shown in fig. 1, the external data sources of the data lake of the present embodiment may be IT data (existing data), open data (e.g., data from various networks), and OT data (e.g., data in the process of generation).

Through the construction of the semantic data lake of the multi-source heterogeneous data, the node relation of the semantic data lake can be displayed, the body data can be processed, and data files related to the body can be processed. Specifically, the ontology can be queried in a semantic data lake through a graph database to obtain attributes related to the ontology; the data lake provides a graphical retrieval interface and a layer-by-layer query interface, supports the display of a data relation map of a query result, and supports the addition, deletion, modification and search (for example, addition: adding a file to be associated with an ontology); a source file corresponding to the query result can be obtained, and a matching list of the file or the data is obtained; at the same time, the file content can be queried. The user can further confirm the nodes in the network map and drill down the detailed query results.

As a first embodiment of the invention:

the local data lake server is composed of a relational database MariaDB, a document database MongoDB, a distributed file system HDFS and a graph database Neo4 j. There is a set of engineering drawing files from a certain FTP, PDF format, its IP address is 192.168.12.101, port is 8080, user name is admin, password is password.

And the data lake server starts a connection service on a software interface of the data lake server, inputs access interface information of the file data source and succeeds in connection.

Further, 20 PDF files to be imported are seen on the interface, the 20 PDF engineering drawing files are imported according to an interface menu, and the data lake server background executes the following operations on each PDF file while importing the files:

1. converting the PDF file into a plain text;

2. extracting semantic information and key words in the text by using a natural language processing method;

3. examining the semantics and keywords extracted in the previous step, comparing the semantics and keywords with the existing graph database body, attributes and tags, establishing RDF association between the body and the PDF file, and writing the RDF association into a graph database Neo4 j; the identity of the ontology is recorded in Neo4j, and other information (e.g., ID information, source files for the query result) such as a pointer to the PDF file associated with the ontology is stored in the MongoDB.

As example two of the present invention:

the local data lake server is composed of a relational database MariaDB, a document database MongoDB, a distributed file system HDFS and a graph database Neo4 j. There is a set of data record files from a certain FTP, XLS format, with IP address 192.168.12.101, port 8080, username admin, password passsd.

Further, seeing the XLS file to be imported on the interface, importing the XLS data file according to the interface menu, and simultaneously importing the files, executing the following operations on each line in the file by the data lake server background:

1. reading the row of data records;

2. extracting semantic information and keywords in the row of data records;

3. examining the semantics and keywords extracted in the previous step, comparing the semantics and keywords with the existing database body, attributes and tags, establishing RDF association between the body and the XLS file record, and writing the RDF association into a database Neo4 j; the identity of the ontology is recorded in Neo4j, and other information (e.g., ID information, source files for the query result) such as a pointer to the PDF file associated with the ontology is stored in the MongoDB.

While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. A semantic data lake construction method of multi-source heterogeneous data is characterized by comprising the following steps:

2. The method for constructing the semantic data lake of the multi-source heterogeneous data according to claim 1, wherein the RDF description comprises nodes and edges, wherein the nodes represent entities/resources/attributes, and the edges represent relationships between the entities and relationships between the entities and the attributes.

3. The method for constructing a semantic data lake of multi-source heterogeneous data according to claim 1, wherein the data lake server is a data storage and management service platform comprising four types of databases, namely a relational database, a document database, a distributed file system and a graph database, the platform adopts a distributed operation and storage architecture, integrates various types of computers, servers and computer clusters/server clusters with data storage and operation functions, and provides various functional components including data management and algorithm development.

4. The method for constructing the semantic data lake of the multi-source heterogeneous data according to claim 3, wherein the data storage and management service platform organizes and manages data files and storage and exchange thereof through log files and metadata files;

the log record data contained in the log file exists in a key value pair mode and contains fields corresponding to the following contents:

the operator name of the current operation;

the type of current operation;

the date and time of the current operation;

the data type of the current operation;

the name of the data being processed;

a description of current data;

the user to which the current data belongs;

the group to which the current data belongs;

a stored destination that matches a database type;

a resource description framework for data generation;

a metadata creation time;

the metadata update time.

5. The method for constructing semantic data lake of multi-source heterogeneous data according to claim 1, wherein the graph database is Neo4j or Cayley or GrapgDB; and/or the document type database is MongoDB or CouchDB.

6. The method for constructing a semantic data lake of multi-source heterogeneous data according to claim 1, wherein the step S1 further comprises:

and selecting an ontology keyword according to a main body of the graph database to be established, and further adding attribute parameter description of the ontology for constructing the graph database.

7. The semantic data lake construction method of multi-source heterogeneous data according to claim 1, wherein the database of the data lake server is built based on software for database construction and management.

8. The method for constructing semantic data lake of multi-source heterogeneous data according to claim 1, wherein the step S2 is executed at the same time when the data file of the external data source is imported or after the data file of the external data source is imported.

9. The method for constructing semantic data lake of multi-source heterogeneous data according to any one of claims 1 to 8,

after the semantic data lake is constructed, one or more of the following processes are further realized:

querying an ontology through a graph database in the semantic data lake to obtain attributes related to the ontology;

the semantic data lake provides a graphical retrieval interface and a layer-by-layer query interface, supports data relation map display of query results, and supports related operations of maps;

obtaining a source file corresponding to a query result, a matching list of files or data and contents of the queryable files;

the user can further confirm the nodes in the network map and drill down the detailed query results.