CN111221791A - Method for importing multi-source heterogeneous data into data lake - Google Patents

Method for importing multi-source heterogeneous data into data lake Download PDF

Info

Publication number
CN111221791A
CN111221791A CN201811438360.4A CN201811438360A CN111221791A CN 111221791 A CN111221791 A CN 111221791A CN 201811438360 A CN201811438360 A CN 201811438360A CN 111221791 A CN111221791 A CN 111221791A
Authority
CN
China
Prior art keywords
data
external
database
file
lake
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811438360.4A
Other languages
Chinese (zh)
Inventor
陈刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sinocbd Inc
Original Assignee
Sinocbd Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sinocbd Inc filed Critical Sinocbd Inc
Priority to CN201811438360.4A priority Critical patent/CN111221791A/en
Publication of CN111221791A publication Critical patent/CN111221791A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for importing multi-source heterogeneous data into a data lake, which comprises the following steps: obtaining an access interface address of external file type data, importing the file type data and storing the file type data in a distributed file system of a local data lake server; or obtaining access interface information of the external data source, connecting the access interface information with the local data lake server, importing data of the external data source and storing the data in a distributed file system in a data file form, or converting non-relational data of the external data source into relational data and then storing the relational data in a relational database, or directly importing the relational data of the external data source and storing the relational data in the relational database, or importing the non-relational data of the external data source and storing the non-relational data in a document database. The invention can solve the problem of multi-element isomerism of the data to be stored, is convenient for the collection, management and application and expansion of multi-source isomerism data, meets various requirements of an organization framework, and ensures the data access safety and the flexibility when importing the data.

Description

Method for importing multi-source heterogeneous data into data lake
Technical Field
The invention relates to the field of collection, management and application of multi-source heterogeneous data, in particular to a method for importing multi-source heterogeneous data into a data lake.
Background
Database technology is the foundation and core of modern computer information systems and computer application systems, and is an important component of information systems. When developing a database application system, the database data is generally required to be exported for backup of the system or data sharing and exchange with other systems.
The concept of data lakes or hubs was originally proposed by big data vendors, and seemingly data was carried on top of inexpensive storage hardware based on the scalable HDFS (Hadoop distributed file system). But the larger the amount of data, the more different kinds of storage are needed. Eventually, all enterprise data may be considered big data, but not all enterprise data is suitable for being deposited on top of an inexpensive HDFS cluster. One part of the value of the data lake is to gather different kinds of data together, and the other part of the value is to perform data analysis without a predefined model. Today's big data architectures are scalable and can provide users with more and more real-time analytics. The data lake architecture is oriented to information storage of multiple data sources, including the Internet of things. Big data analysis or archiving may be handled or delivered to the requesting user by accessing the data lake.
For the collection, management and application of multi-source heterogeneous data, the data application sharing under the right game of an organization is required to be met, and a method for importing the multi-source heterogeneous data into a data lake is required to be provided so as to realize the convenience for the collection, management and application of the multi-source heterogeneous data and further expand the data; meeting various requirements of the organization architecture.
Disclosure of Invention
The invention aims to provide a method for importing multi-source heterogeneous data into a data lake, which realizes that various external data are stored in a database of the local data lake by connecting an external data source with a local data lake server and solves the problem of multi-element heterogeneous of the data to be stored. The invention can facilitate the collection, management and application of multi-source heterogeneous data and further expand, and meet various requirements of organization architecture; the data access security and the flexibility when the data are imported can be ensured, the data can be tracked, and future query, tracking and operation reproduction are facilitated; data access and read-write permission and sharability are achieved, and data access and sharing are facilitated; the invention can also ensure the import speed of the data and is convenient for the preprocessing of the data.
In order to achieve the purpose, the invention discloses a method for importing multi-source heterogeneous data into a data lake, which comprises the following steps:
acquiring access interface information of an external data source, connecting a local data lake server with the external data source, importing data of the external data source, and storing the data in a distributed file system of the local data lake server in a data file form; wherein the external data source comprises an external database and an external stream data source;
and/or obtaining access interface information of an external data source, connecting the local data lake server with the external data source, converting non-relational data of the external data source into relational data, storing the relational data into a relational database of the local data lake server, or directly importing the relational data of the external data source, and storing the relational data into the relational database of the local data lake server;
and/or obtaining access interface information of an external data source, connecting the local data lake server with the external data source, importing non-relational data of the external data source, and storing the non-relational data in a document type database of the local data lake server;
and/or obtaining an access interface address of the external file type data, directly importing the external file type data, and storing the external file type data in a distributed file system of the local data lake server.
Preferably, the obtaining of the access interface information of the external data source refers to obtaining one or more of an IP address, a port number, a user name, and a password of an interface of the external data source.
Preferably, the user may share the data file stored in the distributed file system by itself to other users, and the method further includes:
the user has sharing authority when registering in the data lake server and has the right to share the data file imported into the distributed file system to other users;
various data sources can be imported by different users, and each user can only see the imported data files under the default condition;
when the data file is in the distributed file system, the user can share the data file;
the user can set various permissions including private permissions, permissions visible in the group and public permissions for the data file imported by the user, and the various permissions of the user are set by an administrator of the data lake server.
Preferably, the data lake server is a data storage and management service platform comprising four databases, namely a relational database, a document database, a distributed file system and a graph database, the platform adopts a distributed operation and storage architecture, integrates various computers, servers and computer clusters/server clusters with data storage and operation functions, and provides various functional components including data management and algorithm development.
Preferably, the local data lake server imports the data of the external database or external stream data or external file type data into the operation process of the local data lake server and relevant operation parameters to the document type database of the local data lake server for tracking data processing and log analysis;
the data exchange management of the local data lake server may be based on journaling data stored in a document-based database, the journaling data being in the form of key-value pairs, and file metadata, the file metadata being in the form of key-value pairs.
Preferably, the method for importing the multi-source heterogeneous data into the data lake further comprises the following steps:
selecting data fields of an external data source to be loaded, and storing the data of the selected data fields into a distributed file system of a local data lake server in a data file form;
the data field of the external data source to be loaded is selected, namely after the local data lake server is connected with the external data source, a user sees field information of the external data source on a management interface of the local data lake server, and further selects a data field to be imported; the user can select all the data fields, and the data corresponding to the fields selected by the user can be imported when the data is copied to the local data lake server in the next step.
Preferably, the method for importing the multi-source heterogeneous data into the data lake further comprises the following steps:
after the data of the external data source is copied into a file in a distributed file system of the local data lake server or copied into a relational database/document database of the local data lake server, a user can further check the data of each field of the data file and perform data cleaning operation;
and the user imports the cleaned data into a relational database of the local data lake server or stores the cleaned data into a document database of the local data lake server according to actual needs.
Preferably, the user may perform a joint query across table selection fields, where the joint query across table selection fields is a cross table query method based on a graph database, and specifically includes the following processes:
in the data lake management platform, a user inputs a database name which is to be inquired and is imported into a data lake, and after the database name is executed, the platform searches relevant information of a graph from a specified position of a graph database;
the user fills in initial information containing a data table name and a data column name, and fills in end information containing the data table name and the data column name of a required target;
inquiring a shortest path from a starting point to a destination point in a graph database according to input information of a user;
loading the tables contained in the shortest path into Spark and connecting the tables according to the information in the graph database;
performing relevant operation on the connected table;
and returning the result of the user query.
Preferably, the method for importing the multi-source heterogeneous data into the data lake further comprises the following steps:
and according to the requirement of the external application program on the data, exporting the non-relational data of the local data lake server into relational data for the external application program to use.
Preferably, the method for converting the non-relational data of the external data source into the relational data comprises the following steps: for the data of the non-relational database of the external data source, the conversion from the non-relational data to the relational data is completed by traversing all Key values, analyzing each Value and analyzing the data according to different types of the Value values.
Preferably, the method for importing the multi-source heterogeneous data into the data lake further comprises the following steps:
when the file type data is imported, the local data lake server extracts the information in the data file and stores the information in the file type database and the database;
and/or when the relational data are imported, the local data lake server stores the information in the data file into the document database and the graph database by extracting the information in the data file;
and/or when the non-relational data are imported, the local data lake server stores the information in the data file into the graph database by extracting the information in the data file;
the method for extracting the information in the data file by the local data lake server comprises one or more of an image recognition method, a voice recognition method, a text filtering method and a video file processing method.
Preferably, when the relationship data is saved to the graph database of the local data lake server, the method further comprises:
1) acquiring relevant parameters required by a connection graph database, wherein the relevant parameters comprise an address, a port, a user name and a password;
2) reading a table to be extracted in a relational database;
3) circularly reading the related content of the fields in each table and reading the relationship between the table and other tables;
4) the calling method maps the data record field into the attribute of the node so as to lead the data record field into a graph database, and the recorded primary key is the identifier of the node;
5) and reading the relation and importing the nodes mapped by the data records into the graph database.
Preferably, the user can view the imported data file and view the information of the data lake server, further comprising: the local data lake server administrator or an authorized user can check the data in different databases, check the statistics, the total data amount, the file types and the data amounts of various file types of the data lake server, and perform retrieval and export operations on the data lake server.
Compared with the prior art, the invention has the beneficial effects that: (1) the invention can facilitate the collection, management and application of multi-source heterogeneous data and further expand, and meet various requirements of organization architecture; (2) the invention can ensure the safety of data access and the flexibility of data import, can track data, is convenient for future inquiry, tracking and operation reproduction, has data access and read-write permission and can be shared, and is convenient for data access and sharing; (3) the method can solve the problem of multi-element isomerism of the data to be stored so as to realize that various data can be stored in the data lake; (4) the invention can solve the problems of convenience and speed of data storage and retrieval of file types, ensure the import speed of data and facilitate the pretreatment of the data; the problem of selectivity of data entering a database can be solved, and convenience is brought to users; (5) the method can solve the problems of content understanding, file type data indexing, data retrieval effect and the like of the picture type data file, can extract and understand the picture content, enables the file type data to be structured, facilitates the establishment of a knowledge graph and a semantic type data lake in the future, and improves the retrieval effect of the file type data; (6) the invention can solve the problem of storage compatibility between the traditional database and the novel database; (7) the invention can solve the problem that the industrial field data is usually transferred and stored locally and then put in storage, and realize the reduction of unnecessary transfer and storage steps so as to ensure the theoretical limitless storage speed and storage space; (8) the invention can also solve the problem that information can be searched, and improve the use transparency of the user.
Drawings
FIG. 1 is a schematic diagram of the architecture of a data lake of the present invention.
Detailed Description
In order that the invention may be more readily understood, reference will now be made to the following description taken in conjunction with the accompanying drawings.
The method for importing the multi-source heterogeneous data into the data lake comprises the following steps: the method comprises the steps of obtaining access interface information of an external data source, connecting a local data lake server and the external data source, importing data of the external data source, and storing the imported data in a distributed file system of the local data lake server in a data file form. Wherein the external data source comprises an external database and an external stream data source. As shown in fig. 1, the external data sources of the data lake of the present embodiment may be IT data (existing data), open data (e.g., data from various networks), and OT data (e.g., data in the process of generation).
Illustratively, acquiring the access interface information of the external database or the external stream data source refers to acquiring an IP address, a port number, a user name and a password of the external data source interface.
The invention can also import the external file type data and store the external file type data in the distributed file system of the local data lake server by acquiring the FTP or HTTP address of the access interface of the external file type data. In addition, the invention can also convert the non-relational data of the external data source into relational data and store the relational data in the relational database of the local data lake server, or directly import the relational data of the external data source and store the relational data in the relational database of the local data lake server, or import the non-relational data of the external data source and store the imported non-relational data (in this case, the non-relational data is document data) in the document database of the local data lake server.
The data file is a broad data concept, and includes various types of electronic storage files, and for a data lake server, the data file is a data file.
In this embodiment, the external database refers to a conventional relational database, such as Oracle, MySQL, SQLServer, and the like. When the database is imported, the database management component of the data lake server can (1) add, delete, modify and check the fields or data of the database; (2) importing data of an external database; (3) selecting a database, tables and fields; (4) checking the imported data records, and importing the number of successful and failed pieces.
The stream data means: the data lake server supports the access and conversion of various stream data protocols for the recorded data in the processes of logistics, production sites and the like or in the event occurrence process: for the data of the TCP/IP protocol, the data can be directly imported; for stream data adopting other protocols, the peripheral equipment analyzes the protocols and then imports the data. Furthermore, the mobile terminal equipment compatible with the TCP/IP protocol is supported to import stream data; further, the text content (such as JSON) of the transmitted stream data can be seen, and a field name can be set on the interface, specifying the location of stream data storage, the speed and total data volume of stream data transmission, and the source address of transmission; the data fields, docked APPs, storage locations, IP addresses, APP configuration information, and total data size of the imported streaming data may be viewed within the data lake server through the data management component.
As shown in fig. 1, the data lake server refers to a data storage and management service platform composed of four types of databases, i.e., a relational database (e.g., MariaDB, MySQL, etc.), a document database (e.g., MongoDB, CouchDB, etc.), a distributed file system (e.g., HDFS, PVFS, PanFS, etc.), and a graph database (e.g., Neo4j, Cayley, grappg db, etc.). The platform adopts a distributed operation and storage architecture, integrates various computers with data storage and operation functions, a single computer, a server and a computer cluster or server cluster, and provides various functional components including data management and algorithm development.
Wherein, the distributed operation and storage architecture is as follows: the PaaS cloud computing platform is used for providing distribution of computing resources, the service containers are distributed to all nodes in the cluster, and distributed computing resources are provided.
In this embodiment, the data exchange management of the entire data lake is based on the log record data and the file metadata stored in the MongoDB.
(a) The log record data exists in the form of key-value pairs, and the field names and contents of the log record data are as follows:
field "user": saving the name of the operator in the current operation;
field "operation _ type": saving the type of the current operation, such as creating, modifying, adding and the like;
field "operation _ record": and is used for saving the content of the current operation, namely the execution object of the operation action. When the operation type is modification, creation and addition, the position of the corresponding data source is saved; when the operation type is query, storing a corresponding query statement;
field "operation _ time": the date and time of the current operation is saved, such as: "2018-06-28T03:18: 58.91";
field "operation _ status": saving the current operation state, wherein the current operation state is an auxiliary field and is used for judging whether the current operation is successful or not;
field "operation _ source": saving the data type of the current operation, such as: "hdfs" represents file type data.
(b) The file metadata exists in the form of key-value pairs, and the field names and contents of the file metadata are mainly as follows:
data name: the name of the data being processed;
the following steps are described: a description of current data;
the method comprises the following steps: the user to which the current data belongs;
the group to which the vaccine belongs: the group to which the current data belongs;
storing a back end: the destination of storage, which refers to a certain database type;
auxiliary labeling: RDF generated by data; RDF is an english abbreviation of "resource description framework", and is essentially a Data Model (Data Model) which provides a unified standard for describing entities and resources, and simply, is a method and means for representing things, and is formally represented as "subject-predicate-object" triple;
metadata creation time: a creation time of the metadata;
metadata update time: update time of the metadata;
the "storage back end" field contains different fields according to different data types (file type, document type, table type, and graph type).
For data of file type, there are the following fields: file physical path, file physical name, HDFS occupation space size, real file owner, real file group, front-end display file path, front-end display file name, file extension, MINE type of file (multipurpose internet mail extension type), file real size, stop list (for RDF processing).
For document type data (such as JSON type data), there are the following fields: physical database location, physical collection name, display database name, display collection name, document structure (JSON data structure), stop word table (for RDF processing).
For data of the form type (e.g., data of MySQL), there are the following fields: physical database name, physical table name, display database name, display table name, list table, stop list table (for RDF processing).
For data of the graphics type (such as that of Neo4 j), there are the following fields: neo4j ID (also called ID of ontology), front display name, stop list (for RDF processing).
Through the log file and the metadata file, the data management service platform can efficiently and safely organize and manage the data file and accelerate the data storage speed.
In this embodiment, a data field of an external data source to be loaded may be selected, and data of the selected data field may be stored in a distributed file system of a local data lake in the form of a data file. The method comprises the steps that a user can see field information of an external data source on a local data lake server management interface after connecting the external data source, and can further select a data field to be imported; further, the user can select all data fields, and the data corresponding to the fields selected by the user can be imported when the data is further copied to the local data lake server.
In this embodiment, converting the non-relational data of the external data source into the relational data and storing the relational data in the relational database of the local data lake server means: for data of an external non-relational database, such as JSON data in MongoDB, all Key values (Key values) can be traversed, each Value (as an attribute or a keyword) Value is analyzed, and data is analyzed according to different types of the Value values, so that conversion from non-relational data to relational data is completed.
In addition, the invention can lead out the local non-relational data into the relational data for the external application program to use according to the requirement of the external application program on the data.
In the invention, the data in the file can be cleaned and the field can be extracted from the local data lake server and stored in a relational database or a document database of the local data lake server, or the data file can be stored in a distributed file system. Specifically, after the data of the data source is copied to a file in a distributed file system of the data lake server or copied to a relational database or a document database, the user may further view data of various fields of the data file, and perform a data cleansing operation, including: operations such as deleting null values, filling null values, replacing, aggregating, completing, binning, clustering, and regressing; further, the user can select fields in a cross-table mode to carry out combined query; furthermore, the user can also import the cleaned data into a relational database of the local data lake server according to actual needs; furthermore, the user can also save the data into a document type database of the local data lake server according to actual needs.
The "cross-table selection field for joint query" refers to a cross-table query method based on Neo4j, and specifically includes the following steps: at a data lake management platform, a user inputs a database name to be queried, wherein the database name is imported into a data lake, and after execution, the platform searches relevant information of a graph from a specified position of Neo4 j; further, the user fills in initial information including "data table name" and "data column name", and fills in end information including "data table name" and "data column name" of the desired target; further, according to the input information of the user, the shortest path from the starting point to the end point is queried in Neo4 j; further, the tables contained in the shortest path are loaded into Spark (Spark is an open source cluster computing environment similar to Hadoop), and the tables are connected according to the information in Neo4 j; further, ETL operations (extraction, cleaning, conversion, loading of data) can be performed on the linked tables; further, the results of the user query are returned.
The preparation basis of the cross-table query is that when an external database is imported into a local data lake, the data lake management platform simultaneously reads the metadata description-scheme of the database, obtains the descriptions of all fields in the database to be imported, further establishes the relationship from the a column of the A table to the B column of the B table, traverses all variables, and saves the relationship between any two variables into Neo4 j.
In the invention, the user can also share the file which is stored in the distributed file system by the user to other users. The method specifically comprises the following steps: when the user registers in the data lake server, the user has the sharing authority, and the user has the authority to share the file imported into the distributed file system to other users. The various data sources can be imported by different users, each user can only see the imported data by himself under the default condition, and if the data file is in the distributed file system, the user can share the data file. The user can also set three permissions for the data imported by the user: private, visible within the group and public, the various permissions of the user are set by the administrator of the data lake server, including: access data, access functional module components, create user groups, and the like.
In the invention, the data lake server records the data import, cleaning and extraction operation processes and the related operation logs, and stores the data import, cleaning and extraction operation processes and the related operation logs in the document type database for tracking data processing and log analysis. Specifically, the step of "the data lake server records the data import, cleaning, extraction operation processes and the related operation logs, and stores the data import, cleaning, extraction operation processes and the related operation logs in the document type database" means that: the operation process of importing the external database data, the external data file or the external stream data into the local data lake server and the relevant operation parameters can be stored into a local document type database by the data lake server, and the method specifically comprises the following steps: metadata information such as file name, data field, type, total number of records, number of successfully written pieces, number of unsuccessfully written pieces, and write date and time, and further, user information of data operation, source of data, and storage location.
When the file type data is imported, the local data lake server can also extract information in the data file and store the information in the file type database and the graph database so as to facilitate retrieval and establishment of graph data relationship; the user can also view the data files which are imported and view the information of the data lake. Specifically, the "local data lake server may also extract information in the data file and store the information in the document database and the graph database" means: the data lake server can extract metadata information such as header information of relational data, types, sources and file attributes of file data, and can also extract text, picture contents and the like in PDF files, picture files, video files and other various engineering design files, and the extracted information is stored in a document type database in a JSON form; further, an RDF description may be created for each file or each data record to facilitate retrieval and creation of graph-data relationships in preparation for building a graph database. The method for extracting the information in the data file comprises the following steps: image recognition methods, voice recognition methods, text filtering methods, video file processing methods, and the like. Further, the data lake server can fill in a data file and import the data file to a target address of the distributed file system through a functional component of file processing, display fields and remarks, and determine whether to perform RDF operation on each piece of data. In a graph database, RDF consists of nodes representing entities/resources, attributes, and edges representing relationships between entities and attributes.
When the relational data is imported, the local data lake server can extract the information in the data file and store the information in the document database and the graph database. When importing non-relational data, the local data lake server can extract information in the data file and store the information in the database.
The specific implementation method for the condition that the data of the relational database is stored in the graph database is as follows: for any relational database, there are:
1) acquiring relevant parameters required by a user connection graph database, such as an address, a port, a user name and a password;
2) reading a table to be extracted in the relational database;
3) circularly reading the name, the attribute and other contents of the fields in each table; reading the relation (mainly the foreign key relation) between the table and other tables by the same method;
4) calling a method to map a data record field to an attribute of a node so as to import Neo4j, wherein a primary key of the record is an identifier of the node;
5) reads the relationship and imports the node to which the data record maps into Neo4 j.
Specifically, the user may also view the imported data file and view the information of the data lake, which refer to: the data lake server administrator or authorized users can view data in different databases, view statistics of the data lake, total data amount, file types, data amount of various file types, and perform retrieval and export operations on the data lake.
As shown in fig. 1, as an embodiment of the present invention: the local data lake server is composed of a relational database MariaDB, a document database MongoDB, a distributed file system HDFS and a graph database Neo4 j. An external data source from the Oracle database has an IP address of 192.168.12.101, a port of 8080, a username of admin, and a password of password passsd.
And starting a connection service on a software interface of the user, inputting access interface information of the data source, and successfully connecting. Then, data sheet information of an Oracle data source is given on an interface, a data sheet to be imported can be selected on the interface, and field information of the data sheet is further displayed on the interface; the user checks each field of the data table, selects the field to be imported, confirms the import, and then the server background executes the import operation and imports the data table of the selected field into the local memory; furthermore, a user checks the data of each data field in the local data table, performs data cleaning operation, replaces null values in some fields, and also replaces some abnormal values; the processed data table is then saved to the local maridb, while the data file is saved in CSV (comma-separated text file) format as a local distributed file system.
In the above operation process, the server platform records the log of each step operation, and stores the log in the document type database of the server platform. Further, the user can share the data file to other users for use.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims (13)

1. A method for importing multi-source heterogeneous data into a data lake, which is characterized by comprising the following processes:
acquiring access interface information of an external data source, connecting a local data lake server with the external data source, importing data of the external data source, and storing the data into a distributed file system of the local data lake server in the form of a data file, wherein the external data source comprises an external database and an external stream data source;
and/or obtaining access interface information of an external data source, connecting the local data lake server with the external data source, converting non-relational data of the external data source into relational data, storing the relational data into a relational database of the local data lake server, or directly importing the relational data of the external data source, and storing the relational data into the relational database of the local data lake server;
and/or obtaining access interface information of an external data source, connecting the local data lake server with the external data source, importing non-relational data of the external data source, and storing the non-relational data in a document type database of the local data lake server;
and/or obtaining an access interface address of the external file type data, directly importing the external file type data, and storing the external file type data in a distributed file system of the local data lake server.
2. The method of importing multi-source heterogeneous data into a data lake of claim 1,
the step of obtaining the access interface information of the external data source refers to obtaining one or more of an IP address, a port number, a user name and a password of an interface of the external data source.
3. The method of importing multi-source heterogeneous data into a data lake of claim 1,
the user can share the data file stored in the distributed file system to other users, and the method further comprises the following steps:
the user has sharing authority when registering in the data lake server and has the right to share the data file imported into the distributed file system to other users;
various data sources can be imported by different users, and each user can only see the imported data files under the default condition;
when the data file is in the distributed file system, the user can share the data file;
the user can set various permissions including private permissions, permissions visible in the group and public permissions for the data file imported by the user, and the various permissions of the user are set by an administrator of the data lake server.
4. The method of importing multi-source heterogeneous data into a data lake of claim 1,
the data lake server is a data storage and management service platform comprising four databases, namely a relational database, a document database, a distributed file system and a graph database, the platform adopts a distributed operation and storage architecture, integrates various computers, servers and computer clusters/server clusters with data storage and operation functions, and provides various functional components including data management and algorithm development.
5. The method of importing multi-source heterogeneous data into a data lake of claim 1,
the local data lake server imports the data of the external database or external stream data or external file type data into the operation process of the local data lake server and stores the relevant operation parameters into the document type database of the local data lake server for tracking data processing and log analysis;
the data exchange management of the local data lake server may be based on journaling data stored in a document-based database in the form of key-value pairs and file metadata in the form of key-value pairs.
6. The method of importing multi-source heterogeneous data into a data lake of claim 1, further comprising:
selecting data fields of an external data source to be loaded, and storing the data of the selected data fields into a distributed file system of a local data lake server in a data file form;
the data field of the external data source to be loaded is selected, namely after the local data lake server is connected with the external data source, a user sees field information of the external data source on a management interface of the local data lake server, and further selects a data field to be imported; the user can select all the data fields, and the data corresponding to the fields selected by the user can be imported when the data is copied to the local data lake server in the next step.
7. The method of importing multi-source heterogeneous data into a data lake of claim 1 or 6, further comprising:
after the data of the external data source is copied into a file in a distributed file system of the local data lake server or copied into a relational database/document database of the local data lake server, a user can further check the data of each field of the data file and perform data cleaning operation;
and the user imports the cleaned data into a relational database of the local data lake server or stores the cleaned data into a document database of the local data lake server according to actual needs.
8. The method of importing multi-source heterogeneous data into a data lake of claim 7,
the user can perform joint query across table selection fields, the joint query across table selection fields is a table-crossing query method based on a graph database, and the method specifically comprises the following processes:
in the data lake management platform, a user inputs a database name which is to be inquired and is imported into a data lake, and after the database name is executed, the platform searches relevant information of a graph from a specified position of a graph database;
the user fills in initial information containing a data table name and a data column name, and fills in end information containing the data table name and the data column name of a required target;
inquiring a shortest path from a starting point to a destination point in a graph database according to input information of a user;
loading the tables contained in the shortest path into Spark and connecting the tables according to the information in the graph database;
performing relevant operation on the connected table;
and returning the result of the user query.
9. The method of importing multi-source heterogeneous data into a data lake of claim 1, further comprising:
and according to the requirement of the external application program on the data, exporting the non-relational data of the local data lake server into relational data for the external application program to use.
10. The method of importing multi-source heterogeneous data into a data lake of claim 9,
the method for converting the non-relational data of the external data source into the relational data comprises the following steps: for the data of the non-relational database of the external data source, the conversion from the non-relational data to the relational data is completed by traversing all Key values, analyzing each Value and analyzing the data according to different types of the Value values.
11. The method of importing multi-source heterogeneous data into a data lake of claim 1, further comprising:
when the file type data is imported, the local data lake server extracts the information in the data file and stores the information in the file type database and the database;
and/or when the relational data are imported, the local data lake server stores the information in the data file into the document database and the graph database by extracting the information in the data file;
and/or when the non-relational data are imported, the local data lake server stores the information in the data file into the graph database by extracting the information in the data file;
the method for extracting the information in the data file by the local data lake server comprises one or more of an image recognition method, a voice recognition method, a text filtering method and a video file processing method.
12. The method of importing multi-source heterogeneous data into a data lake of claim 11,
when the relation data is saved to the graph database of the local data lake server, the method further comprises the following steps:
1) acquiring relevant parameters required by a connection graph database, wherein the relevant parameters comprise an address, a port, a user name and a password;
2) reading a table to be extracted in a relational database;
3) circularly reading the related content of the fields in each table and reading the relationship between the table and other tables;
4) the calling method maps the data record field into the attribute of the node so as to lead the data record field into a graph database, and the recorded primary key is the identifier of the node;
5) and reading the relation and importing the nodes mapped by the data records into the graph database.
13. The method for importing multi-source heterogeneous data into a data lake according to claim 1, 6, 11 or 12,
the user can view the imported data file and view the information of the data lake server, and the method further comprises the following steps: the local data lake server administrator or an authorized user can check the data in different databases, check the statistics, the total data amount, the file types and the data amounts of various file types of the data lake server, and perform retrieval and export operations on the data lake server.
CN201811438360.4A 2018-11-27 2018-11-27 Method for importing multi-source heterogeneous data into data lake Pending CN111221791A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811438360.4A CN111221791A (en) 2018-11-27 2018-11-27 Method for importing multi-source heterogeneous data into data lake

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811438360.4A CN111221791A (en) 2018-11-27 2018-11-27 Method for importing multi-source heterogeneous data into data lake

Publications (1)

Publication Number Publication Date
CN111221791A true CN111221791A (en) 2020-06-02

Family

ID=70827437

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811438360.4A Pending CN111221791A (en) 2018-11-27 2018-11-27 Method for importing multi-source heterogeneous data into data lake

Country Status (1)

Country Link
CN (1) CN111221791A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035566A (en) * 2020-11-04 2020-12-04 长沙树根互联技术有限公司 Data calling method and device, electronic equipment and storage medium
CN112883091A (en) * 2021-01-12 2021-06-01 平安资产管理有限责任公司 Factor data acquisition method and device, computer equipment and storage medium
CN113110326A (en) * 2021-04-12 2021-07-13 清华大学 Intelligent factory operating system based on industrial Internet architecture
CN113342808A (en) * 2021-05-26 2021-09-03 电子科技大学 Knowledge graph inference engine architecture system based on electromechanical equipment
CN113342807A (en) * 2021-05-20 2021-09-03 电子科技大学 Knowledge graph based on mixed database and construction method thereof
CN114048260A (en) * 2022-01-12 2022-02-15 南湖实验室 Method for interconnecting data lake and relational database
CN114265814A (en) * 2022-03-01 2022-04-01 天津安锐捷技术有限公司 Data lake file system based on object storage
CN115658978A (en) * 2022-11-14 2023-01-31 杭州欧若数网科技有限公司 Graph database system multi-source data importing method and device
CN116501810A (en) * 2023-04-25 2023-07-28 北京捷泰云际信息技术有限公司 System and method for processing spatial big data based on data lake

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035566B (en) * 2020-11-04 2021-02-23 长沙树根互联技术有限公司 Data calling method and device, electronic equipment and storage medium
CN112035566A (en) * 2020-11-04 2020-12-04 长沙树根互联技术有限公司 Data calling method and device, electronic equipment and storage medium
CN112883091A (en) * 2021-01-12 2021-06-01 平安资产管理有限责任公司 Factor data acquisition method and device, computer equipment and storage medium
CN113110326A (en) * 2021-04-12 2021-07-13 清华大学 Intelligent factory operating system based on industrial Internet architecture
CN113342807A (en) * 2021-05-20 2021-09-03 电子科技大学 Knowledge graph based on mixed database and construction method thereof
CN113342808B (en) * 2021-05-26 2022-11-08 电子科技大学 Knowledge graph inference engine architecture system based on electromechanical equipment
CN113342808A (en) * 2021-05-26 2021-09-03 电子科技大学 Knowledge graph inference engine architecture system based on electromechanical equipment
CN114048260A (en) * 2022-01-12 2022-02-15 南湖实验室 Method for interconnecting data lake and relational database
CN114048260B (en) * 2022-01-12 2022-09-09 南湖实验室 Method for interconnecting data lake and relational database
US11914609B2 (en) 2022-01-12 2024-02-27 Nanhu Laboratory Method for interconnecting data lake and relational database
CN114265814B (en) * 2022-03-01 2022-06-07 天津安锐捷技术有限公司 Data lake file system based on object storage
CN114265814A (en) * 2022-03-01 2022-04-01 天津安锐捷技术有限公司 Data lake file system based on object storage
CN115658978A (en) * 2022-11-14 2023-01-31 杭州欧若数网科技有限公司 Graph database system multi-source data importing method and device
CN116501810A (en) * 2023-04-25 2023-07-28 北京捷泰云际信息技术有限公司 System and method for processing spatial big data based on data lake

Similar Documents

Publication Publication Date Title
CN111221791A (en) Method for importing multi-source heterogeneous data into data lake
US20220100774A1 (en) Generating data transformation workflows
US10713247B2 (en) Executing queries for structured data and not-structured data
JP6617117B2 (en) Scalable analysis platform for semi-structured data
US11907216B2 (en) Multi-language fusion query method and multi-model database system
US20140074771A1 (en) Query optimization
KR20170019352A (en) Data query method and apparatus
US9229961B2 (en) Database management delete efficiency
US10936559B1 (en) Strongly-consistent secondary index for a distributed data set
CN113051268A (en) Data query method, data query device, electronic equipment and storage medium
CN111258978A (en) Data storage method
US20230024345A1 (en) Data processing method and apparatus, device, and readable storage medium
CN111221785A (en) Semantic data lake construction method of multi-source heterogeneous data
Hu et al. Towards big linked data: a large-scale, distributed semantic data storage
Mehmood et al. Distributed real-time ETL architecture for unstructured big data
US20140149386A1 (en) Database row access control
US20180225314A1 (en) Managing a single database management system
CN112912870A (en) Tenant identifier conversion
US10997160B1 (en) Streaming committed transaction updates to a data store
US11847121B2 (en) Compound predicate query statement transformation
CN113297252A (en) Data query service method with mode being unaware
CN112434189A (en) Data query method, device and equipment
Fong et al. Toward a scale-out data-management middleware for low-latency enterprise computing
CN112889039A (en) Identification of records for post-clone tenant identifier conversion
Gupta et al. Correlation and comparison of nosql specimen with relational data store

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination