CN115774750A

CN115774750A - Database lake entering configuration method and system, electronic equipment and storage medium

Info

Publication number: CN115774750A
Application number: CN202211698949.4A
Authority: CN
Inventors: 侯宇辉; 边琪; 闫云鹏; 蒙泽敏
Original assignee: Jingying Digital Technology Co Ltd
Current assignee: Jingying Digital Technology Co Ltd
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-03-10

Abstract

The invention provides a method, a system, electronic equipment and a storage medium for configuring a database into a lake, wherein the method comprises the steps of firstly, acquiring a configuration item and a mapping relation table of a data source needing to be operated into the lake; the data source comprises a plurality of databases; the configuration items at least include: data source type, data tasks and data source connection items; then determining field information of the database in the lake according to the configuration item and the mapping relation table; generating a creating catalog statement by using an identifier of a data source, and generating a creating downstream table statement by using a data task; generating a data source mapping statement by using the data source connection item and the mapping relation table, and generating a full query insertion statement by using the mapping relation table; finally, executing a plurality of databases in the statement control data source according to the configuration items to sequentially complete lake entering operation; the method can realize the automation of the lake entering configuration of the database, simultaneously supports the automatic pulling of the metadata, flexibly supports the integration of flow and batch, and reduces the cost of accessing the multi-source heterogeneous data by the user.

Description

Method and system for configuring database in lake, electronic equipment and storage medium

Technical Field

The invention relates to the field of database interaction, in particular to a method and a system for configuring a database in lake, electronic equipment and a storage medium.

Background

At present, with the digital transformation of all industries, high-frequency acquisition equipment and intelligent equipment acquire more and more data, so that efficient, simple and reliable data transmission becomes the most critical link in the field of big data.

A data lake is a method of storing data in a system or repository in a natural format that facilitates configuring data, typically object blocks or files, in various patterns and structural forms. The main idea of the data lake is to store all data in the enterprise uniformly, converting from raw data to target data for various tasks such as reporting, visualization, analysis, and machine learning.

As the data lake is larger and larger in data scale, more data sources need to be accessed, the difference between the data sources is extremely large, the lake needs to be manually configured for users of multi-source data, and the cost of using the data lake by the users is greatly increased.

Disclosure of Invention

In view of this, the present invention provides a method, a system, an electronic device and a storage medium for configuring a database in a lake, where the method can realize automation of configuring the database in the lake, support automatic pulling of metadata, flexibly support integration of stream and batch, and reduce the cost of accessing multi-source heterogeneous data by a user. The user can easily get through the data from the external data source and the local data lake, and the efficient lake entering of the data is realized.

In a first aspect, an embodiment of the present invention provides a method for configuring a database in a lake, where the method includes:

acquiring configuration items and a mapping relation table of a data source needing lake entering operation; the data source comprises a plurality of databases; the configuration items at least include: data source type, data task and data source connection item;

determining field information of the database in the lake according to the configuration item and the mapping relation table;

generating a creating catalog statement by using an identifier of a data source, and generating a creating downstream table statement by using a data task;

generating a data source mapping statement by using the data source connection item and the mapping relation table, and generating a full query insertion statement by using the mapping relation table;

and executing a directory creating statement, a downstream table creating statement, a data source mapping statement and a full query inserting statement according to the configuration items, and controlling a plurality of databases in the data source to sequentially complete lake entering operation.

In one embodiment, the step of determining the field information of the database in the lake according to the configuration item and the mapping relation table includes:

inputting a data source entering the lake by using the configuration item of the data source;

determining field information of a data table contained in a database according to the structural parameters of the data source;

and inquiring the mapping relation table according to the type of the data source, acquiring a field type corresponding to the lake entering field according to the inquiry result, and determining field information of the database in the lake by using the field type.

In one embodiment, the steps of generating a create directory statement using an identifier of a data source and generating a create downstream table statement using a data task include:

acquiring a data subject contained in a data source, determining a directory English name according to a pinyin abbreviation of the data subject and an identifier of the data source, and generating a directory sentence according to the directory English name;

and acquiring downstream configuration information according to the data source type and the data task, determining the downstream table information by using the downstream configuration information, the upstream data table information and the data theme, and generating a downstream table statement according to the downstream table information.

In one embodiment, the step of generating a data source mapping statement using the data source connection item and the mapping relation table, and generating a full query insertion statement using the mapping relation table includes:

acquiring a lake internal data table field and a preset virtual mapping table, determining a condition parameter of the data mapping table to be generated according to the data source connection item, the lake internal data table field, the virtual mapping table and the mapping relation table, and determining a data source mapping statement according to the condition parameter;

and determining a full query insertion statement according to the mapping table name, the downstream table name and the field name determined by the mapping relation table.

In one embodiment, the process of creating a directory statement, creating a downstream table statement, a data source mapping statement, and a full query insertion statement is performed according to a configuration item, comprising:

adding a catalog creating statement, a downstream table creating statement, a data source mapping statement and a full query inserting statement into a preset instruction group;

and traversing the instruction group, and controlling the configuration items to sequentially execute the statements contained in the instruction group according to the addition sequence of the statements in the instruction group.

In one embodiment, the obtaining process of the mapping relation table includes:

determining the field type contained in the data source according to the data source type;

and generating a mapping relation table of the field types according to the corresponding relation between the data source type and the types in the lake by using the field types.

In one embodiment, the configuration item acquisition process includes:

determining the type of a table storage engine, the compression format of a data file, the target size of metadata and the historical outdated snapshot duration which are obtained from a data source as configuration items of a downstream data table;

acquiring parallelism data and checkpoint interval duration corresponding to a data source, and determining the parallelism data and the checkpoint interval duration as task operation configuration items;

and determining the downstream data table configuration item and the task operation configuration item as configuration items.

In a second aspect, an embodiment of the present invention provides a database lake entering configuration system, including:

the configuration module is used for acquiring configuration items of a data source needing lake entering operation and a mapping relation table; the data source comprises a plurality of databases; the configuration items at least include: data source type, data task and data source connection item;

the field information determining module is used for determining the field information of the database in the lake according to the configuration item and the mapping relation table;

the first generation module is used for generating a creating catalog statement by using an identifier of a data source and generating a creating downstream table statement by using a data task;

the second generation module is used for generating a data source mapping statement by using the data source connection item and the mapping relation table and generating a full query insertion statement by using the mapping relation table;

and the lake entering module is used for executing the creating of the directory statement, the creating of the downstream table statement, the data source mapping statement and the full query inserting statement according to the configuration items, and controlling the plurality of databases in the data source to sequentially complete the lake entering operation.

In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a processor and a memory, where the memory stores computer-executable instructions capable of being executed by the processor, and the processor executes the computer-executable instructions to implement the steps of the database-lake-entering configuration method provided in the first aspect.

In a fourth aspect, embodiments of the present invention further provide a storage medium, where the storage medium stores computer-executable instructions, and when the computer-executable instructions are called and executed by a processor, the computer-executable instructions cause the processor to implement the steps of the database-in-lake configuring method provided in the first aspect.

The embodiment of the invention provides a method, a system, electronic equipment and a storage medium for configuring a database into a lake, wherein the method comprises the steps of firstly obtaining a configuration item of a data source needing lake entering operation and a mapping relation table; the data source comprises a plurality of databases; the configuration items at least include: data source type, data task and data source connection item; then determining field information of the database in the lake according to the configuration item and the mapping relation table; generating a creating catalog statement by using an identifier of a data source, and generating a creating downstream table statement by using a data task; generating a data source mapping statement by using the data source connection item and the mapping relation table, and generating a full query insertion statement by using the mapping relation table; and finally, executing a directory creating statement, a downstream table creating statement, a data source mapping statement and a full query inserting statement according to the configuration items, and controlling a plurality of databases in the data source to sequentially complete lake entering operation. The method can realize the automation of the lake entering configuration of the database, simultaneously supports the automatic pulling of the metadata, flexibly supports the integration of flow and batch, and reduces the cost of accessing the multi-source heterogeneous data by the user. The user can easily get through the data from the external data source and the local data lake, and the efficient lake entering of the data is realized.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flowchart of a method for configuring a database into a lake according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating determining field information of a database in a lake according to a configuration item and a mapping table in a configuration method for configuring the database in the lake according to an embodiment of the present invention;

fig. 3 is a flowchart of generating a create directory statement by using an identifier of a data source and generating a create downstream table statement by using a data task in a database lake entering configuration method according to an embodiment of the present invention;

fig. 4 is a flowchart of generating a data source mapping statement by using a data source connection item and a mapping relationship table, and generating a full query insertion statement by using the mapping relationship table in the database lake entering configuration method provided by the embodiment of the present invention;

fig. 5 is a flowchart of a process of creating a directory statement, creating a downstream table statement, a data source mapping statement, and a full query insertion statement according to a configuration item in the database lake entering configuration method according to the embodiment of the present invention;

fig. 6 is a flowchart of an obtaining process of a mapping relation table in a method for configuring a database into a lake according to an embodiment of the present invention;

fig. 7 is a flowchart of a configuration item obtaining process in a method for configuring a database into a lake according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a method for configuring a database into a lake according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a database lake-entering configuration system according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Icon:

910-configuration module; 920-field information determination module; 930-a first generating module; 940-a second generation module; 950-a lake entering module;

101-a processor; 102-a memory; 103-a bus; 104-communication interface.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the embodiments, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As the data lake is larger and larger in data scale, more data sources need to be accessed, the difference between the data sources is extremely large, the lake needs to be manually configured for users of multi-source data, and the cost of using the data lake by the users is greatly increased. Based on the method, the system, the electronic equipment and the storage medium for the lake-entering configuration of the database can realize automation of the lake-entering configuration of the database, simultaneously support automatic pulling of metadata, flexibly support integration of stream and batch and reduce cost of accessing multi-source heterogeneous data by a user. The user can easily get through the data through the external data source and the local data lake, and the efficient lake entering of the data is realized.

To facilitate understanding of the embodiment, a method for configuring a database in a lake, disclosed in the embodiment of the present invention, is first described in detail, and as shown in fig. 1, the method includes:

s101, acquiring a configuration item and a mapping relation table of a data source needing lake entering operation; the data source comprises a plurality of databases; the configuration items at least include: data source type, data tasks, and data source connection items.

The configuration items of the data source mainly comprise a data source type, a data task, a data source connection item and the like; the mapping relation table mainly represents the corresponding relation between the external data source type and the data lake internal type. The data source comprises a plurality of databases which can be of different types, such as MySQL, oracle, sqlServer, mongoDB and the like.

And S102, determining field information of the database in the lake according to the configuration item and the mapping relation table.

Recording the data source entering the lake according to the configuration item information by combining the mapping relation table; when the data source is structured data, inquiring metadata of the data source, acquiring all table information, and inquiring field information in all tables according to the selected table information; when the data source is a semi-structured structure, the user can manually fill in the sample data and parse the data structure, thereby generating the field information of the table.

Step S103, generating a creating directory statement by using the identifier of the data source, and generating a creating downstream table statement by using the data task.

This step uses the identifier of the data source to generate a create directory statement, which may be an SQL statement. In the process of creating the downstream table statement by using the data task, the name of the lake entering data table is specified to be not changeable, and then the configuration information for creating the downstream table is generated by using the type related data of the data task, so that the related statement for creating the downstream table is generated. Since the lake-entering task is to insert data in full, the create downstream table statement is implemented according to the data source table mapping field type.

And step S104, generating a data source mapping statement by using the data source connection item and the mapping relation table, and generating a full query insertion statement by using the mapping relation table.

The step of generating a mapping statement of the data source by combining the connection information corresponding to the data source connection item with the mapping relation; and generating a full query insertion statement according to the table name of the mapping relation table and the attribute of the corresponding downstream table.

And step S105, according to the configuration items, executing creating a directory statement, a downstream table statement, a data source mapping statement and a full query insertion statement, and controlling a plurality of databases in the data source to sequentially complete lake entering operation.

And after the statements are obtained, performing database processing on the statements according to the corresponding execution sequence, thereby controlling how the databases in the data source complete the operation in sequence.

In one embodiment, the step S102 of determining the field information of the database in the lake according to the configuration item and the mapping relation table, as shown in fig. 2, includes:

step S201, inputting a data source entering a lake by using a configuration item of the data source;

step S202, determining field information of a data table contained in a database according to the structural parameters of the data source;

and S203, querying the mapping relation table according to the data source type, acquiring a field type corresponding to the lake entering field according to the query result, and determining field information of the database in the lake by using the field type.

Specifically, the data source entering the lake is recorded according to the configuration item information of the data source. And when the data source is structured, inquiring metadata of the data source, acquiring field information of the data tables contained in the database, and acquiring the field information in all the tables according to the selected table information. And then querying the mapping relation table according to the type of the data source, acquiring a field type corresponding to the lake entering field according to the query result, and determining field information of the database in the lake by using the field type.

In one embodiment, the step S103 of generating a create directory statement by using an identifier of a data source and generating a create downstream table statement by using a data task includes:

step S301, obtaining a data subject contained in the data source, determining a directory English name according to the Pinyin abbreviation of the data subject and the identifier of the data source, and generating a directory creating statement according to the directory English name.

And generating a directory English name according to the theme where the data in the selected data source is located, the theme Pinyin abbreviation and the data source identifier, generating and creating a directory SQL sentence according to the English name, and adding the SQL sentence into the SQL group information.

Step S302, obtaining downstream configuration information according to the data source type and the data task, determining the downstream table information by using the downstream configuration information, the upstream data table information and the data subject, and generating a downstream table creating statement according to the downstream table information.

And acquiring the downstream configuration information according to the selected data source type and the selected task type. And (3) specifying that the name of the lake entering data table is not modified, generating and creating a downstream table SQL statement according to the name of the upstream data table, the selected data theme, the field information of the data table in the lake and the configuration information of the downstream table, and adding the SQL statement into the SQL group information.

In one embodiment, the step S104 of generating a data source mapping statement by using the data source connection item and the mapping relation table, and generating a full query insertion statement by using the mapping relation table includes:

step S401, obtaining the field of the data table in the lake and a preset virtual mapping table, determining the condition parameters of the data mapping table to be generated according to the data source connection item, the field of the data table in the lake, the virtual mapping table and the mapping relation table, and determining the data source mapping statement according to the condition parameters.

And generating a data source mapping table SQL statement according to the selected data source connection information, the fields of the data tables in the lake, the fixed virtual mapping table name and the data source configuration information, and adding the SQL statement into the SQL group.

Step S402, determining a full query insertion statement according to the mapping table name, the downstream table name and the field name determined by the mapping relation table.

And generating a full query insertion SQL statement according to the table name, the downstream table name and the field name of the mapping table, and adding the SQL statement into the SQL group.

In one embodiment, the process of creating a directory statement, creating a downstream table statement, a data source mapping statement, and a full query insertion statement is performed according to configuration items, as shown in fig. 5, including:

step S501, adding a creating catalog statement, a creating downstream table statement, a data source mapping statement and a full query inserting statement into a preset instruction group.

The instruction group in this step, i.e., the SQL group mentioned in the above embodiment, needs to save the adding sequence after adding to the instruction group for subsequent execution.

Step S502, traversing the instruction group, and controlling the configuration items to sequentially execute the sentences contained in the instruction group according to the addition sequence of the sentences in the instruction group.

And submitting the SQL group to the cluster according to the task configuration information, sequentially executing the SQL according to the sequence of adding the SQL group, and running and processing the data task.

In one embodiment, the obtaining process of the mapping relationship table, as shown in fig. 6, includes:

step S601, determining the field type contained in the data source according to the data source type;

and step S602, generating a mapping relation table of the field types according to the corresponding relation between the data source type and the types in the lake by using the field types.

And generating a field type mapping relation table according to the corresponding relation between the type of the external data source and the type in the data lake. In the process of acquiring the mapping relation table, SQL statement templates can be prepared, including creating a directory template SQL, creating a mapping table template SQL of an external data table, creating a template SQL of a downstream data table, and inquiring and storing the data template SQL for the subsequent generation process of SQL statements.

In one embodiment, the configuration item obtaining process, as shown in fig. 7, includes:

step S701, determining a table storage engine type, a data file compression format, a metadata target size and a history expired snapshot duration which are obtained from a data source as downstream data table configuration items;

step S702, acquiring parallelism data and checkpoint interval duration corresponding to a data source, and determining the parallelism data and the checkpoint interval duration as task operation configuration items;

step S703, determining the downstream data table configuration item and the task operation configuration item as configuration items.

Specifically, the configuration items may include three types, data source configuration items, including: the data source type, the data source structure type, the data task type, the data source connection item and the like; creating a configuration entry for a downstream data table, comprising: the type of a table storage engine, the compression format of a data file, the target size of metadata, the time length of a historical overdue snapshot and the like; and task running configuration items comprise parallelism, checkpoint interval duration and the like.

As shown in fig. 8, the data source in fig. 8 includes multiple types of databases such as MySQL, oracle, sqlServer, and MongoDB, and the database is automatically pulled to flow into the data lake.

The method can be realized by the following steps:

the first step is as follows: preparing a data source configuration item, comprising: data source type, data source structure type, data task type, data source link item and the like. Configuration items for preparing to create a downstream data table comprise a table storage engine type, a data file compression format, a metadata target size, a history expired snapshot duration and the like. And preparing task running configuration items including parallelism and checkpoint interval duration.

The second step: preparing an SQL statement template, including creating a directory template SQL, creating a mapping table template SQL of an external data table, creating a template SQL of a downstream data table, and inquiring and storing the data template SQL. And sorting all required data source types, sorting field types according to different data sources of the data sources, and generating a field type mapping relation table according to the corresponding relation between the types of the external data sources and the types in the data lake.

The third step: and recording a data source entering the lake according to the configuration information. And when the data source is structured, inquiring the metadata of the data source, acquiring all table information, and searching field information in all tables according to the selected table information. When the data source is in a semi-structured mode, a user manually fills in sample data, analyzes the data structure and generates table field information.

The fourth step: and inquiring a field type mapping relation table according to the data source type and the data source field type. And according to the mapping relation table, acquiring the type corresponding to the lake entering field, and generating field information of the data table in the lake.

The fifth step: and generating a directory English name according to the theme of the data in the selected data source and the theme Pinyin abbreviation and the data source identifier. And generating a creating directory SQL according to the English name and the creating directory template, and adding the creating directory SQL into the SQL group information.

And a sixth step: and acquiring the downstream configuration information according to the selected data source type and the selected task type. 5, the name of the lake entering data table is not modified, and according to the name of the upstream data table, the selected data theme, the field information of the lake data table and the downstream table configuration information, the created downstream table SQL is generated according to the created downstream table template and is added into the SQL group information.

The seventh step: according to the selected data source connection information, the fields of the data table in the lake and the fixed virtual

And generating 0 external data source mapping table SQL according to the mapping table template for creating the external data table by the mapping table name, the data source configuration information and the external, and adding the SQL into the SQL group.

Eighth step: and generating a full query insert SQL statement according to the table name, the downstream table name and the field name of the mapping table, querying and storing the data template, and adding the full query insert SQL statement into the SQL group.

The ninth step: and submitting the SQL to the cluster according to the task configuration information, sequentially executing the SQL according to the sequence of adding the SQL groups, and running and processing data tasks.

5 when the user enters the upstream data of the lake, the user is required to fill in relevant fields entering the lake and the name of the table entering the lake, the fields entering the lake table can be automatically pulled according to the database and the name of the table selected by the user, and the cost of writing SQL sentences by the user is reduced; meanwhile, the method can intelligently judge the type of the data source entering the lake, and intelligent pulling of the metadata is achieved. According to different types of data sources, by enumeration

And the key-value pair mapping relation is mapped to different field types of the same data source according to different field types of different data sources. The unification of different field types of different data sources is realized, and the problems of non-unification and compatibility of different data source data field types are solved.

The method can store the configuration of the database information after the user inputs the relevant information of the database, and the relevant information of the database can be displayed to the user in a list form for the user to select when the user performs lake-entering operation, so that the configurable lake-entering and full-volume lake-entering processes are realized.

5 in the specific implementation process, the supportable data source type related information can be imported into the related database in advance, and the data of the designated database can be input into the lake by adding the related database driver jar. And realizing the configuration of the connection information related to the data source connector and the database address. Based on the configurability, the user can select increment lake entering and full volume lake entering, and meanwhile, the user can select the data table of the lake entering according to regular arbitrary batches. In different service scenes, the lake-entering data have different requirements, and users may need to enter the lake in full quantity for statistical analysis. Therefore, the system can be switched to full-volume data lake entering according to the selection of the user, and data screening is carried out according to the data time period selected by the user when the full-volume data lake entering is carried out. For the increment service scene, the invention can automatically discover the primary key on the basis of the upstream increment table of the user, and perform increment updating and increment upsert operation on the data of the user according to the primary key.

Meanwhile, the method can be configured into the lake and simultaneously supports the configuration of stream and batch integration, different modes can be provided according to different requirements, and a real-time scene can be configured to switch from a batch lake entering mode to a lake flowing mode by one key.

According to the method for configuring the database into the lake, automation of configuration of the database into the lake can be achieved, automatic pulling of metadata is supported, integration of flow and batch is flexibly supported, and cost of accessing multi-source heterogeneous data by a user is reduced. The user can easily get through the data from the external data source and the local data lake, and the efficient lake entering of the data is realized.

As to the method for configuring a database into a lake provided in the foregoing embodiment, an embodiment of the present invention provides a system for configuring a database into a lake, as shown in fig. 9, where the system includes:

the configuration module 910 is configured to obtain a configuration item of a data source that needs to go into a lake for operation and a mapping relation table; the data source comprises a plurality of databases; the configuration items at least include: data source type, data tasks and data source connection items;

the field information determining module 920 is configured to determine field information of the database in the lake according to the configuration item and the mapping relation table;

a first generating module 930, configured to generate a create directory statement using an identifier of a data source, and generate a create downstream table statement using a data task;

a second generating module 940, configured to generate a data source mapping statement by using the data source connection item and the mapping relationship table, and generate a full query insertion statement by using the mapping relationship table;

and a lake entering module 950, configured to execute a create directory statement, a create downstream table statement, a data source mapping statement, and a full query insertion statement according to the configuration item, and control the multiple databases in the data source to sequentially complete a lake entering operation.

In one embodiment, the field information determining module 920 is further configured to: inputting a data source entering the lake by using the configuration item of the data source; determining field information of a data table contained in a database according to the structural parameters of the data source; and inquiring the mapping relation table according to the type of the data source, acquiring the field type corresponding to the lake entering field according to the inquiry result, and determining the field information of the database in the lake by using the field type.

In one embodiment, the first generation module 930 is further configured to: acquiring a data theme contained in a data source, determining a directory English name according to a Pinyin abbreviation of the data theme and an identifier of the data source, and generating a directory creation statement according to the directory English name; and acquiring downstream configuration information according to the data source type and the data task, determining the downstream table information by using the downstream configuration information, the upstream data table information and the data theme, and generating a downstream table statement according to the downstream table information.

In an embodiment, the second generating module 940 is further configured to: acquiring a field of a data table in a lake and a preset virtual mapping table, determining a condition parameter of the data mapping table to be generated according to a data source connection item, the field of the data table in the lake, the virtual mapping table and a mapping relation table, and determining a data source mapping statement according to the condition parameter; and determining a full query insertion statement according to the mapping table name, the downstream table name and the field name determined by the mapping relation table.

In one embodiment, the lake entering module 950 is further configured to, during the process of creating a directory statement, creating a downstream table statement, a data source mapping statement, and a full query insertion statement according to the configuration item: adding a catalog creating statement, a downstream table creating statement, a data source mapping statement and a full query inserting statement into a preset instruction group; and traversing the instruction group, and controlling the configuration items to sequentially execute the sentences contained in the instruction group according to the addition sequence of the sentences in the instruction group.

In one embodiment, the configuration module 910 is further configured to, during the obtaining of the mapping relationship table: determining the field type contained in the data source according to the data source type; and generating a mapping relation table of the field types according to the corresponding relation between the data source types and the types in the lake by using the field types.

In one embodiment, the configuration module 910 is further configured to, during the obtaining of the configuration item: determining the type of a table storage engine, the compression format of a data file, the target size of metadata and the historical outdated snapshot duration which are obtained from a data source as configuration items of a downstream data table; acquiring parallelism data and checkpoint interval duration corresponding to a data source, and determining the parallelism data and the checkpoint interval duration as task operation configuration items; and determining the downstream data table configuration item and the task operation configuration item as configuration items.

The system for configuring the database into the lake, provided by the embodiment of the invention, can realize automation of configuration of the database into the lake, simultaneously supports automatic pulling of metadata, flexibly supports integration of flow and batch, and reduces the cost of accessing multi-source heterogeneous data by a user. The user can easily get through the data from the external data source and the local data lake, and the efficient lake entering of the data is realized.

The implementation principle and the generated technical effects of the database lake entering configuration system provided by the embodiment of the invention are the same as those of the database lake entering configuration method embodiment, and for the sake of brief description, the corresponding contents in the method embodiment can be referred to where the device embodiment is not mentioned.

The embodiment also provides an electronic device, a schematic structural diagram of which is shown in fig. 10, and the electronic device includes a processor 101 and a memory 102; the memory 102 is used for storing one or more computer instructions, and the one or more computer instructions are executed by the processor to implement the database lake-entering configuration method.

The electronic device shown in fig. 10 further includes a bus 103 and a communication interface 104, and the processor 101, the communication interface 104, and the memory 102 are connected through the bus 103.

The Memory 102 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Bus 103 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 10, but this does not indicate only one bus or one type of bus.

The communication interface 104 is used for connecting with at least one user terminal and other network units through a network interface, and sending the packaged IPv4 message or IPv4 message to the user terminal through the network interface.

The processor 101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 101. The Processor 101 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present disclosure may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The storage medium is located in the memory 102, and the processor 101 reads the information in the memory 102 and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.

Embodiments of the present invention further provide a storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program performs the steps of the method of the foregoing embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some communication interfaces, indirect coupling or communication connection between devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in software functional units and sold or used as a stand-alone product, may be stored in a non-transitory computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention or a part thereof which contributes to the prior art in essence can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for configuring a database into a lake is characterized by comprising the following steps:

acquiring a configuration item and a mapping relation table of a data source needing lake entering operation; the data source comprises a plurality of databases; the configuration items at least comprise: data source type, data tasks and data source connection items;

generating a creating directory statement by using the identifier of the data source, and generating a creating downstream table statement by using the data task;

and executing the creating catalog statement, the creating downstream table statement, the data source mapping statement and the full query insertion statement according to the configuration item, and controlling a plurality of databases in the data source to sequentially complete the lake entering operation.

2. The method for configuring the database in the lake according to claim 1, wherein the step of determining the field information of the database in the lake according to the configuration item and the mapping relation table comprises:

recording the data source entering the lake by using the configuration item of the data source;

determining field information of a data table contained in the database according to the structural parameters of the data source;

and querying the mapping relation table according to the type of the data source, acquiring a field type corresponding to a lake entering field according to a query result, and determining field information of the database in the lake by using the field type.

3. The database lake-entering configuration method according to claim 1, wherein the step of generating a create directory statement using the identifier of the data source and generating a create downstream table statement using the data task comprises:

acquiring a data subject contained in the data source, determining a directory English name according to a pinyin abbreviation of the data subject and an identifier of the data source, and generating the directory creating statement according to the directory English name;

and acquiring downstream configuration information according to the data source type and the data task, determining downstream table information by using the downstream configuration information, upstream data table information and the data theme, and generating the statement for creating the downstream table according to the downstream table information.

4. The method for configuring the database into the lake according to claim 1, wherein the step of generating the data source mapping statement by using the data source connection item and the mapping relation table, and generating the full query insertion statement by using the mapping relation table comprises:

acquiring a field of a data table in a lake and a preset virtual mapping table, determining a condition parameter of the data mapping table to be generated according to the data source connection item, the field of the data table in the lake, the virtual mapping table and the mapping relation table, and determining a data source mapping statement according to the condition parameter;

and determining the full-scale query insertion statement according to the mapping table name, the downstream table name and the field name determined by the mapping relation table.

5. The database lake entering configuration method according to claim 1, wherein the process of creating a directory statement, creating a downstream table statement, executing the data source mapping statement and executing the full query insertion statement according to the configuration item comprises:

adding the creating catalog statement, the creating downstream table statement, the data source mapping statement and the full query insertion statement into a preset instruction group;

and traversing the instruction group, and controlling the configuration items to sequentially execute the sentences contained in the instruction group according to the addition sequence of the sentences in the instruction group.

6. The method for configuring the database into the lake according to claim 1, wherein the obtaining process of the mapping relation table comprises:

and generating a mapping relation table of the field type according to the corresponding relation between the data source type and the type in the lake by using the field type.

7. The database lake entering configuration method according to claim 1, wherein the acquisition process of the configuration items comprises:

determining the table storage engine type, the data file compression format, the metadata target size and the historical expired snapshot duration obtained from the data source as downstream data table configuration items;

acquiring parallelism data and checkpoint interval duration corresponding to the data source, and determining the parallelism data and the checkpoint interval duration as the task operation configuration item;

and determining the downstream data table configuration item and the task operation configuration item as the configuration items.

8. A database lake entry configuration system, the system comprising:

the first generation module is used for generating a creating directory statement by using the identifier of the data source and generating a creating downstream table statement by using the data task;

and the lake entering module is used for executing the catalog creating statement, the downstream table creating statement, the data source mapping statement and the full query inserting statement according to the configuration items and controlling the plurality of databases in the data source to sequentially complete the lake entering operation.

9. An electronic device, comprising: a processor and a storage device; the storage device has stored thereon computer-executable instructions executable by a processor to perform the steps of the database-in-lake configuring method of any one of the preceding claims 1 to 7.

10. A storage medium storing computer-executable instructions which, when invoked and executed by a processor, cause the processor to perform the steps of the database-entry configuration method of any one of claims 1 to 7.