CN110569317A

CN110569317A - metadata collection method and device for data source

Info

Publication number: CN110569317A
Application number: CN201910866414.5A
Authority: CN
Inventors: 宋柯; 张毅然
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2019-12-13

Abstract

The invention provides a metadata acquisition method and a metadata acquisition device for a data source, wherein the method comprises the following steps: pre-estimating required cluster resources according to the metadata scale of the collected data source; loading a database table required by running metadata acquisition into a memory of the cluster to form a temporary table; and executing SQL set for collecting metadata based on the temporary table. In the invention, the relational database metadata acquisition SQL is operated through the cluster, so that the problem that the SQL cannot be operated by the JDBC direct connection database in large data volume is solved.

Description

metadata collection method and device for data source

Technical Field

The invention relates to the field of databases, in particular to a metadata acquisition method and device for a data source.

Background

Most of the existing collection modes of metadata of relational databases such as Hive, Mysql, Oracle, Postgres and the like are to connect libraries where the metadata of various data sources are located through JDBC, and then query various database tables storing the metadata information through sql to extract the metadata information of the data sources.

The collection mode of the metadata is actually the core sql logical operation performed at the server side where the metadata database is located, and the operation is actually performed in the memory of the data source server. The metadata collection mode has no problem under the condition of small data volume, but under the condition of large data volume, the condition that the result cannot be calculated occurs, and often, a user does not have the authority to expand the database server, and the effect of single-machine expansion is limited.

Disclosure of Invention

The embodiment of the invention provides a metadata acquisition method and a metadata acquisition device for a data source, which at least solve the problem of insufficient computing capability caused by the acquisition mode of metadata in the related technology.

According to an embodiment of the present invention, there is provided a metadata collection method for a data source, including: pre-estimating required cluster resources according to the metadata scale of the collected data source; loading a database table required by running metadata acquisition into a memory of the cluster to form a temporary table; and executing SQL set for collecting metadata based on the temporary table.

Preferably, after predicting the required cluster resources according to the metadata size of the collected data source, the method further includes: and initializing the cluster session through the estimated cluster resources.

preferably, the method further comprises: and when the cluster resources are insufficient in the running process, increasing the cluster resources and reinitializing the cluster session.

Preferably, the cluster resources include memory and CPU resources of the cluster.

According to another embodiment of the present invention, there is provided a metadata collection apparatus of a data source, including: the pre-estimation module is used for pre-estimating the required cluster resources according to the metadata scale of the acquired data source; the loading module is used for loading a database table required by running metadata acquisition into the memory of the cluster to form a temporary table; and the execution module is used for executing the SQL set for collecting the metadata based on the temporary table.

Preferably, the apparatus further comprises: and the initialization module is used for initializing cluster conversation through the estimated cluster resources.

Preferably, the apparatus further comprises: and the adjusting module is used for increasing the cluster resources and reinitializing the cluster session when the cluster resources are insufficient in the running process.

According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

In the embodiment of the invention, the relational database metadata collection SQL is operated through the cluster, so that the problem that the SQL cannot be operated through the JDBC direct connection database in a large data volume is solved.

drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of a computing terminal configured to operate in accordance with a method of an embodiment of the present invention;

FIG. 2 is a flow diagram of a method of metadata collection for a data source according to an embodiment of the invention;

FIG. 3 is a flow diagram of a method for metadata collection of a data source in accordance with an alternative embodiment of the present invention;

FIG. 4 is a schematic diagram of a metadata collection apparatus for a data source according to an embodiment of the present invention;

Fig. 5 is a schematic structural diagram of a metadata collection apparatus of a data source according to an alternative embodiment of the present invention.

Detailed Description

the invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The method provided by the first embodiment of the present application may be executed in a computer terminal, a server, or a similar computing device. Taking the operation on a computer terminal as an example, fig. 1 is a hardware structure block diagram of the computer terminal operated by the method of the embodiment of the present invention. As shown in fig. 1, the computer terminal 100 may include one or more processors 102 (only one is shown in fig. 1) (the processor 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally, the computer terminal 100 may further include a transmission device 106 for communication function and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the computer terminal. For example, computer terminal 100 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 can be used for storing computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the methods in the embodiments of the present invention, and the processor 102 executes the computer programs stored in the memory 104 to execute various functional applications and data processing, i.e., to implement the methods described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 100 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 100. In one example, the transmission device 106 includes a Network adapter (NIC) through which communication with the internet is possible.

In the embodiment, a metadata collection method for a data source running on the computer terminal is provided. In the embodiment of the invention, the database table for storing the metadata in different relational databases is loaded into the cluster, and then the metadata acquisition SQL is operated through the operational capability of the cluster.

fig. 2 is a flow chart of a method according to an embodiment of the present invention, as shown in fig. 2, the flow chart includes the following steps:

Step S202, pre-estimating the required cluster resources according to the metadata scale of the collected data source;

Step S204, loading a database table required by running metadata acquisition into a memory of the cluster to form a temporary table;

Step S206, collecting the sql set of the metadata based on the temporary table.

After step S202 in this embodiment, the method may further include: initializing the spark session through the estimated spark cluster resources.

In step S206 of this embodiment, the method may further include: and when the spare cluster resources are insufficient in the running process, increasing the spare cluster resources and reinitializing the spare session.

in this embodiment, the cluster may be a spark cluster, and the cluster resources may include memory and CPU core number resources of the spark cluster.

in the embodiment of the invention, the relational data source metadata acquisition sql can be efficiently operated and calculated to extract the metadata of different relational data sources, and the metadata of the ultra-large relational data sources in data management can be efficiently acquired.

In order to facilitate an understanding of the technical solutions provided by the embodiments of the present invention, an embodiment of a specific application will be described in detail below.

FIG. 3 provides a method of metadata collection for a data source. As shown in fig. 3, in the present embodiment, the method mainly includes the following steps:

Step S301, pre-estimating required spark cluster resources according to the metadata scale of the acquired data source, and initializing spark session according to the required spark cluster resources;

step S302, loading a metadata table to a memory of the spark cluster to form a temporary table;

Step S303, executing the metadata collection SQL through a temporary table in the spark cluster;

Step S304, judging whether the execution is successful, if not, jumping to step S301, and if so, executing step S305;

Step S305 ends.

In this embodiment, the spark component spark sql may be adopted to load the information of the metadata base of each data source into the memory of the spark cluster, and then perform the analysis operation of the metadata in the operation cluster with strong operation force and memory capacity.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a metadata collection apparatus for a data source is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description of which has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

fig. 4 is a block diagram of a metadata collection apparatus of a data source according to an embodiment of the present invention, and as shown in fig. 4, the apparatus includes a prediction module 10, a loading module 20, and an execution module 30.

The estimation module 10 is used for estimating the required cluster resources according to the metadata scale of the collected data source.

The loading module 20 is configured to load a database table required for running metadata collection into the memory of the cluster to form a temporary table.

The execution module 30 is configured to perform the collection of the sql set of metadata based on the temporary table.

Fig. 5 is a block diagram of a metadata collection apparatus of a data source according to an alternative embodiment of the present invention, and as shown in fig. 5, the apparatus includes an initialization module 40 and an adjustment module 50 in addition to the estimation module 10, the loading module 20 and the execution module 30 shown in fig. 4.

the initialization module 40 is configured to initialize a cluster session with the estimated cluster resources.

The adjusting module 50 is configured to, when the spare cluster resources are insufficient in the running process, increase the cluster resources and reinitialize the cluster session.

it should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

it will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for metadata collection from a data source, comprising:

Pre-estimating required cluster resources according to the metadata scale of the collected data source;

Loading a database table required by running metadata acquisition into a memory of the cluster to form a temporary table;

and executing SQL set for collecting metadata based on the temporary table.

2. The method of claim 1, further comprising, after estimating the required cluster resources based on the metadata size of the collected data sources:

And initializing the cluster session through the estimated cluster resources.

3. The method of claim 1, further comprising:

and when the cluster resources are insufficient in the running process, increasing the cluster resources and reinitializing the cluster session.

4. the method of any of claims 1 to 3, wherein the cluster resources comprise memory and CPU resources of the cluster.

5. A metadata collection apparatus for a data source, comprising:

The pre-estimation module is used for pre-estimating the required cluster resources according to the metadata scale of the acquired data source;

the loading module is used for loading a database table required by running metadata acquisition into the memory of the cluster to form a temporary table;

And the execution module is used for executing the SQL set for collecting the metadata based on the temporary table.

6. The apparatus of claim 5, further comprising:

And the initialization module is used for initializing cluster conversation through the estimated cluster resources.

7. The apparatus of claim 5, further comprising:

and the adjusting module is used for increasing the cluster resources and reinitializing the cluster session when the cluster resources are insufficient in the running process.

8. The apparatus of any of claims 5 to 7, wherein the cluster resources comprise memory and CPU resources of the cluster.

9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 4 when executed.

10. an electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 4.