CN116578570B

CN116578570B - Method, system and equipment for automatically optimizing table data structure layout

Info

Publication number: CN116578570B
Application number: CN202310851427.1A
Authority: CN
Inventors: 赵扬名; 张敢; 吴文池; 吴小前
Original assignee: Beijing Deepexi Technology Co Ltd
Current assignee: Beijing Deepexi Technology Co Ltd
Priority date: 2023-07-12
Filing date: 2023-07-12
Publication date: 2023-10-13
Anticipated expiration: 2043-07-12
Also published as: CN116578570A

Abstract

The application relates to the technical field of big data and data lakes, in particular to a method, a system and equipment for automatically optimizing the layout of a table data structure. Comprising the following steps: acquiring optimized resource information; and carrying out resource binding between k8s and the data lake iceberg table by utilizing the optimized resource information to obtain the metadata information of the iceberg table and the result of isolating the resource, and automatically generating an optimization task by utilizing the metadata information of the iceberg table and the result of isolating the resource, and operating the optimization task to obtain the related operation result of the data lake iceberg table. According to the application, the table optimization is realized through the step of optimizing the task, so that the number of small files and the number of deleted files of the table can be reduced, and the query performance of the table can be improved through rewriting the table structure; meanwhile, when the table in the optimization task is optimized, the resource setting of the table optimization can be automatically generated through the task generation rule and the scoring rule, the number of small files and the number of deleted files of the table do not need to be concerned by a user, and the operation and maintenance efficiency can be improved.

Description

Method, system and equipment for automatically optimizing table data structure layout

Technical Field

The application relates to the technical field of big data and data lakes, in particular to a method, a system and equipment for automatically optimizing the layout of a table data structure.

Background

Currently, the compact service of Iceberg needs to be based on a table, and needs to manually call the stored procedure operation of spark. In the actual use scene, the table in the iceberg is hundreds of thousands, even more than ten thousands, thus bringing huge operation and maintenance cost; in addition, when the spark storage process is called to run the compact task, the setting of the calculation unit cannot be performed according to the current situation of the table, so that resource waste is likely to occur due to the fact that a small table uses a large number of calculation units, or the situation that the task running time is long due to the fact that a large table only gives small resources, the query efficiency is low, even finally failure occurs, when the compact task is run, the resource isolation can only be performed by manually designating the cluster, and automatic isolation cannot be performed according to the category, the library, the table and the like.

In the prior art, when the table structure is optimized by using the compact task, the operation and maintenance cost is too high, the query performance cannot reach the expected effect, and the overall operation efficiency is low.

Disclosure of Invention

The application provides a method, a system and equipment for automatically optimizing the layout of a table data structure, which aims to solve the problems that the overall operation efficiency is low due to the fact that the operation and maintenance cost is too high and the query performance cannot reach the expected effect when the table structure is optimized by using a compact task in the related technology at least to a certain extent.

The scheme of the application is as follows:

in a first aspect, the present application provides a method of automatically optimizing a table data structure layout, the method comprising:

acquiring resource information to be optimized;

performing resource binding between k8s and the data lake iceberg table by utilizing the optimized resource information to obtain metadata information of the iceberg table and a result of isolating resources; automatically generating an optimization task by utilizing metadata information of the iceberg table and the result of the isolated resource;

and setting a spark task by using a k8s spark operator service, and operating the optimization task through the set spark task to obtain a related operation result of the data lake iceberg table.

Further, the obtaining the result of the metadata information of the iceberg table and the isolated resource by using the optimized resource information and performing resource binding between k8s and the data lake iceberg table includes:

judging whether the optimized resource information meets a first preset condition or not;

if yes, generating a nascent through an API client function provided by k8 s;

performing resource quota on the Namespace, configuring the CPU which is maximally available under the current condition and the memory which is maximally available under the current condition, and realizing resource configuration;

binding the one-to-one association relationship between the nalmespace under k8s and the category under the data lake iceberg by using the result of the resource allocation to obtain the result of metadata information of the iceberg table and the isolation resource;

wherein, the first preset condition is: the optimized resource information does not exist in the relevant background and the optimized resource information meets the k8s resource requirement.

Further, the automatically generating an optimization task by using metadata information of the iceberg table and the result of the isolated resource includes:

acquiring the running state information of the data lake iceberg table by utilizing the result of the isolated resource;

acquiring relevant operation state factors by using the operation state information of the data lake iceberg table, wherein the relevant operation state factors comprise: task type factors, small file quantity factors, delete files quantity factors, newly added snapshot factors, newly added data factors;

judging whether a related optimization operation and maintenance task is needed to be created at present or not by utilizing the related operation state factors;

if the threshold number of the related running state factors meets a second preset condition, creating a corresponding optimized running maintenance task;

and carrying out priority scheduling on the optimized operation and maintenance task.

Further, the obtaining the running state of the data lake iceberg table according to the result of utilizing the isolated resource includes:

scanning related information of a database and a table under the category through a category manager function under the data lake iceberg by utilizing the result of the isolated resource, and acquiring a storage path of the table;

calling an API of the iceberg by using the acquired storage path of the table, and acquiring the running state information of the data lake iceberg table;

wherein, the running state information of the data lake iceberg table comprises: the method comprises the following steps of table snapshot information, file information under the snapshot, deleted file information and record number information.

Further, if the threshold number of the related running state factors meets a second preset condition, creating a corresponding optimized running task, including:

if a new snapshot factor exists in the current data lake iceberg table, the total snapshot data exceeds a first snapshot threshold preset by a background, and a cleaning task corresponding to the current task queue is not created, a corresponding cleaning task is created;

if a new data factor exists in the current data lake iceberg table, and the small file quantity factor or delete file quantity factor exceeds a second new factor threshold preset in the background, and a merging task corresponding to the current task queue is not created, creating a corresponding merging task;

if a new data factor exists in the current data lake iceberg table, the small file quantity factor or delete file quantity factor exceeds a second new factor threshold preset in the background, the ratio of the total data quantity of the new data factor exceeds a third new factor threshold preset in the background, and no sequencing task corresponding to the new data factor is created in the current task queue, and a corresponding sequencing task is created.

Further, the performing priority scheduling on the optimized operation task includes:

scoring the optimized operation task;

generating task priorities from high to low according to scoring results, and executing the optimized operation and maintenance tasks through the task priorities;

wherein, the priority starting score of the merging task and the sorting task is higher than that of the cleaning task,

increasing the priority of the cleaning task according to the snapshot data volume gradient;

and respectively increasing the priorities of the merging task and the sorting task according to the delete files quantity factor and the small file quantity factor.

Further, the setting a spark task by using the k8s spark operator service, and before running the optimization task through the set spark task, obtaining the relevant running result of the data lake iceberg table, further includes:

generating task acquisition threads according to the number of the optimized resources of the bonded cataog by utilizing the result of the isolated resources and the result of the automatically generated optimized tasks;

judging whether available resources exist under the current category or not by utilizing the task acquisition thread;

if so, acquiring a task to be operated according to the priority in the task queue;

if the priorities are the same, the creation time of the running task is used as an execution standard, and the execution is performed first with early creation time.

Further, the setting a spark task by using the k8s spark operator service, and running the optimization task through the set spark task to obtain a relevant running result of the data lake iceberg table, including:

acquiring optimization parameters of the optimization task by utilizing the related information of the optimization task, wherein the optimization parameters comprise: the method comprises the steps of maximum small file number, minimum file size, maximum file size, sorting field, file deletion threshold value and snapshot preservation time;

judging whether the requirements of the resource clusters are met or not by utilizing the optimization parameters of the optimization task;

if yes, submitting the optimization task to the resource cluster and setting resources of the spark task through a k8s spark operator;

and operating the optimization task through the set spark task to obtain a related operation result of the data lake iceberg table.

In a second aspect, the present application provides a system for automatically optimizing a table data structure layout, the system comprising:

the acquisition module is used for acquiring the resource information to be optimized;

the resource isolation module is used for carrying out resource binding between k8s and the data lake iceberg table by utilizing the optimized resource information to obtain metadata information of the iceberg table and a result of isolating resources;

the data processing module is used for automatically generating an optimization task by utilizing metadata information of the iceberg table and the result of the isolated resource;

and the execution module is used for setting a spark task by using a k8s spark operator service, and running the optimization task through the set spark task to obtain a related running result of the data lake iceberg table.

In a third aspect, the present application provides an apparatus for automatically optimizing a table data structure layout, the apparatus comprising:

a memory having an executable program stored thereon;

a processor for executing the executable program in the memory to implement the steps of any of the methods described above.

The technical scheme provided by the application can comprise the following beneficial effects:

the method and the device acquire the information of the resource to be optimized; the optimized resource information is utilized, and resource binding is carried out between k8s and a data lake iceberg table, so that a resource isolation result is obtained; automatically generating an optimization task by utilizing the result of the isolated resource; and operating the optimization task to obtain a related operation result of the data lake iceberg table. According to the application, the table optimization is realized through the step of optimizing the task, so that the number of small files and the number of deleted files of the table can be reduced, and the query performance of the table can be improved through rewriting the table structure; meanwhile, when the table in the optimization task is optimized, the resource setting of the table optimization can be automatically generated through the task generation rule and the scoring rule, the user does not need to care about the number of small files and the number of deleted files of the table, the efficiency can be improved, the resource binding is carried out between k8s and the data lake iceberg table, the optimization task is automatically generated and executed, the role of an optimizer is eliminated in the scene of DataOps, the file structure optimization operation can be continuously carried out in the background, the operation and maintenance workload is reduced to the maximum extent, the operation and maintenance cost is reduced, and the operation and maintenance efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow chart of a method for automatically optimizing a table data structure layout according to one embodiment of the present application;

FIG. 2 is a flow chart of optimizing resource management of a dynamic optimization table data structure layout according to another embodiment of the present application;

FIG. 3 is a schematic flow chart of a running optimization task of a dynamic optimization table data structure layout according to another embodiment of the present application;

FIG. 4 is a schematic diagram of a system composition flow for automatically optimizing a table data structure layout according to one embodiment of the present application;

FIG. 5 is a schematic diagram of an apparatus for automatically optimizing a layout of a table data structure according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

Iceberg is an open table format redefining the metadata organization of data, which is a middle layer between the upper layer compute engine and the underlying storage format that can be used for large data analysis scenarios. Iceberg provides the compute engine with unified SQL-like semantics, but its bottom layer is still in the storage format of part, ORC, etc.

One of the key tradeoffs in managing the Iceberg data lake is choosing between write throughput and query performance. To achieve better write throughput, lower data visibility latency is achieved by writing the incoming data to a smaller data file. This will greatly increase the parallelism of the compute engines and thus increase the data ingest speed. However, this approach creates many small files, which can degrade query performance. In addition, for many file systems, including HDFS, when there are many small files, the performance of the file itself may also decrease. In order to be able to support architecture that allows for fast insertion without affecting query performance, iceberg developed a "compact" service to rewrite data structure layout, merging small files into large files, and can improve query efficiency without affecting write throughput.

Currently, the compact service of Iceberg needs to be based on a table, and needs to manually call the stored procedure operation of spark. In the actual use scene, the table in iceberg is hundreds of thousands, even more than ten thousands, which inevitably brings huge operation and maintenance cost.

When the spark storage process is called to run the compact task, the setting of the calculation unit cannot be performed according to the current situation of the table, and the situation that the small table uses a lot of calculation units to cause resource waste is likely to occur, or the large table only gives a small resource to cause long running time of the task, so that query efficiency is reduced, and even the situation of failure finally occurs. When running the compact task, the resource isolation can only be performed by manually designating the cluster, and the automatic isolation cannot be performed according to the category, library, table and the like.

Example 1

Referring to fig. 1, fig. 1 is a flowchart of a method for automatically optimizing a table data structure layout according to an embodiment of the present application, where the method includes:

s1, acquiring resource information to be optimized;

s2, performing resource binding between k8s and the data lake iceberg table by utilizing the optimized resource information to obtain metadata information of the iceberg table and a result of isolating resources;

s3, automatically generating an optimization task by utilizing the combination of metadata information of the iceberg table and the isolation resource;

s4, setting a spark task by using a k8s spark operator service, and operating the optimization task through the set spark task to obtain a related operation result of the data lake iceberg table.

In one embodiment, the obtaining the resource information to be optimized refers broadly to a resource pool that can run the optimization task, and in this embodiment includes, but is not limited to, k8 s' nacespace.

In one embodiment, as described in step S2, using the optimized resource information, performing resource binding between k8S and the iceberg table of the data lake to obtain the result of isolating the metadata information of the iceberg table from the resource includes:

the step of obtaining a result of resource isolation by using the optimized resource information and performing resource binding between k8s and a data lake iceberg table comprises the following steps:

if yes, generating a nascent through an API client function provided by k8 s;

binding the one-to-one association relationship between the nalmespace under k8s and the category under the data lake iceberg by using the result of the resource allocation to obtain a result of resource isolation;

Referring to fig. 2 in detail, it is determined whether the optimized resource information satisfies the first preset condition, and if so, the resource information is persisted before k8 s' nacese is created, i.e. the data is written into the database.

Creating a Namespace of k8s, judging whether the created Namespace exists in the k8s, and if not, creating the Namespace; and finally, the configmap configuration file created in the last step of the hdfs configuration in the spark job is set to complete the resource isolation.

In one embodiment, as described in step S3, the automatically generating an optimization task using metadata information of the iceberg table and the result of the isolating resource includes:

scoring the optimized operation task;

In specific implementation, state details of the iceberg table are obtained:

and scanning information of a database and a table under the category through the category manager, taking a storage path of the table, and calling api (Application Programming Interface) of the iceberg, namely an application programming interface, to acquire specific snapshot information of the table, file information under the snapshot, deleted file information and record number information.

Judging whether an optimal operation and maintenance task is needed to be created currently according to factors such as task types, small file numbers, delete files numbers and the like, scoring task priorities, and scheduling priorities:

if the current table has a new snapshot, the total snapshot data exceeds a threshold value, and the task queue does not have the same type of task, a cleaning task is generated.

If there is new data, the number of small files exceeds a threshold or the number of deleted files exceeds a threshold, and the task queue does not have the same type of task, a merged task is generated.

If the table state meets the merging task condition, the configuration parameters are configured with the ordering field, the duty ratio of the newly added data volume in the data volume exceeds a threshold value, and the ordering task is generated if the tasks of the same type are not in the task queue.

Generating task priorities by scoring the tasks, and determining the running sequence of the tasks:

the priority starting score of the merging task and the sorting task is higher than that of the cleaning task.

The priority of the cleaning task can be increased in a gradient manner according to the snapshot data volume.

And increasing the priorities of the merging tasks and the sorting tasks according to the number of deleted files and the number of small files in a gradient manner.

If the task in the queue does not run for a long time, the priority is automatically raised.

In one embodiment, before the running the optimization task and obtaining the relevant running result of the data lake iceberg table, the method further includes:

wherein, the relevant information of the optimization task comprises: the task information comprises catalogid, catalogname, database information and table information;

In the specific implementation, see fig. 3 in detail, query the lake bin directory associated with the optimized resource, and obtain a task queue, namely an optimized task, according to the lake bin directory; obtaining optimization parameters of the task by using the obtained optimization task; judging whether the requirements of the resource clusters are met or not;

In the specific implementation, the resource information to be optimized is obtained; the optimized resource information is utilized, and resource binding is carried out between k8s and a data lake iceberg table, so that a resource isolation result is obtained; automatically generating an optimization task by utilizing the result of the isolated resource; and operating the optimization task to obtain a related operation result of the data lake iceberg table. According to the application, the table optimization is realized through the step of optimizing the task, so that the number of small files and the number of deleted files of the table can be reduced, and the query performance of the table can be improved through rewriting the table structure; meanwhile, when the table in the optimization task is optimized, the resource setting of the table optimization can be automatically generated through the task generation rule and the scoring rule, the user does not need to care about the number of small files and the number of deleted files of the table, the efficiency can be improved, the optimization task is automatically generated and executed through resource binding between k8s and the data lake iceberg table, the role of an optimizer is eliminated in the scene of DataOps, the file structure optimization operation can be continuously carried out in the background, the operation and maintenance workload is reduced to the maximum extent, and the operation and maintenance cost is reduced.

Example two

Referring to fig. 4, fig. 4 is a schematic flow diagram of a system for automatically optimizing a layout of a table data structure according to an embodiment of the present application, where the system includes:

an obtaining module 41, configured to obtain resource information to be optimized;

the resource isolation module 42 is configured to obtain a resource isolation result by performing resource binding between k8s and the data lake iceberg table by using the optimized resource information;

the data processing module 43 automatically generates an optimization task by using the result of the isolated resource;

and the execution module 44 is configured to set a spark task by using a k8s spark operator service, and execute the optimization task through the set spark task to obtain a related operation result of the data lake iceberg table.

Example III

Referring to fig. 5, fig. 5 is a schematic flow diagram of an apparatus for automatically optimizing a layout of a table data structure according to an embodiment of the present application, where the apparatus includes:

a memory 51 on which an executable program is stored;

a processor 52 for executing the executable program in the memory 51 to implement the steps of the method as described in any of the above.

It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.

It should be noted that in the description of the present application, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present application, unless otherwise indicated, the meaning of "plurality" means at least two.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. A method for automatically optimizing a table data structure layout, the method comprising:

acquiring resource information to be optimized;

performing resource binding between k8s and the data lake iceberg table by utilizing the optimized resource information to obtain metadata information of the iceberg table and a result of isolating resources;

automatically generating an optimization task by utilizing metadata information of the iceberg table and the result of the isolated resource;

setting a spark task by using a k8s spark operator service, and operating the optimization task through the set spark task to obtain a related operation result of the data lake iceberg table;

the step of obtaining the result of the isolation of the resources and the metadata information of the iceberg table by using the optimized resource information and performing resource binding between k8s and the data lake iceberg table comprises the following steps:

if yes, generating a nascent through an API client function provided by k8 s;

wherein, the first preset condition is: the optimized resource information does not exist in the relevant background and meets k8s resource requirements;

the automatic generation of the optimization task by utilizing the metadata information of the iceberg table and the result of the isolated resource comprises the following steps:

performing priority scheduling on the optimized operation task;

setting a spark task by using a k8s spark operator service, and running the optimization task through the set spark task to obtain a relevant running result of the data lake iceberg table, wherein the method comprises the following steps:

2. The method of claim 1, wherein the obtaining the operating state of the data lake iceberg table using the result of the isolated resource comprises:

3. The method of claim 1, wherein creating a corresponding optimized operation and maintenance task if the threshold number of relevant operation state factors satisfies a second preset condition comprises:

4. The method of claim 1, wherein prioritizing the optimized operation and maintenance tasks comprises:

scoring the optimized operation task;

5. The method according to claim 1, wherein the setting a spark task by using the k8s spark operator service, and before running the optimization task through the set spark task, obtaining the relevant running result of the data lake iceberg table, further includes:

6. A system for automatically optimizing a layout of a table data structure, the system comprising:

if yes, generating a nascent through an API client function provided by k8 s;

performing priority scheduling on the optimized operation task;

the execution module is used for setting a spark task by utilizing a k8s spark operator service, and running the optimization task through the set spark task to obtain a related running result of the data lake iceberg table;

7. An apparatus for automatically optimizing a layout of a table data structure, the apparatus comprising:

a memory having an executable program stored thereon;

a processor for executing the executable program in the memory to implement the steps of the method of any one of claims 1-5.