CN116578570B - Method, system and equipment for automatically optimizing table data structure layout - Google Patents

Method, system and equipment for automatically optimizing table data structure layout Download PDF

Info

Publication number
CN116578570B
CN116578570B CN202310851427.1A CN202310851427A CN116578570B CN 116578570 B CN116578570 B CN 116578570B CN 202310851427 A CN202310851427 A CN 202310851427A CN 116578570 B CN116578570 B CN 116578570B
Authority
CN
China
Prior art keywords
task
iceberg
resource
optimization
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310851427.1A
Other languages
Chinese (zh)
Other versions
CN116578570A (en
Inventor
赵扬名
张敢
吴文池
吴小前
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Deepexi Technology Co Ltd
Original Assignee
Beijing Deepexi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Deepexi Technology Co Ltd filed Critical Beijing Deepexi Technology Co Ltd
Priority to CN202310851427.1A priority Critical patent/CN116578570B/en
Publication of CN116578570A publication Critical patent/CN116578570A/en
Application granted granted Critical
Publication of CN116578570B publication Critical patent/CN116578570B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to the technical field of big data and data lakes, in particular to a method, a system and equipment for automatically optimizing the layout of a table data structure. Comprising the following steps: acquiring optimized resource information; and carrying out resource binding between k8s and the data lake iceberg table by utilizing the optimized resource information to obtain the metadata information of the iceberg table and the result of isolating the resource, and automatically generating an optimization task by utilizing the metadata information of the iceberg table and the result of isolating the resource, and operating the optimization task to obtain the related operation result of the data lake iceberg table. According to the application, the table optimization is realized through the step of optimizing the task, so that the number of small files and the number of deleted files of the table can be reduced, and the query performance of the table can be improved through rewriting the table structure; meanwhile, when the table in the optimization task is optimized, the resource setting of the table optimization can be automatically generated through the task generation rule and the scoring rule, the number of small files and the number of deleted files of the table do not need to be concerned by a user, and the operation and maintenance efficiency can be improved.

Description

Method, system and equipment for automatically optimizing table data structure layout
Technical Field
The application relates to the technical field of big data and data lakes, in particular to a method, a system and equipment for automatically optimizing the layout of a table data structure.
Background
Currently, the compact service of Iceberg needs to be based on a table, and needs to manually call the stored procedure operation of spark. In the actual use scene, the table in the iceberg is hundreds of thousands, even more than ten thousands, thus bringing huge operation and maintenance cost; in addition, when the spark storage process is called to run the compact task, the setting of the calculation unit cannot be performed according to the current situation of the table, so that resource waste is likely to occur due to the fact that a small table uses a large number of calculation units, or the situation that the task running time is long due to the fact that a large table only gives small resources, the query efficiency is low, even finally failure occurs, when the compact task is run, the resource isolation can only be performed by manually designating the cluster, and automatic isolation cannot be performed according to the category, the library, the table and the like.
In the prior art, when the table structure is optimized by using the compact task, the operation and maintenance cost is too high, the query performance cannot reach the expected effect, and the overall operation efficiency is low.
Disclosure of Invention
The application provides a method, a system and equipment for automatically optimizing the layout of a table data structure, which aims to solve the problems that the overall operation efficiency is low due to the fact that the operation and maintenance cost is too high and the query performance cannot reach the expected effect when the table structure is optimized by using a compact task in the related technology at least to a certain extent.
The scheme of the application is as follows:
in a first aspect, the present application provides a method of automatically optimizing a table data structure layout, the method comprising:
acquiring resource information to be optimized;
performing resource binding between k8s and the data lake iceberg table by utilizing the optimized resource information to obtain metadata information of the iceberg table and a result of isolating resources; automatically generating an optimization task by utilizing metadata information of the iceberg table and the result of the isolated resource;
and setting a spark task by using a k8s spark operator service, and operating the optimization task through the set spark task to obtain a related operation result of the data lake iceberg table.
Further, the obtaining the result of the metadata information of the iceberg table and the isolated resource by using the optimized resource information and performing resource binding between k8s and the data lake iceberg table includes:
judging whether the optimized resource information meets a first preset condition or not;
if yes, generating a nascent through an API client function provided by k8 s;
performing resource quota on the Namespace, configuring the CPU which is maximally available under the current condition and the memory which is maximally available under the current condition, and realizing resource configuration;
binding the one-to-one association relationship between the nalmespace under k8s and the category under the data lake iceberg by using the result of the resource allocation to obtain the result of metadata information of the iceberg table and the isolation resource;
wherein, the first preset condition is: the optimized resource information does not exist in the relevant background and the optimized resource information meets the k8s resource requirement.
Further, the automatically generating an optimization task by using metadata information of the iceberg table and the result of the isolated resource includes:
acquiring the running state information of the data lake iceberg table by utilizing the result of the isolated resource;
acquiring relevant operation state factors by using the operation state information of the data lake iceberg table, wherein the relevant operation state factors comprise: task type factors, small file quantity factors, delete files quantity factors, newly added snapshot factors, newly added data factors;
judging whether a related optimization operation and maintenance task is needed to be created at present or not by utilizing the related operation state factors;
if the threshold number of the related running state factors meets a second preset condition, creating a corresponding optimized running maintenance task;
and carrying out priority scheduling on the optimized operation and maintenance task.
Further, the obtaining the running state of the data lake iceberg table according to the result of utilizing the isolated resource includes:
scanning related information of a database and a table under the category through a category manager function under the data lake iceberg by utilizing the result of the isolated resource, and acquiring a storage path of the table;
calling an API of the iceberg by using the acquired storage path of the table, and acquiring the running state information of the data lake iceberg table;
wherein, the running state information of the data lake iceberg table comprises: the method comprises the following steps of table snapshot information, file information under the snapshot, deleted file information and record number information.
Further, if the threshold number of the related running state factors meets a second preset condition, creating a corresponding optimized running task, including:
if a new snapshot factor exists in the current data lake iceberg table, the total snapshot data exceeds a first snapshot threshold preset by a background, and a cleaning task corresponding to the current task queue is not created, a corresponding cleaning task is created;
if a new data factor exists in the current data lake iceberg table, and the small file quantity factor or delete file quantity factor exceeds a second new factor threshold preset in the background, and a merging task corresponding to the current task queue is not created, creating a corresponding merging task;
if a new data factor exists in the current data lake iceberg table, the small file quantity factor or delete file quantity factor exceeds a second new factor threshold preset in the background, the ratio of the total data quantity of the new data factor exceeds a third new factor threshold preset in the background, and no sequencing task corresponding to the new data factor is created in the current task queue, and a corresponding sequencing task is created.
Further, the performing priority scheduling on the optimized operation task includes:
scoring the optimized operation task;
generating task priorities from high to low according to scoring results, and executing the optimized operation and maintenance tasks through the task priorities;
wherein, the priority starting score of the merging task and the sorting task is higher than that of the cleaning task,
increasing the priority of the cleaning task according to the snapshot data volume gradient;
and respectively increasing the priorities of the merging task and the sorting task according to the delete files quantity factor and the small file quantity factor.
Further, the setting a spark task by using the k8s spark operator service, and before running the optimization task through the set spark task, obtaining the relevant running result of the data lake iceberg table, further includes:
generating task acquisition threads according to the number of the optimized resources of the bonded cataog by utilizing the result of the isolated resources and the result of the automatically generated optimized tasks;
judging whether available resources exist under the current category or not by utilizing the task acquisition thread;
if so, acquiring a task to be operated according to the priority in the task queue;
if the priorities are the same, the creation time of the running task is used as an execution standard, and the execution is performed first with early creation time.
Further, the setting a spark task by using the k8s spark operator service, and running the optimization task through the set spark task to obtain a relevant running result of the data lake iceberg table, including:
acquiring optimization parameters of the optimization task by utilizing the related information of the optimization task, wherein the optimization parameters comprise: the method comprises the steps of maximum small file number, minimum file size, maximum file size, sorting field, file deletion threshold value and snapshot preservation time;
judging whether the requirements of the resource clusters are met or not by utilizing the optimization parameters of the optimization task;
if yes, submitting the optimization task to the resource cluster and setting resources of the spark task through a k8s spark operator;
and operating the optimization task through the set spark task to obtain a related operation result of the data lake iceberg table.
In a second aspect, the present application provides a system for automatically optimizing a table data structure layout, the system comprising:
the acquisition module is used for acquiring the resource information to be optimized;
the resource isolation module is used for carrying out resource binding between k8s and the data lake iceberg table by utilizing the optimized resource information to obtain metadata information of the iceberg table and a result of isolating resources;
the data processing module is used for automatically generating an optimization task by utilizing metadata information of the iceberg table and the result of the isolated resource;
and the execution module is used for setting a spark task by using a k8s spark operator service, and running the optimization task through the set spark task to obtain a related running result of the data lake iceberg table.
In a third aspect, the present application provides an apparatus for automatically optimizing a table data structure layout, the apparatus comprising:
a memory having an executable program stored thereon;
a processor for executing the executable program in the memory to implement the steps of any of the methods described above.
The technical scheme provided by the application can comprise the following beneficial effects:
the method and the device acquire the information of the resource to be optimized; the optimized resource information is utilized, and resource binding is carried out between k8s and a data lake iceberg table, so that a resource isolation result is obtained; automatically generating an optimization task by utilizing the result of the isolated resource; and operating the optimization task to obtain a related operation result of the data lake iceberg table. According to the application, the table optimization is realized through the step of optimizing the task, so that the number of small files and the number of deleted files of the table can be reduced, and the query performance of the table can be improved through rewriting the table structure; meanwhile, when the table in the optimization task is optimized, the resource setting of the table optimization can be automatically generated through the task generation rule and the scoring rule, the user does not need to care about the number of small files and the number of deleted files of the table, the efficiency can be improved, the resource binding is carried out between k8s and the data lake iceberg table, the optimization task is automatically generated and executed, the role of an optimizer is eliminated in the scene of DataOps, the file structure optimization operation can be continuously carried out in the background, the operation and maintenance workload is reduced to the maximum extent, the operation and maintenance cost is reduced, and the operation and maintenance efficiency is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a flow chart of a method for automatically optimizing a table data structure layout according to one embodiment of the present application;
FIG. 2 is a flow chart of optimizing resource management of a dynamic optimization table data structure layout according to another embodiment of the present application;
FIG. 3 is a schematic flow chart of a running optimization task of a dynamic optimization table data structure layout according to another embodiment of the present application;
FIG. 4 is a schematic diagram of a system composition flow for automatically optimizing a table data structure layout according to one embodiment of the present application;
FIG. 5 is a schematic diagram of an apparatus for automatically optimizing a layout of a table data structure according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
Iceberg is an open table format redefining the metadata organization of data, which is a middle layer between the upper layer compute engine and the underlying storage format that can be used for large data analysis scenarios. Iceberg provides the compute engine with unified SQL-like semantics, but its bottom layer is still in the storage format of part, ORC, etc.
One of the key tradeoffs in managing the Iceberg data lake is choosing between write throughput and query performance. To achieve better write throughput, lower data visibility latency is achieved by writing the incoming data to a smaller data file. This will greatly increase the parallelism of the compute engines and thus increase the data ingest speed. However, this approach creates many small files, which can degrade query performance. In addition, for many file systems, including HDFS, when there are many small files, the performance of the file itself may also decrease. In order to be able to support architecture that allows for fast insertion without affecting query performance, iceberg developed a "compact" service to rewrite data structure layout, merging small files into large files, and can improve query efficiency without affecting write throughput.
Currently, the compact service of Iceberg needs to be based on a table, and needs to manually call the stored procedure operation of spark. In the actual use scene, the table in iceberg is hundreds of thousands, even more than ten thousands, which inevitably brings huge operation and maintenance cost.
When the spark storage process is called to run the compact task, the setting of the calculation unit cannot be performed according to the current situation of the table, and the situation that the small table uses a lot of calculation units to cause resource waste is likely to occur, or the large table only gives a small resource to cause long running time of the task, so that query efficiency is reduced, and even the situation of failure finally occurs. When running the compact task, the resource isolation can only be performed by manually designating the cluster, and the automatic isolation cannot be performed according to the category, library, table and the like.
The application provides a method, a system and equipment for automatically optimizing the layout of a table data structure, which aims to solve the problems that the overall operation efficiency is low due to the fact that the operation and maintenance cost is too high and the query performance cannot reach the expected effect when the table structure is optimized by using a compact task in the related technology at least to a certain extent.
Example 1
Referring to fig. 1, fig. 1 is a flowchart of a method for automatically optimizing a table data structure layout according to an embodiment of the present application, where the method includes:
s1, acquiring resource information to be optimized;
s2, performing resource binding between k8s and the data lake iceberg table by utilizing the optimized resource information to obtain metadata information of the iceberg table and a result of isolating resources;
s3, automatically generating an optimization task by utilizing the combination of metadata information of the iceberg table and the isolation resource;
s4, setting a spark task by using a k8s spark operator service, and operating the optimization task through the set spark task to obtain a related operation result of the data lake iceberg table.
In one embodiment, the obtaining the resource information to be optimized refers broadly to a resource pool that can run the optimization task, and in this embodiment includes, but is not limited to, k8 s' nacespace.
In one embodiment, as described in step S2, using the optimized resource information, performing resource binding between k8S and the iceberg table of the data lake to obtain the result of isolating the metadata information of the iceberg table from the resource includes:
the step of obtaining a result of resource isolation by using the optimized resource information and performing resource binding between k8s and a data lake iceberg table comprises the following steps:
judging whether the optimized resource information meets a first preset condition or not;
if yes, generating a nascent through an API client function provided by k8 s;
performing resource quota on the Namespace, configuring the CPU which is maximally available under the current condition and the memory which is maximally available under the current condition, and realizing resource configuration;
binding the one-to-one association relationship between the nalmespace under k8s and the category under the data lake iceberg by using the result of the resource allocation to obtain a result of resource isolation;
wherein, the first preset condition is: the optimized resource information does not exist in the relevant background and the optimized resource information meets the k8s resource requirement.
Referring to fig. 2 in detail, it is determined whether the optimized resource information satisfies the first preset condition, and if so, the resource information is persisted before k8 s' nacese is created, i.e. the data is written into the database.
Creating a Namespace of k8s, judging whether the created Namespace exists in the k8s, and if not, creating the Namespace; and finally, the configmap configuration file created in the last step of the hdfs configuration in the spark job is set to complete the resource isolation.
In one embodiment, as described in step S3, the automatically generating an optimization task using metadata information of the iceberg table and the result of the isolating resource includes:
acquiring the running state information of the data lake iceberg table by utilizing the result of the isolated resource;
acquiring relevant operation state factors by using the operation state information of the data lake iceberg table, wherein the relevant operation state factors comprise: task type factors, small file quantity factors, delete files quantity factors, newly added snapshot factors, newly added data factors;
judging whether a related optimization operation and maintenance task is needed to be created at present or not by utilizing the related operation state factors;
if the threshold number of the related running state factors meets a second preset condition, creating a corresponding optimized running maintenance task;
and carrying out priority scheduling on the optimized operation and maintenance task.
Further, the obtaining the running state of the data lake iceberg table according to the result of utilizing the isolated resource includes:
scanning related information of a database and a table under the category through a category manager function under the data lake iceberg by utilizing the result of the isolated resource, and acquiring a storage path of the table;
calling an API of the iceberg by using the acquired storage path of the table, and acquiring the running state information of the data lake iceberg table;
wherein, the running state information of the data lake iceberg table comprises: the method comprises the following steps of table snapshot information, file information under the snapshot, deleted file information and record number information.
Further, if the threshold number of the related running state factors meets a second preset condition, creating a corresponding optimized running task, including:
if a new snapshot factor exists in the current data lake iceberg table, the total snapshot data exceeds a first snapshot threshold preset by a background, and a cleaning task corresponding to the current task queue is not created, a corresponding cleaning task is created;
if a new data factor exists in the current data lake iceberg table, and the small file quantity factor or delete file quantity factor exceeds a second new factor threshold preset in the background, and a merging task corresponding to the current task queue is not created, creating a corresponding merging task;
if a new data factor exists in the current data lake iceberg table, the small file quantity factor or delete file quantity factor exceeds a second new factor threshold preset in the background, the ratio of the total data quantity of the new data factor exceeds a third new factor threshold preset in the background, and no sequencing task corresponding to the new data factor is created in the current task queue, and a corresponding sequencing task is created.
Further, the performing priority scheduling on the optimized operation task includes:
scoring the optimized operation task;
generating task priorities from high to low according to scoring results, and executing the optimized operation and maintenance tasks through the task priorities;
wherein, the priority starting score of the merging task and the sorting task is higher than that of the cleaning task,
increasing the priority of the cleaning task according to the snapshot data volume gradient;
and respectively increasing the priorities of the merging task and the sorting task according to the delete files quantity factor and the small file quantity factor.
In specific implementation, state details of the iceberg table are obtained:
and scanning information of a database and a table under the category through the category manager, taking a storage path of the table, and calling api (Application Programming Interface) of the iceberg, namely an application programming interface, to acquire specific snapshot information of the table, file information under the snapshot, deleted file information and record number information.
Judging whether an optimal operation and maintenance task is needed to be created currently according to factors such as task types, small file numbers, delete files numbers and the like, scoring task priorities, and scheduling priorities:
if the current table has a new snapshot, the total snapshot data exceeds a threshold value, and the task queue does not have the same type of task, a cleaning task is generated.
If there is new data, the number of small files exceeds a threshold or the number of deleted files exceeds a threshold, and the task queue does not have the same type of task, a merged task is generated.
If the table state meets the merging task condition, the configuration parameters are configured with the ordering field, the duty ratio of the newly added data volume in the data volume exceeds a threshold value, and the ordering task is generated if the tasks of the same type are not in the task queue.
Generating task priorities by scoring the tasks, and determining the running sequence of the tasks:
the priority starting score of the merging task and the sorting task is higher than that of the cleaning task.
The priority of the cleaning task can be increased in a gradient manner according to the snapshot data volume.
And increasing the priorities of the merging tasks and the sorting tasks according to the number of deleted files and the number of small files in a gradient manner.
If the task in the queue does not run for a long time, the priority is automatically raised.
In one embodiment, before the running the optimization task and obtaining the relevant running result of the data lake iceberg table, the method further includes:
generating task acquisition threads according to the number of the optimized resources of the bonded cataog by utilizing the result of the isolated resources and the result of the automatically generated optimized tasks;
judging whether available resources exist under the current category or not by utilizing the task acquisition thread;
if so, acquiring a task to be operated according to the priority in the task queue;
if the priorities are the same, the creation time of the running task is used as an execution standard, and the execution is performed first with early creation time.
Further, the setting a spark task by using the k8s spark operator service, and running the optimization task through the set spark task to obtain a relevant running result of the data lake iceberg table, including:
acquiring optimization parameters of the optimization task by utilizing the related information of the optimization task, wherein the optimization parameters comprise: the method comprises the steps of maximum small file number, minimum file size, maximum file size, sorting field, file deletion threshold value and snapshot preservation time;
wherein, the relevant information of the optimization task comprises: the task information comprises catalogid, catalogname, database information and table information;
judging whether the requirements of the resource clusters are met or not by utilizing the optimization parameters of the optimization task;
if yes, submitting the optimization task to the resource cluster and setting resources of the spark task through a k8s spark operator;
and operating the optimization task through the set spark task to obtain a related operation result of the data lake iceberg table.
In the specific implementation, see fig. 3 in detail, query the lake bin directory associated with the optimized resource, and obtain a task queue, namely an optimized task, according to the lake bin directory; obtaining optimization parameters of the task by using the obtained optimization task; judging whether the requirements of the resource clusters are met or not;
if yes, submitting the optimization task to the resource cluster and setting resources of the spark task through a k8s spark operator;
and operating the optimization task through the set spark task to obtain a related operation result of the data lake iceberg table.
In the specific implementation, the resource information to be optimized is obtained; the optimized resource information is utilized, and resource binding is carried out between k8s and a data lake iceberg table, so that a resource isolation result is obtained; automatically generating an optimization task by utilizing the result of the isolated resource; and operating the optimization task to obtain a related operation result of the data lake iceberg table. According to the application, the table optimization is realized through the step of optimizing the task, so that the number of small files and the number of deleted files of the table can be reduced, and the query performance of the table can be improved through rewriting the table structure; meanwhile, when the table in the optimization task is optimized, the resource setting of the table optimization can be automatically generated through the task generation rule and the scoring rule, the user does not need to care about the number of small files and the number of deleted files of the table, the efficiency can be improved, the optimization task is automatically generated and executed through resource binding between k8s and the data lake iceberg table, the role of an optimizer is eliminated in the scene of DataOps, the file structure optimization operation can be continuously carried out in the background, the operation and maintenance workload is reduced to the maximum extent, and the operation and maintenance cost is reduced.
Example two
Referring to fig. 4, fig. 4 is a schematic flow diagram of a system for automatically optimizing a layout of a table data structure according to an embodiment of the present application, where the system includes:
an obtaining module 41, configured to obtain resource information to be optimized;
the resource isolation module 42 is configured to obtain a resource isolation result by performing resource binding between k8s and the data lake iceberg table by using the optimized resource information;
the data processing module 43 automatically generates an optimization task by using the result of the isolated resource;
and the execution module 44 is configured to set a spark task by using a k8s spark operator service, and execute the optimization task through the set spark task to obtain a related operation result of the data lake iceberg table.
Example III
Referring to fig. 5, fig. 5 is a schematic flow diagram of an apparatus for automatically optimizing a layout of a table data structure according to an embodiment of the present application, where the apparatus includes:
a memory 51 on which an executable program is stored;
a processor 52 for executing the executable program in the memory 51 to implement the steps of the method as described in any of the above.
It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.
It should be noted that in the description of the present application, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present application, unless otherwise indicated, the meaning of "plurality" means at least two.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims (7)

1. A method for automatically optimizing a table data structure layout, the method comprising:
acquiring resource information to be optimized;
performing resource binding between k8s and the data lake iceberg table by utilizing the optimized resource information to obtain metadata information of the iceberg table and a result of isolating resources;
automatically generating an optimization task by utilizing metadata information of the iceberg table and the result of the isolated resource;
setting a spark task by using a k8s spark operator service, and operating the optimization task through the set spark task to obtain a related operation result of the data lake iceberg table;
the step of obtaining the result of the isolation of the resources and the metadata information of the iceberg table by using the optimized resource information and performing resource binding between k8s and the data lake iceberg table comprises the following steps:
judging whether the optimized resource information meets a first preset condition or not;
if yes, generating a nascent through an API client function provided by k8 s;
performing resource quota on the Namespace, configuring the CPU which is maximally available under the current condition and the memory which is maximally available under the current condition, and realizing resource configuration;
binding the one-to-one association relationship between the nalmespace under k8s and the category under the data lake iceberg by using the result of the resource allocation to obtain the result of metadata information of the iceberg table and the isolation resource;
wherein, the first preset condition is: the optimized resource information does not exist in the relevant background and meets k8s resource requirements;
the automatic generation of the optimization task by utilizing the metadata information of the iceberg table and the result of the isolated resource comprises the following steps:
acquiring the running state information of the data lake iceberg table by utilizing the result of the isolated resource;
acquiring relevant operation state factors by using the operation state information of the data lake iceberg table, wherein the relevant operation state factors comprise: task type factors, small file quantity factors, delete files quantity factors, newly added snapshot factors, newly added data factors;
judging whether a related optimization operation and maintenance task is needed to be created at present or not by utilizing the related operation state factors;
if the threshold number of the related running state factors meets a second preset condition, creating a corresponding optimized running maintenance task;
performing priority scheduling on the optimized operation task;
setting a spark task by using a k8s spark operator service, and running the optimization task through the set spark task to obtain a relevant running result of the data lake iceberg table, wherein the method comprises the following steps:
acquiring optimization parameters of the optimization task by utilizing the related information of the optimization task, wherein the optimization parameters comprise: the method comprises the steps of maximum small file number, minimum file size, maximum file size, sorting field, file deletion threshold value and snapshot preservation time;
judging whether the requirements of the resource clusters are met or not by utilizing the optimization parameters of the optimization task;
if yes, submitting the optimization task to the resource cluster and setting resources of the spark task through a k8s spark operator;
and operating the optimization task through the set spark task to obtain a related operation result of the data lake iceberg table.
2. The method of claim 1, wherein the obtaining the operating state of the data lake iceberg table using the result of the isolated resource comprises:
scanning related information of a database and a table under the category through a category manager function under the data lake iceberg by utilizing the result of the isolated resource, and acquiring a storage path of the table;
calling an API of the iceberg by using the acquired storage path of the table, and acquiring the running state information of the data lake iceberg table;
wherein, the running state information of the data lake iceberg table comprises: the method comprises the following steps of table snapshot information, file information under the snapshot, deleted file information and record number information.
3. The method of claim 1, wherein creating a corresponding optimized operation and maintenance task if the threshold number of relevant operation state factors satisfies a second preset condition comprises:
if a new snapshot factor exists in the current data lake iceberg table, the total snapshot data exceeds a first snapshot threshold preset by a background, and a cleaning task corresponding to the current task queue is not created, a corresponding cleaning task is created;
if a new data factor exists in the current data lake iceberg table, and the small file quantity factor or delete file quantity factor exceeds a second new factor threshold preset in the background, and a merging task corresponding to the current task queue is not created, creating a corresponding merging task;
if a new data factor exists in the current data lake iceberg table, the small file quantity factor or delete file quantity factor exceeds a second new factor threshold preset in the background, the ratio of the total data quantity of the new data factor exceeds a third new factor threshold preset in the background, and no sequencing task corresponding to the new data factor is created in the current task queue, and a corresponding sequencing task is created.
4. The method of claim 1, wherein prioritizing the optimized operation and maintenance tasks comprises:
scoring the optimized operation task;
generating task priorities from high to low according to scoring results, and executing the optimized operation and maintenance tasks through the task priorities;
wherein, the priority starting score of the merging task and the sorting task is higher than that of the cleaning task,
increasing the priority of the cleaning task according to the snapshot data volume gradient;
and respectively increasing the priorities of the merging task and the sorting task according to the delete files quantity factor and the small file quantity factor.
5. The method according to claim 1, wherein the setting a spark task by using the k8s spark operator service, and before running the optimization task through the set spark task, obtaining the relevant running result of the data lake iceberg table, further includes:
generating task acquisition threads according to the number of the optimized resources of the bonded cataog by utilizing the result of the isolated resources and the result of the automatically generated optimized tasks;
judging whether available resources exist under the current category or not by utilizing the task acquisition thread;
if so, acquiring a task to be operated according to the priority in the task queue;
if the priorities are the same, the creation time of the running task is used as an execution standard, and the execution is performed first with early creation time.
6. A system for automatically optimizing a layout of a table data structure, the system comprising:
the acquisition module is used for acquiring the resource information to be optimized;
the resource isolation module is used for carrying out resource binding between k8s and the data lake iceberg table by utilizing the optimized resource information to obtain metadata information of the iceberg table and a result of isolating resources;
the step of obtaining the result of the isolation of the resources and the metadata information of the iceberg table by using the optimized resource information and performing resource binding between k8s and the data lake iceberg table comprises the following steps:
judging whether the optimized resource information meets a first preset condition or not;
if yes, generating a nascent through an API client function provided by k8 s;
performing resource quota on the Namespace, configuring the CPU which is maximally available under the current condition and the memory which is maximally available under the current condition, and realizing resource configuration;
binding the one-to-one association relationship between the nalmespace under k8s and the category under the data lake iceberg by using the result of the resource allocation to obtain the result of metadata information of the iceberg table and the isolation resource;
wherein, the first preset condition is: the optimized resource information does not exist in the relevant background and meets k8s resource requirements;
the data processing module is used for automatically generating an optimization task by utilizing metadata information of the iceberg table and the result of the isolated resource;
the automatic generation of the optimization task by utilizing the metadata information of the iceberg table and the result of the isolated resource comprises the following steps:
acquiring the running state information of the data lake iceberg table by utilizing the result of the isolated resource;
acquiring relevant operation state factors by using the operation state information of the data lake iceberg table, wherein the relevant operation state factors comprise: task type factors, small file quantity factors, delete files quantity factors, newly added snapshot factors, newly added data factors;
judging whether a related optimization operation and maintenance task is needed to be created at present or not by utilizing the related operation state factors;
if the threshold number of the related running state factors meets a second preset condition, creating a corresponding optimized running maintenance task;
performing priority scheduling on the optimized operation task;
the execution module is used for setting a spark task by utilizing a k8s spark operator service, and running the optimization task through the set spark task to obtain a related running result of the data lake iceberg table;
setting a spark task by using a k8s spark operator service, and running the optimization task through the set spark task to obtain a relevant running result of the data lake iceberg table, wherein the method comprises the following steps:
acquiring optimization parameters of the optimization task by utilizing the related information of the optimization task, wherein the optimization parameters comprise: the method comprises the steps of maximum small file number, minimum file size, maximum file size, sorting field, file deletion threshold value and snapshot preservation time;
judging whether the requirements of the resource clusters are met or not by utilizing the optimization parameters of the optimization task;
if yes, submitting the optimization task to the resource cluster and setting resources of the spark task through a k8s spark operator;
and operating the optimization task through the set spark task to obtain a related operation result of the data lake iceberg table.
7. An apparatus for automatically optimizing a layout of a table data structure, the apparatus comprising:
a memory having an executable program stored thereon;
a processor for executing the executable program in the memory to implement the steps of the method of any one of claims 1-5.
CN202310851427.1A 2023-07-12 2023-07-12 Method, system and equipment for automatically optimizing table data structure layout Active CN116578570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310851427.1A CN116578570B (en) 2023-07-12 2023-07-12 Method, system and equipment for automatically optimizing table data structure layout

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310851427.1A CN116578570B (en) 2023-07-12 2023-07-12 Method, system and equipment for automatically optimizing table data structure layout

Publications (2)

Publication Number Publication Date
CN116578570A CN116578570A (en) 2023-08-11
CN116578570B true CN116578570B (en) 2023-10-13

Family

ID=87538195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310851427.1A Active CN116578570B (en) 2023-07-12 2023-07-12 Method, system and equipment for automatically optimizing table data structure layout

Country Status (1)

Country Link
CN (1) CN116578570B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116880993A (en) * 2023-09-04 2023-10-13 北京滴普科技有限公司 Method and device for processing large number of small files in Iceberg

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10216455B1 (en) * 2017-02-14 2019-02-26 Veritas Technologies Llc Systems and methods for performing storage location virtualization
CN112379935A (en) * 2019-07-29 2021-02-19 中兴通讯股份有限公司 Spark performance optimization control method, device, equipment and storage medium
CN112448846A (en) * 2020-11-05 2021-03-05 北京浪潮数据技术有限公司 Health inspection method, device and equipment for k8s cluster
CN115509693A (en) * 2022-11-02 2022-12-23 广西壮族自治区公众信息产业有限公司 Data optimization method based on cluster Pod scheduling combined with data lake
CN116166191A (en) * 2022-12-30 2023-05-26 中国电信股份有限公司 Integrated system of lake and storehouse

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10216455B1 (en) * 2017-02-14 2019-02-26 Veritas Technologies Llc Systems and methods for performing storage location virtualization
CN112379935A (en) * 2019-07-29 2021-02-19 中兴通讯股份有限公司 Spark performance optimization control method, device, equipment and storage medium
CN112448846A (en) * 2020-11-05 2021-03-05 北京浪潮数据技术有限公司 Health inspection method, device and equipment for k8s cluster
CN115509693A (en) * 2022-11-02 2022-12-23 广西壮族自治区公众信息产业有限公司 Data optimization method based on cluster Pod scheduling combined with data lake
CN116166191A (en) * 2022-12-30 2023-05-26 中国电信股份有限公司 Integrated system of lake and storehouse

Also Published As

Publication number Publication date
CN116578570A (en) 2023-08-11

Similar Documents

Publication Publication Date Title
US6154852A (en) Method and apparatus for data backup and recovery
US20050246386A1 (en) Hierarchical storage management
CN103092678B (en) A kind of many incremental virtual machine internal storage management system and method
US8918783B2 (en) Managing virtual computers simultaneously with static and dynamic dependencies
US20110276772A1 (en) Management apparatus and management method
US20070078914A1 (en) Method, apparatus and program storage device for providing a centralized policy based preallocation in a distributed file system
EP3040865B1 (en) Database management system and computer system
CN116578570B (en) Method, system and equipment for automatically optimizing table data structure layout
US11036608B2 (en) Identifying differences in resource usage across different versions of a software application
US9836516B2 (en) Parallel scanners for log based replication
CN109885642B (en) Hierarchical storage method and device for full-text retrieval
US20210103554A1 (en) Rolling Back Kubernetes Applications Including Custom Resources
JP2009080671A (en) Computer system, management computer and file management method
CN112148788A (en) Data synchronization method and system for heterogeneous data source
US20110239231A1 (en) Migrating electronic document version contents and version metadata as a collection with a single operation
US11620191B2 (en) Fileset passthrough using data management and storage node
CN110825694A (en) Data processing method, device, equipment and storage medium
US20230267046A1 (en) Fileset partitioning for data storage and management
CN103617133B (en) Virtual memory compression method and device in a kind of Windows systems
CN111414422B (en) Data distribution method, device, equipment and storage medium
EP3550451A1 (en) Data storage and maintenance method and device, and computer storage medium
CN109241011B (en) Virtual machine file processing method and device
WO2015058628A1 (en) File access method and device
WO2017001900A1 (en) A data processing method
CN101236481A (en) Apparatus, system, and method for uninterrupted storage configuration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant