CN117807064A - Data lake information processing method, device, equipment and storage medium - Google Patents

Data lake information processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN117807064A
CN117807064A CN202311624911.7A CN202311624911A CN117807064A CN 117807064 A CN117807064 A CN 117807064A CN 202311624911 A CN202311624911 A CN 202311624911A CN 117807064 A CN117807064 A CN 117807064A
Authority
CN
China
Prior art keywords
file
data
information
target partition
path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311624911.7A
Other languages
Chinese (zh)
Inventor
刘子鸿
姜雪明
黄乙元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Supcon Information Industry Co Ltd
Original Assignee
Zhejiang Supcon Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Supcon Information Industry Co Ltd filed Critical Zhejiang Supcon Information Industry Co Ltd
Priority to CN202311624911.7A priority Critical patent/CN117807064A/en
Publication of CN117807064A publication Critical patent/CN117807064A/en
Pending legal-status Critical Current

Links

Abstract

The application provides a data lake information processing method, device, equipment and storage medium, and relates to the technical field of data processing. The method comprises the following steps: responding to an information deleting operation of a first target partition in an information table aiming at a preset data lake, determining a first list file path corresponding to the information of the first target partition from the information table, scanning a first list file of the first target partition in the preset data lake according to the first list file path to obtain a first list file in the first list file, deleting metadata information of each data file in the first list file, and deleting the path of each data file in the information table. According to the method, when the first target partition is deleted, metadata information of all data files in the first list file is only deleted, so that when a preset query engine queries all data files of the first target partition, a first list file path can be queried, and query errors are avoided.

Description

Data lake information processing method, device, equipment and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing information in a data lake.
Background
The data Lake (Delta Lake) stores metadata information based on partition tables created by the data warehouse tool Hive in the Hive's metadata store, but the specific storage path of the data is managed by Delta Lake. The partition data is stored in a specific table on the storage medium HDFS, and the specified partition of the corresponding table can be directly selected for deletion and correction operations, and when the partition is newly added or deleted, the corresponding metadata also changes.
However, deleting partition data provided by Delta Lake may delete the manifest folder under a specific partition path, and the search mechanism of the Presto query by the preset query engine Presto is to search the data in the manifest folder under the target partition, so that the manifest folder is displayed to be absent when the Presto search is performed, thereby searching for errors.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provide a data lake information processing method, a device, equipment and a storage medium, so that when a first target partition is deleted, metadata information of each data file in a first list file is only deleted, and a preset query engine can query a first list file path when querying each data file of the first target partition, thereby avoiding query errors.
In order to achieve the above purpose, the technical solution adopted in the embodiment of the present application is as follows:
in a first aspect, an embodiment of the present application provides a data lake information processing method, including:
responding to an information deleting operation of a first target partition in an information table aiming at a preset data lake, and determining a first list file path corresponding to the information of the first target partition from the information table;
according to the first list file path, scanning a list file of a first list file folder of the first target partition in the preset data lake to obtain a first list file in the first list file folder;
deleting metadata information of each data file in the first manifest file;
and deleting the paths of the data files in the information table.
In an alternative embodiment, the method further comprises:
if a preset data partition synchronization strategy is triggered, acquiring a data file to be synchronized;
determining information of a second target partition corresponding to the data file to be synchronized by adopting the preset data partition synchronization strategy;
determining a second manifest file path corresponding to the information of the second target partition from the information table;
according to the second manifest file path, performing manifest file scanning on a second manifest file folder of the second target partition in the preset data lake to obtain a second manifest file in the second manifest file folder;
synchronizing the metadata information of the data file to be synchronized into the second manifest file to obtain a path of the data file to be synchronized;
and adding the path of the data file to be synchronized into the information table.
In an alternative embodiment, the acquiring the data file to be synchronized includes:
searching a preset distributed data file system to obtain a path of the data file to be synchronized;
and acquiring the data file to be synchronized from the preset distributed data file system based on the path of the data file to be synchronized.
In an alternative embodiment, the method further comprises:
and responding to an information adding operation aiming at a third target partition in the information table, and adding the third target partition in the preset data lake.
In an alternative embodiment, the method further comprises:
and adding the information of the third target partition in the information table at regular time by adopting a preset partition information adding command.
In an alternative embodiment, the method further comprises:
responding to a data query operation aiming at a fourth target partition in the information table, and determining a fourth list file path corresponding to the information of the fourth target partition from the information table;
according to the fourth list file path, scanning a list file of a fourth list file folder of the fourth target partition in the preset data lake to obtain a fourth list file in the fourth list file folder;
and scanning the data file of the fourth target partition in the preset data lake according to the path of each data file in the fourth list file to obtain the data file to be queried.
In an alternative embodiment, before deleting the path of each data file in the information table, the method includes:
and determining the path of each data file according to the metadata information of each data file.
In a second aspect, an embodiment of the present application further provides a data lake information processing apparatus, including:
the determining module is used for responding to the information deleting operation of the first target partition in the information table aiming at the preset data lake, and determining a first list file path corresponding to the information of the first target partition from the information table;
the scanning module is used for scanning the first list file of the first target partition in the preset data lake according to the first list file path to obtain a first list file in the first list file;
the deleting module is used for deleting the metadata information of each data file in the first list file;
the deleting module is further configured to delete paths of the data files in the information table.
In a third aspect, embodiments of the present application further provide a computer device, including: a processor, a storage medium, and a bus, the storage medium storing program instructions executable by the processor, the processor and the storage medium communicating over the bus when the computer device is running, the processor executing the program instructions to perform the steps of the data lake information processing method according to any one of the first aspects.
In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor performs the steps of the data lake information processing method according to any one of the first aspects.
The beneficial effects of this application are:
the embodiment of the application provides a data lake information processing method, device, equipment and storage medium, which comprise the following steps: responding to an information deleting operation of a first target partition in an information table aiming at a preset data lake, determining a first list file path corresponding to the information of the first target partition from the information table, scanning a first list file of the first target partition in the preset data lake according to the first list file path to obtain a first list file in the first list file, deleting metadata information of each data file in the first list file, and deleting the path of each data file in the information table. According to the method, when the first target partition is deleted, through determination of the first manifest file, metadata information of all data files in the first manifest file is deleted, and the first manifest file path are not required to be deleted, so that when a preset query engine queries all the data files of the first target partition, the first manifest file path can be queried, and query errors are avoided.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a data lake information processing method provided in an embodiment of the present application;
FIG. 2 is a schematic flow chart of another method for processing information of a data lake according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of another method for processing information of a data lake according to an embodiment of the present application;
fig. 4 is a schematic functional block diagram of a data lake information processing device according to an embodiment of the present application;
fig. 5 is a schematic diagram of a computer device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention.
Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
In the description of the present application, it should be noted that, if the terms "upper", "lower", and the like indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, or an azimuth or the positional relationship that is commonly put when the product of the application is used, it is merely for convenience of description and simplification of the description, and does not indicate or imply that the apparatus or element to be referred to must have a specific azimuth, be configured and operated in a specific azimuth, and therefore should not be construed as limiting the present application.
Furthermore, the terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that, without conflict, features in embodiments of the present application may be combined with each other.
The prest is an open-source distributed SQL query engine, which is used for executing interactive analysis query on a large-scale data set, can provide good big data real-time query capability based on a Delta Lake, and when the prest query engine queries the data files of the target partition in the preset data Lake, firstly, the manifest files of the target partition are required to be obtained, and the data file paths are required to be obtained from the manifest files, so that the data files of the target partition are read. At present, if the target partition in the preset data lake is deleted, the manifest file of the target partition is correspondingly deleted, namely, the manifest file path of the target partition and the data files in the manifest file are deleted, and at this time, when the data files of the target partition are queried through the preset query engine, the manifest file path of the target partition cannot be acquired, and query errors can occur.
Therefore, in order to avoid query errors when a preset query engine is used for querying a data file of a deleted partition in a preset data lake, the embodiment of the application provides a data lake information processing method, which includes: and responding to the information deleting operation of the first target partition in the information table of the preset data lake, determining a first list file path corresponding to the information of the first target partition from the information table, scanning a list file of a first list folder of the first target partition in the preset data lake according to the first list file path to obtain a first list file in the first list folder, deleting metadata information of each data file in the first list file, and deleting paths of each data file in the information table. By determining each data file of the first target partition, only deleting metadata information of each data file, and reserving the list file, query errors are avoided when a preset query engine queries each data file of the first target partition.
The data lake information processing method provided in the embodiment of the present application is explained in detail by a specific example with reference to the accompanying drawings. The data lake information processing method provided by the embodiment of the application can be implemented by pre-installing: the data lake information processing algorithm or the computer equipment for detecting the software is preset and is realized by running the algorithm or the software. The computer device may be, for example, a server or a terminal, which may be a user computer. Fig. 1 is a flow chart of a data lake information processing method according to an embodiment of the present application. As shown in fig. 1, the method includes:
s101, responding to an information deleting operation of a first target partition in an information table aiming at a preset data lake, and determining a first list file path corresponding to the information of the first target partition from the information table.
In this embodiment, the preset data Lake is Delta Lake, where Delta Lake is an open source item, and a Lakehouse architecture may be constructed on the data Lake. The preset data Lake Delta Lake provides ACID transactions, extensible metadata processing, and unifies stream processing and batch processing of data on existing data lakes. The information table of the preset data lake is a Delta table, and metadata of the information table comprises: the table structure, partition information, data file path, and the like, for example, metadata of the information table includes: partition information of the A area, the B area and the C area and a data file path corresponding to each partition information. The metadata is also called intermediate data and relay data, and is data (data about data) describing data, mainly describing information of data attribute (property), and is used for supporting functions such as indicating storage location, history data, resource searching, file recording, and the like.
When deleting the first target partition in the preset data lake, analyzing metadata in the information table through the ad hoc query tool Delta Connector, and determining a first list file path from the information table according to information, namely identification, of the first target partition. Wherein the metadata in the information table is stored in a metadata directory of the information table, and in a Transaction Log (Transaction Log).
S102, according to the first list file path, scanning a list file of a first target partition in a preset data lake to obtain a first list file in the first list file.
The Manifest file is a Manifest file of a data file in a record table. The Manifest file stores metadata about the data file, such as file path, size, checksum, etc. The Manifest file may improve query performance and allow the pre-set data Lake Delta Lake to more efficiently manage and maintain metadata of the table.
And determining a first list folder of the first target partition in the preset data lake according to the first list file path, and scanning the first list folder to obtain a first list file in the first list folder.
S103, deleting metadata information of each data file in the first list file.
The data file, namely the parquet file, is a free open-source array-oriented data storage format of the Apache Hadoop ecological system, and the default data Lake Delta Lake uses the parquet to store real data.
Specifically, the first manifest file includes metadata information of each data file in the first target partition, where the metadata information of each data file includes: and deleting the metadata information of each data file in the first list file, and reserving the first list file and the first list file path so that each data file in the first list file is an empty file.
S104, deleting paths of all data files in the information table.
Optionally, before deleting the path of each data file in the information table, the method further includes:
and determining the path of each data file according to the metadata information of each data file.
Specifically, since the metadata information of each data file in the first manifest file is deleted, and the metadata information of each data file includes the path of each data file, the path of each data file can be determined according to the metadata information of each data file.
Because the information table comprises the data file paths of the target partitions, the paths of the data files of the first target partition are determined from the information table according to the determined paths of the data files and deleted, so that the data in the information table is updated.
In summary, the embodiment of the application provides a data lake information processing method, which includes: responding to an information deleting operation of a first target partition in an information table aiming at a preset data lake, determining a first list file path corresponding to the information of the first target partition from the information table, scanning a first list file of the first target partition in the preset data lake according to the first list file path to obtain a first list file in the first list file, deleting metadata information of each data file in the first list file, and deleting the path of each data file in the information table. According to the method, when the first target partition is deleted, through determination of the first manifest file, metadata information of all data files in the first manifest file is deleted, and the first manifest file path are not required to be deleted, so that when a preset query engine queries all the data files of the first target partition, the first manifest file path can be queried, and query errors are avoided.
The embodiment of the application also provides a possible implementation manner of another data lake information processing method, and fig. 2 is a schematic flow chart of another data lake information processing method provided in the embodiment of the application. As shown in fig. 2, the method further includes:
s201, if a preset data partition synchronization strategy is triggered, a data file to be synchronized is acquired.
In this embodiment, the preset data partition synchronization policy is used to synchronize the data file to be synchronized to the second manifest file of the second target partition of the preset data lake.
Optionally, searching a preset distributed data file system to obtain a path of the data file to be synchronized; and acquiring the data file to be synchronized from a preset distributed data file system based on the path of the data file to be synchronized.
Wherein the preset distributed data file system is a Hadoop Distributed File System (HDFS) refers to a distributed file system (Distributed File System) designed to fit on general purpose hardware (commodity hardware).
And obtaining a path of the data file to be synchronized by retrieving a preset distributed data file system, and obtaining the data file to be synchronized from the preset distributed data file system.
S202, determining information of a second target partition corresponding to the data file to be synchronized by adopting a preset data partition synchronization strategy.
S203, determining a second manifest file path corresponding to the information of the second target partition from the information table.
S204, according to the second list file path, scanning a list file of a second target partition in the preset data lake to obtain a second list file in the second list file.
Specifically, the metadata of the information table includes: table structure, partition information, data file path, etc., a second manifest file path is determined from the information table based on the information, i.e., identification, of the second target partition.
And determining a second list folder of a second target partition in the preset data lake according to the second list file path, and scanning the second list folder to obtain a second list file in the second list folder.
S205, synchronizing the metadata information of the data files to be synchronized into the second manifest file to obtain paths of the data files to be synchronized.
S206, adding the paths of the data files to be synchronized into the information table.
Specifically, the second manifest file includes metadata information of each data file in the second target partition, where the metadata information of each data file includes: and synchronizing the metadata information of the data files to be synchronized into the second manifest file by the identification of the second target partition, the data file paths and the data, so that the metadata information of the data files to be synchronized is also contained in the second manifest file.
And because the information table comprises the data file paths of the target partitions, adding the paths of the data files to be synchronized of the second target partition from the information table to update the data in the information table.
In the method provided by the embodiment of the invention, if the preset data partition synchronization strategy is triggered, the data file to be synchronized is acquired, the preset data partition synchronization strategy is adopted to determine the information of the second target partition corresponding to the data file to be synchronized, the second list file path corresponding to the information of the second target partition is determined from the information table, then the list file scanning is performed on the second list folder of the second target partition in the preset data lake according to the second list file path, the second list file in the second list folder is obtained, the metadata information of the data file to be synchronized is synchronized to the second list file, the path of the data file to be synchronized is obtained, and finally the path of the data file to be synchronized is added to the information table, so that the synchronization of the data file to be synchronized is realized.
The embodiment of the application also provides a possible implementation manner of another data lake information processing method, and the method further comprises the following steps:
and adding the third target partition in the preset data lake in response to the information adding operation aiming at the third target partition in the information table.
Specifically, when the third target partition is added to the information table, in response to the information adding operation for the third target partition in the information table, the third target partition is added to the preset data lake, and the target partition can be re-divided in the preset data lake, so that the third target partition can be added, or the third target partition is added on the basis of the existing partition, and the limitation is not limited herein.
Optionally, the information of the third target partition is added in the information table by a preset partition information adding command at regular time.
Specifically, metadata in the data warehouse, which is Apache Hive, is a Hadoop-based data warehouse tool that provides a query language (HiveQL) similar to SQL to query and analyze data stored in a Hadoop Distributed File System (HDFS), is not updated immediately when a third target partition is added to the information table, which results in inconsistent metadata of the information table.
Therefore, the information of the third target partition needs to be added to the information TABLE at regular time by using a preset partition information adding command, wherein the preset partition information adding command is a MSCK REPAIR TABLE command, which is a command for repairing metadata of the partition TABLE (Partitioned Table) in Hive, and the MSCK REPAIR TABLE command can scan a storage path of the TABLE, identify a new partition and update metadata of the Hive, thereby repairing partition information of the TABLE and realizing adding information of the third target partition to the information TABLE.
The embodiment of the application also provides another possible implementation manner of the data lake information processing method, and fig. 3 is a schematic flow chart of another data lake information processing method provided in the embodiment of the application. As shown in fig. 3, the method further includes:
s301, responding to data query operation aiming at a fourth target partition in the information table, and determining a fourth list file path corresponding to the information of the fourth target partition from the information table.
In this embodiment, when the data of the fourth target partition in the information table is queried by the preset query engine Presto, the metadata in the information table is parsed by the ad hoc query tool, including the table structure, partition information, data file paths, and the like. And determining a fourth list file path corresponding to the information of the fourth target partition.
S302, according to the fourth list file path, scanning a list file of a fourth target partition in the preset data lake to obtain a fourth list file in the fourth list file.
Specifically, according to the fourth manifest file path, determining a fourth manifest file of a fourth target partition in the preset data lake, and performing manifest file scanning on the fourth manifest file to obtain a fourth manifest file in the fourth manifest file.
S303, scanning the data file of the fourth target partition in the preset data lake according to the path of each data file in the fourth manifest file to obtain the data file to be queried.
Specifically, the Presto of the preset query engine reads the fourth manifest file to obtain paths of the data files, and performs data file scanning on the fourth target partition in the preset data lake according to the paths of the data files to realize the reading of the data files to be queried. The preset query engine can directly analyze the data file.
In the method provided by the embodiment of the application, a fourth list file path corresponding to the information of a fourth target partition is determined from an information table in response to a data query operation for the fourth target partition in the information table; according to the fourth list file path, scanning a fourth list file of a fourth target partition in the preset data lake to obtain a fourth list file in the fourth list file; and scanning the data file of the fourth target partition in the preset data lake according to the path of each data file in the fourth manifest file to obtain the data file to be queried. Even if the fourth target partition is a deletion partition, the metadata information of each data file in the fourth manifest file of the fourth target partition is deleted, and the fourth manifest file path are reserved, so that the data file to be queried is a null file, but the condition of query error cannot occur.
The data lake information processing apparatus and the computer device provided by any of the embodiments of the present application are further explained correspondingly as follows, and specific implementation processes and technical effects thereof are the same as those of the corresponding method embodiments, and for brevity, no part is mentioned in this embodiment, and reference may be made to corresponding contents in the method embodiments.
Fig. 4 is a schematic functional block diagram of a data lake information processing device according to an embodiment of the present application. As shown in fig. 4, the data lake information processing apparatus 100 includes:
a determining module 110, configured to determine, from an information table of a preset data lake, a first manifest file path corresponding to information of a first target partition in response to an information deletion operation for the first target partition in the information table;
the scanning module 120 is configured to scan a first manifest file of a first target partition in a preset data lake according to the first manifest file path, so as to obtain a first manifest file in the first manifest file;
a deleting module 130, configured to delete metadata information of each data file in the first manifest file;
the deleting module 130 is further configured to delete paths of the data files in the information table.
Optionally, the data lake information processing apparatus 100 further includes:
the acquisition module is used for acquiring a data file to be synchronized if a preset data partition synchronization strategy is triggered;
the determining module 110 is further configured to determine information of a second target partition corresponding to the data file to be synchronized by adopting a preset data partition synchronization policy; determining a second manifest file path corresponding to the information of the second target partition from the information table;
the scanning module 120 is further configured to perform a manifest file scanning on a second manifest file folder of a second target partition in the preset data lake according to the second manifest file path, so as to obtain a second manifest file in the second manifest file folder;
the synchronization module is used for synchronizing the metadata information of the data files to be synchronized into the second manifest file to obtain paths of the data files to be synchronized;
and the adding module is used for adding the path of the data file to be synchronized into the information table.
Optionally, an acquiring module is configured to retrieve a preset distributed data file system to obtain a path of the data file to be synchronized; and acquiring the data file to be synchronized from a preset distributed data file system based on the path of the data file to be synchronized.
Optionally, the adding module is further configured to add the third target partition in the preset data lake in response to an information adding operation for the third target partition in the information table.
Optionally, the adding module is further configured to add information of the third target partition in the information table by using a preset partition information adding command at regular time.
Optionally, the determining module 110 is further configured to determine, from the information table, a fourth manifest file path corresponding to information of the fourth target partition in response to a data query operation for the fourth target partition in the information table;
the scanning module 120 is further configured to perform, according to the fourth manifest file path, manifest file scanning on a fourth manifest file folder of a fourth target partition in the preset data lake, to obtain a fourth manifest file in the fourth manifest file folder; and scanning the data file of the fourth target partition in the preset data lake according to the path of each data file in the fourth manifest file to obtain the data file to be queried.
Optionally, the determining module 110 is further configured to determine a path of each data file according to metadata information of each data file.
The foregoing apparatus is used for executing the method provided in the foregoing embodiment, and its implementation principle and technical effects are similar, and are not described herein again.
The above modules may be one or more integrated circuits configured to implement the above methods, for example: one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASICs), or one or more microprocessors, or one or more field programmable gate arrays (Field Programmable Gate Array, abbreviated as FPGAs), etc. For another example, when a module above is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).
Fig. 5 is a schematic diagram of a computer device according to an embodiment of the present application, where the computer device may be used for data lake information processing. As shown in fig. 5, the computer device 200 includes: a processor 210, a storage medium 220, and a bus 230.
The storage medium 220 stores machine-readable instructions executable by the processor 210. When the computer device is running, the processor 210 communicates with the storage medium 220 via the bus 230, and the processor 210 executes the machine-readable instructions to perform the steps of the method embodiments described above. The specific implementation manner and the technical effect are similar, and are not repeated here.
Optionally, the present application further provides a storage medium 220, where the storage medium 220 stores a computer program, which when executed by a processor performs the steps of the above-mentioned method embodiments. The specific implementation manner and the technical effect are similar, and are not repeated here.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.
The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (english: processor) to perform some of the steps of the methods according to the embodiments of the invention. And the aforementioned storage medium includes: u disk, mobile hard disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily appreciate variations or alternatives within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (10)

1. A data lake information processing method, characterized by comprising:
responding to an information deleting operation of a first target partition in an information table aiming at a preset data lake, and determining a first list file path corresponding to the information of the first target partition from the information table;
according to the first list file path, scanning a list file of a first list file folder of the first target partition in the preset data lake to obtain a first list file in the first list file folder;
deleting metadata information of each data file in the first manifest file;
and deleting the paths of the data files in the information table.
2. The method of claim 1, wherein the method further comprises:
if a preset data partition synchronization strategy is triggered, acquiring a data file to be synchronized;
determining information of a second target partition corresponding to the data file to be synchronized by adopting the preset data partition synchronization strategy;
determining a second manifest file path corresponding to the information of the second target partition from the information table;
according to the second manifest file path, performing manifest file scanning on a second manifest file folder of the second target partition in the preset data lake to obtain a second manifest file in the second manifest file folder;
synchronizing the metadata information of the data file to be synchronized into the second manifest file to obtain a path of the data file to be synchronized;
and adding the path of the data file to be synchronized into the information table.
3. The method of claim 2, wherein the obtaining the data file to be synchronized comprises:
searching a preset distributed data file system to obtain a path of the data file to be synchronized;
and acquiring the data file to be synchronized from the preset distributed data file system based on the path of the data file to be synchronized.
4. The method of claim 1, wherein the method further comprises:
and responding to an information adding operation aiming at a third target partition in the information table, and adding the third target partition in the preset data lake.
5. The method of claim 4, wherein the method further comprises:
and adding the information of the third target partition in the information table at regular time by adopting a preset partition information adding command.
6. The method of claim 1, wherein the method further comprises:
responding to a data query operation aiming at a fourth target partition in the information table, and determining a fourth list file path corresponding to the information of the fourth target partition from the information table;
according to the fourth list file path, scanning a list file of a fourth list file folder of the fourth target partition in the preset data lake to obtain a fourth list file in the fourth list file folder;
and scanning the data file of the fourth target partition in the preset data lake according to the path of each data file in the fourth list file to obtain the data file to be queried.
7. The method of claim 1, wherein prior to deleting the path of each data file in the information table, further comprising:
and determining the path of each data file according to the metadata information of each data file.
8. A data lake information processing apparatus, comprising:
the determining module is used for responding to the information deleting operation of the first target partition in the information table aiming at the preset data lake, and determining a first list file path corresponding to the information of the first target partition from the information table;
the scanning module is used for scanning the first list file of the first target partition in the preset data lake according to the first list file path to obtain a first list file in the first list file;
the deleting module is used for deleting the metadata information of each data file in the first list file;
the deleting module is further configured to delete paths of the data files in the information table.
9. A computer device, comprising: a processor, a storage medium and a bus, the storage medium storing program instructions executable by the processor, the processor and the storage medium communicating via the bus when the computer device is running, the processor executing the program instructions to perform the steps of the data lake information processing method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the data lake information processing method of any one of claims 1 to 7.
CN202311624911.7A 2023-11-30 2023-11-30 Data lake information processing method, device, equipment and storage medium Pending CN117807064A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311624911.7A CN117807064A (en) 2023-11-30 2023-11-30 Data lake information processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311624911.7A CN117807064A (en) 2023-11-30 2023-11-30 Data lake information processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117807064A true CN117807064A (en) 2024-04-02

Family

ID=90424328

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311624911.7A Pending CN117807064A (en) 2023-11-30 2023-11-30 Data lake information processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117807064A (en)

Similar Documents

Publication Publication Date Title
CN105630864B (en) Forced ordering of a dictionary storing row identifier values
CN109189852B (en) Data synchronization method and device for data synchronization
US20160019228A1 (en) Snapshot-consistent, in-memory graph instances in a multi-user database
US10970300B2 (en) Supporting multi-tenancy in a federated data management system
CN106407360B (en) Data processing method and device
CN112559481A (en) Data storage method and device based on distributed system and relational database
WO2018095299A1 (en) Time sequence data management method, device and apparatus
CN104239377A (en) Platform-crossing data retrieval method and device
EP3438845A1 (en) Data updating method and device for a distributed database system
JP2018511861A (en) Method and device for processing data blocks in a distributed database
CN115145943B (en) Method, system, equipment and storage medium for rapidly comparing metadata of multiple data sources
US20150169623A1 (en) Distributed File System, File Access Method and Client Device
CN111061802B (en) Power data management processing method, device and storage medium
CN106484694B (en) Full-text search method and system based on distributed data base
CN112306957A (en) Method and device for acquiring index node number, computing equipment and storage medium
US9886446B1 (en) Inverted index for text searching within deduplication backup system
EP3620932A1 (en) Method and system for merging data
US20220342888A1 (en) Object tagging
JP6648307B2 (en) Electronic device, method for deduplicating name list entry, and computer-readable storage medium
CN111639087A (en) Data updating method and device in database and electronic equipment
CN117807064A (en) Data lake information processing method, device, equipment and storage medium
CN116361287A (en) Path analysis method, device and system
US20150347402A1 (en) System and method for enabling a client system to generate file system operations on a file system data set using a virtual namespace
CN116628042A (en) Data processing method, device, equipment and medium
CN114385657A (en) Data storage method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination