CN116010345A - Method, device and equipment for realizing table service scheme of flow batch integrated data lake - Google Patents

Method, device and equipment for realizing table service scheme of flow batch integrated data lake Download PDF

Info

Publication number
CN116010345A
CN116010345A CN202211385638.2A CN202211385638A CN116010345A CN 116010345 A CN116010345 A CN 116010345A CN 202211385638 A CN202211385638 A CN 202211385638A CN 116010345 A CN116010345 A CN 116010345A
Authority
CN
China
Prior art keywords
data
partition
file
files
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211385638.2A
Other languages
Chinese (zh)
Inventor
周朝卫
刘钧
周世军
覃华云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unihub China Information Technology Co Ltd
Original Assignee
Unihub China Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unihub China Information Technology Co Ltd filed Critical Unihub China Information Technology Co Ltd
Priority to CN202211385638.2A priority Critical patent/CN116010345A/en
Publication of CN116010345A publication Critical patent/CN116010345A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a method, a device and equipment for realizing a table service scheme of a flow batch integrated data lake, wherein the method comprises the following steps: creating a data table, determining a main key field of the table, and designating an attribute of a partition number threshold; performing Hash calculation according to the primary key to generate a Hash value of an integer type, then taking a remainder from the Hash value to a partition number threshold value, wherein the remainder value is the partition number corresponding to the primary key record, and completing partition; reading data, carrying out Hash calculation according to the primary key field to obtain a partition number field, and carrying out partition according to the partition number field when writing the data to generate partition catalogues, wherein each partition catalogue corresponds to a plurality of data files; operating on the data according to the requirements, including: data writing, data query and data merging. In this way, the same set of data processing architecture is used, offline and real-time data processing is supported, data can be updated and deleted rapidly according to the main key, and the data can be queried in real time.

Description

Method, device and equipment for realizing table service scheme of flow batch integrated data lake
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to a method, a device and equipment for realizing a table service scheme of a flow batch integrated data lake.
Background
With the development of technology, data of many scene applications are infinite, batch processing cannot meet the requirements of the scenes, and as the market demands increase, stream processing engines are gradually exploded. The flow processing engine is used for calculating in real time, processing one piece of data, and processing after waiting for one batch of data to arrive is not needed like batch processing, so that the flow processing engine has wide application in network monitoring, electronic commerce and the like due to the advantage of real-time calculation.
hive is a hadoop-based data warehouse analysis processing tool and can be queried by sql statements; because the bottom layer of hive is the encapsulated stream hadoop, the data of hive are all stored in hadoop files, so the hive has good batch processing capability, but the hive does not support stream processing, and for the data of different partitions, each partition is a folder, and the files under the folder are the stored data; in hive, for newly added data in old partition, file will be automatically built in hdfs, and data cannot be added under existing files, so that the stream processing engine cannot read the newly added file in old partition of hive table, i.e. cannot support dynamic reading of new file in old partition in hive.
When a large number of new partition files are added at one time, as the stream processing engine cannot read all the new partition files at one time, the partitions in the stream processing engine and the partitions in the hive are not matched due to time difference, and partial new partition miss-reading occurs.
Such as the patent: "method, apparatus and device for supporting dynamic reading of hive table data by stream processing" (application number: CN 202111194393.0) ": the method comprises the following steps: setting a partitionmap and a partitionValueList, and setting a time stamp in the first round of reading as a preset initial value; if the partition name to be read in the partitionValueList can be found in the partitionmap, judging that the corresponding partition to be read is an old partition; if the file modification time of the file to be read in the old partition is greater than the maximum file modification time which is already read in the corresponding partition in the partitionmap and the file modification time of the file to be read in the old partition is greater than the current round time stamp, judging that the file to be read in the old partition is a new file of the old partition, and reading the new file of the old partition according to the path of the corresponding file to be read in the partitionValueList. The method for supporting dynamic reading of the Hive table data by the stream processing mode can execute the data source of the Hive on a stream mechanism, so that the stream processing engine can dynamically read the new file of the old partition in the Hive, but the processing of offline data and real-time data in the Hive can not be realized at the same time, the logic of writing the data processing into the Hive table is complex, and the data processing is migrated to the new storage of the new iceberg and other support updating, so that the workload of transformation is large and the risk is high.
Disclosure of Invention
In order to solve the problems, the invention uses the same set of data processing architecture, supports offline and real-time data processing, supports the updating and deleting of data in real-time data processing, can quickly update and delete the data according to the main key, and can inquire in real time; three different types of merging can be performed on the data; meanwhile, the method is compatible with Hive, supports upgrading an offline architecture of the Hive into a real-time architecture, and supports updating and deleting operations on the basis of the Hive.
According to the embodiment of the invention, a method, a device and equipment for realizing a table service scheme of a flow batch integrated data lake are provided.
In a first aspect of the invention, a method of implementing a table service scheme for a batch integrated data lake is provided. The method comprises the following steps:
s01: creating a data table, determining a main key field of the table, and designating an attribute of a partition number threshold;
s02: performing Hash calculation according to the primary key to generate a Hash value of an integer type, then taking a remainder from the Hash value to a partition number threshold value, wherein the remainder value is the partition number corresponding to the primary key record, and completing partition;
s03: reading data, carrying out Hash calculation according to the primary key field to obtain a partition number field, and carrying out partition according to the partition number field when writing the data to generate partition catalogues, wherein each partition catalogue corresponds to a plurality of data files;
s04: operating on the data according to the requirements, including: data writing, data query and data merging.
Further, at least one of the number of key fields described in S01, a comma is used to distinguish between a plurality of key fields.
Further, the manner of Hash computation described in S02 is not unique.
Further, the data in S03 is stored in a real-time data directory after being written, the real-time data is combined periodically, meanwhile, a data newly added file and a data deleted file are combined, the combined file is stored in a Hive data directory, offline data is directly written into the Hive, and when the data is read, the data of the Hive table and the real-time data are read at the same time, so that compatibility of the Hive table is realized.
Further, the operation of writing data described in S04 includes: data are newly added, data are deleted and data are updated;
the operation steps of the data addition are as follows: newly adding data to form a new data file and partitioning the data file;
the operation steps of the data deletion are as follows: designating a primary key, identifying data of the primary key as deletion, and storing the primary key in a file, wherein the type of the file is a data deletion type;
the operation steps of the data updating are as follows: and respectively adding a deletion type file and a new addition type file, storing a main key of data to be updated in the deletion type file, and storing updated data in the new addition type file.
Further, the data to be updated must contain the data of the full field.
Further, when the data writing is operated, the sequence of the data is ensured by using a submitting serial number plus a recording offset;
the commit sequence number is generated when writing is performed each time, and is of an increasing integer type, and all files of the current writing batch share one commit sequence number;
the record offset is an incremental sequence number generated for each record, the record offset is recorded in each data file, and each record corresponds to one record offset.
Further, the step of querying the data in S04 is as follows:
CX01: reading the metadata file of each partition;
CX02: reading the data files according to the sequence of the data files defined by the metadata files in each partition;
CX03: in each partition, performing logic combination after filtering according to the type of the data file;
CX04: and returning the combined data to the user.
Further, the logic merging step described in CX03 is:
CX031: traversing each file in turn;
CX02: if the file is a newly added file, directly reading the file, and using the read data as input data of a subsequent process;
CX033: if the file is a deletion type file, deleting the data containing the primary key if the primary key is contained in the data read before the file;
CX034: and taking the processed data as input data of a subsequent traversing file to carry out logic combination.
Further, the data merge described in S04 includes three types: merging small files of a deletion type, merging files of a new addition type, and merging files of deletion and new addition types at the same time;
the operation steps of merging and deleting the small files are as follows: searching and removing the deleted type files in each partition, and merging and storing the removed files in a parent node;
the operation steps of merging the files of the newly added type are as follows: searching newly added type small files in each partition, and combining a plurality of newly added type small files in the same partition into a large newly added type file;
the operation steps of simultaneously merging the deleted and newly added files are as follows: searching the deleted type files and the newly added type small files in each partition, and merging the newly added type small files while removing the deleted type files.
In a second aspect of the invention, an apparatus is provided for implementing a table service scheme for a batch integrated data lake. The device comprises:
the data table creation module: the method comprises the steps of creating a data table, determining a main key field of the table, and designating an attribute of a partition quantity threshold;
partition module: the method comprises the steps of carrying out Hash calculation according to a primary key, generating a Hash value of an integer type, then taking a remainder from the Hash value to a partition number threshold, wherein the remainder value is a partition number corresponding to the primary key record, and completing partition;
and a data writing module: reading data, carrying out Hash calculation according to the primary key field to obtain a partition number field, and carrying out partition according to the partition number field when writing the data to generate partition catalogues, wherein each partition catalogue corresponds to a plurality of data files;
and an operation module: for manipulating data according to demand, comprising: data writing, data query and data merging.
In a third aspect of the invention, an electronic device is provided. The electronic device includes: a memory and a processor, the memory having stored thereon a computer program, the processor implementing a method according to the first aspect of the invention when executing the program.
In a fourth aspect of the invention, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method as according to the first aspect of the invention.
The english abbreviations mentioned above are defined:
hash: hash function
ID: identity document identification number of ID card
Spark: universal big data analysis engine
Hive: hadoop-based data warehouse tool
Hadoop: distributed system infrastructure developed by Apache foundation
Sql: structured Query Language structured query language
Hdfs: hadoop DistributedFile System distributed file system
The invention uses the same set of data processing architecture, supports offline and real-time data processing, supports the updating and deleting of data in real-time data processing, can update and delete data rapidly according to the main key, and can inquire in real time; three different types of merging can be performed on the data; meanwhile, the method is compatible with Hive, supports upgrading an offline architecture of the Hive into a real-time architecture, and supports updating and deleting operations on the basis of the Hive.
It should be understood that the description in this summary is not intended to limit the critical or essential features of the embodiments of the invention, nor is it intended to limit the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
The above and other features, advantages and aspects of embodiments of the present invention will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. Wherein:
FIG. 1 illustrates a method flow diagram for implementing a table service scheme for a batch integrated data lake in accordance with an embodiment of the invention;
FIG. 2 illustrates an apparatus block diagram of a table service scheme implementing a stream batch integrated data lake in accordance with an embodiment of the invention;
FIG. 3 shows a partition schematic diagram according to an embodiment of the invention;
FIG. 4 illustrates a schematic diagram of a Hash computation partition using primary keys, according to an embodiment of the present invention;
FIG. 5 shows a schematic diagram of merging delete types of doclets according to an embodiment of the present invention;
FIG. 6 illustrates a schematic diagram of merging newly added types of files according to an embodiment of the present invention;
FIG. 7 illustrates a schematic diagram of a simultaneous merge delete and add type file according to an embodiment of the invention;
FIG. 8 illustrates a diagram of a simultaneous merge delete and add type file according to an embodiment of the invention;
FIG. 9 illustrates an apparatus schematic diagram of a table service scheme implementing a stream batch integrated data lake according to an embodiment of the invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
According to the embodiment of the invention, a method, a device and equipment for realizing a table service scheme of a flow-batch integrated data lake are provided, the same set of data processing architecture is used, off-line and real-time data processing is supported, the real-time data processing supports data updating and deleting, the data can be updated and deleted rapidly according to a main key, and the data can be queried in real time; three different types of merging can be performed on the data; meanwhile, the method is compatible with Hive, supports upgrading an offline architecture of the Hive into a real-time architecture, and supports updating and deleting operations on the basis of the Hive.
The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments thereof.
FIG. 1 is a flow chart of a method for implementing a table service scheme for a batch integrated data lake in accordance with one embodiment of the invention. The method comprises the following steps:
s01: creating a data table, determining a main key field of the table, and designating an attribute of a partition number threshold;
s02: performing Hash calculation according to the primary key to generate a Hash value of an integer type, then taking a remainder from the Hash value to a partition number threshold value, wherein the remainder value is the partition number corresponding to the primary key record, and completing partition;
s03: reading data, carrying out Hash calculation according to the primary key field to obtain a partition number field, and carrying out partition according to the partition number field when writing the data to generate partition catalogues, wherein each partition catalogue corresponds to a plurality of data files;
s04: operating on the data according to the requirements, including: data writing, data query and data merging.
It should be noted that although the operations of the method of the present invention are described in a particular order in the above embodiments and the accompanying drawings, this does not require or imply that the operations must be performed in the particular order or that all of the illustrated operations be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
In order to more clearly explain the above-described method of implementing the table service scheme of the batch integrated data lake, a specific embodiment will be described below, however, it should be noted that this embodiment is only for better illustrating the present invention and is not meant to limit the present invention unduly.
The following describes in more detail, with a specific example, a method for implementing a table service scheme for a batch integrated data lake:
creating a data table, determining the primary key field of the table, and examples are as follows:
Figure SMS_1
the primary key is used to designate the primary key field of the table as id, and when the table is built, a plurality of fields can be designated as primary keys and comma separation is used. The partition key field of the partition by designation table may also be used.
The using statement specifies the format of the data table, i.e., the format of our custom data source, here designated by the name lake.
While building the table, the relevant attribute information of the table may be specified, and the partition number threshold may be specified, for example as follows:
Figure SMS_2
/>
Figure SMS_3
the size (max_partition_threshold) of the partition number threshold of the table is specified by properties to be 4.
And generating a value of a unique integer type according to the primary key calculation, for example, carrying out Hash calculation on a single primary key or multiple primary keys to generate a Hash value of the integer type, and then taking the remainder of the Hash value to the partition number threshold value, wherein the remainder value is the partition number corresponding to the primary key record.
Where the partition count threshold is a fixed value, typically no further modifications are made after the determination. The partitions are used for storing data files, one partition can store a plurality of data files, hash values generated by primary keys of data corresponding to the data files are used for taking remainder for the threshold value of the number of the partitions, and the remainder is the same.
The remainder is stored as the number of the partition on the leaf node of the binary tree, the number of the partition corresponding to the node number of the leaf child node of the binary tree.
By way of specific example, assume that there is data as shown in table 1:
TABLE 1
id value create_time
1 aaa 2022-09-0712:11:21
2 aab 2022-09-0713:12:25
3 aac 2022-09-0711:15:30
4 bbb 2022-09-0714:22:45
5 bbc 2022-09-0715:35:52
6 fef 2022-09-0715:35:55
7 qee 2022-09-0715:37:37
8 eee 2022-09-0716:35:56
Defining a data source writing/inquiring interface by utilizing a Spark data source interface, creating a new data source class, inheriting abstract classes, namely a retatblerelation provider and a datasource register, and realizing a related method as follows:
Figure SMS_4
/>
Figure SMS_5
in the code of the above class, two methods are rewritten, one is createReaction, and the other is shortName, which is the name of the data source, which is the logic used to implement the data writing.
The threshold value of 4 is indicated according to the previous steps, which means that at most 4 partition storage data are generated, but each partition can contain a plurality of data files.
There are many types of Hash algorithms for calculating Hash values from primary keys, and here, it is assumed that a certain Hash algorithm is selected, a numerical value is generated, and then the remainder is taken for the partition number, as shown in table 2:
TABLE 2
id Hash value calculated according to certain Hash algorithm Hash takes remainder from partition number
1 100 100%4=0
2 101 101%4=1
3 102 102%4=2
4 103 103%4=3
5 104 104%4=0
6 105 105%4=1
7 106 106%4=2
8 107 107%4=3
At each batch, under each partition, one to multiple files may be generated. Multiple batches of data are written, and under each partition, multiple batches of data are written into the generated data file.
Each batch specifically generates several files, and is related to the parallelism of data writing, if the parallelism is 2, 2 files can be generated at most for each partition, and the partition schematic diagram is shown in fig. 3.
The data is subjected to a new adding operation, and the steps are as follows:
first two pieces of data are inserted to generate a new file, file 1. The contents of file 1 are shown in table 3:
TABLE 3 Table 3
id value
11 aa
12 bb
Then, after inserting 2 pieces of data, a new file, file 2, is generated. The contents of file 2 are shown in table 4:
TABLE 4 Table 4
id value
13 cc
14 dd
The types of both the above files are newly added types.
The deleting operation is carried out on the data, and the steps are as follows:
to delete the record with id=2, a new file is created: file 3. The file type is a deletion type in which a primary key to be deleted is recorded. The contents of file 3 are shown in table 5:
TABLE 5
id
12
When data inquiry is performed, the file 1, the file 2 and the file 3 are sequentially read. File 3 records the primary key of the deleted data. The list of primary keys read by files 1 and 2 is: 11. 12, 13, 14. And the primary key list read by the file 3 is 12, the record of the primary key 12 is deleted from the primary key lists [11, 12, 13, 14] generated by the file 1 and the file 2 because the file 3 is of the deletion type. Eventually, only the data of id=11, 13, 14 will be returned to the user, as shown in table 6.
TABLE 6
id value
11 aa
13 cc
14 dd
The data is updated as follows:
assuming a record with id=13, name is updated to ccc1.
First, a deletion type file is added: file 4, the primary key of the data to be updated is recorded, the contents are shown in table 7:
TABLE 7
id
13
Then, a new added type file is generated: file 5. The updated data is recorded, the contents of which are shown in table 8:
TABLE 8
id value
13 cc1
When the data query is performed, merging the results according to the sequence of generating the files and the types of the files, removing the data with id=3 in the original file, and adding the data with id=3 in the last newly added type file. And returning to the user. The data that is ultimately returned to the user is shown in table 9:
TABLE 9
id value
11 aa
13 cc1
14 dd
When the data is updated, the method is divided into two steps of deleting and adding, so that a plurality of files of new adding types and deleting types are generated when a plurality of series of data deleting and updating operations exist. To ensure sequentiality, a commit sequence number + record offset is maintained at the time of data writing. The commit sequence number is generated each time a write is made, and is an incremented integer type, the batch currently being written to, and all files share this commit sequence number. The record offset is an incremental sequence number generated for each record, and only needs to be guaranteed to be incremental and unique in the current batch.
By way of example, the following five steps of continuous operation were performed in one batch:
INSERT(100,101,104,105,108)
DELETE 101,104
INSERT(112,113,116,120,124)
DELETE 96
INSERT(128,129)
the number is the primary key inserted into the record, the current batch, a 5-step operation is performed, assuming a partition number threshold of 4, and the commit sequence number of the batch is 10, hash calculation is performed using the primary key according to the previous rules, and the above data will generate data for two partitions, partition 0 and partition 1, as shown in fig. 4.
The data is subjected to the new searching operation, and the steps are as follows:
taking partition 0 above as an example, when a record with a primary key of 104 is queried, the record is included in both the newly added type file 1 and the deleted type file 1, but the offset (equal to 7) of the deleted type 1 file is greater than the offset (equal to 3) of the newly added type 1 file, so that the new addition is performed, then the deletion is performed, and finally, the record is in a deleted state, and therefore, the result of the query is null.
Newly-built data source classes, inherit abstract classes of relational provider and datasourceRegister, and realize a related method as follows:
Figure SMS_6
the data are combined as follows:
1. merging small files of a deletion type: as shown in fig. 5, when deleting the deletion type file 1 in the partition 0 and the partition 1, the two partitions remove the deletion type file 1 respectively, and merge the deletion type files 1 and store them in the parent node.
2. Merging newly added types of files: as shown in fig. 6, there are three files in partition 0, respectively: newly added type file 1, newly added type file 2, deleted type file 1; there are three files in partition 1, respectively: newly added type file 1, newly added type file 2, deleted type file 1; the newly added type file 1 and the newly added type file 2 in the partition 0 are combined to form a larger newly added type file 1, and the newly added type file 1 and the newly added type file 2 in the partition 1 are combined to form a larger newly added type file 1.
3. Simultaneously merging deleted and newly added types of files: as shown in fig. 7 and 8, three files exist in partition 0, respectively: newly added type file 1, newly added type file 2, deleted type file 1; there are three files in partition 1, respectively: newly added type file 1, newly added type file 2, deleted type file 1; and merging the newly added type file 1 and the newly added type file 2 in the respective partitions to form a larger newly added type file 1 while removing the deleted type file 1 from the respective two partitions.
The small files of the deletion type are very small because only information such as a main key row, offset and the like is stored, and the cost of merging is the lowest in the three types, so that merging can be frequently carried out; for the newly added type of files, a threshold value of the number of small files can be set, and when the number of the small files reaches the threshold value, the small files are combined; for merging files of both deleted and newly added types, the cost is relatively high, but the efficiency of the merged data query is highest, and the merging can be performed periodically, for example, once per day or hour.
However, in enterprises, there are a large number of data processing flows, namely, storage and reading of data are performed by using Hive, and the data is directly migrated to an upper set of architecture, so as to support real-time update/deletion of the data, but the direct migration means that the files of the Hive bottom layer are regenerated according to the upper architecture, the reconstruction workload is large, the risk is high, and the implementation is almost difficult, especially when the data volume of Hive table is particularly large. Thus, an optimization can be made on the above architecture (based on which compatibility to Hive is achieved.
Data is stored in two locations: real-time data and Hive's offline data. Real-time data is written through the above architecture and stored in a directory of real-time data. And merging the real-time data periodically, merging the new data added file and the deleted data file, and storing the merged file into the Hive data directory. Offline data can still be written directly into Hive. And when the data is read, the data of the Hive table and the real-time data are read at the same time, so that the compatibility of the Hive table is realized.
Based on the same inventive concept, the invention also provides a device for realizing the table service scheme of the flow batch integrated data lake. The implementation of the device can be referred to as implementation of the above method, and the repetition is not repeated. As shown in fig. 2, the apparatus 100 includes:
the data table creation module 101: the method comprises the steps of creating a data table, determining a main key field of the table, and designating an attribute of a partition quantity threshold;
partition module 102: the method comprises the steps of carrying out Hash calculation according to a primary key, generating a Hash value of an integer type, then taking a remainder from the Hash value to a partition number threshold, wherein the remainder value is a partition number corresponding to the primary key record, and completing partition;
data writing module 103: reading data, carrying out Hash calculation according to the primary key field to obtain a partition number field, and carrying out partition according to the partition number field when writing the data to generate partition catalogues, wherein each partition catalogue corresponds to a plurality of data files;
operation module 104: for manipulating data according to demand, comprising: data writing, data query and data merging.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the described modules may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.
As shown in fig. 9, the apparatus includes a Central Processing Unit (CPU) that can perform various appropriate actions and processes according to computer program instructions stored in a Read Only Memory (ROM) or computer program instructions loaded from a storage unit into a Random Access Memory (RAM). In the RAM, various programs and data required for the operation of the device can also be stored. The CPU, ROM and RAM are connected to each other by a bus. An input/output (I/O) interface is also connected to the bus.
A plurality of components in a device are connected to an I/O interface, comprising: an input unit such as a keyboard, a mouse, etc.; an output unit such as various types of displays, speakers, and the like; a storage unit such as a magnetic disk, an optical disk, or the like; and communication units such as network cards, modems, wireless communication transceivers, and the like. The communication unit allows the device to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processing unit performs the respective methods and processes described above, for example, the methods S01 to S04. For example, in some embodiments, methods S01-S04 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device via the ROM and/or the communication unit. When the computer program is loaded into RAM and executed by the CPU, one or more steps of the methods S01 to S04 described above may be performed. Alternatively, in other embodiments, the CPU may be configured to perform methods S01-S04 by any other suitable means (e.g., by means of firmware).
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), etc.
Program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Moreover, although operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the invention. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims (13)

1. A method of implementing a table service scheme for a batch integrated data lake, the method comprising:
s01: creating a data table, determining a main key field of the table, and designating an attribute of a partition number threshold;
s02: performing Hash calculation according to the primary key to generate a Hash value of an integer type, then taking a remainder from the Hash value to a partition number threshold value, wherein the remainder value is the partition number corresponding to the primary key record, and completing partition;
s03: reading data, carrying out Hash calculation according to the primary key field to obtain a partition number field, and carrying out partition according to the partition number field when writing the data to generate partition catalogues, wherein each partition catalogue corresponds to a plurality of data files;
s04: operating on the data according to the requirements, including: data writing, data query and data merging.
2. The method for implementing a table service scheme for a batch integrated data lake of claim 1 wherein at least one of the number of primary key fields described in S01 distinguishes multiple primary key fields using commas.
3. The method for implementing a table service scheme for a batch integrated data lake of claim 1 wherein the manner of Hash computation in S02 is not unique.
4. The method for implementing a table service scheme of a stream-batch integrated data lake according to claim 1, wherein the data in S03 is stored in a directory of real-time data after being written, the real-time data is periodically combined, meanwhile, a new data adding file and a data deleting file are combined, the combined file is stored in a data directory of Hive, offline data is directly written into Hive, and when the data is read, the data of Hive table and the real-time data are read at the same time, so that compatibility of the Hive table is realized.
5. The method of claim 1, wherein the operation of writing data in S04 includes: data are newly added, data are deleted and data are updated;
the operation steps of the data addition are as follows: newly adding data to form a new data file and partitioning the data file;
the operation steps of the data deletion are as follows: designating a primary key, identifying data of the primary key as deletion, and storing the primary key in a file, wherein the type of the file is a data deletion type;
the operation steps of the data updating are as follows: and respectively adding a deletion type file and a new addition type file, storing a main key of data to be updated in the deletion type file, and storing updated data in the new addition type file.
6. The method of claim 5, wherein the data to be updated must include a full field of data.
7. The method for implementing a table service scheme for a batch integrated data lake of claim 4 wherein said data writing is performed using a commit sequence number plus a record offset to ensure the ordering of the data;
the commit sequence number is generated when writing is performed each time, and is of an increasing integer type, and all files of the current writing batch share one commit sequence number;
the record offset is an incremental sequence number generated for each record, the record offset is recorded in each data file, and each record corresponds to one record offset.
8. The method for implementing a table service scheme for a batch integrated data lake of claim 1, wherein the step of querying the data in S04 is:
CX01: reading the metadata file of each partition;
CX02: reading the data files according to the sequence of the data files defined by the metadata files in each partition;
CX03: in each partition, performing logic combination after filtering according to the type of the data file;
CX04: and returning the combined data to the user.
9. The method of claim 8, wherein the step of logically merging in CX03 is performed by:
CX031: traversing each file in turn;
CX02: if the file is a newly added file, directly reading the file, and using the read data as input data of a subsequent process;
CX033: if the file is a deletion type file, deleting the data containing the primary key if the primary key is contained in the data read before the file;
CX034: and taking the processed data as input data of a subsequent traversing file to carry out logic combination.
10. The method of claim 1, wherein the data merging in S04 includes three types: merging small files of a deletion type, merging files of a new addition type, and merging files of deletion and new addition types at the same time;
the operation steps of merging and deleting the small files are as follows: searching and removing the deleted type files in each partition, and merging and storing the removed files in a parent node;
the operation steps of merging the files of the newly added type are as follows: searching newly added type small files in each partition, and combining a plurality of newly added type small files in the same partition into a large newly added type file;
the operation steps of simultaneously merging the deleted and newly added files are as follows: searching the deleted type files and the newly added type small files in each partition, and merging the newly added type small files while removing the deleted type files.
11. An apparatus for implementing a table service scheme for a batch integrated data lake, the apparatus comprising:
the data table creation module: the method comprises the steps of creating a data table, determining a main key field of the table, and designating an attribute of a partition quantity threshold;
partition module: the method comprises the steps of carrying out Hash calculation according to a primary key, generating a Hash value of an integer type, then taking a remainder from the Hash value to a partition number threshold, wherein the remainder value is a partition number corresponding to the primary key record, and completing partition;
and a data writing module: reading data, carrying out Hash calculation according to the primary key field to obtain a partition number field, and carrying out partition according to the partition number field when writing the data to generate partition catalogues, wherein each partition catalogue corresponds to a plurality of data files;
and an operation module: for manipulating data according to demand, comprising: data writing, data query and data merging.
12. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program, characterized in that the processor, when executing the program, implements the method of any of claims 1-10.
13. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-10.
CN202211385638.2A 2022-11-07 2022-11-07 Method, device and equipment for realizing table service scheme of flow batch integrated data lake Pending CN116010345A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211385638.2A CN116010345A (en) 2022-11-07 2022-11-07 Method, device and equipment for realizing table service scheme of flow batch integrated data lake

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211385638.2A CN116010345A (en) 2022-11-07 2022-11-07 Method, device and equipment for realizing table service scheme of flow batch integrated data lake

Publications (1)

Publication Number Publication Date
CN116010345A true CN116010345A (en) 2023-04-25

Family

ID=86026205

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211385638.2A Pending CN116010345A (en) 2022-11-07 2022-11-07 Method, device and equipment for realizing table service scheme of flow batch integrated data lake

Country Status (1)

Country Link
CN (1) CN116010345A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116880993A (en) * 2023-09-04 2023-10-13 北京滴普科技有限公司 Method and device for processing large number of small files in Iceberg

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116880993A (en) * 2023-09-04 2023-10-13 北京滴普科技有限公司 Method and device for processing large number of small files in Iceberg

Similar Documents

Publication Publication Date Title
CN112287182B (en) Graph data storage and processing method and device and computer storage medium
US9411840B2 (en) Scalable data structures
CN107704202B (en) Method and device for quickly reading and writing data
CN109241159B (en) Partition query method and system for data cube and terminal equipment
CN109739828B (en) Data processing method and device and computer readable storage medium
US20140046928A1 (en) Query plans with parameter markers in place of object identifiers
EP4105793A1 (en) Signature-based cache optimization for data preparation
CN110134681B (en) Data storage and query method and device, computer equipment and storage medium
CN109815240A (en) For managing method, apparatus, equipment and the storage medium of index
CN114090695A (en) Query optimization method and device for distributed database
US8396858B2 (en) Adding entries to an index based on use of the index
CN116010345A (en) Method, device and equipment for realizing table service scheme of flow batch integrated data lake
CN114329096A (en) Method and system for processing native map database
CN104573112A (en) Page query method and data processing node for OLTP cluster database
CN112860412B (en) Service data processing method and device, electronic equipment and storage medium
WO2024078122A1 (en) Database table scanning method and apparatus, and device
CN111708895B (en) Knowledge graph system construction method and device
CN111125216B (en) Method and device for importing data into Phoenix
CN112835638A (en) Configuration information management method and device based on embedded application program
CN110858199A (en) Document data distributed computing method and device
CN113760600B (en) Database backup method, database restoration method and related devices
CN114860727A (en) Zipper watch updating method and device
US20170031909A1 (en) Locality-sensitive hashing for algebraic expressions
CN114297196A (en) Metadata storage method and device, electronic equipment and storage medium
CN113722296A (en) Agricultural information processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination