CN117251214A - Execution method of data operation instruction based on Apache Hudi table format of distributed database - Google Patents

Execution method of data operation instruction based on Apache Hudi table format of distributed database Download PDF

Info

Publication number
CN117251214A
CN117251214A CN202311532626.2A CN202311532626A CN117251214A CN 117251214 A CN117251214 A CN 117251214A CN 202311532626 A CN202311532626 A CN 202311532626A CN 117251214 A CN117251214 A CN 117251214A
Authority
CN
China
Prior art keywords
instruction
data file
data
read
reading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311532626.2A
Other languages
Chinese (zh)
Inventor
邝金清
陶征霖
常雷
姚佳丽
霍瑞龙
刘大伟
宋宜旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Even Number Technology Co ltd
Original Assignee
Beijing Even Number Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Even Number Technology Co ltd filed Critical Beijing Even Number Technology Co ltd
Priority to CN202311532626.2A priority Critical patent/CN117251214A/en
Publication of CN117251214A publication Critical patent/CN117251214A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the disclosure discloses an execution method, an execution device, electronic equipment and a computer readable medium based on a data operation instruction in a distributed database APACHE HUDI table format. One embodiment of the method comprises the following steps: acquiring an operation instruction aiming at target data in an Apache Hudi table format; in response to determining that the operation instruction is a read instruction, executing a read execution plan in accordance with the read step; in response to determining that the operation instruction is a modification instruction, executing the modification execution plan in accordance with the modification step; returning an execution receipt of the read execution plan and/or an execution receipt of the modified execution plan. According to the embodiment, the reading and writing of the hudi table format are realized by using the mode of c/c++, the reading and writing performance and the changing performance of the table format are improved, the user table without a main key row can be supported, long transactions are supported, the memory consumption in all scenes can be controlled, the occupation of the resources for reading and writing does not exceed the limit specified by a user, and therefore the optimal concurrent reading and writing upper limit is achieved.

Description

Execution method of data operation instruction based on Apache Hudi table format of distributed database
Technical Field
Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method, an apparatus, an electronic device, and a computer readable medium for executing an operation instruction based on data in a format of an APACHE HUDI table of a distributed database.
Background
Information push, also known as "web broadcast", is a technology that reduces information overload by pushing information required by users over the internet, through a certain technical standard or protocol. The information push technology can reduce the time spent by the user searching on the network by actively executing the data operation instruction based on the distributed database APACHE HUDI table format data to the user.
The related information pushing mode is usually to directly load various data operation instructions based on the distributed database APACHE HUDI table format on the webpage, and the execution of the data operation instructions based on the distributed database APACHE HUDI table format is obviously different from the content of the webpage where the data operation instructions are located.
Disclosure of Invention
The disclosure is in part intended to introduce concepts in a simplified form that are further described below in the detailed description. The disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Some embodiments of the present disclosure propose a method, apparatus, electronic device, and computer-readable medium for executing an operation instruction based on data in a distributed database APACHE HUDI table format, to solve the technical problems mentioned in the background section above.
In a first aspect, some embodiments of the present disclosure provide a method for executing an operation instruction based on data in a format of a distributed database APACHE HUDI table, the method including: acquiring an operation instruction aiming at target data in an Apache Hudi table format, wherein the operation instruction comprises a change instruction and/or a reading instruction; in response to determining that the operation instruction is a read instruction, executing a read execution plan in accordance with the read step: determining target data according to the reading instruction; determining a visible data file from the distributed database according to the reading instruction, and determining a query range corresponding to the reading instruction from the visible data file; cutting the visible data file in the query range to obtain a data file to be read; reading target data from the data file to be read according to the reading instruction; taking the target data as an execution receipt of the reading execution plan; in response to determining that the operation instruction is a modification instruction, executing a modification execution plan according to the modification step: determining modification data according to the modification instruction; changing the target data file in the distributed database according to the change instruction and the change data; adding statistical information of the target data file and file information of the target data file into a submitting record file of the target data file according to the change data and the change instruction; taking the submitted record file as an execution receipt of the change execution plan; returning the execution receipt of the read execution plan and/or the execution receipt of the changed execution plan.
In a second aspect, some embodiments of the present disclosure provide an execution apparatus for operating instructions based on data in a distributed database APACHE HUDI table format, the apparatus including: an obtaining unit configured to obtain an operation instruction for target data in an Apache Hudi table format, wherein the operation instruction comprises a change instruction and/or a read instruction; a reading unit configured to execute a reading execution plan in accordance with the reading step in response to determining that the operation instruction is a reading instruction: determining target data according to the reading instruction; determining a visible data file from the distributed database according to the reading instruction, and determining a query range corresponding to the reading instruction from the visible data file; cutting the visible data file in the query range to obtain a data file to be read; reading target data from the data file to be read according to the reading instruction; taking the target data as an execution receipt of the reading execution plan; a modification unit configured to execute a modification execution plan in accordance with the modification step in response to determining that the operation instruction is a modification instruction: determining modification data according to the modification instruction; changing the target data file in the distributed database according to the change instruction and the change data; adding statistical information of the target data file and file information of the target data file into a submitting record file of the target data file according to the change data and the change instruction; taking the submitted record file as an execution receipt of the change execution plan; and the return unit is configured to return the execution receipt of the read execution plan and/or the execution receipt of the modified execution plan.
In a third aspect, an embodiment of the present application provides an electronic device, where the network device includes: one or more processors; a storage means for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer readable medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.
One of the above embodiments of the present disclosure has the following advantageous effects: the reading and writing of the hudi table format are realized by using the mode of c/c++, the reading and writing performance and the changing performance of the table format are improved, the user table without a main key row can be supported, long transactions are supported, the memory consumption in all scenes can be controlled, the occupation of the resources for reading and writing does not exceed the limit specified by a user, and therefore the upper limit of more optimal concurrent reading and writing is achieved.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.
FIG. 1 is a schematic diagram of one application scenario of a method of executing an operation instruction based on data in a distributed database APACHE HUDI table format according to some embodiments of the present disclosure;
FIG. 2 is a flow chart of some embodiments of a method of executing a data manipulation instruction based on a distributed database APACHE HUDI table format in accordance with the present disclosure;
FIG. 3 is a schematic diagram of some embodiments of an execution device based on a distributed database APACHE HUDI table format data manipulation instruction according to the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings. Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 is a schematic diagram of an application scenario of an execution method of an operation instruction based on data in a distributed database APACHE HUDI table format according to some embodiments of the present disclosure.
As shown in fig. 1, the server 100 may obtain an operation instruction 101 for target data in the format of an Apache Hudi table, where the operation instruction 101 includes a change instruction 102 and/or a read instruction 103; in response to determining that the operation instruction 101 is the read instruction 103, executing a read execution plan according to a read step 104; in response to determining that the operation instruction is a modification instruction 102, executing a modification execution plan according to a modification step 105; returning to the execution receipt 106 of the read execution plan and/or the execution receipt 107 of the modified execution plan.
It will be appreciated that the method for executing the data operation instruction based on the distributed database APACHE HUDI table format may be executed by the terminal device, or may also be executed by the server 101, and the execution body of the method may also include a device formed by integrating the terminal device and the server 101 through a network, or may also be executed by various software programs. The terminal device may be, among other things, various electronic devices with information processing capabilities including, but not limited to, smartphones, tablet computers, electronic book readers, laptop and desktop computers, and the like. The execution body may be embodied as a server 101, software, or the like. When the execution subject is software, the execution subject can be installed in the electronic device enumerated above. It may be implemented as a plurality of software or software modules, for example, for providing distributed services, or as a single software or software module. The present invention is not particularly limited herein.
It should be understood that the number of servers in fig. 1 is merely illustrative. There may be any number of servers, as desired for implementation.
With continued reference to fig. 2, a flow 200 of some embodiments of a method of executing data manipulation instructions based on a distributed database APACHE HUDI table format according to the present disclosure is shown. The method for executing the data operation instruction based on the distributed database APACHE HUDI table format comprises the following steps:
step 201, obtaining an operation instruction for target data in Apache Hudi table format, wherein the operation instruction includes a change instruction and/or a read instruction.
In some embodiments, an execution body (such as a server shown in fig. 1) of an execution method of the operation instruction based on the data in the format of the APACHE HUDI table of the distributed database may receive the operation instruction for the target data in the format of the APACHE HUDI table through a wired connection manner or a wireless connection manner, where the operation instruction includes a change instruction and/or a read instruction. It should be noted that the wireless connection may include, but is not limited to, 3G/4G connections, wiFi connections, bluetooth connections, wiMAX connections, zigbee connections, UWB (ultra wideband) connections, and other now known or later developed wireless connection means.
Specifically, the change instruction generally refers to an instruction about data change issued by a user, and as an example, the change instruction may be deleting data, changing data, inserting data, or the like.
The above-described reading instruction generally refers to an instruction issued by a user regarding data reading, and may be, for example, search data, presentation data, or the like.
Step 202, in response to determining that the operation instruction is a read instruction, executing a read execution plan according to the read step: determining target data according to the reading instruction; determining a visible data file from the distributed database according to the reading instruction, and determining a query range corresponding to the reading instruction from the visible data file; cutting the visible data file in the query range to obtain a data file to be read; reading target data from the data file to be read according to the reading instruction; and taking the target data as an execution receipt of the read execution plan.
In some embodiments, in response to determining that the operation instruction is a read instruction, the execution body may execute a read execution plan according to the reading step:
determining target data according to the reading instruction; determining a visible data file from the distributed database according to the reading instruction, and determining a query range corresponding to the reading instruction from the visible data file; cutting the visible data file in the query range to obtain a data file to be read; reading target data from the data file to be read according to the reading instruction; and taking the target data as an execution receipt of the read execution plan.
Specifically, the target data generally refers to data for which the read command is directed. The visible data file generally refers to a data file that the user can access and read. As an example, when the b data file user has access rights, the a data file user has no access rights, then the b data file is a visible data file. As yet another example, when the a-file has been invoked and modified by another user and the a-file has not been modified, the unmodified version of the a-file belongs to the visible data file and the content data in the a-file modification is not the visible data file. Here, the query range generally refers to a specific read range for which the read instruction is directed.
The execution body may clip the visible data file based on a sphere condition of a query statement (a read instruction), where a clipping manner includes: partition path clipping, statistics information transfer clipping, bucket index clipping, and the like. These clipping modes do not conflict with each other and can be used in combination. By cutting the data file, full-table scanning can be avoided, and reading performance is greatly improved.
In some optional implementations of some embodiments, in response to determining that the incremental modification proportion of the visible data file reaches a threshold, the execution body may combine the base data file and the incremental data file in the visible data file to obtain the data file to be read; and responding to the fact that the coincidence degree of the visible data files reaches a threshold value, and cutting the visible data files according to the query instruction to obtain the data files to be read.
In some optional implementations of some embodiments, the execution body may determine a target data file for which the read instruction is directed; and determining the visible range of the target data file according to the submitted record file of the target data file, and taking the visible range as the query range corresponding to the reading instruction.
In some optional implementations of some embodiments, the execution body may merge the commit record files in response to determining that a number of commit record files corresponding to the target data file exceeds a threshold.
As an example, in a data reading scenario, a reading execution plan may be executed by a reading step:
1. all visible data files of the current transaction are parsed from the commit record file of Hdfs: analyzing the submitted record file of the table to obtain all visible data files visible to the current transaction; if the current transaction has an uncommitted modification, the modification is merged with the visible data file; the analysis result is cached locally, so that repeated analysis is avoided.
2. Clipping the data file according to the sphere condition of the current query statement, and only leaving the data file to be scanned: when reading data, if the increment modification proportion of the current user table reaches a threshold value, combining the basic data file and the increment data file to generate a new version of basic data file so as to ensure the reading performance. When reading data, if the overlap ratio of the numerical intervals in the plurality of data files is found to be high, the reading effect is deteriorated, so that the overlap ratio reaches a threshold value and then cutting is performed. When reading data, if the number of files submitted by the table is found to be too large, the existing files are combined into one file to ensure the reading performance.
3. Enabling multiple executors to read the table data in parallel.
4. If an incremental modification file exists (update/delete generation), these data modifications are merged.
5. And returning the read data, and if the data of the hidden columns need to be transferred, converting the hidden columns from the character strings into integers and then transmitting the integers.
When the data file is read, the incremental modification of users is needed to be combined, character strings are compared with each other in the open source implementation, and the generated primary key is optimized: after serialization, the integers are compared with each other, and the speed is about 3-10 times faster than that of other modes; meanwhile, the memory occupation is obviously reduced, and the memory occupation is ensured to be not higher than the limit of a user.
Thereby, the read-write performance is improved through the embodiment: the analysis result of the visible data file is cached, so that the delay of reading data can be reduced, the cutting of the data file can be avoided, the full-table scanning can be avoided, and the reading performance can be greatly improved; timely merging the increment files to ensure that the reading performance is always in a better condition; timely rearranging the data in the data file to ensure the cutting effect of the data file; in the actual reading and writing of the data file, a cache service is adopted, so that the I/O performance is improved; the zero copy of the coding analysis further reduces the reading and writing expense; the optimization of the increment combination reduces the memory occupation and improves the reading performance at the same time; the integral conversion of hidden columns adopts SIMD, and the cost of data transmission among multiple points is reduced on the premise of ensuring small conversion cost.
In step 203, in response to determining that the operation instruction is a modification instruction, the modification execution plan is executed according to the modification step: determining modification data according to the modification instruction; changing the target data file in the distributed database according to the change instruction and the change data; adding statistical information of the target data file and file information of the target data file into a submitting record file of the target data file according to the change data and the change instruction; and taking the submitted record file as an execution receipt of the change execution plan.
In some embodiments, in response to determining that the operation instruction is a modification instruction, the execution subject may execute the modification execution plan according to the modification step: determining modification data according to the modification instruction; changing the target data file in the distributed database according to the change instruction and the change data; adding statistical information of the target data file and file information of the target data file into a submitting record file of the target data file according to the change data and the change instruction; and taking the submitted record file as an execution receipt of the change execution plan.
In some alternative implementations of some embodiments, the modifying step further includes: in response to determining that the change data is a keyless table, splicing the node number and the node insertion line number of the change data and the time stamp of the reading instruction into a character string of 16 bytes; and transcoding the character string into 22 bytes, and taking the transcoded character string as a main key of the change data. When data is inserted, a hidden column value is automatically generated for the user table without the primary key according to a specific rule, the generation rule can ensure that the result is unique in the whole table, a single-point bottleneck does not exist, and the user table without the primary key column can be supported.
As an example, in a data modification scenario, a modification execution plan may be executed by a modification step:
1. if a user table without a primary key column is currently written, a column of hidden columns can be additionally generated as primary keys, and a unique key value is generated for each row of data. Specifically, for the user list without the designated primary key row, an extra hidden row is added as the primary key row, and when the user list is inserted, a relevant value is automatically generated for the hidden row, and the value can be ensured to be unique in the whole list range and does not depend on the centralized service; and the generated string values may be serialized within an integer of 16B. In particular, hudi has some hidden columns (all are of character string types) which are required to be transmitted in the actual query process, and the columns can be converted into integers and then transmitted, and the conversion process uses SIMD (Single instruction multiple data) for acceleration so as to improve the transmission performance.
2. All user data is encoded into a specific open source data file format (Apache ORC, apache part, apache Avro, etc.). Specifically, the method can comprise orc/parquet basic data file coding analysis, avro/kryo incremental modification file coding analysis and the like, and compared with open source implementation, memory allocation and data copying of a plurality of intermediate steps are eliminated, and zero copying of data can be realized in partial scenes.
3. Data is cached and periodically flushed into the Hdfs distributed file system to ensure I/O performance. When writing data, the data is interacted with the HDFS directly, and when reading data, the data is interacted with the existing local cache service, so that the cluster pressure of the HDFS is reduced, and the I/O performance is improved.
4. After all data writing is completed, all file information (including file name, file length, data amount of written row, etc.) generated by the writing and statistical information (non-empty number, maximum/minimum value, etc. of each column) of the file need to be returned.
5. After receiving the result, the file information is cached in the memory, if the transaction is a long transaction, the file information needs to be combined with the previous modification information, and when the transaction is submitted, the file information is written into the submitted record file of the hdfs.
When a plurality of data files are written at the same time (such as a partition table is inserted into tens of thousands of partitions at a time), each file occupies a certain amount of memory, and a data rewriting mechanism of a Huffman tree can be adopted to limit the number of files which are opened simultaneously, so that the aim of limiting memory occupation is achieved. If a plurality of tables are modified in a transaction, the modification information is recorded in a specific file and then written into the submitted files of the tables one by one, so that if a machine fault exists in the middle, the submitting process before the recovery can be continued after the restarting is ensured. When the increment is combined, the character string comparison is changed into integer comparison, so that the memory occupation is reduced; when writing multiple files, the number of the files which are opened simultaneously is controlled through a Huffman tree algorithm, so that the memory occupation is limited.
Step 204, returning the execution receipt of the read execution plan and/or the execution receipt of the modified execution plan.
In some embodiments, the execution body may return an execution response piece to the read execution plan and/or an execution response piece to the change execution plan.
One of the above embodiments of the present disclosure has the following advantageous effects: the reading and writing of the hudi table format are realized by using the mode of c/c++, the reading and writing performance and the changing performance of the table format are improved, the user table without a main key row can be supported, long transactions are supported, the memory consumption in all scenes can be controlled, the occupation of the resources for reading and writing does not exceed the limit specified by a user, and therefore the upper limit of more optimal concurrent reading and writing is achieved.
With further reference to fig. 3, as an implementation of the method shown in the foregoing figures, the present disclosure provides some embodiments of an execution apparatus for operating instructions based on data in a distributed database APACHE HUDI table format, where the apparatus embodiments correspond to those method embodiments shown in fig. 2, and the apparatus is particularly applicable to various electronic devices.
As shown in fig. 3, an execution apparatus 300 of some embodiments based on a distributed database APACHE HUDI table format data operation instruction includes: an acquisition unit 301, a reading unit 302, a modification unit 303, and a return unit 304. Wherein, the acquiring unit 301 is configured to acquire an operation instruction for target data in an Apache Hudi table format, where the operation instruction includes a change instruction and/or a read instruction; a reading unit 302 configured to execute a reading execution plan in accordance with a reading step in response to determining that the above-described operation instruction is a reading instruction: determining target data according to the reading instruction; determining a visible data file from the distributed database according to the reading instruction, and determining a query range corresponding to the reading instruction from the visible data file; cutting the visible data file in the query range to obtain a data file to be read; reading target data from the data file to be read according to the reading instruction; taking the target data as an execution receipt of the reading execution plan; a modification unit 303 configured to execute a modification execution plan in accordance with the modification step in response to determining that the above-described operation instruction is a modification instruction: determining modification data according to the modification instruction; changing the target data file in the distributed database according to the change instruction and the change data; adding statistical information of the target data file and file information of the target data file into a submitting record file of the target data file according to the change data and the change instruction; taking the submitted record file as an execution receipt of the change execution plan; a return unit 304 configured to return the execution receipt of the read execution plan and/or the execution receipt of the modified execution plan.
It will be appreciated that the elements described in the apparatus 300 correspond to the various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting benefits described above with respect to the method are equally applicable to the apparatus 300 and the units contained therein, and are not described in detail herein.
One of the above embodiments of the present disclosure has the following advantageous effects: the reading and writing of the hudi table format are realized by using the mode of c/c++, the reading and writing performance and the changing performance of the table format are improved, the user table without a main key row can be supported, long transactions are supported, cross-table transactions are supported, the memory consumption in all scenes can be controlled, the occupation of the resources for reading and writing does not exceed the limit specified by a user, and therefore the optimal concurrent reading and writing upper limit is achieved.
Referring now to fig. 4, a schematic diagram of an electronic device (e.g., server in fig. 1) 400 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 4 is merely an example and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.
As shown in fig. 4, the electronic device 400 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 401, which may perform various suitable actions and processes according to a program stored in a Read Only Memory (ROM) 402 or a program loaded from a storage means 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic device 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
In general, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, magnetic tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the electronic device 400 to communicate with other devices wirelessly or by wire to exchange data. While fig. 4 shows an electronic device 400 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 4 may represent one device or a plurality of devices as needed.
In particular, according to some embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications device 409, or from storage 408, or from ROM 402. The above-described functions defined in the methods of some embodiments of the present disclosure are performed when the computer program is executed by the processing device 401.
It should be noted that, in some embodiments of the present disclosure, the computer readable medium may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring an operation instruction aiming at target data in an Apache Hudi table format, wherein the operation instruction comprises a change instruction and/or a reading instruction; in response to determining that the operation instruction is a read instruction, executing a read execution plan in accordance with the read step: determining target data according to the reading instruction; determining a visible data file from the distributed database according to the reading instruction, and determining a query range corresponding to the reading instruction from the visible data file; cutting the visible data file in the query range to obtain a data file to be read; reading target data from the data file to be read according to the reading instruction; taking the target data as an execution receipt of the reading execution plan; in response to determining that the operation instruction is a modification instruction, executing a modification execution plan according to the modification step: determining modification data according to the modification instruction; changing the target data file in the distributed database according to the change instruction and the change data; adding statistical information of the target data file and file information of the target data file into a submitting record file of the target data file according to the change data and the change instruction; taking the submitted record file as an execution receipt of the change execution plan; returning the execution receipt of the read execution plan and/or the execution receipt of the changed execution plan.
Computer program code for carrying out operations for some embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in some embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, for example, described as: a processor includes an acquisition unit, a reading unit, a modification unit, and a return unit. The names of these units do not limit the unit itself in some cases, and the acquisition unit may also be described as a "unit that acquires an operation instruction for target data in the Apache Hudi table format" including a change instruction and/or a read instruction, for example.
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims (10)

1. An execution method of data operation instructions based on a distributed database Apache Hudi table format comprises the following steps:
acquiring an operation instruction aiming at target data in an Apache Hudi table format, wherein the operation instruction comprises a change instruction and/or a reading instruction;
in response to determining that the operation instruction is a read instruction, executing a read execution plan in accordance with the read step: determining target data according to the reading instruction; determining a visible data file from a distributed database according to the reading instruction and determining a query range corresponding to the reading instruction from the visible data file; cutting the visible data file in the query range to obtain a data file to be read; reading target data from the data file to be read according to the reading instruction; taking the target data as an execution receipt of the read execution plan;
in response to determining that the operation instruction is a change instruction, executing a change execution plan in accordance with the change step: determining modification data according to the modification instruction; changing the target data file in the distributed database according to the change instruction and the change data; adding statistical information of the target data file and file information of the target data file into a submitting record file of the target data file according to the change data and the change instruction; taking the submitted record file as an execution receipt of the change execution plan;
and returning the execution receipt of the read execution plan and/or the execution receipt of the changed execution plan.
2. The method of claim 1, wherein the modifying step further comprises:
in response to determining that the change data is a keyless table, splicing the node number and the node insertion line number of the change data and the time stamp of the reading instruction into a character string of 16 bytes;
and transcoding the character string into 22 bytes, and taking the transcoded character string as a main key of the change data.
3. The method of claim 1, wherein the cropping the visible data file in the query range to obtain the data file to be read comprises:
in response to determining that the incremental modification proportion of the visible data file reaches a threshold value, merging the basic data file and the incremental data file in the visible data file to obtain a data file to be read;
and responding to the fact that the coincidence degree of the visible data files reaches a threshold value, and cutting the visible data files according to the query instruction to obtain the data files to be read.
4. The method according to claim 1, wherein the determining a visible data file from the distributed database according to the read instruction and determining a query range corresponding to the read instruction from the visible data file includes:
determining a target data file aimed by the reading instruction;
and determining the visible range of the target data file according to the submitted record file of the target data file, and taking the visible range as the query range corresponding to the reading instruction.
5. The method according to claim 4, wherein the method further comprises:
and merging the submitted record files in response to determining that the number of submitted record files corresponding to the target data file exceeds a threshold.
6. An execution device based on a data operation instruction in a distributed database Apache Hudi table format, comprising:
an acquisition unit configured to acquire an operation instruction for target data in an Apache Hudi table format, wherein the operation instruction includes a change instruction and/or a read instruction;
a reading unit configured to execute a reading execution plan in accordance with a reading step in response to determining that the operation instruction is a reading instruction: determining target data according to the reading instruction; determining a visible data file from a distributed database according to the reading instruction and determining a query range corresponding to the reading instruction from the visible data file; cutting the visible data file in the query range to obtain a data file to be read; reading target data from the data file to be read according to the reading instruction; taking the target data as an execution receipt of the read execution plan;
a modification unit configured to execute a modification execution plan in accordance with a modification step in response to determining that the operation instruction is a modification instruction: determining modification data according to the modification instruction; changing the target data file in the distributed database according to the change instruction and the change data; adding statistical information of the target data file and file information of the target data file into a submitting record file of the target data file according to the change data and the change instruction; taking the submitted record file as an execution receipt of the change execution plan;
and the return unit is configured to return the execution receipt of the read execution plan and/or the execution receipt of the changed execution plan.
7. The apparatus of claim 6, wherein the modifying step further comprises:
in response to determining that the change data is a keyless table, splicing the node number and the node insertion line number of the change data and the time stamp of the reading instruction into a character string of 16 bytes;
and transcoding the character string into 22 bytes, and taking the transcoded character string as a main key of the change data.
8. The apparatus of claim 6, wherein the reading unit is further configured to:
in response to determining that the incremental modification proportion of the visible data file reaches a threshold value, merging the basic data file and the incremental data file in the visible data file to obtain a data file to be read;
and responding to the fact that the coincidence degree of the visible data files reaches a threshold value, and cutting the visible data files according to the query instruction to obtain the data files to be read.
9. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-5.
10. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-5.
CN202311532626.2A 2023-11-17 2023-11-17 Execution method of data operation instruction based on Apache Hudi table format of distributed database Pending CN117251214A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311532626.2A CN117251214A (en) 2023-11-17 2023-11-17 Execution method of data operation instruction based on Apache Hudi table format of distributed database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311532626.2A CN117251214A (en) 2023-11-17 2023-11-17 Execution method of data operation instruction based on Apache Hudi table format of distributed database

Publications (1)

Publication Number Publication Date
CN117251214A true CN117251214A (en) 2023-12-19

Family

ID=89128034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311532626.2A Pending CN117251214A (en) 2023-11-17 2023-11-17 Execution method of data operation instruction based on Apache Hudi table format of distributed database

Country Status (1)

Country Link
CN (1) CN117251214A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180150490A1 (en) * 2015-05-07 2018-05-31 Cloudera, Inc. Mutations in a column store
CN113094340A (en) * 2021-04-28 2021-07-09 杭州海康威视数字技术股份有限公司 Data query method, device and equipment based on Hudi and storage medium
CN114528127A (en) * 2022-03-31 2022-05-24 Oppo广东移动通信有限公司 Data processing method and device, storage medium and electronic equipment
CN115470223A (en) * 2022-09-02 2022-12-13 上海浪潮云计算服务有限公司 Data lake data incremental consumption method based on two-layer time identification
CN116186041A (en) * 2023-02-21 2023-05-30 中移动信息技术有限公司 Data lake index creation method and device, electronic equipment and computer storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180150490A1 (en) * 2015-05-07 2018-05-31 Cloudera, Inc. Mutations in a column store
CN113094340A (en) * 2021-04-28 2021-07-09 杭州海康威视数字技术股份有限公司 Data query method, device and equipment based on Hudi and storage medium
CN114528127A (en) * 2022-03-31 2022-05-24 Oppo广东移动通信有限公司 Data processing method and device, storage medium and electronic equipment
CN115470223A (en) * 2022-09-02 2022-12-13 上海浪潮云计算服务有限公司 Data lake data incremental consumption method based on two-layer time identification
CN116186041A (en) * 2023-02-21 2023-05-30 中移动信息技术有限公司 Data lake index creation method and device, electronic equipment and computer storage medium

Similar Documents

Publication Publication Date Title
CN110609872B (en) Method and apparatus for synchronizing node data
CN109508326B (en) Method, device and system for processing data
WO2021203918A1 (en) Method for processing model parameters, and apparatus
CN111857720B (en) User interface state information generation method and device, electronic equipment and medium
CN112395253A (en) Index file generation method, terminal device, electronic device and medium
US20180018367A1 (en) Remote query optimization in multi data sources
CN113190517B (en) Data integration method and device, electronic equipment and computer readable medium
CN109697034B (en) Data writing method and device, electronic equipment and storage medium
CN109558251B (en) Method and terminal for modifying page structure information
CN113220281A (en) Information generation method and device, terminal equipment and storage medium
CN117251214A (en) Execution method of data operation instruction based on Apache Hudi table format of distributed database
CN115203210A (en) Hash table processing method, device and equipment and computer readable storage medium
CN112000667B (en) Method, apparatus, server and medium for retrieving tree data
CN113127496B (en) Method and device for determining change data in database, medium and equipment
CN112307061A (en) Method and device for querying data
CN111581930A (en) Online form data processing method and device, electronic equipment and readable medium
CN112507676A (en) Energy report generation method and device, electronic equipment and computer readable medium
CN112445820A (en) Data conversion method and device
CN116483808B (en) Data migration method, device, electronic equipment and computer readable medium
CN116467178B (en) Database detection method, apparatus, electronic device and computer readable medium
CN117422556B (en) Derivative transaction system, device and computer medium based on replication state machine
CN110716885B (en) Data management method and device, electronic equipment and storage medium
US20240028519A1 (en) Data processing method, electronic device and computer program product
CN117349288A (en) Data query method and device based on online analysis processing and electronic equipment
CN116226189A (en) Cache data query method, device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination