WO2021004266A1 - Data insertion method and apparatus, device and storage medium - Google Patents

Data insertion method and apparatus, device and storage medium Download PDF

Info

Publication number
WO2021004266A1
WO2021004266A1 PCT/CN2020/097563 CN2020097563W WO2021004266A1 WO 2021004266 A1 WO2021004266 A1 WO 2021004266A1 CN 2020097563 W CN2020097563 W CN 2020097563W WO 2021004266 A1 WO2021004266 A1 WO 2021004266A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
group
partition
target table
current
Prior art date
Application number
PCT/CN2020/097563
Other languages
French (fr)
Chinese (zh)
Inventor
张泓毅
陈振强
Original Assignee
星环信息科技(上海)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 星环信息科技(上海)有限公司 filed Critical 星环信息科技(上海)有限公司
Publication of WO2021004266A1 publication Critical patent/WO2021004266A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof

Definitions

  • This application relates to data storage technology, such as a data insertion method, device, equipment, and storage medium.
  • Partitioning is a commonly used data organization method in databases. Most of the related technologies use a single partition to insert one by one, and the performance cannot meet the requirements when processing large quantities of data. If you want to insert different partitions at the same time, it will cause system instability due to operating too many files in different partitions at the same time for a long time, such as excessive memory usage and too many open file handles.
  • This application provides a data insertion method, device, equipment, and storage medium to improve the performance of dynamic partition insertion while ensuring system stability.
  • a data insertion method including:
  • the insertion action type is dynamic partition insertion, determine the partition and bucket information of the target table according to the meta information of the target table;
  • each group of data to be inserted is dynamically inserted into the corresponding target table file in turn.
  • a data insertion device including:
  • An insert action type determining module configured to determine the insert action type of the data insert command according to the acquired data insert command
  • the partitioning bucket information determining module is configured to determine the partitioning and bucketing information of the target table according to the meta-information of the target table when the insertion action type is dynamic partition insertion;
  • a data grouping module configured to divide the data to be inserted into at least one group according to the partition and bucket information
  • a data sorting module configured to sort the data to be inserted in each group according to the partition and bucket information
  • the data insertion module is configured to dynamically insert each group of data to be inserted into the corresponding target table file in sequence according to the sequence of the data to be inserted in each group.
  • a device is also provided, and the device includes:
  • One or more processors are One or more processors;
  • Memory set to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the data insertion method provided in any embodiment of the present application.
  • a storage medium containing computer-executable instructions is also provided, and the computer-executable instructions are used to execute the data insertion method provided in any embodiment of the present application when executed by a computer processor.
  • FIG. 1 is a flowchart of a data insertion method in Embodiment 1 of the present application
  • FIG. 2 is a flowchart of a data insertion method in Embodiment 2 of the present application.
  • FIG. 3 is a flowchart of a data insertion method in Embodiment 3 of the present application.
  • FIG. 5 is a flowchart of a data insertion method in Embodiment 5 of the present application.
  • FIG. 6 is a schematic structural diagram of a data insertion device in the sixth embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a device in Embodiment 7 of the present application.
  • Fig. 1 is a flowchart of a data insertion method provided in the first embodiment of this application. This embodiment is applicable to the case of importing data into a database.
  • the method can be executed by a data insertion device, which can be hardware and/or Software implementation, including the following steps:
  • Step 110 Determine the insertion action type of the data insertion command according to the acquired data insertion command.
  • a corresponding insertion command when a data insertion task needs to be executed, a corresponding insertion command will be executed.
  • the data insertion command of the database engine it can be determined whether it is a dynamic partition insertion or a static partition insertion.
  • Static partition insertion refers to specifying the target partition when inserting data, and it can only be inserted into one partition at a time;
  • dynamic partition insertion refers to not specifying the target partition when inserting data, but dynamically judging the target partition of the data according to the partition column. Insert into multiple partitions.
  • Step 120 When the insert action type is dynamic partition insertion, determine the partition and bucket information of the target table according to the meta information of the target table.
  • the meta information of the target table is obtained, and the partition and bucket information of the target table is determined. For example, according to the insert command of the sql statement, determine whether it is a dynamic partition insert, and then obtain the meta-information of the target table, determine whether the target table is a partitioned non-bucketed table, or a partitioned bucketed table, and partition columns , Bucket column and other information.
  • Step 130 According to the partition and bucket information, divide the data to be inserted into at least one group.
  • the target table is a bucketed table
  • the data to be inserted when the data to be inserted is grouped, ensure that the data of the same bucketed file is in the same group. If the target table is not a bucketed table, theoretically any grouping of the inserted data will not be inserted into the dynamic partition. Make an impact.
  • the data to be inserted can be grouped to avoid too much data to be inserted in a single group. Generally, the data to be inserted can be Group evenly.
  • the data to be inserted is divided into at least one group according to the first grouping rule.
  • dividing the data to be inserted into at least one group according to the first grouping rule includes: obtaining a preset number of groups; determining the group identification number of each piece of data to be inserted, wherein the group identification number includes the preset field of the data to be inserted The hash value or the random number corresponding to the data to be inserted; the result of modulo the preset group number and the group number according to the group identification number, and the data to be inserted is allocated to the result of the modulus and the group number corresponding to the Array.
  • the group identification number of the data to be inserted is modulo the preset grouping number according to the group identification number, and the result of the modulus is matched with the group number, so that the data to be inserted is allocated to the array corresponding to each group of data to be inserted.
  • the preset group number is 7 (group number is 0-6)
  • the hash value of a selected field of the data to be inserted is 123878437872
  • the modulo 7 is 0, so the data to be inserted is assigned to the group number 0 group. It is also possible to divide the data to be inserted into at least one group according to the amount of data to be inserted and the amount of the preset single group of data; wherein the amount of data to be inserted in each group is less than or equal to the amount of the preset single group of data.
  • the number of data to be inserted is less than or equal to the preset single group of data, then only one group is enough; if the number of data to be inserted is greater than the preset single group of data, then it is divided into multiple groups, each group to be inserted
  • the amount of data is less than or equal to the preset single set of data amount.
  • the data can be grouped in equal proportions or in unequal proportions, which is not limited here.
  • Step 140 Sort the data to be inserted in each group according to the partition and bucket information.
  • each group of data to be inserted is sorted within the group according to the value of the partition column, and the sorting is to ensure that the data of the same partition is arranged continuously.
  • the data to be inserted corresponding to the same target table partition is adjacent and arranged consecutively.
  • the data to be inserted corresponding to the same target table partition is inserted into the partition file of the target table. Insert the data to be inserted into the next target table partition.
  • Step 150 According to the order of the data to be inserted in each group, dynamically insert each group of data to be inserted into the corresponding target table file in turn.
  • the technical solution of this embodiment solves the problem of system instability caused by operating a large number of files in different partitions at the same time for a long time by sorting the data to be inserted into the target table one by one, inserting into the target table one by one, and ensuring the stability of the system. Next, the effect of improving the performance of dynamic partition insertion.
  • Fig. 2 is a flowchart of a data insertion method provided in the second embodiment of the application.
  • the technical solution of this embodiment is described on the basis of the above technical solution, and includes the following steps:
  • Step 210 Determine the insertion action type of the data insertion command according to the acquired data insertion command.
  • Step 220 When the insert action type is dynamic partition insertion, determine the partition and bucket information of the target table according to the meta information of the target table.
  • Step 230 When the target table is a partitioned and bucketed table, group the data to be inserted according to the hash value of the bucketed column.
  • the target table is a partition and bucket table
  • the data must be in the same group, which is semantically equivalent to distribute by buckets.
  • Step 240 Sort the data to be inserted in the group according to the partition and bucket information.
  • Step 250 According to the sequence of the data to be inserted in the group, dynamically insert each group of data to be inserted into the corresponding target table file in turn.
  • the technical solution of this embodiment by grouping the data in the same bucket, ensures that the data to be inserted that needs to be inserted into the same bucket file are in the same group, so as to avoid repeated scanning of partitions during dynamic partition insertion, and improve dynamic partition insertion. performance.
  • Fig. 3 is a flowchart of a data insertion method provided in the third embodiment of the application.
  • the technical solution of this embodiment is described on the basis of the above technical solution, and includes the following steps:
  • Step 310 Determine the insertion action type of the data insertion command according to the acquired data insertion command.
  • Step 320 When the insert action type is dynamic partition insertion, determine the partition and bucket information of the target table according to the meta information of the target table.
  • Step 330 According to the partition and bucket information, divide the data to be inserted into at least one group.
  • Step 340 When the target table is a multi-level partition table, sort the data to be inserted in the group according to the order of the partitions of the target table.
  • the target table is a multi-level partition table determined by the partition and bucket information
  • the data to be inserted in the group is sorted in order according to the partition order of the target table. Semantically equivalent to sort by partition column 1, partition column 2... This ensures the continuous arrangement of data in the same partition.
  • Step 350 When the target table is a multi-level range partition table, sort the data to be inserted in the group according to the sequence after the target table partition field is mapped to the range partition of the target table.
  • the target table is determined by the partition and bucket information to be a multi-level range partition table
  • the multi-level range partition table is not a single-value partition corresponding to a value of a partition key, but a range of a partition corresponding to a partition key value. That is, the range of the value of a partition key. Therefore, multi-level range partition sorting should sort the data to be inserted in the group after the partition field is mapped to the partition.
  • the range partition r1, r2 with 2 fields.
  • r1 is the date (date) type
  • r2 is the integer (int) type
  • the range partition is first partitioned by the month of r1 ([1970-01-01,1970-01-31] a partition, [1970-02-01,1970 -02-28] a partition), and then partition according to the hundreds of r2 (ie [0,99] is in a partition, [100,199] is in a partition, etc.). Then, to_month(r1), r2/10 should be sorted.
  • the equivalent syntax is distribute by to_month(r1), r2/10.
  • the data to be inserted in the group is sorted directly by the value of the partition key, the following order may appear: (1970-01-01,1),(1970-01-01,11),(1970-01-02,2 ), and they belong to partition 1, partition 2, and partition 1, respectively.
  • the data in the same partition is not arranged together. If it is for a single range partition field, it can be simplified to directly sort the data to be inserted in the group by the value of the partition field.
  • Step 360 According to the sequence of the data to be inserted in the group, dynamically insert each group of data to be inserted into the corresponding target table file in turn.
  • step 340 and step 350 are executed selectively according to the partition type of the target table, and the order of execution is not limited as shown in FIG. 3.
  • the data to be inserted in the group is sorted through the case that the target table is a range partition to ensure that the data to be inserted in the same partition is arranged continuously, so that when the data in the group is dynamically partitioned and inserted sequentially, There is no need to repeatedly open different partition files, which improves the performance of dynamic partition insertion.
  • FIG. 4 is a flowchart of a data insertion method provided in Embodiment 4 of this application.
  • the technical solution of this embodiment is described on the basis of the above technical solution, and includes the following steps:
  • Step 410 Determine the insertion action type of the data insertion command according to the acquired data insertion command.
  • Step 420 When the insert action type is dynamic partition insertion, determine the partition and bucket information of the target table according to the meta information of the target table.
  • Step 430 According to the partition and bucket information, divide the data to be inserted into at least one group.
  • Step 440 Sort the data to be inserted in the group according to the partition and bucket information.
  • Step 450 Read the first row in the current group, open the file handle corresponding to the first row as the current file handle, and write the data to be inserted in the first row into the current file handle.
  • Step 460 Read the next row as the current row in sequence, and if the current row belongs to the target table file corresponding to the current file handle, write the data to be inserted in the current row into the current file handle.
  • Step 470 If the current row does not belong to the target table file corresponding to the current file handle, close the current file handle, open the file handle corresponding to the current row, and write the data to be inserted in the current row into the file handle corresponding to the current row.
  • Step 480 When the next row cannot be read in the current group, the insertion operation of the data to be inserted in the group is ended; the insertion operation of all groups is completed in a preset order.
  • the target table is a partitioned and bucketed table
  • the data to be inserted in the group is sorted according to the partition attribution, thus ensuring The data belonging to the same partition must be arranged continuously during the insertion process. Therefore, in the process of inserting files, you only need to open one file at a time for continuous insertion. When a row is found that does not belong to the currently opened file, then all the data to be inserted belonging to the currently opened file in this group must have been inserted previously. Therefore, you can directly close the currently opened file, and open the corresponding file to which the current line belongs, and you can continue to insert.
  • the SQL insert instruction is executed to query the meta information of the target table to obtain the partition and bucket information of the target table. Determine whether it is a dynamic partition insertion. If it is, determine whether the target table is bucketed according to the meta-information of the target table. If the target table is bucketed, group the source table data according to the hash value of the bucket column. If the target table is not bucketed according to The preset rules divide the data to be inserted into at least one group. Perform sorting within the group according to the partition column field for each grouping.
  • the target table is a partitioned and bucketed table, as shown in Table 1.
  • FIG. 6 is a schematic structural diagram of a data insertion device provided by Embodiment 6 of the application.
  • the device can be configured in a database.
  • the device includes: an insertion action type determination module 610, a partition and bucket information determination module 620, and a data grouping module 630 , The data sorting module 640 and the data inserting module 650.
  • the insertion action type determining module 610 is configured to determine the insertion action type of the data insertion command according to the acquired data insertion command;
  • the partition bucket information determining module 620 is configured to determine the partition bucket information of the target table according to the meta information of the target table when the insert action type is dynamic insert;
  • the data grouping module 630 is configured to divide the data to be inserted into at least one group according to the partition and bucket information;
  • the data sorting module 640 is configured to sort the data to be inserted in each group according to the partition and bucket information
  • the data insertion module 650 is configured to dynamically insert each group of data to be inserted into the corresponding target table file in sequence according to the sequence of the data to be inserted in each group.
  • the technical solution of this embodiment solves the problem of system instability caused by operating a large number of files in different partitions at the same time for a long time by sorting the data to be inserted into the target table one by one, inserting into the target table one by one, and ensuring the stability of the system. Next, the effect of improving the performance of dynamic partition insertion.
  • the data grouping module 630 includes:
  • the first grouping unit is configured to divide the data to be inserted into at least one group according to the first grouping rule when the target table is a partitioned non-bucketed table.
  • the first grouping unit is set to:
  • the data to be inserted is divided into at least one group according to the amount of data to be inserted and the amount of the preset single group of data; wherein the amount of data to be inserted in each group is less than or equal to the amount of preset single group of data.
  • the first grouping unit is set to:
  • the group identification number includes a hash value of a preset field of the data to be inserted or a random number corresponding to the data to be inserted;
  • the data to be inserted is allocated to the corresponding array.
  • the data grouping module 630 includes:
  • the second grouping unit is set to group the data to be inserted according to the hash value of the bucketing column when the target table is a partitioned bucketing table.
  • the data sorting module 640 includes:
  • the first sorting unit is set to sort the data to be inserted in each group according to the order of the partitions of the target table when the target table is a multi-level partition table.
  • the data sorting module 640 includes:
  • the second sorting unit is set to sort the data to be inserted in each group according to the sequence after the target table partition field is mapped to the range partition of the target table when the target table is a multi-level range partition table.
  • the data insertion module 650 includes:
  • the first writing unit is set to read the first row in the current group, open the file handle corresponding to the first row as the current file handle, and write the data to be inserted in the first row into the current file handle;
  • the second writing unit is set to read the next row as the current row in sequence. If the current row belongs to the target table file corresponding to the current file handle, write the data to be inserted in the current row into the current file handle;
  • the third writing unit is set to close the current file handle if the current row does not belong to the target table file corresponding to the current file handle, open the file handle corresponding to the current row, and write the data to be inserted in the current row into the file handle corresponding to the current row ;
  • the insertion end unit of the group is set to end the insertion operation of the data to be inserted in the group when the next line cannot be read in the current group;
  • Full group insertion unit set to complete the insertion operation of all groups in a preset order.
  • the data insertion device provided in the embodiment of the present application can execute the data insertion method provided in any embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method.
  • FIG. 7 is a schematic structural diagram of a device provided in Embodiment 7 of the application.
  • the device includes a processor 710, a memory 720, an input device 730, and an output device 740; the number of processors 710 in the device may be One or more, one processor 710 is taken as an example in FIG. 7; the processor 710, the memory 720, the input device 730, and the output device 740 in the device can be connected by a bus or other means. In FIG. 7, the connection by a bus is taken as an example .
  • the memory 720 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the data insertion method in the embodiment of the present application (for example, the insertion action in the data insertion device).
  • the processor 710 executes various functional applications and data processing of the device by running the software programs, instructions, and modules stored in the memory 720, that is, realizes the aforementioned data insertion method.
  • the memory 720 may include a program storage area and a data storage area.
  • the program storage area may store an operating system and an application program required for at least one function; the data storage area may store data created according to the use of the terminal, and the like.
  • the memory 720 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the memory 720 may include a memory remotely provided with respect to the processor 710, and these remote memories may be connected to the device through a network. Examples of the aforementioned networks include the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input device 730 may be used to receive inputted numeric or character information, and generate key signal input related to user settings and function control of the device.
  • the output device 740 may include a display device such as a display screen.
  • the eighth embodiment of the present application also provides a storage medium containing computer-executable instructions, which are used to execute a data insertion method when the computer-executable instructions are executed by a computer processor, and the method includes:
  • the insertion action type is dynamic partition insertion, determine the partition and bucket information of the target table according to the meta information of the target table;
  • each group of data to be inserted is dynamically inserted into the corresponding target table file in turn.
  • An embodiment of the present application provides a storage medium containing computer-executable instructions.
  • the computer-executable instructions include the method operations described above, and can also perform related operations in the data insertion method provided in any embodiment of the present application.
  • This application can be implemented with the help of software and necessary general-purpose hardware, or can be implemented with hardware.
  • the technical solution of this application can be embodied in the form of a software product.
  • the computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, read-only memory (ROM), and random access memory ( Random Access Memory, flash memory (FLASH), hard disk or optical disk, etc., including at least one instruction to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute the multiple embodiments of this application Methods.
  • the multiple units and modules included are only divided according to the functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; in addition, the multiple functional units
  • the names are only for the convenience of distinguishing each other, and are not used to limit the scope of protection of this application.

Abstract

Disclosed are a data insertion method and apparatus, a device and a storage medium. The data insertion method comprises: determining, according to an acquired data insertion command, an insertion action type of the data insertion command; insofar as the insertion action type is dynamic partition insertion, determining, according to meta-information of a target table, partition and bucket division information of the target table; dividing, according to the partition and bucket division information, data, which is to be inserted, into at least one group; sorting said data in each group according to the partition and bucket division information; and according to the sequence of said data in each group, dynamically inserting said data in each group into a corresponding target table file in sequence.

Description

数据插入方法、装置、设备和储存介质Data insertion method, device, equipment and storage medium
本申请要求在2019年07月10日提交中国专利局、申请号为201910619873.3的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the Chinese Patent Office with application number 201910619873.3 on July 10, 2019. The entire content of this application is incorporated into this application by reference.
技术领域Technical field
本申请涉及数据存储技术,例如涉及一种数据插入方法、装置、设备和储存介质。This application relates to data storage technology, such as a data insertion method, device, equipment, and storage medium.
背景技术Background technique
随着应用场景的复杂化,数据经常在不同数据库之间流转。而随着大数据时代的到来,数据库之间导入或导出的数据量也越来越大。As application scenarios become more complex, data often flows between different databases. With the advent of the big data era, the amount of data imported or exported between databases is also increasing.
分区是数据库中常用的数据组织方式,相关技术大都采用单个分区逐个插入的方式,在处理大批量数据时性能无法满足要求。如果要同时插入不同分区,则会因为长时间同时操作不同分区的太多文件,导致系统不稳定,例如内存使用过多、打开的文件句柄数量过大等问题。Partitioning is a commonly used data organization method in databases. Most of the related technologies use a single partition to insert one by one, and the performance cannot meet the requirements when processing large quantities of data. If you want to insert different partitions at the same time, it will cause system instability due to operating too many files in different partitions at the same time for a long time, such as excessive memory usage and too many open file handles.
发明内容Summary of the invention
本申请提供一种数据插入方法、装置、设备和储存介质,以实现在保证系统稳定的情况下,提高动态分区插入的性能。This application provides a data insertion method, device, equipment, and storage medium to improve the performance of dynamic partition insertion while ensuring system stability.
提供了一种数据插入方法,包括:Provides a data insertion method, including:
根据获取到的数据插入命令,确定所述数据插入命令的插入动作类型;Determine the type of insertion action of the data insertion command according to the acquired data insertion command;
在所述插入动作类型为动态分区插入的情况下,根据目标表的元信息,确定所述目标表的分区分桶信息;In a case where the insertion action type is dynamic partition insertion, determine the partition and bucket information of the target table according to the meta information of the target table;
根据所述分区分桶信息,将待插入数据分为至少一组;Divide the data to be inserted into at least one group according to the partition and bucket information;
根据所述分区分桶信息,对每组组内的待插入数据进行排序;Sort the data to be inserted in each group according to the partition and bucket information;
按照所述待插入数据在每组组内的顺序,将每组待插入数据依次动态插入对应的目标表文件。According to the sequence of the data to be inserted in each group, each group of data to be inserted is dynamically inserted into the corresponding target table file in turn.
还提供了一种数据插入装置,包括:A data insertion device is also provided, including:
插入动作类型确定模块,设置为根据获取到的数据插入命令,确定所述数据插入命令的插入动作类型;An insert action type determining module, configured to determine the insert action type of the data insert command according to the acquired data insert command;
分区分桶信息确定模块,设置为在所述插入动作类型为动态分区插入的情况下,根据目标表的元信息,确定所述目标表的分区分桶信息;The partitioning bucket information determining module is configured to determine the partitioning and bucketing information of the target table according to the meta-information of the target table when the insertion action type is dynamic partition insertion;
数据分组模块,设置为根据所述分区分桶信息,将待插入数据分为至少一组;A data grouping module, configured to divide the data to be inserted into at least one group according to the partition and bucket information;
数据排序模块,设置为根据所述分区分桶信息,对每组组内的待插入数据进行排序;A data sorting module, configured to sort the data to be inserted in each group according to the partition and bucket information;
数据插入模块,设置为按照所述待插入数据在每组组内的顺序,将每组待插入数据依次动态插入对应的目标表文件。The data insertion module is configured to dynamically insert each group of data to be inserted into the corresponding target table file in sequence according to the sequence of the data to be inserted in each group.
还提供了一种设备,所述设备包括:A device is also provided, and the device includes:
一个或多个处理器;One or more processors;
存储器,设置为存储一个或多个程序,Memory, set to store one or more programs,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本申请任意实施例所提供的数据插入方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the data insertion method provided in any embodiment of the present application.
还提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行本申请任意实施例所提供的数据插入方法。A storage medium containing computer-executable instructions is also provided, and the computer-executable instructions are used to execute the data insertion method provided in any embodiment of the present application when executed by a computer processor.
附图说明Description of the drawings
图1是本申请实施例一中的一种数据插入方法的流程图;FIG. 1 is a flowchart of a data insertion method in Embodiment 1 of the present application;
图2是本申请实施例二中的一种数据插入方法的流程图;2 is a flowchart of a data insertion method in Embodiment 2 of the present application;
图3是本申请实施例三中的一种数据插入方法的流程图;FIG. 3 is a flowchart of a data insertion method in Embodiment 3 of the present application;
图4是本申请实施例四中的一种数据插入方法的流程图;4 is a flowchart of a data insertion method in the fourth embodiment of the present application;
图5是本申请实施例五中的一种数据插入方法的流程图;FIG. 5 is a flowchart of a data insertion method in Embodiment 5 of the present application;
图6是本申请实施例六中的一种数据插入装置的结构示意图;6 is a schematic structural diagram of a data insertion device in the sixth embodiment of the present application;
图7是本申请实施例七中的一种设备的结构示意图。FIG. 7 is a schematic structural diagram of a device in Embodiment 7 of the present application.
具体实施方式Detailed ways
下面结合附图和实施例对本申请进行说明。附图中仅示出了与本申请相关的部分而非全部结构。The application will be described below with reference to the drawings and embodiments. The drawings only show a part but not all of the structure related to this application.
实施例一Example one
图1为本申请实施例一提供的一种数据插入方法的流程图,本实施例可适用于向数据库导入数据的情况,该方法可以由数据插入装置来执行,该装置可以由硬件和/或软件来实现,包括如下步骤:Fig. 1 is a flowchart of a data insertion method provided in the first embodiment of this application. This embodiment is applicable to the case of importing data into a database. The method can be executed by a data insertion device, which can be hardware and/or Software implementation, including the following steps:
步骤110、根据获取到的数据插入命令,确定数据插入命令的插入动作类型。Step 110: Determine the insertion action type of the data insertion command according to the acquired data insertion command.
其中,当需要执行数据插入任务时,会执行相应的插入命令,根据数据库引擎的数据插入命令可以确定是动态分区插入还是静态分区插入。静态分区插入是指插入数据的时候指定目标分区,一次只能插入到一个分区;动态分区插入是指插入数据的时候不指定目标分区,而是根据分区列动态的判断数据的目标分区,一次可以插入到多个分区。Among them, when a data insertion task needs to be executed, a corresponding insertion command will be executed. According to the data insertion command of the database engine, it can be determined whether it is a dynamic partition insertion or a static partition insertion. Static partition insertion refers to specifying the target partition when inserting data, and it can only be inserted into one partition at a time; dynamic partition insertion refers to not specifying the target partition when inserting data, but dynamically judging the target partition of the data according to the partition column. Insert into multiple partitions.
步骤120、当插入动作类型为动态分区插入时,根据目标表的元信息,确定目标表的分区分桶信息。Step 120: When the insert action type is dynamic partition insertion, determine the partition and bucket information of the target table according to the meta information of the target table.
其中,如果是动态分区插入,则获取目标表的元信息,确定目标表的分区分桶信息。例如,根据sql语句的插入命令,判断是否为动态分区插入,在确定为动态分区插入,进而获取目标表的元信息,确定目标表是分区非分桶表,还是分区分桶表,以及分区列、分桶列等信息。Among them, if it is a dynamic partition insertion, the meta information of the target table is obtained, and the partition and bucket information of the target table is determined. For example, according to the insert command of the sql statement, determine whether it is a dynamic partition insert, and then obtain the meta-information of the target table, determine whether the target table is a partitioned non-bucketed table, or a partitioned bucketed table, and partition columns , Bucket column and other information.
步骤130、根据分区分桶信息,将待插入数据分为至少一组。Step 130: According to the partition and bucket information, divide the data to be inserted into at least one group.
其中,如果目标表是分桶表,对待插入数据分组时,保证同一个分桶文件的数据在同一组中,如果目标表不是分桶表,理论上对待插入数据任意分组不会对动态分区插入造成影响。为了单个分组中待插入数据的数量不会过大,造成组内需要排序的数据过多,可以将待插入数据进行分组,避免单个分组中待插入数据的数量过大,一般可以将待插入数据进行均匀分组。可选的,当目标表为分区非分桶表,按照第一分组规则将待插入数据分为至少一组。示例的,按照第一分组规则将待插入数据分为至少一组,包括:获取预设分组数;确定每一条待插入数据的分组标识数,其中,分组标识数包括待插入数据的预设字段的哈希(hash)值或待插入数据对应的随机数;根据分组标识数对预设分组数取模的结果和分组号,将待插入数据分配到与所述取模的结果和分组号对应的数组。也就是,确定将要把待插入数据分为多少组,对每一条待插入数据的选定字段求hash值,或者对每一条待插入数据生成一个随机数,将得到的hash值或随机数作为该条待插入数据的分组标识数,根据分组标识数对预设分组数取模,将取模的结果与分组号进行匹配,从而将待插入数据分配到与每组待插入数据对应的数组中。例如,预设分组数是7(组号是0-6),一个待插入数据选定字段的hash值是123878437872,对7取模是0,所以将该待插入数据分配到组号 为0的组。还可以根据待插入数据的数量和预设单组数据量,将待插入数据分为至少一组;其中,每个组中的待插入数据的数量小于或等于预设单组数据量。如果待插入数据的数量小于或等于预设单组数据量,则只分一组即可;如果待插入数据的数量大于预设单组数据量,则分为多组,每组中的待插入数据的数量均小于或等于预设单组数据量,其中,可以等比例分组,也可以不等比例分组,在此不做限定。Among them, if the target table is a bucketed table, when the data to be inserted is grouped, ensure that the data of the same bucketed file is in the same group. If the target table is not a bucketed table, theoretically any grouping of the inserted data will not be inserted into the dynamic partition. Make an impact. In order that the amount of data to be inserted in a single group will not be too large, resulting in too much data that needs to be sorted in the group, the data to be inserted can be grouped to avoid too much data to be inserted in a single group. Generally, the data to be inserted can be Group evenly. Optionally, when the target table is a partitioned non-bucketed table, the data to be inserted is divided into at least one group according to the first grouping rule. For example, dividing the data to be inserted into at least one group according to the first grouping rule includes: obtaining a preset number of groups; determining the group identification number of each piece of data to be inserted, wherein the group identification number includes the preset field of the data to be inserted The hash value or the random number corresponding to the data to be inserted; the result of modulo the preset group number and the group number according to the group identification number, and the data to be inserted is allocated to the result of the modulus and the group number corresponding to the Array. That is, determine how many groups the data to be inserted will be divided into, calculate the hash value for each selected field of the data to be inserted, or generate a random number for each data to be inserted, and use the obtained hash value or random number as the The group identification number of the data to be inserted is modulo the preset grouping number according to the group identification number, and the result of the modulus is matched with the group number, so that the data to be inserted is allocated to the array corresponding to each group of data to be inserted. For example, the preset group number is 7 (group number is 0-6), the hash value of a selected field of the data to be inserted is 123878437872, and the modulo 7 is 0, so the data to be inserted is assigned to the group number 0 group. It is also possible to divide the data to be inserted into at least one group according to the amount of data to be inserted and the amount of the preset single group of data; wherein the amount of data to be inserted in each group is less than or equal to the amount of the preset single group of data. If the number of data to be inserted is less than or equal to the preset single group of data, then only one group is enough; if the number of data to be inserted is greater than the preset single group of data, then it is divided into multiple groups, each group to be inserted The amount of data is less than or equal to the preset single set of data amount. Among them, the data can be grouped in equal proportions or in unequal proportions, which is not limited here.
步骤140、根据分区分桶信息,对每组组内的待插入数据进行排序。Step 140: Sort the data to be inserted in each group according to the partition and bucket information.
其中,在分组之后,将每组待插入数据根据分区列的值进行组内排序,排序是为了保证同一个分区的数据连续排列。这样在以组为单位执行插入时,对应同一个目标表分区的待插入数据是相邻的,且连续排列,在对应同一个目标表分区的待插入数据插入目标表的分区文件之后,才会插入下一个目标表分区的待插入数据。Among them, after grouping, each group of data to be inserted is sorted within the group according to the value of the partition column, and the sorting is to ensure that the data of the same partition is arranged continuously. In this way, when inserting is performed in a group, the data to be inserted corresponding to the same target table partition is adjacent and arranged consecutively. The data to be inserted corresponding to the same target table partition is inserted into the partition file of the target table. Insert the data to be inserted into the next target table partition.
步骤150、按照待插入数据在每组组内的顺序,将每组待插入数据依次动态插入对应的目标表文件。Step 150: According to the order of the data to be inserted in each group, dynamically insert each group of data to be inserted into the corresponding target table file in turn.
其中,按照待插入数据在组内的顺序进行插入,在对应同一个目标表分区的待插入数据插入之后,自动识别出下一个需要插入数据的目标表分区,直至组内所有的待插入数据插入完毕,对每组待插入数据都进行插入,从而完成所有待插入数据插入到目标表文件中。Among them, insert according to the order of the data to be inserted in the group. After the data to be inserted corresponding to the same target table partition is inserted, the next target table partition that needs to be inserted data is automatically identified until all the data to be inserted in the group is inserted When finished, insert each group of data to be inserted, thereby completing all the data to be inserted into the target table file.
自动识别动态分区插入的场景,针对分区和分桶列对数据进行划分和排序,使得动态分区插入的过程中,达到像静态分区插入一样每次对一个分区插入的效果,在执行一次操作的前提下,既避免了多次分区的重复扫表,也避免了对内存或文件句柄造成巨大的压力。Automatically identify the scenario of dynamic partition insertion, divide and sort the data for partitions and bucket columns, so that during the process of dynamic partition insertion, the effect of inserting one partition at a time is the same as static partition insertion. The premise is to perform an operation This not only avoids repeated table scans for multiple partitions, but also avoids huge pressure on memory or file handles.
本实施例的技术方案,通过对待插入数据分组排序,逐个插入目标表,同时只打开一个文件,解决长时间同时操作不同分区的大量文件,导致系统不稳定的问题,实现在保证系统稳定的情况下,提高动态分区插入的性能的效果。The technical solution of this embodiment solves the problem of system instability caused by operating a large number of files in different partitions at the same time for a long time by sorting the data to be inserted into the target table one by one, inserting into the target table one by one, and ensuring the stability of the system. Next, the effect of improving the performance of dynamic partition insertion.
实施例二Example two
图2为本申请实施例二提供的一种数据插入方法的流程图,本实施例的技术方案在上述技术方案的基础上进行说明,包括如下步骤:Fig. 2 is a flowchart of a data insertion method provided in the second embodiment of the application. The technical solution of this embodiment is described on the basis of the above technical solution, and includes the following steps:
步骤210、根据获取到的数据插入命令,确定数据插入命令的插入动作类型。Step 210: Determine the insertion action type of the data insertion command according to the acquired data insertion command.
步骤220、当插入动作类型为动态分区插入时,根据目标表的元信息,确定目标表的分区分桶信息。Step 220: When the insert action type is dynamic partition insertion, determine the partition and bucket information of the target table according to the meta information of the target table.
步骤230、当目标表为分区分桶表,将待插入数据根据分桶列的hash值分 组。Step 230: When the target table is a partitioned and bucketed table, group the data to be inserted according to the hash value of the bucketed column.
其中,在得到分区分桶信息之后,如果目标表是分区分桶表,首先将源表的数据根据目标表的分桶列的hash值划分成分桶数组,这样保证最后应该插入同一个分桶文件的数据一定分在同一组,语义上等价于distribute by分桶列。Among them, after obtaining the partition and bucket information, if the target table is a partition and bucket table, first divide the source table data into a bucket array according to the hash value of the bucket column of the target table, so as to ensure that the same bucket file should be inserted at the end The data must be in the same group, which is semantically equivalent to distribute by buckets.
步骤240、根据分区分桶信息,对组内的待插入数据进行排序。Step 240: Sort the data to be inserted in the group according to the partition and bucket information.
步骤250、按照待插入数据在组内的顺序,将每组待插入数据依次动态插入对应的目标表文件。Step 250: According to the sequence of the data to be inserted in the group, dynamically insert each group of data to be inserted into the corresponding target table file in turn.
本实施例的技术方案,通过将同一分桶的数据进行分组,保证需要插入同一个分桶文件的待插入数据在同一分组,实现在动态分区插入时,避免反复扫描分区,提高动态分区插入的性能。The technical solution of this embodiment, by grouping the data in the same bucket, ensures that the data to be inserted that needs to be inserted into the same bucket file are in the same group, so as to avoid repeated scanning of partitions during dynamic partition insertion, and improve dynamic partition insertion. performance.
实施例三Example three
图3为本申请实施例三提供的一种数据插入方法的流程图,本实施例的技术方案在上述技术方案的基础上进行说明,包括如下步骤:Fig. 3 is a flowchart of a data insertion method provided in the third embodiment of the application. The technical solution of this embodiment is described on the basis of the above technical solution, and includes the following steps:
步骤310、根据获取到的数据插入命令,确定数据插入命令的插入动作类型。Step 310: Determine the insertion action type of the data insertion command according to the acquired data insertion command.
步骤320、当插入动作类型为动态分区插入时,根据目标表的元信息,确定目标表的分区分桶信息。Step 320: When the insert action type is dynamic partition insertion, determine the partition and bucket information of the target table according to the meta information of the target table.
步骤330、根据分区分桶信息,将待插入数据分为至少一组。Step 330: According to the partition and bucket information, divide the data to be inserted into at least one group.
步骤340、当目标表为多级分区表,按照目标表的分区的顺序,对组内的待插入数据进行排序。Step 340: When the target table is a multi-level partition table, sort the data to be inserted in the group according to the order of the partitions of the target table.
其中,如果由分区分桶信息确定目标表是多级分区表,则按目标表的分区的顺序对组内的待插入数据依次排序。语义上等价于sort by分区列1,分区列2……。这样保证了同一个分区的数据连续排列。Wherein, if the target table is a multi-level partition table determined by the partition and bucket information, the data to be inserted in the group is sorted in order according to the partition order of the target table. Semantically equivalent to sort by partition column 1, partition column 2... This ensures the continuous arrangement of data in the same partition.
步骤350、当目标表为多级范围分区表,按照目标表分区字段映射到目标表的范围分区之后的顺序,对组内的待插入数据进行排序。Step 350: When the target table is a multi-level range partition table, sort the data to be inserted in the group according to the sequence after the target table partition field is mapped to the range partition of the target table.
其中,如果由分区分桶信息确定目标表是多级范围分区表,由于多级范围分区表不是一个分区对应分区键的一个值的单值分区,而是一个分区对应分区键值的一个范围,也就是一个分区键的值的区间,因此,多级范围分区排序要按照分区字段映射到分区之后对组内的待插入数据进行排序。示例性的,对于有2个字段的范围分区r1,r2。r1是日期(date)类型,r2是整数(int)类型,范围分区是先按r1的月份分区([1970-01-01,1970-01-31]一个分区,[1970-02-01,1970-02-28]一个分区),再按r2的百位进行分区(即[0,99]在一个分区,[100,199]在一个分区etc.)。那么,应该对to_month(r1),r2/10排序。等价语 法是distribute by to_month(r1),r2/10。如果直接按分区键的值对组内的待插入数据排序,则可能出现以下顺序:(1970-01-01,1),(1970-01-01,11),(1970-01-02,2),而它们分别归属分区1,分区2,分区1,相同分区的数据没有排到一起。如果是对于单个范围分区字段,则可以简化为直接按分区字段的值对组内的待插入数据排序。Among them, if the target table is determined by the partition and bucket information to be a multi-level range partition table, because the multi-level range partition table is not a single-value partition corresponding to a value of a partition key, but a range of a partition corresponding to a partition key value. That is, the range of the value of a partition key. Therefore, multi-level range partition sorting should sort the data to be inserted in the group after the partition field is mapped to the partition. Exemplarily, for the range partition r1, r2 with 2 fields. r1 is the date (date) type, r2 is the integer (int) type, and the range partition is first partitioned by the month of r1 ([1970-01-01,1970-01-31] a partition, [1970-02-01,1970 -02-28] a partition), and then partition according to the hundreds of r2 (ie [0,99] is in a partition, [100,199] is in a partition, etc.). Then, to_month(r1), r2/10 should be sorted. The equivalent syntax is distribute by to_month(r1), r2/10. If the data to be inserted in the group is sorted directly by the value of the partition key, the following order may appear: (1970-01-01,1),(1970-01-01,11),(1970-01-02,2 ), and they belong to partition 1, partition 2, and partition 1, respectively. The data in the same partition is not arranged together. If it is for a single range partition field, it can be simplified to directly sort the data to be inserted in the group by the value of the partition field.
步骤360、按照待插入数据在组内的顺序,将每组待插入数据依次动态插入到对应的目标表文件。Step 360: According to the sequence of the data to be inserted in the group, dynamically insert each group of data to be inserted into the corresponding target table file in turn.
在一实施例中,步骤340和步骤350根据目标表的分区类型择一执行,图3中所示不对其执行的顺序构成限定。In an embodiment, step 340 and step 350 are executed selectively according to the partition type of the target table, and the order of execution is not limited as shown in FIG. 3.
本实施例的技术方案,通过目标表为范围分区的情况,对组内的待插入数据进行排序,保证同一个分区的待插入数据为连续排列,使得对组内数据依次进行动态分区插入时,不必反复打开不同分区文件,提高动态分区插入的性能。In the technical solution of this embodiment, the data to be inserted in the group is sorted through the case that the target table is a range partition to ensure that the data to be inserted in the same partition is arranged continuously, so that when the data in the group is dynamically partitioned and inserted sequentially, There is no need to repeatedly open different partition files, which improves the performance of dynamic partition insertion.
实施例四Example four
图4为本申请实施例四提供的一种数据插入方法的流程图,本实施例的技术方案在上述技术方案的基础上进行说明,包括如下步骤:FIG. 4 is a flowchart of a data insertion method provided in Embodiment 4 of this application. The technical solution of this embodiment is described on the basis of the above technical solution, and includes the following steps:
步骤410、根据获取到的数据插入命令,确定数据插入命令的插入动作类型。Step 410: Determine the insertion action type of the data insertion command according to the acquired data insertion command.
步骤420、当插入动作类型为动态分区插入时,根据目标表的元信息,确定目标表的分区分桶信息。Step 420: When the insert action type is dynamic partition insertion, determine the partition and bucket information of the target table according to the meta information of the target table.
步骤430、根据分区分桶信息,将待插入数据分为至少一组。Step 430: According to the partition and bucket information, divide the data to be inserted into at least one group.
步骤440、根据分区分桶信息,对组内的待插入数据进行排序。Step 440: Sort the data to be inserted in the group according to the partition and bucket information.
步骤450、读取当前组内第一行,打开第一行对应的文件句柄作为当前文件句柄,将第一行的待插入数据写入当前文件句柄。Step 450: Read the first row in the current group, open the file handle corresponding to the first row as the current file handle, and write the data to be inserted in the first row into the current file handle.
步骤460、依次读取后一行作为当前行,如果当前行属于当前文件句柄对应的目标表文件,将当前行的待插入数据写入当前文件句柄。Step 460: Read the next row as the current row in sequence, and if the current row belongs to the target table file corresponding to the current file handle, write the data to be inserted in the current row into the current file handle.
步骤470、如果当前行不属于当前文件句柄对应的目标表文件,关闭当前文件句柄,打开当前行对应的文件句柄,将当前行的待插入数据写入当前行对应的文件句柄。Step 470: If the current row does not belong to the target table file corresponding to the current file handle, close the current file handle, open the file handle corresponding to the current row, and write the data to be inserted in the current row into the file handle corresponding to the current row.
步骤480、当当前组内读取不到下一行时,结束本组内的待插入数据的插入操作;按预设顺序完成所有组的插入操作。Step 480: When the next row cannot be read in the current group, the insertion operation of the data to be inserted in the group is ended; the insertion operation of all groups is completed in a preset order.
其中,在对待插入数据进行分组后,如果目标表是分区分桶表,保证了同一个桶的数据一定分到一组,并且对组内的待插入数据根据分区归属进行了排序,因此保证了插入过程中属于同一个分区的数据一定连续排列。所以在插入 文件的过程中,一次只需要同时打开一个文件连续插入。当发现有一行不属于当前打开的文件时,那么本组内属于当前打开的文件的待插入数据一定已经在前面全部插入完毕。所以,可以直接关闭当前打开的文件,而打开当前行所属的对应文件,即可继续插入。Among them, after the data to be inserted is grouped, if the target table is a partitioned and bucketed table, it is guaranteed that the data in the same bucket must be grouped into a group, and the data to be inserted in the group is sorted according to the partition attribution, thus ensuring The data belonging to the same partition must be arranged continuously during the insertion process. Therefore, in the process of inserting files, you only need to open one file at a time for continuous insertion. When a row is found that does not belong to the currently opened file, then all the data to be inserted belonging to the currently opened file in this group must have been inserted previously. Therefore, you can directly close the currently opened file, and open the corresponding file to which the current line belongs, and you can continue to insert.
本实施例的技术方案,在对待插入数据进行分组,并组内排序之后,一次只需要同时打开一个文件,进行待插入数据的连续插入,不必反复打开同一个文件,实现在保证系统稳定的情况下,提高动态分区插入的性能。In the technical solution of this embodiment, after the data to be inserted is grouped and sorted within the group, only one file needs to be opened at a time to perform continuous insertion of the data to be inserted, and the same file does not need to be opened repeatedly, thus ensuring the stability of the system Next, improve the performance of dynamic partition insertion.
实施例五Example five
本实施例为本申请的一个可选实施例,如图5所示,执行sql插入指令,查询目标表的元信息,得到目标表的分区分桶信息。判断是否为动态分区插入,如果是,根据目标表的元信息判断目标表是否分桶,如果目标表分桶,将源表数据根据分桶列的哈希值分组,如果目标表未分桶根据预先设定的规则将待插入数据分成至少一组。对每个分组均执行根据分区列字段组内排序。之后,将组内第一行作为当前行,打开当前行对应句柄,判断当前行是否属于当前句柄对应文件,如果当前行属于当前句柄对应文件就将当前行写入文件句柄,并读取下一行作为当前行;如果当前行不属于当前句柄对应文件,关闭当前句柄,打开当前行对应句柄,将当前行写入文件句柄,再读取下一行作为当前行。判断下一行是否存在,如果存在下一行返回判断当前行是否属于当前句柄对应文件,如果下一行不存在,本组插入结束。之后,照上述步骤进行其他组的数据插入。示例的,目标表为分区分桶表,如表1所示。This embodiment is an optional embodiment of the application. As shown in FIG. 5, the SQL insert instruction is executed to query the meta information of the target table to obtain the partition and bucket information of the target table. Determine whether it is a dynamic partition insertion. If it is, determine whether the target table is bucketed according to the meta-information of the target table. If the target table is bucketed, group the source table data according to the hash value of the bucket column. If the target table is not bucketed according to The preset rules divide the data to be inserted into at least one group. Perform sorting within the group according to the partition column field for each grouping. After that, take the first row in the group as the current row, open the handle corresponding to the current row, determine whether the current row belongs to the file corresponding to the current handle, if the current row belongs to the file corresponding to the current handle, write the current row to the file handle, and read the next row As the current line; if the current line does not belong to the file corresponding to the current handle, close the current handle, open the handle corresponding to the current line, write the current line to the file handle, and read the next line as the current line. Judge whether the next line exists, if there is the next line, return to determine whether the current line belongs to the file corresponding to the current handle, if the next line does not exist, this group of insertion ends. After that, follow the steps above to insert other groups of data. For example, the target table is a partitioned and bucketed table, as shown in Table 1.
表1Table 1
Figure PCTCN2020097563-appb-000001
Figure PCTCN2020097563-appb-000001
获取目标表的元信息,之后把源数据根据分桶列分成4组,如表2所示。Obtain the meta-information of the target table, and then divide the source data into 4 groups according to the bucket column, as shown in Table 2.
表2Table 2
Figure PCTCN2020097563-appb-000002
Figure PCTCN2020097563-appb-000002
再对每个组针对分区列进行组内排序,保证相同分区键的列一定相邻。根据上述分组和分区信息,可以同时只打开一个文件,把对应该分区分桶的数据插入进去再插入下一个文件,如表3所示。Then perform intra-group sorting on partition columns for each group to ensure that columns with the same partition key must be adjacent. According to the above grouping and partitioning information, you can open only one file at the same time, insert the data corresponding to the partition and bucket, and then insert the next file, as shown in Table 3.
表3table 3
Figure PCTCN2020097563-appb-000003
Figure PCTCN2020097563-appb-000003
实施例六Example Six
图6为本申请实施例六提供的一种数据插入装置的结构示意图,该装置可以配置于数据库中,该装置包括:插入动作类型确定模块610、分区分桶信息确定模块620、数据分组模块630、数据排序模块640以及数据插入模块650。6 is a schematic structural diagram of a data insertion device provided by Embodiment 6 of the application. The device can be configured in a database. The device includes: an insertion action type determination module 610, a partition and bucket information determination module 620, and a data grouping module 630 , The data sorting module 640 and the data inserting module 650.
插入动作类型确定模块610,设置为根据获取到的数据插入命令,确定数据插入命令的插入动作类型;The insertion action type determining module 610 is configured to determine the insertion action type of the data insertion command according to the acquired data insertion command;
分区分桶信息确定模块620,设置为当插入动作类型为动态插入时,根据目标表的元信息,确定目标表的分区分桶信息;The partition bucket information determining module 620 is configured to determine the partition bucket information of the target table according to the meta information of the target table when the insert action type is dynamic insert;
数据分组模块630,设置为根据分区分桶信息,将待插入数据分为至少一组;The data grouping module 630 is configured to divide the data to be inserted into at least one group according to the partition and bucket information;
数据排序模块640,设置为根据分区分桶信息,对每组组内的待插入数据进行排序;The data sorting module 640 is configured to sort the data to be inserted in each group according to the partition and bucket information;
数据插入模块650,设置为按照待插入数据在每组组内的顺序,将每组待插入数据依次动态插入对应的目标表文件。The data insertion module 650 is configured to dynamically insert each group of data to be inserted into the corresponding target table file in sequence according to the sequence of the data to be inserted in each group.
本实施例的技术方案,通过对待插入数据分组排序,逐个插入目标表,同 时只打开一个文件,解决长时间同时操作不同分区的大量文件,导致系统不稳定的问题,实现在保证系统稳定的情况下,提高动态分区插入的性能的效果。The technical solution of this embodiment solves the problem of system instability caused by operating a large number of files in different partitions at the same time for a long time by sorting the data to be inserted into the target table one by one, inserting into the target table one by one, and ensuring the stability of the system. Next, the effect of improving the performance of dynamic partition insertion.
可选的,数据分组模块630,包括:Optionally, the data grouping module 630 includes:
第一分组单元,设置为当目标表为分区非分桶表,按照第一分组规则将待插入数据分为至少一组。The first grouping unit is configured to divide the data to be inserted into at least one group according to the first grouping rule when the target table is a partitioned non-bucketed table.
可选的,第一分组单元是设置为:Optionally, the first grouping unit is set to:
根据待插入数据的数量和预设单组数据量,将待插入数据分为至少一组;其中,每个组中的待插入数据的数量小于或等于预设单组数据量。The data to be inserted is divided into at least one group according to the amount of data to be inserted and the amount of the preset single group of data; wherein the amount of data to be inserted in each group is less than or equal to the amount of preset single group of data.
可选的,第一分组单元是设置为:Optionally, the first grouping unit is set to:
获取预设分组数;Get the preset number of groups;
确定每一条待插入数据的分组标识数,其中,所述分组标识数包括所述待插入数据的预设字段的哈希hash值或所述待插入数据对应的随机数;Determining the group identification number of each piece of data to be inserted, where the group identification number includes a hash value of a preset field of the data to be inserted or a random number corresponding to the data to be inserted;
根据所述分组标识数对所述预设分组数取模的结果和分组号,将所述待插入数据分配到对应的数组。According to the result of modulo the preset group number and the group number according to the group identification number, the data to be inserted is allocated to the corresponding array.
可选的,数据分组模块630,包括:Optionally, the data grouping module 630 includes:
第二分组单元,设置为当目标表为分区分桶表,将待插入数据根据分桶列的hash值分组。The second grouping unit is set to group the data to be inserted according to the hash value of the bucketing column when the target table is a partitioned bucketing table.
可选的,数据排序模块640,包括:Optionally, the data sorting module 640 includes:
第一排序单元,设置为当目标表为多级分区表,按照目标表的分区的顺序,对每组组内的待插入数据进行排序。The first sorting unit is set to sort the data to be inserted in each group according to the order of the partitions of the target table when the target table is a multi-level partition table.
可选的,数据排序模块640,包括:Optionally, the data sorting module 640 includes:
第二排序单元,设置为当目标表为多级范围分区表,按照目标表分区字段映射到目标表的范围分区之后的顺序,对每组组内的待插入数据进行排序。The second sorting unit is set to sort the data to be inserted in each group according to the sequence after the target table partition field is mapped to the range partition of the target table when the target table is a multi-level range partition table.
可选的,数据插入模块650,包括:Optionally, the data insertion module 650 includes:
第一写入单元,设置为读取当前组内第一行,打开第一行对应的文件句柄作为当前文件句柄,将第一行的待插入数据写入当前文件句柄;The first writing unit is set to read the first row in the current group, open the file handle corresponding to the first row as the current file handle, and write the data to be inserted in the first row into the current file handle;
第二写入单元,设置为依次读取后一行作为当前行,如果当前行属于当前文件句柄对应的目标表文件,将当前行的待插入数据写入当前文件句柄;The second writing unit is set to read the next row as the current row in sequence. If the current row belongs to the target table file corresponding to the current file handle, write the data to be inserted in the current row into the current file handle;
第三写入单元,设置为如果当前行不属于当前文件句柄对应的目标表文件,关闭当前文件句柄,打开当前行对应的文件句柄,将当前行的待插入数据写入 当前行对应的文件句柄;The third writing unit is set to close the current file handle if the current row does not belong to the target table file corresponding to the current file handle, open the file handle corresponding to the current row, and write the data to be inserted in the current row into the file handle corresponding to the current row ;
本组插入结束单元,设置为在所述当前组内当读取不到下一行时,结束本组内的待插入数据的插入操作;The insertion end unit of the group is set to end the insertion operation of the data to be inserted in the group when the next line cannot be read in the current group;
全组插入单元,设置为按预设顺序完成所有组的插入操作。Full group insertion unit, set to complete the insertion operation of all groups in a preset order.
本申请实施例所提供的数据插入装置可执行本申请任意实施例所提供的数据插入方法,具备执行方法相应的功能模块和有益效果。The data insertion device provided in the embodiment of the present application can execute the data insertion method provided in any embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method.
实施例七Example Seven
图7为本申请实施例七提供的一种设备的结构示意图,如图7所示,该设备包括处理器710、存储器720、输入装置730和输出装置740;设备中处理器710的数量可以是一个或多个,图7中以一个处理器710为例;设备中的处理器710、存储器720、输入装置730和输出装置740可以通过总线或其他方式连接,图7中以通过总线连接为例。FIG. 7 is a schematic structural diagram of a device provided in Embodiment 7 of the application. As shown in FIG. 7, the device includes a processor 710, a memory 720, an input device 730, and an output device 740; the number of processors 710 in the device may be One or more, one processor 710 is taken as an example in FIG. 7; the processor 710, the memory 720, the input device 730, and the output device 740 in the device can be connected by a bus or other means. In FIG. 7, the connection by a bus is taken as an example .
存储器720作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本申请实施例中的数据插入方法对应的程序指令/模块(例如,数据插入装置中的插入动作类型确定模块610、分区分桶信息确定模块620、数据分组模块630、数据排序模块640和数据插入模块650)。处理器710通过运行存储在存储器720中的软件程序、指令以及模块,从而执行设备的多种功能应用以及数据处理,即实现上述的数据插入方法。As a computer-readable storage medium, the memory 720 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the data insertion method in the embodiment of the present application (for example, the insertion action in the data insertion device). Type determination module 610, partition and bucket information determination module 620, data grouping module 630, data sorting module 640, and data insertion module 650). The processor 710 executes various functional applications and data processing of the device by running the software programs, instructions, and modules stored in the memory 720, that is, realizes the aforementioned data insertion method.
存储器720可包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据终端的使用所创建的数据等。此外,存储器720可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器720可包括相对于处理器710远程设置的存储器,这些远程存储器可以通过网络连接至设备。上述网络的实例包括互联网、企业内部网、局域网、移动通信网及其组合。The memory 720 may include a program storage area and a data storage area. The program storage area may store an operating system and an application program required for at least one function; the data storage area may store data created according to the use of the terminal, and the like. In addition, the memory 720 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices. In some examples, the memory 720 may include a memory remotely provided with respect to the processor 710, and these remote memories may be connected to the device through a network. Examples of the aforementioned networks include the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
输入装置730可用于接收输入的数字或字符信息,以及产生与设备的用户设置以及功能控制有关的键信号输入。输出装置740可包括显示屏等显示设备。The input device 730 may be used to receive inputted numeric or character information, and generate key signal input related to user settings and function control of the device. The output device 740 may include a display device such as a display screen.
实施例八Example eight
本申请实施例八还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行一种数据插入方法,该方法包括:The eighth embodiment of the present application also provides a storage medium containing computer-executable instructions, which are used to execute a data insertion method when the computer-executable instructions are executed by a computer processor, and the method includes:
根据获取到的数据插入命令,确定所述数据插入命令的插入动作类型;Determine the type of insertion action of the data insertion command according to the acquired data insertion command;
当所述插入动作类型为动态分区插入时,根据目标表的元信息,确定所述目标表的分区分桶信息;When the insertion action type is dynamic partition insertion, determine the partition and bucket information of the target table according to the meta information of the target table;
根据所述分区分桶信息,将待插入数据分为至少一组;Divide the data to be inserted into at least one group according to the partition and bucket information;
根据所述分区分桶信息,对每组组内的待插入数据进行排序;Sort the data to be inserted in each group according to the partition and bucket information;
按照所述待插入数据在每组组内的顺序,将每组待插入数据依次动态插入对应的目标表文件。According to the sequence of the data to be inserted in each group, each group of data to be inserted is dynamically inserted into the corresponding target table file in turn.
本申请实施例所提供的一种包含计算机可执行指令的存储介质,其计算机可执行指令包括如上所述的方法操作,还可以执行本申请任意实施例所提供的数据插入方法中的相关操作。An embodiment of the present application provides a storage medium containing computer-executable instructions. The computer-executable instructions include the method operations described above, and can also perform related operations in the data insertion method provided in any embodiment of the present application.
本申请可借助软件及必需的通用硬件来实现,也可以通过硬件实现。本申请的技术方案可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括至少一个指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请多个实施例所述的方法。This application can be implemented with the help of software and necessary general-purpose hardware, or can be implemented with hardware. The technical solution of this application can be embodied in the form of a software product. The computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, read-only memory (ROM), and random access memory ( Random Access Memory, flash memory (FLASH), hard disk or optical disk, etc., including at least one instruction to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute the multiple embodiments of this application Methods.
上述数据插入装置的实施例中,所包括的多个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,多个功能单元的名称也只是为了便于相互区分,并不用于限制本申请的保护范围。In the above embodiment of the data insertion device, the multiple units and modules included are only divided according to the functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; in addition, the multiple functional units The names are only for the convenience of distinguishing each other, and are not used to limit the scope of protection of this application.

Claims (10)

  1. 一种数据插入方法,包括:A data insertion method, including:
    根据获取到的数据插入命令,确定所述数据插入命令的插入动作类型;Determine the type of insertion action of the data insertion command according to the acquired data insertion command;
    在所述插入动作类型为动态分区插入的情况下,根据目标表的元信息,确定所述目标表的分区分桶信息;In a case where the insertion action type is dynamic partition insertion, determine the partition and bucket information of the target table according to the meta information of the target table;
    根据所述分区分桶信息,将待插入数据分为至少一组;Divide the data to be inserted into at least one group according to the partition and bucket information;
    根据所述分区分桶信息,对每组组内的待插入数据进行排序;Sort the data to be inserted in each group according to the partition and bucket information;
    按照所述待插入数据在每组组内的顺序,将每组待插入数据依次动态插入对应的目标表文件。According to the sequence of the data to be inserted in each group, each group of data to be inserted is dynamically inserted into the corresponding target table file in turn.
  2. 根据权利要求1所述的方法,其中,所述根据所述分区分桶信息,将待插入数据分为至少一组,包括:The method according to claim 1, wherein the dividing the data to be inserted into at least one group according to the partition and bucketing information comprises:
    在所述目标表为分区非分桶表的情况下,按照第一分组规则将所述待插入数据分为至少一组。In a case where the target table is a partitioned non-bucketed table, the data to be inserted is divided into at least one group according to the first grouping rule.
  3. 根据权利要求2所述的方法,其中,所述按照第一分组规则将所述待插入数据分为至少一组,包括:The method according to claim 2, wherein the dividing the data to be inserted into at least one group according to the first grouping rule comprises:
    获取预设分组数;Get the preset number of groups;
    确定每一条待插入数据的分组标识数,其中,所述分组标识数包括所述待插入数据的预设字段的哈希hash值或所述待插入数据对应的随机数;Determining the group identification number of each piece of data to be inserted, where the group identification number includes a hash value of a preset field of the data to be inserted or a random number corresponding to the data to be inserted;
    根据所述分组标识数对所述预设分组数取模的结果和分组号将所述待插入数据分配到与所述取模的结果和分组号对应的数组。The data to be inserted is allocated to an array corresponding to the result of the modulus and the group number according to the result of modulo the preset group number and the group number.
  4. 根据权利要求1所述的方法,其中,所述根据所述分区分桶信息,将待插入数据分为至少一组,包括:The method according to claim 1, wherein the dividing the data to be inserted into at least one group according to the partition and bucketing information comprises:
    在所述目标表为分区分桶表的情况下,将所述待插入数据根据分桶列的hash值分组。When the target table is a partitioned and bucketed table, the data to be inserted is grouped according to the hash value of the bucketed column.
  5. 根据权利要求1所述的方法,其中,所述根据所述分区分桶信息,对每组组内的待插入数据进行排序,包括:The method according to claim 1, wherein the sorting the data to be inserted in each group according to the partition and bucket information comprises:
    在所述目标表为多级分区表的情况下,按照所述目标表的分区的顺序,对每组组内的待插入数据进行排序。In the case that the target table is a multi-level partition table, the data to be inserted in each group is sorted according to the order of the partition of the target table.
  6. 根据权利要求1所述的方法,其中,所述根据所述分区分桶信息,对每组组内的待插入数据进行排序,包括:The method according to claim 1, wherein the sorting the data to be inserted in each group according to the partition and bucket information comprises:
    在所述目标表为多级范围分区表的情况下,按照所述目标表分区字段映射 到所述目标表的范围分区之后的顺序,对每组组内的待插入数据进行排序。In the case that the target table is a multi-level range partition table, the data to be inserted in each group is sorted according to the sequence after the target table partition field is mapped to the range partition of the target table.
  7. 根据权利要求1所述的方法,其中,所述按照所述待插入数据在每组组内的顺序,将每组待插入数据依次动态插入对应的目标表文件,包括:The method according to claim 1, wherein the step of dynamically inserting each group of data to be inserted into the corresponding target table file in turn according to the sequence of the data to be inserted in each group includes:
    读取当前组内第一行,打开所述第一行对应的文件句柄作为当前文件句柄,将所述第一行的待插入数据写入所述当前文件句柄;Read the first row in the current group, open the file handle corresponding to the first row as the current file handle, and write the data to be inserted in the first row into the current file handle;
    依次读取后一行作为当前行,在所述当前行属于所述当前文件句柄对应的目标表文件的情况下,将所述当前行的待插入数据写入所述当前文件句柄;Sequentially read the next row as the current row, and in the case that the current row belongs to the target table file corresponding to the current file handle, write the data to be inserted in the current row into the current file handle;
    在所述当前行不属于所述当前文件句柄对应的目标表文件的情况下,关闭所述当前文件句柄,打开所述当前行对应的文件句柄,将所述当前行的待插入数据写入所述当前行对应的文件句柄;In the case that the current row does not belong to the target table file corresponding to the current file handle, the current file handle is closed, the file handle corresponding to the current row is opened, and the data to be inserted in the current row is written to the State the file handle corresponding to the current line;
    在所述当前组内读取不到下一行的情况下,结束本组内的待插入数据的插入操作;In the case that the next row cannot be read in the current group, end the insertion operation of the data to be inserted in the group;
    按预设顺序完成所有组的插入操作。Complete the insert operation of all groups in the preset order.
  8. 一种数据插入装置,包括:A data insertion device includes:
    插入动作类型确定模块,设置为根据获取到的数据插入命令,确定所述数据插入命令的插入动作类型;An insert action type determining module, configured to determine the insert action type of the data insert command according to the acquired data insert command;
    分区分桶信息确定模块,设置为在所述插入动作类型为动态分区插入的情况下,根据目标表的元信息,确定所述目标表的分区分桶信息;The partitioning bucket information determining module is configured to determine the partitioning and bucketing information of the target table according to the meta-information of the target table when the insertion action type is dynamic partition insertion;
    数据分组模块,设置为根据所述分区分桶信息,将待插入数据分为至少一组;A data grouping module, configured to divide the data to be inserted into at least one group according to the partition and bucket information;
    数据排序模块,设置为根据所述分区分桶信息,对每组组内的待插入数据进行排序;A data sorting module, configured to sort the data to be inserted in each group according to the partition and bucket information;
    数据插入模块,设置为按照所述待插入数据在每组组内的顺序,将每组待插入数据依次动态插入对应的目标表文件。The data insertion module is configured to dynamically insert each group of data to be inserted into the corresponding target table file in sequence according to the sequence of the data to be inserted in each group.
  9. 一种设备,包括:A device that includes:
    一个或多个处理器;One or more processors;
    存储器,设置为存储一个或多个程序,Memory, set to store one or more programs,
    所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-7中任一项所述的数据插入方法。The one or more programs are executed by the one or more processors, so that the one or more processors implement the data insertion method according to any one of claims 1-7.
  10. 一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如权利要求1-7中任一项所述的数据插入方法。A storage medium containing computer-executable instructions, when the computer-executable instructions are executed by a computer processor, are used to execute the data insertion method according to any one of claims 1-7.
PCT/CN2020/097563 2019-07-10 2020-06-23 Data insertion method and apparatus, device and storage medium WO2021004266A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910619873.3 2019-07-10
CN201910619873.3A CN110362577B (en) 2019-07-10 2019-07-10 Data insertion method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2021004266A1 true WO2021004266A1 (en) 2021-01-14

Family

ID=68218660

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/097563 WO2021004266A1 (en) 2019-07-10 2020-06-23 Data insertion method and apparatus, device and storage medium

Country Status (2)

Country Link
CN (1) CN110362577B (en)
WO (1) WO2021004266A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362577B (en) * 2019-07-10 2020-06-09 星环信息科技(上海)有限公司 Data insertion method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022763A (en) * 2014-04-30 2015-11-04 博雅网络游戏开发(深圳)有限公司 Method and system for implementing data query
EP3098730A1 (en) * 2015-05-29 2016-11-30 Sap Se Aggregating database entries by hashing
CN110362577A (en) * 2019-07-10 2019-10-22 星环信息科技(上海)有限公司 A kind of data insertion method, device, equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241121A (en) * 2017-06-29 2019-01-18 阿里巴巴集团控股有限公司 The storage of time series data and querying method, device, system and electronic equipment
CN108446399B (en) * 2018-03-29 2021-07-30 重庆大学 Dynamic storage optimization method for structured massive real-time data
CN109299190B (en) * 2018-09-10 2020-11-17 华为技术有限公司 Method and device for processing metadata of object in distributed storage system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022763A (en) * 2014-04-30 2015-11-04 博雅网络游戏开发(深圳)有限公司 Method and system for implementing data query
EP3098730A1 (en) * 2015-05-29 2016-11-30 Sap Se Aggregating database entries by hashing
CN110362577A (en) * 2019-07-10 2019-10-22 星环信息科技(上海)有限公司 A kind of data insertion method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG, YUCONG ET AL.: "NTCI-Flow:A Scalable High-speed Network Traffic Processing Framework", ADVANCED ENGINEERING SCIENCES, no. A1, 30 April 2017 (2017-04-30), ISSN: 2096-3246, DOI: 20200921232540 *

Also Published As

Publication number Publication date
CN110362577B (en) 2020-06-09
CN110362577A (en) 2019-10-22

Similar Documents

Publication Publication Date Title
US20140351239A1 (en) Hardware acceleration for query operators
CN106897322B (en) A kind of access method and device of database and file system
Parker et al. Comparing nosql mongodb to an sql db
US10769147B2 (en) Batch data query method and apparatus
CN104135437B (en) The apparatus and method searched for for the table that centralised storage device pond is utilized in the network switch
AU2014201593C1 (en) Shared cache used to provide zero copy memory mapped database
US9195701B2 (en) System and method for flexible distributed massively parallel processing (MPP) database
WO2015184762A1 (en) Database query method and device
US8380737B2 (en) Computing intersection of sets of numbers
Cossu et al. Prost: Distributed execution of sparql queries using mixed partitioning strategies
US20160103858A1 (en) Data management system comprising a trie data structure, integrated circuits and methods therefor
US10645090B2 (en) Access control for objects having attributes defined against hierarchically organized domains containing fixed number of values
CN112015741A (en) Method and device for storing massive data in different databases and tables
CN105706092A (en) Methods and systems of four-valued simulation
CN106970929A (en) Data lead-in method and device
TW201926081A (en) Data allocating system
CN112579595A (en) Data processing method and device, electronic equipment and readable storage medium
CN105677687A (en) Data processing method and device
US10678789B2 (en) Batch data query method and apparatus
WO2021004266A1 (en) Data insertion method and apparatus, device and storage medium
CN111475511A (en) Data storage method, data access method, data storage device, data access device and data access equipment based on tree structure
CN113297266B (en) Data processing method, device, equipment and computer storage medium
WO2016175880A1 (en) Merging incoming data in a database
CN108710640B (en) Method for improving search efficiency of Spark SQL
CN106991116A (en) The optimization method and device of database executive plan

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20836655

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20836655

Country of ref document: EP

Kind code of ref document: A1