CN117216059A

CN117216059A - Data table merging method, device, equipment and medium

Info

Publication number: CN117216059A
Application number: CN202311147041.9A
Authority: CN
Inventors: 胡荣辉
Original assignee: Beijing Oceanbase Technology Co Ltd
Current assignee: Beijing Oceanbase Technology Co Ltd
Priority date: 2023-09-06
Filing date: 2023-09-06
Publication date: 2023-12-12

Abstract

One or more embodiments of the present disclosure provide a data table merging method, apparatus, device, and medium. According to the method and the device, the plurality of first merging tasks are generated based on the plurality of column groups to be merged, so that the plurality of column groups can be merged based on the plurality of first merging tasks to obtain the merged first data table, and the merging tasks of the plurality of column groups can be split into the plurality of first merging tasks, so that the merging failure of the data table caused by overlarge memory occupied by a single merging task is avoided, and the merging success rate of the data table is improved.

Description

Data table merging method, device, equipment and medium

Technical Field

One or more embodiments of the present disclosure relate to the field of database technologies, and in particular, to a method, an apparatus, a device, and a medium for merging data tables.

Background

Column storage has become a key technology in the scenes of big data analysis, data warehouse, data real-time analysis and the like. The column storage has the advantages of low Input/Output (IO) cost, high compression ratio, high-efficiency query support, memory saving and the like, and can provide high-performance data processing and query capacity so as to meet the increasing data volume and query requirements.

The column store data is typically static and difficult to update in place, while the string tables (Sorted String Table, SSTable) in the structured merge Tree (Log Structured Merge Tree, LSM-Tree) are also static, making SSTable naturally suitable for implementing column storage.

When SSTable is used to implement Column storage, a Column Group (Column Group) may be set as required, each Column Group may include a plurality of columns, each Column Group corresponds to one SSTable, and in this case, how to implement merging of Column storage sstables becomes a problem to be solved.

Disclosure of Invention

In view of this, one or more embodiments of the present disclosure provide a data table merging method, apparatus, device, and medium.

In order to achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:

according to a first aspect of one or more embodiments of the present disclosure, a data table merging method is provided, including:

generating a plurality of first merging tasks based on a plurality of column groups to be merged, wherein for any first merging task, the first merging task is used for merging part of column groups in the plurality of column groups, and column groups corresponding to different first merging tasks are different;

and combining the plurality of column groups based on the plurality of first combining tasks to obtain a combined first data table.

According to a second aspect of one or more embodiments of the present specification, there is provided a data table merging apparatus, comprising:

the generation module is used for generating a plurality of first merging tasks based on a plurality of column groups to be merged, and for any one first merging task, the first merging task is used for merging part of column groups in the plurality of column groups, and column groups corresponding to different first merging tasks are different;

and the merging module is used for merging the plurality of column groups based on the plurality of first merging tasks so as to obtain a merged first data table.

According to a third aspect of one or more embodiments of the present specification, there is provided an electronic device comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor implements the method of the first aspect by executing the executable instructions.

According to a fourth aspect of one or more embodiments of the present description, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a method as described in the first aspect.

According to the method and the device, the plurality of first merging tasks are generated based on the plurality of column groups to be merged, so that the plurality of column groups can be merged based on the plurality of first merging tasks to obtain the merged first data table, and the merging tasks of the plurality of column groups can be split into the plurality of first merging tasks, so that the merging failure of the data table caused by overlarge memory occupied by a single merging task is avoided, and the merging success rate of the data table is improved.

Drawings

FIG. 1 is a flow chart of a data table merge method provided in an exemplary embodiment.

Fig. 2 is a schematic diagram of a table structure provided in an exemplary embodiment.

Fig. 3 is a schematic diagram of a data table according to an exemplary embodiment.

FIG. 4 is a schematic block diagram of a computing device provided by an exemplary embodiment.

Fig. 5 is a block diagram of a data table merging apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.

It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.

User information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in this specification are both information and data authorized by the user or sufficiently authorized by the parties, and the collection, use and processing of relevant data requires compliance with relevant laws and regulations and standards of the relevant country and region, and is provided with corresponding operation portals for the user to choose authorization or denial.

For ease of understanding, technical terms referred to in the present specification will be described first.

Memory table (Memtable): when writing data in the database, the memory is written first, and the corresponding data structure is called Memtable.

String table (Sorted String Table, SSTable): the data is in a data structure that is persistent on disk, providing read-only capability. SSTable may be divided into two layers, one layer is secondary SSTable (Minor SSTable) and one layer is primary SSTable (MajorSSTable), wherein Minor SSTable is memtsable frozen data, and majorstable is that memtsable and minorstable are combined (majorstable) to generate baseline data, which contains complete information of a line.

Combining and dumping: in the LSM-Tree architecture, data is divided into two parts of Memable and SSTable, when the data size in Memable exceeds a certain threshold value, the data in Memable needs to be transferred to SSTable to release memory, and the process is called dumping; the dumping can generate new SSTable, when the number of dumping exceeds a certain threshold value or during the low peak period of daily business, the base line SSTable and the increment SSTable dumped later are combined into one SSTable, and the process is called merging.

Column Group (Column Group): the Column Group consists of several columns, representing a collection of certain columns. If a Column Group contains all columns of a table, it is a row store; if each Column of a table is a separate Column Group, then it is the "purest" Column store.

After related technical terms of the present specification are introduced, the following detailed description is made of aspects of the present specification.

In the related art, each Column Group exists as a single SSTable in the storage layer, that is, for each independent Column Group, an independent SSTable can be constructed for each Column Group, and in the SSTable corresponding to one Column Group, only the data corresponding to the Column Group is contained.

When the Column storage data is combined, the combination process is not to simply splice the data of a plurality of SSTable together physically, but the data of the SSTable is ordered according to a key field (Rowkey), and then other non-primary key Column groups are simultaneously written into a new SSTable together according to the ordering result, so that the combination of Column groups is realized, and the SSTable obtained by combination can meet the ordering requirement of the Rowkey.

However, when the number of columns groups is large, the number of sstables to be written at the same time is relatively large, which results in excessive memory usage, and thus may cause SSTable merging failure due to insufficient memory.

Accordingly, it is desirable to provide a data table merging method for merging sstables corresponding to a plurality of Column groups into one SSTable. According to the scheme provided by the specification, the merging task of the first data table can be split into a plurality of first merging tasks, so that the merging failure of the data table caused by overlarge occupied memory of the single merging task is avoided, and the merging success rate of the data table is improved.

The database operation method may be executed by a computing device, and the computing device may be a server, such as one server, a plurality of servers, a server cluster, a cloud computing platform, and the like, which are not limited in the present specification.

Alternatively, the data table merging method provided in the present specification may be used to implement merging of data tables in multiple types of databases. For example, the data table merging method provided in the present specification can be applied to a distributed database (such as a distributed relational database, a columnar storage database, etc.), a Key-Value (Key-Value) database, a cloud data warehouse, etc., and the specific type of the database to be applied is not limited in the present specification.

The foregoing is merely exemplary descriptions about application scenarios of the present specification, and does not limit the application scenarios of the present specification, and in more possible implementations, the solution provided in the present specification may be applied to databases stored in columns as storage engines, and the present specification does not limit specific application scenarios.

After the application scenario of the present specification is described, a description is next given of a specific implementation procedure of the present specification.

Referring to fig. 1, fig. 1 is a flowchart of a data table merging method according to an exemplary embodiment, and as shown in fig. 1, the method includes:

step 101, generating a plurality of first merging tasks based on a plurality of column groups to be merged, wherein for any one first merging task, the first merging task is used for merging part of column groups in the plurality of column groups, and column groups corresponding to different first merging tasks are different.

Alternatively, the multiple column groups to be combined may be split into multiple batches (Batch), so as to generate a first combining task based on the column groups of the same Batch, and so on, each column group of the Batch may generate a corresponding first combining task, so as to obtain multiple first combining tasks.

Step 102, merging the plurality of column groups based on the plurality of first merging tasks to obtain a merged first data table.

According to the scheme provided by the specification, the merging tasks of the plurality of column groups can be split into the plurality of first merging tasks, so that the merging failure of the data table caused by overlarge occupied memory of the single merging task is avoided, and the merging success rate of the data table is improved.

Having described the basic implementation of the present description, alternative implementations of the present description are described below.

In some embodiments, for step 101, when generating a plurality of first merging tasks based on a plurality of column groups to be merged, this may be achieved by:

and generating a first merging task based on a first set number of column groups in the plurality of column groups to obtain a plurality of first merging tasks.

Alternatively, the first set number of column groups may be divided into one lot to divide the plurality of column groups into a plurality of lots, so that one first merging task may be generated based on the column groups of each lot, one first merging task may be generated from the first set number of column groups, and so on, to achieve the generation of a plurality of first merging tasks.

The first set number may be any value, and the specific value of the first set number is not limited in this specification.

In some embodiments, for step 102, when merging a plurality of column groups based on a plurality of first merging tasks to obtain a merged first data table, the following steps may be implemented:

step 1021, merging the column groups corresponding to each first merging task.

In one possible implementation, a plurality of first merging tasks may be performed in series, so as to merge the column groups corresponding to each first merging task respectively.

In another possible implementation manner, part of the first merging tasks in the plurality of first merging tasks may be executed in parallel, so as to merge column groups corresponding to each first merging task respectively.

It should be noted that the column groups to be combined may include a first column group and a second column group, where the first column group is a column group including a key field (Rowkey), and the second column group is a column group not including a key field.

For example, for a table having five columns (Rowkey 1, rowkey2, col1, col2, col 3), where Rowkey1 and Rowkey2 share the primary keys of the composition table, (Rowkey 1, rowkey 2), (Col 1, col 2), (Col 2, col 3) may be set to be a Column Group, the table structure may be referred to in fig. 2, fig. 2 is a schematic diagram of a table structure provided by an exemplary embodiment, as shown in fig. 2, rowkey1 and Rowkey2 are one Column Group (denoted as Column Group Rowkey), col1 and Col2 are one Column Group (denoted as Column Group 1), and Col2 and Col3 are one Column Group (denoted as Column Group 2).

As shown in fig. 3, the SSTable corresponding to the Column Group shown in fig. 2 may be shown in fig. 3, and fig. 3 is a schematic diagram of a data table according to an exemplary embodiment, where Major SSTable Column Group Rowkey is an SSTable corresponding to Column Group Rowkey, major SSTable Column Group is an SSTable corresponding to Column Group 1, and Major SSTable Column Group 2 is an SSTable corresponding to Column Group 2.

Optionally, whether the plurality of first merging tasks are executed in series or in parallel, when each column group corresponding to each first merging task is merged, merging processing may be performed based on the first merging task corresponding to the first column group, and log data may be generated; and then, according to the log data, carrying out merging processing based on the first merging task corresponding to the second column group.

By generating log data based on the merging process of the key field column groups, it is possible to record where the data in the merging result comes from, so that the merging of the non-key field column groups can be realized based on the log data later.

The log data may be used to record the operation type, operation range, and operation content of the merging process, among other things. For example, the format of the log data may be as shown in table 1 below:

TABLE 1

Operation Type (Log Type)	Operating Range (Parameters)	Operation content (MerrgeLog)
			INSERT_ROW	row_index	Inserting rows at specified row offset locations
UPDATE_ROW	row_index	Updating rows at specified row offset locations
			DELETE_ROW	row_index	Deleting rows at specified row offset locations

Optionally, the first column group may further include non-critical fields in addition to the critical fields, and when the first column group is merged, in addition to the critical fields being required to be merged to generate the SSTable, other non-primary key fields are also written into the new SSTable in association to directly complete the merging of the first column group.

Step 1022, merging the plurality of second data tables to obtain the first data table.

In the above embodiments, the merging task of the plurality of column groups is split into the plurality of first merging tasks, and the plurality of first merging tasks are processed respectively to implement data table merging, and in more possible implementations, if the number of the plurality of column groups is less than or equal to the first set number, the plurality of column groups may be directly merged to obtain the first data table.

That is, for a plurality of column groups to be merged, the number of column groups to be merged may be determined first, if the number of column groups is less than or equal to the first set number, the column groups may be written directly at the same time in a merging process, and since the number of column groups is less, the situation that the number of sstables written at the same time is more, the occupied content is too large, and the merging of the data table fails is avoided, and in this case, since only one merging process is involved, log data is not required to be generated; if the number of the column groups is greater than the first set number, the merging task of the plurality of column groups may be split into a plurality of first merging tasks through the scheme provided in the above embodiment, so as to process the plurality of first merging tasks respectively to implement data table merging, that is, the merging task of the plurality of column groups is split into a plurality of latches, each latch is responsible for executing merging of the plurality of column groups, firstly, latches containing rowkeys are merged to generate log data, and the rest latches process non-primary key data by reading the log data. Using Batch to perform the merge task, the merge is performed separately for each column group, which may reduce the read consumption of log data.

In some embodiments, if any of the first merge tasks fails, a plurality of second merge tasks are regenerated based on the plurality of column groups to be merged.

Alternatively, when regenerating a plurality of second merging tasks based on a plurality of column groups to be merged, this may be achieved by:

and generating a second merging task based on a second set number of column groups in the plurality of column groups to obtain a plurality of second merging tasks. Wherein the second set number is smaller than the first set number.

Alternatively, the second set number of column groups may be divided into one lot to divide the plurality of column groups into a plurality of lots, so that one second merging task may be generated based on the column groups of each lot, one second merging task may be generated from the second set number of column groups, and so on, to achieve the generation of a plurality of second merging tasks.

The second set number may be any value, and the specific value of the second set number is not limited in this specification.

Through the embodiment, when the first merging task fails to merge due to the memory problem, the number of the column groups scheduled by each batch can be reduced, so that the sufficient memory can be ensured in the execution process of each second merging task, and the success rate of merging the data table is improved.

Corresponding to the embodiments of the method described above, the present description also provides corresponding device embodiments.

Referring to fig. 4, fig. 4 is a schematic block diagram of a computing device provided by an exemplary embodiment. Referring to fig. 4, at the hardware level, the device includes a processor 402, an internal bus 404, a network interface 406, a memory 408, and a nonvolatile memory 410, although other tasks may be performed. One or more embodiments of the present description may be implemented in a software-based manner, such as by the processor 402 reading a corresponding computer program from the non-volatile memory 410 into the memory 408 and then running. Of course, in addition to software implementation, one or more embodiments of the present disclosure do not exclude other implementation manners, such as a logic device or a combination of software and hardware, etc., that is, the execution subject of the following processing flow is not limited to each logic unit, but may also be hardware or a logic device.

The present disclosure further provides a data table merging device, please refer to fig. 5, fig. 5 is a block diagram of a data table merging device provided in an exemplary embodiment, and the data table merging device may be applied to the computing device shown in fig. 4 to implement the technical solution of the present disclosure. Wherein, the data table merging device may include:

the generating module 501 is configured to generate a plurality of first merging tasks based on a plurality of column groups to be merged, where for any one first merging task, the first merging task is configured to merge a part of column groups in the plurality of column groups, and column groups corresponding to different first merging tasks are different;

the merging module 502 is configured to merge the plurality of column groups based on the plurality of first merging tasks to obtain a merged first data table.

In some embodiments, the generating module 501, when configured to generate a plurality of first merging tasks based on a plurality of column groups to be merged, is configured to:

In some embodiments, the merging module 502 is further configured to directly merge the plurality of column groups to obtain the first data table if the number of the plurality of column groups is less than or equal to the first set number.

In some embodiments, the merging module 502, when configured to merge a plurality of column groups based on a plurality of first merging tasks to obtain a merged first data table, is configured to:

respectively merging the column groups corresponding to each first merging task;

and merging the plurality of second data tables to obtain a first data table.

In some embodiments, the merging module 502 is configured to, when configured to merge the column groups corresponding to each first merging task, any one of the following:

serially executing a plurality of first merging tasks to merge column groups corresponding to each first merging task respectively;

and executing part of the first merging tasks in the plurality of first merging tasks in parallel so as to merge the column groups corresponding to each first merging task respectively.

In some embodiments, the plurality of column groups includes a first column group that includes key fields and a second column group that does not include key fields;

the merging module 502, when configured to merge the column groups corresponding to each first merging task, is configured to:

carrying out merging processing based on a first merging task corresponding to the first column group, and generating log data;

and carrying out merging processing based on the first merging task corresponding to the second column group according to the log data.

In some embodiments, the generating module 501 is further configured to, if any of the first merging tasks fails, regenerate a plurality of second merging tasks based on a plurality of column groups to be merged.

In some embodiments, the generating module 501, when configured to regenerate the plurality of second merging tasks based on the plurality of column groups to be merged, is configured to:

generating a second merging task based on a second set number of column groups in the plurality of column groups to obtain a plurality of second merging tasks;

the second set number is smaller than the first set number, and the first set number is the number of column groups included in each first merging task.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

In a typical configuration, a computer includes one or more processors (Central Processing Unit, CPU), input/output interfaces, network interfaces, and memory.

The Memory may include non-volatile Memory in a computer readable medium, random access Memory (Random Access Memory, RAM) and/or non-volatile Memory, etc., such as Read-Only Memory (ROM) or flash RAM. Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change Memory (Phase Change Random Access Memory, PRAM), static random access Memory (Static Random Access Memory, SRAM), dynamic random access Memory (Dynamic Random Access Memory, DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically erasable programmable read Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), flash Memory or other Memory technology, read Only optical disk read Only Memory (Compact Disc Read-Only Memory, CD-ROM), digital versatile disks (Digital Video Disc, DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum Memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission media, that can be used to store information that can be accessed by a computing device. Computer-readable Media, as defined herein, does not include Transitory computer-readable Media (transmission Media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims

1. A data table merging method, comprising:

2. The method of claim 1, the generating a plurality of first merge tasks based on a plurality of column groups to be merged, comprising:

generating a first merging task based on a first set number of column groups in the plurality of column groups to obtain the plurality of first merging tasks.

3. The method of claim 2, the method further comprising:

if the number of the plurality of column groups is smaller than or equal to the first set number, the plurality of column groups are directly combined to obtain the first data table.

4. The method of claim 1, the merging the plurality of column groups based on the plurality of first merging tasks to obtain a merged first data table, comprising:

and merging the plurality of second data tables to obtain the first data table.

5. The method according to claim 4, wherein the merging the column groups corresponding to each first merging task includes any one of the following:

the plurality of first merging tasks are executed in series so as to merge column groups corresponding to each first merging task respectively;

6. The method of claim 4, wherein the plurality of column groups comprises a first column group and a second column group, the first column group being a column group comprising key fields, the second column group being a column group not comprising key fields;

the merging of the column groups corresponding to each first merging task includes:

7. The method of claim 1, the method further comprising:

if any first merging task fails, regenerating a plurality of second merging tasks based on a plurality of column groups to be merged.

8. The method of claim 7, the regenerating a plurality of second merge tasks based on a plurality of column groups to be merged, comprising:

generating a second merging task based on a second set number of column groups in the plurality of column groups to obtain the plurality of second merging tasks;

9. A data table merging apparatus comprising:

10. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any of claims 1-8 by executing the executable instructions.

11. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any of claims 1-8.