CN117194355B

CN117194355B - Data processing method and device based on database and electronic equipment

Info

Publication number: CN117194355B
Application number: CN202311479500.3A
Authority: CN
Inventors: 胡浩; 郑启洋; 邹翔宇; 夏文; 李诗逸; 张程伟; 张皖川; 蒋兆恒
Original assignee: Primitive Data Beijing Information Technology Co ltd; Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Primitive Data Beijing Information Technology Co ltd; Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2023-11-08
Filing date: 2023-11-08
Publication date: 2024-02-13
Anticipated expiration: 2043-11-08
Also published as: CN117194355A

Abstract

The embodiment of the application provides a data processing method and device based on a database and electronic equipment, and belongs to the technical field of data processing. The method comprises the following steps: screening out a selected data table according to the data type of the newly added data table, and extracting the preset batch size from the selected data table. And constructing a preset batch range according to the preset batch size and the preset value, wherein the preset batch range comprises a plurality of first batch sizes, and compressing the newly-added data table according to the first batch sizes to obtain a plurality of candidate data tables. Acquiring a first scanning time of each candidate data table, and taking a first batch size corresponding to the minimum first scanning time as a target batch size; and compressing the newly added data table according to the target batch size to obtain a target data table. When the newly added data table is compressed according to the target batch size, the scanning time of the obtained target data table is minimum, so that the balance between the decompression speed and the compression rate is realized.

Description

Data processing method and device based on database and electronic equipment

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a database-based data processing method and apparatus, and an electronic device.

Background

At present, a smaller data storage space is used for data compression in a database so as to save storage cost and reduce disk IO (input/output) quantity, thereby improving the performance of the database. The existing database compression technical scheme comprises the following steps: (1) Page level compression scheme: when the data table is created, the existing compression algorithm is designated, the compression and decompression are carried out by taking the page as granularity, the compression rate is higher, but the problem of decompression and amplification can be caused, namely, one line of data is read only, the whole page is decompressed, and the decompression speed is low. (2) row-level compression scheme: creating a dictionary in units of a page, compressing and decompressing each line of data in the page by the dictionary by adopting the same compression algorithm, and storing the dictionary in the page. Compared with the page-level compression scheme, which only reads one line of data, the technical scheme also decompresses the whole page, and the line-level compression scheme has the advantages of higher decompression speed and low compression rate. In the existing database compression technical scheme, only the decompression speed can be sacrificed to obtain a higher compression rate, or the compression rate is reduced to improve the decompression speed, so that the performance of the database cannot be optimized. In order to optimize the performance of the database, the decompression speed and compression rate should be balanced. Therefore, how to achieve the balance of decompression speed and compression rate and improve the performance of the database becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application mainly aims to provide a data processing method, a data processing device and electronic equipment based on a database, and aims to realize balance of decompression speed and compression rate and improve performance of the database.

To achieve the above object, a first aspect of an embodiment of the present application proposes a database-based data processing method, including:

acquiring a new data table and the data type of the new data table;

screening a selected data table from a preset historical data table according to the data type of the newly added data table;

extracting a preset batch size from the selected data table; the method comprises the steps that when the preset batch size representation is compressed according to the preset batch size to obtain the selected data table, the scanning time of the selected data table is minimum;

constructing a preset batch range according to the preset batch size and the preset value; wherein the preset batch range includes a plurality of first batch sizes;

compressing the newly added data table according to the first batch size to obtain at least one candidate data table;

acquiring a first scanning time of each candidate data table, and taking a first batch size corresponding to the minimum first scanning time as a target batch size; and compressing the newly added data table according to the target batch size to obtain a target data table.

In some embodiments, before the obtaining the new data table and the data type of the new data table, the method further includes:

acquiring a preliminary data table; wherein the preliminary data sheet comprises at least one data page;

acquiring a plurality of second batch sizes; wherein the second plurality of batch sizes are 1 to N, respectively, the N being represented as the number of data pages comprising data;

compressing the preliminary data table according to the second batch size to obtain at least one compressed data table;

acquiring a second scanning time of each compressed data table, and taking a second batch size corresponding to the minimum second scanning time as a preset batch size;

performing data compression on the preliminary data table according to the preset batch size to obtain a historical data table;

and storing the historical data table and the preset batch size together.

In some embodiments, before the obtaining the preliminary data table, the method further comprises:

acquiring an original data table;

compressing the original data table to obtain an original compressed data table;

acquiring the scanning time of the original data table and the scanning time of the original compressed data table;

And if the scanning time of the original data table is longer than that of the original compressed data table, taking the original data table as the preliminary data table.

In some embodiments, the compressing the new data table according to the target batch size to obtain a target data table includes:

acquiring data of each row in the newly added data table to obtain newly added row data;

compressing each newly added row of data according to a preset compression format and the target batch size to obtain at least one data frame and/or a matching frame; the preset compression format includes the data frame and the matching frame, where the data frame includes: a non-repeating data length, non-repeating data, and a separator, the separator serving as a terminal position to distinguish the non-repeating data; the matching frame comprises an offset from the current position of the repeated data to the matching position, the total length of repeated reference matching and the first matching length of the repeated data;

repeating the steps until the new row data is traversed, and combining at least one data frame and/or the matching frame to obtain the target data table.

In some embodiments, the compressing each newly added row data according to the preset compression format and the target batch size to obtain at least one data frame and/or matching frame includes:

Traversing each newly added row of data through a sliding window, and calculating a hash value of the sliding window;

if the hash value does not exist in the shared table, the data traversed by the sliding window is non-repeated data, and the hash value, the offset of the current position of the non-repeated data in the compressed data and the pointer pointing to the current position of the uncompressed data are recorded in the shared table;

storing continuous non-repeated data as data frames according to the length of the non-repeated data and the non-repeated data;

if the hash value exists in the shared table, the data traversed by the sliding window is repeated data, the total length matched by repeated reference of the repeated data is obtained according to the corresponding pointer pointing to the current position of uncompressed data in the shared table, the offset from the current position of the repeated data to the matched position is obtained according to the offset of the current position of the non-repeated data in the compressed data and the offset from the current position of the repeated data in the compressed data, and the first matched length of the repeated data is obtained according to the offset from the current position of the repeated data to the matched position;

and storing the repeated data as a matching frame according to the offset from the current position of the repeated data to the matching position, the total length of the repeated reference matching and the first matching length of the repeated data.

In some embodiments, after the combining at least one of the data frames and/or the matching frames to obtain the target data table, the method further includes:

acquiring a query plan;

analyzing the query plan to obtain decompressed data selection information;

selecting data from the target data table according to the decompressed data selection information to obtain target compressed data; wherein the target compressed data comprises at least one data frame and/or a matching frame;

traversing the data frames and the matching frames in the target compressed data in sequence;

copying non-repeated data in the data frame, acquiring the position and the length of the repeated data according to the matching frame, and copying the repeated data from the data frame according to the position and the length of the repeated data;

and sequentially combining all the non-repeated data and all the repeated data to obtain decompressed data.

In some embodiments, the acquiring the scan time of the original data table includes;

acquiring the physical space of the original data table and the page eviction overhead of a database;

and predicting the scanning time according to the physical space, the preset memory budget, the preset disk I/O bandwidth and the page eviction overhead to obtain the scanning time of the original data table.

In some embodiments, the obtaining the scan time of the original compressed data table includes:

acquiring the consumed proportion of the memory alignment in the buffer pool, the compression rate under the second batch size and the decompression speed under the second batch size; wherein the raw data table is stored in the buffer pool;

and carrying out scanning time prediction according to the physical space, the preset memory budget, the preset disk I/O bandwidth, the page eviction overhead, the consumed proportion of memory alignment in the buffer pool, the compression rate under the second batch size and the decompression speed under the second batch size, and obtaining the scanning time of the original compressed data table.

To achieve the above object, a second aspect of the embodiments of the present application proposes a database-based data processing apparatus, the apparatus comprising:

the first acquisition module is used for acquiring a new data table and the data types of the new data table;

the screening module is used for screening a selected data table from a preset historical data table according to the data type of the newly added data table;

the extraction module is used for extracting the preset batch size from the selected data table; the method comprises the steps that when the preset batch size representation is compressed according to the preset batch size to obtain the selected data table, the scanning time of the selected data table is minimum;

The construction module is used for constructing a preset batch range according to the preset batch size and the preset value; wherein the preset batch range includes a plurality of first batch sizes;

the first compression module is used for compressing the newly added data table according to the first batch size to obtain at least one candidate data table;

the second acquisition module is used for acquiring the first scanning time of each candidate data table and taking the first batch size corresponding to the minimum first scanning time as a target batch size;

and the second compression module is used for compressing the newly-added data table according to the target batch size to obtain a target data table.

To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, which includes a memory and a processor, the memory storing a computer program, the processor implementing the method according to the first aspect when executing the computer program.

According to the data processing method, the data processing device and the electronic equipment based on the database, the selected data table is screened from the preset historical data table according to the data type of the newly added data table, the preset batch size is extracted from the selected data table, and when the selected data table is obtained by compression according to the preset batch size, the scanning time of the selected data table is minimum. After a preset batch range is established according to the preset batch size and the preset value, the preset batch range comprises a plurality of first batch sizes, and the newly added data table is compressed according to the first batch sizes to obtain at least one candidate data table. Acquiring a first scanning time of each candidate data table, and taking a first batch size corresponding to the minimum first scanning time as a target batch size; and compressing the newly added data table according to the target batch size to obtain a target data table. When the newly added data table is compressed according to the size of the target batch, the scanning time of the obtained target data table is minimum, so that balance between decompression speed and compression rate is realized, the size of the target batch is selected from the preset batch range, the calculation process can be reduced, the size of the target batch is quickly selected, and the performance of the database is improved.

Drawings

FIG. 1 is a flow chart of a database-based data processing method provided in an embodiment of the present application;

FIG. 2 is a flow chart of a database-based data processing method provided in another embodiment of the present application;

FIG. 3 is a flowchart of a database-based data processing method provided in a third embodiment of the present application;

fig. 4 is a flowchart of step S107 in fig. 1;

FIG. 5 is a schematic diagram of compressing data in a predetermined compression format according to an embodiment of the present application;

FIG. 6 is a schematic diagram of another embodiment of compressing data in a predetermined compression format;

fig. 7 is a flowchart of step S402 in fig. 4;

FIG. 8 is a flow chart of compressed data provided by an embodiment of the present application;

FIG. 9 is a flowchart of a database-based data processing method provided in a fourth embodiment of the present application;

fig. 10 is a first flowchart of step S301 in fig. 3;

fig. 11 is a second flowchart of step S301 in fig. 3;

FIG. 12 is a schematic diagram of a database-based data processing apparatus according to an embodiment of the present application;

fig. 13 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

First, several nouns referred to in this application are parsed:

data page (page): the data storage method refers to the most basic data unit of data storage in a database, namely the minimum data processing unit when a disk space manager supports processing of external memory.

A key factor affecting decompression speed and compression rate is aggregate compression size (batch size), i.e., what number of rows are aggregated together for compression. The compression scheme in the prior art is to aggregate all rows of the entire data page together for compression, or aggregate only one row together for compression, respectively page-level compression scheme and row-level compression scheme. The page-level compression scheme compresses and decompresses the page with granularity, namely, all lines in the whole data page are aggregated together for compression, the compression rate is higher, but the problem of amplification is solved, namely, only one line of data is read, the whole page also needs to be decompressed, and the decompression speed is low. The line-level compression scheme is to create a dictionary by taking a page as a unit, and compress and decompress each line of data in the page by adopting the same compression algorithm. In the existing database compression technical scheme, only the decompression speed can be sacrificed to obtain a higher compression rate, or the compression rate is reduced to improve the decompression speed, namely the decompression speed and the compression rate cannot reach balance, so that the performance of the database cannot reach the optimal. In order to optimize the performance of the database, the decompression speed and compression rate should be balanced.

Based on the above, the embodiment of the application provides a data processing method, a device and an electronic device based on a database, which aim to improve the performance of the database by calculating the corresponding target batch sizes under different data tables, compressing the data with the target batch sizes as granularity, and realizing the balance of decompression speed and compression rate in different data tables.

The method, the device and the electronic equipment for processing data based on the database provided by the embodiment of the application are specifically described through the following embodiments, and the method for processing data based on the database in the embodiment of the application is described first.

The data processing method based on the database provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements a database-based data processing method, but is not limited to the above form.

The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network personal computers (Personal Computer, PCs), minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Referring to fig. 1, fig. 1 is an optional flowchart of a database-based data processing method according to an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S107.

Step S101, obtaining a new data table and the data types of the new data table;

step S102, screening a selected data table from a preset historical data table according to the data type of the newly added data table;

step S103, extracting a preset batch size from the selected data table; the method comprises the steps that when a preset batch size representation is compressed according to the preset batch size to obtain a selected data table, the scanning time of the selected data table is minimum;

step S104, a preset batch range is constructed according to the preset batch size and the preset value; wherein the preset batch range includes a plurality of first batch sizes;

step S105, compressing the newly added data table according to the first batch size to obtain at least one candidate data table;

step S106, obtaining a first scanning time of each candidate data table, and taking a first batch size corresponding to the minimum first scanning time as a target batch size;

and S107, compressing the newly added data table according to the target batch size to obtain a target data table.

In steps S101 to S103 of some embodiments, the preset batch size is calculated in advance, each of the history data tables corresponds to a preset batch size, and the history data tables are stored together with the corresponding preset batch size. For example, when the preset batch size is 5, the historical data table is obtained by compressing with 5-row granularity, and when the historical data table is obtained by compressing with the preset batch size, the scanning time of the historical data table is minimum, and the decompression speed and the compression rate of the historical data table are balanced. When a new data table exists in the database and the new data table needs to be compressed, a selected data table with the same data type is screened from a preset historical data table according to the data type of the new data table, and the preset batch size in the selected data table is extracted.

When the new data table needs to be compressed, it may be determined whether the new data table is stored in a disk or a buffer pool (buffer pool). If the data is stored in the buffer pool, the new data table in the buffer pool can be directly used for calculating the corresponding target batch size. If the data is stored in the disk, the new data table in the disk is written into the buffer pool, and then the corresponding target batch size is calculated by using the new data table in the buffer pool, so that the disk I/O cost can be saved.

It can be understood that, since the new data table and the selected data table are the same type of data table, the preset batch size of the selected data table is similar to the target batch size of the new data table, and the new data table is compressed according to the preset batch size to obtain a better compression strategy.

In steps S104 to S107 of some embodiments, when the preset batch size is 5 and the preset value is 3, the preset batch range is [2,8], then the first batch sizes are 2, 3, 4, 5, 6, 7, and 8, respectively. And compressing the newly added data table by taking the first batch size as granularity to obtain 7 candidate data tables, and respectively obtaining 7 first scanning times of the 7 candidate data tables. If the first batch size is 4, the corresponding first scanning time is the smallest, the target batch size is 4, the newly added data table is compressed according to the 4-row granularity to obtain a target data table, the target data table is stored in the database, the candidate data table is not stored, and redundant data tables are prevented from being stored in the database. Or, the candidate data table corresponding to the first batch size of 4 is directly stored in the database as the target data table, and the rest candidate data tables are not stored. Under the condition that too many data tables are prevented from being stored in the database, repeated compression can be prevented, the compression speed is improved, and the compression rate and the decompression speed in the target data table at the moment reach balance.

Specifically, when the newly added data table is compressed according to the first batch size to obtain at least one candidate data table, the sample compression may be performed. The method comprises the steps of compressing partial data in a newly added data table according to a first batch size to obtain at least one candidate data table, obtaining first scanning time of each candidate data table, and taking the first batch size corresponding to the minimum first scanning time as a target batch size. In this embodiment, the calculation amount for obtaining the target batch size is further reduced through sampling and compression, and the process for obtaining the target batch size is quickened.

It should be noted that, the selection of the target batch size may be accelerated by a dichotomy, when the preset batch range is [2,8], the newly added data table is compressed with the first batch size being 5 and 8 respectively to obtain two candidate data tables, when the candidate data table scanning time corresponding to the first batch size being 5 is smaller than the candidate data table scanning time corresponding to the first batch size being 8, the candidate data table scanning time corresponding to the first batch size being 3 is calculated. When the candidate data table scanning time corresponding to the first batch size of 5 is smaller than the candidate data table scanning time corresponding to the first batch size of 3, calculating the candidate data table scanning time corresponding to the first batch size of 4, and when the candidate data table scanning time corresponding to the first batch size of 4 is smaller than the candidate data table scanning time corresponding to the first batch size of 5, the target batch size of 4, according to the embodiment, the selection of the target batch size is accelerated without calculating the candidate data table scanning time corresponding to all the first batch sizes.

In some examples, after the preset batch size is extracted from the selected data table, the preset batch size may be constructed in other manners, for example, the lower limit of the preset batch size is the preset batch size divided by 2, and when the result is not an integer, the preset batch size is rounded up, the lower limit is the preset batch size multiplied by 2, for example, the preset batch size is 5, the preset batch size is [3, 10], and then the target batch size is selected from the preset batch size according to the dichotomy, so that the target batch size can be selected more quickly.

In the steps S101 to S107 illustrated in the embodiment of the present application, when compressing the new data table, the data type of the new data table is obtained, and the selected data table with the same data type is selected from the preset historical data table. Extracting a preset batch size from the selected data table, and constructing a preset batch range according to the preset batch size and a preset value; wherein the preset batch range includes a plurality of first batch sizes. And compressing the newly added data table according to the first batch size to obtain at least one candidate data table, and obtaining the scanning time of each candidate data table to obtain the first scanning time. And taking the first batch size corresponding to the minimum first scanning time as a target batch size, and compressing the newly added data table according to the target batch size to obtain a target data table. According to the embodiment, the preset batch size is extracted from the selected data table, the preset batch range is constructed, the target batch size is positioned in the preset batch range, the target batch size can be selected rapidly, when the newly-added data table is compressed according to the target batch size, the scanning time of the obtained target data table is minimum, and the balance between the decompression speed and the compression rate is realized.

Referring to fig. 2, in some embodiments, before step S101, the database-based data processing method of the present application specifically further includes, but is not limited to, steps S201 to S206:

step S201, a preliminary data table is obtained; wherein the preliminary data sheet comprises at least one data page;

step S202, obtaining a plurality of second batch sizes; wherein the second plurality of batches are 1 to N, respectively, N being the number of data pages containing data;

step S203, compressing the preliminary data table according to the second batch size to obtain at least one compressed data table;

step S204, obtaining a second scanning time of each compressed data table, and taking a second batch size corresponding to the minimum second scanning time as a preset batch size;

step S205, compressing the preliminary data table according to the preset batch size to obtain a historical data table;

step S206, the history data table and the preset batch size are stored together.

In the steps S201 to S206 illustrated in the embodiment of the present application, the original data table in the database is referred to as a preliminary data table, and if the number of lines loaded in one data page is 10 or the number of lines included in the data page is 10, the second lot sizes are 1, 2, …, and 10, respectively. And respectively compressing the preliminary data tables by taking the second batch size as granularity to obtain 10 compressed data tables, and respectively obtaining the scanning time of the 10 compressed data tables to obtain 10 second scanning times. If the second batch size is 4, the corresponding second scanning time is the smallest, the preset batch size is 4, the preliminary data table is compressed according to the 4-behavior granularity, a historical data table is obtained, and the historical data table and the preset batch size are stored together. In this embodiment, the preliminary data table is compressed according to the calculated preset batch size, and the obtained historical data table has the minimum scanning time, which indicates that the compression rate and decompression speed are balanced, and the database performance is improved. When a new data table is added in the database, a selected data table with the same data type can be screened from a preset historical data table according to the data type of the new data table, and the preset batch size is extracted from the selected data table. The target batch size of the newly added data table can be obtained rapidly according to the preset batch size, so that when the newly added data table is prevented from being added, the scanning time corresponding to all possible batch sizes is calculated, and the target batch size can be obtained.

It should be noted that, when the data type of the new data table does not exist in the preset history data table, the target batch size of the new data table may be obtained by the method of this embodiment.

Specifically, when the newly added data table is compressed according to the second batch size to obtain at least one compressed data table, the compressed data table may be sample compressed. The method comprises the steps of compressing partial data in a preliminary data table according to a second batch size to obtain at least one compressed data table, obtaining second scanning time of each compressed data table, and taking the second batch size corresponding to the minimum second scanning time as a preset batch size. In this embodiment, the calculation amount for obtaining the preset batch size is further reduced through sampling and compression, so as to accelerate the process of obtaining the preset batch size.

In some examples, the selection of the preset batch size may also be accelerated by a dichotomy, and if the data page has N number of data loaded most, the specific implementation process is: and respectively compressing the preliminary data tables with the second batch size of N and N/2 to obtain two compressed data tables, and if the compressed data table scanning time corresponding to the second batch size of N/2 is smaller than the compressed data table scanning time corresponding to the second batch size of N, calculating the compressed data table scanning time corresponding to the second batch size of N/4. Comparing the sizes of the compressed data table scanning times corresponding to the second batch size of N/4 and N/2, if the compressed data table scanning time corresponding to the second batch size of N/4 is smaller, selecting the second batch size of N/8, and comparing the sizes of the compressed data table scanning times corresponding to the second batch size of N/8 and N/4 again, so analogy is performed until the preset batch size is found.

Referring to fig. 3, in some embodiments, before step S201, the database-based data processing method of the present application specifically further includes, but is not limited to, steps S301 to S302:

step S301, an original data table is obtained;

step S302, compressing an original data table to obtain an original compressed data table;

step S303, acquiring the scanning time of the original data table and the scanning time of the original compressed data table;

in step S304, if the scanning time of the original data table is longer than the scanning time of the original compressed data table, the original data table is used as the preliminary data table.

In the steps S301 to S304 illustrated in the present embodiment, the memory space can be saved after most data tables are compressed, but there is also some memory space required after data tables are compressed that is not reduced. Therefore, not all data tables need to be compressed, but only the data tables in the database with compression potential. The data table which can save storage space after compression is called a data table with compression potential. And judging whether the data table has compression potential or not by judging the scanning time before the data table is compressed and the scanning time after the database table is compressed, and compressing the data table with the compression potential as a preliminary data table. Since only the data table with compression potential is compressed, the transaction delay caused by unnecessary compression is reduced, and the performance of the database is further improved.

It should be noted that, the original data table is compressed according to the second lot sizes, if at least one original compressed data table exists, the scanning time is smaller than that of the original data table, the original data table is described as the data table with compression potential, and the original data table is used as the preliminary data table.

Specifically, the original data table is compressed, and when the original compressed data table is obtained, sampling compression can be performed. And compressing partial data in the original data table according to the second batch size to obtain at least one original compressed data table, and taking the original data table as a preliminary data table if the scanning time of the partial data in the original data table is longer than that of the original compressed data table. In this embodiment, the process of determining whether the original data table has compression potential is expedited by sampling compression.

Referring to fig. 4, in some embodiments, step S107 may include, but is not limited to, steps S401 to S403:

step S401, each line of data in the newly added data table is obtained, and the newly added line of data is obtained;

step S402, compressing each new line data according to a preset compression format and a target batch size to obtain at least one data frame and/or a matching frame; the preset compression format comprises a data frame and a matching frame, wherein the data frame comprises: non-repeating data length, non-repeating data, and a separator, the separator serving as a terminal position to distinguish the non-repeating data; the matching frame comprises an offset from the current position of the repeated data to the matching position, the total length of repeated reference matching and the first matching length of the repeated data;

Step S403, repeating the steps until the new row data is traversed, and combining at least one data frame and/or matching frame to obtain the target data table.

In steps S401 to S403 illustrated in the present embodiment, the data frame represents non-repeated data and meta information of the non-repeated data, and the matching frame represents meta information of the repeated data. Fig. 5 is a schematic diagram of compressing data according to a preset compression format, as shown in fig. 5, where the target batch size is 2, the data is compressed with 2 row granularity, and the partial newly added row data are www.icde2024.github.io and www.icde2023.ics.uci.edu respectively. The non-repeated data length is represented by one byte, and since the first segment of non-repeated data length in two lines of data is 22 bytes and is represented by 0x16 using 16 system, the length field of the first data frame is 0x16 and the first data frame non-repeated data is "www.icde2024.github.io". The separator tag is represented as 0x24, and the separator only needs to be set to a character that does not exist.

The second data has duplicate data "www.icde202" and this portion of the data will be replaced with a matching frame. The Offset of the current position of the repeated data to the matching position is represented as an Offset field, which is represented by two bytes. The first "W" of the second row data to the first "W" of the first data frame data field, which does not repeat data, has 22 bytes in total, plus one byte of the first data frame tag field and two bytes of the first matching frame Offset field, which has 25 bytes in total. Therefore, the Offset field is 0×19×0, the total length of the repeated reference matches is represented as a match_length field, one byte is represented, and the repeated data has 11 bytes, so the first match length of the repeated data is represented as a tag_length field, and when the data is not overlapped for reference, the tag_length field can be omitted. The non-repeated data of the second line is "3.Ics. Uci. Edu", and the length is 11 bytes, so the length field of the second data frame is 0x0B, and the data field of the second data frame is "3.Ics. Uci. Edu".

In some examples, fig. 6 is a schematic diagram of another embodiment of compressing data according to a preset compression format, where the target batch size is 3 and the compression is performed with 3-row granularity, as shown in fig. 6. The partial newly added line data are "QWQUSGZAZP", "MNQSU1YPOZAZP" and "PAJIDXC1YPOZAZP", respectively. The non-duplicate data is the first row data "QWQUSGZAZP" and the second row data "MNQSU1YPO", for a total of 19 bytes, so the length field in the first data frame is 0x13, and the data field in the first data frame is "qwqusgzaznqsu 1YPO". The second data has repeated data "ZAZP", which will be replaced with a matching frame. Since the total of 13 bytes from "Z" of the first second data to the first "Z" of the first data frame data field, which is not repeated data, is added to one byte of the first data frame tag field and two bytes of the first matching frame Offset field, which is 16 bytes in total, the Offset field in the first matching frame is 0×10×0. The repeated data has 4 bytes, so the match_length field in the first matching frame is 0x04, and the first matching length is 4 bytes, so the tag_length field in the first matching frame is 0x04. The third line of non-repeated data is "PAJIDXC", so the length field in the second data frame is 0x07, and the data field in the second data frame is "PAJIDXC". The third line of repeated data is "1 ypozap", which is to be replaced with a matching frame. Since "1" of the third line data to "1" of the first data frame data field in which data is not repeated have 11 bytes in total, plus two bytes of the second matching frame Offset field, one byte of the second data frame tag field and one byte of the length field, four bytes of the first matching frame total field and one byte of the first data frame tag field, and 20 bytes in total, the Offset field in the second matching frame is 0×14×0. The repeated data has 8 bytes, so the match_length field in the first matching frame is 0x08, the first matching length is 4 bytes, and the tag_length field in the first matching frame is 0x04, wherein "ZAZP" in the third line data refers to "ZAZP" in the second line data, and "ZAZP" in the second line data refers to "ZAZP" in the first line data, that is, the third line data has overlapping references.

It should be noted that, since the match_offset field is an offset in the compressed data, not an offset in the data before compression, any reference in the compressed data can be directly located at the position of the reference without decompressing all the previous data, so that the decompression with fine granularity is realized, that is, after the compression with the granularity of three rows, only one row of data is needed to be decompressed, all three rows of data are not needed to be decompressed, but only one row of data is decompressed, and the decompression speed is also improved while the compression rate is improved.

Referring to fig. 7, in some embodiments, step S402 may include, but is not limited to, steps S701 to S705:

step S701, traversing each new row of data through the sliding window, and calculating a hash value of the sliding window;

step S702, if the hash value does not exist in the shared table, the data traversed by the sliding window is non-repeated data, and the hash value, the offset of the current position of the non-repeated data in the compressed data and the pointer pointing to the current position of the uncompressed data are recorded in the shared table;

step S703, storing continuous non-repeated data as data frames according to the length of the non-repeated data and the non-repeated data;

Step S704, if the hash value exists in the shared table, the traversed data of the sliding window is repeated data, the total length of repeated reference matching of the repeated data is obtained according to the corresponding pointer pointing to the current position of uncompressed data in the shared table, the offset of the current position of the repeated data to the matching position is obtained according to the offset of the current position of the non-repeated data in the compressed data and the offset of the current position of the repeated data in the compressed data, and the first matching length of the repeated data is obtained according to the offset of the current position of the repeated data to the matching position;

step S705, storing the repeated data as a matching frame according to the offset of the current position of the repeated data to the matching position, the total length of the repeated reference matching, and the first matching length of the repeated data.

In steps S701 to S705 illustrated in the present embodiment, the data to be compressed is traversed through the sliding window, and when the hash value of a certain sliding window does not exist in the shared table, it is indicated that the data corresponding to the sliding window is non-duplicate data. And recording the related information of the sliding window in a sharing table, continuing to slide the sliding window by one byte, and storing continuous unrepeated data as data frames when the hash value of the sliding window exists in the sharing table to indicate that the data of the sliding window is repeated data. And locating the matching position in the data before compression through a corresponding pointer pointing to the current position of uncompressed data in the shared table, comparing byte by byte, finding out the total length (i.e. match_length field) of repeated reference matching, and then acquiring the offset (i.e. offset field) from the current position of repeated data to the matching position and the first matching length (i.e. tag_length field) of the repeated data, so as to store the repeated data as a matching frame. The sliding window slides the match_length bytes to skip the repeated data length. The shared table is only created when compression is finished, destroyed when compression is finished and does not occupy storage space.

Referring to fig. 8, when the sliding window is four bytes, it is indicated that four consecutive bytes are identical and are determined as repeated data. Referring to fig. 5, the sliding window traverses the data www.icde2024.github.io, www.icde2023.ics.uci.edu to be compressed, the sliding window first stays on "www" of the first line of data, and a hash value of the sliding window is calculated, where the hash value is not equal to the hash value in the shared table, so that "www" of the first line of data is non-duplicate data. The length field of the first data frame is one byte, and the current position of the non-repeated data is the first "w" in the first row of data at this time, so the offset of the current position of the non-repeated data in the compressed data is 0x01, the hash value, 0x01 and the pointer pointing to the current position of the uncompressed data are recorded on the first piece of data in the shared table, wherein the pointer pointing to the current position of the uncompressed data points to the first "w" in the first row of data. The sliding window continues to slide to the next byte, moves to the 'ww.i' of the first row of data, the hash value of the sliding window does not exist in the sharing table, the 'ww.i' of the first row of data is non-repeated data, and the related information of the sliding window is recorded in the sharing table. Repeating the steps until the sliding window moves to the 'www of the second row of data, finding that the hash value of the current sliding window exists on the first piece of data in the sharing table, and indicating that the' www of the second row of data is repeated data. The data before "www." of the second line of data is non-repeated data, the length of the non-repeated data is 22, and the non-repeated data is "www.icde2024.github.io", so the length field of the first data frame is 0x16, and the data field of the first data frame is "www.icde2024.github.io".

According to the pointer pointing to the current position of uncompressed data in the first data of the shared table, the pointer is positioned on the first w of the first row of data, byte-by-byte comparison is carried out, the www.icde202 of the second row of data is found to be repeated data, the total length of repeated reference matching of the repeated data is 11 bytes, namely the match_length field of the first matching frame is 0x0B. The Offset of the current position of the non-repeated data in the first piece of data in the shared table in the compressed data is 0x01, the Offset of the current position of the repeated data in the compressed data is 0x1A, the difference between 0x1A and 0x01 is the Offset from the current position of the repeated data to the matching position, namely the Offset field of the first matching frame is 0x19 x0, and the matching frame is used for replacing the repeated data. The sliding window is slid directly onto "3.Ic" in the second row of data, thereby skipping the length of the duplicate data, and continuing to store the non-duplicate data "3.Ics. Uci. Edu" as the second data frame.

Referring to fig. 9, in some embodiments, the database-based data processing method may further include, but is not limited to, steps S901 to S906:

step S901, obtaining a query plan;

step S902, analyzing the query plan to obtain decompressed data selection information;

Step S903, selecting data in the target data table according to the decompressed data selection information to obtain target compressed data; wherein the target compressed data comprises at least one data frame and/or a matching frame;

step S904, traversing the data frames and the matching frames in the target compressed data in sequence;

step S905, copying non-repeated data in the data frame, acquiring the position and the length of the repeated data according to the matched frame, and copying the repeated data from the data frame according to the position and the length of the repeated data;

step S906, combining all non-repeated data and all repeated data in sequence to obtain decompressed data.

In steps S901 to S906 illustrated in this embodiment, when the database executes the read query, the storage engine parses the query plan to obtain decompressed data selection information, reads the relevant page into the buffer pool according to the decompressed data selection information, and finds the line to be decompressed, each required line (i.e., the required line where the query plan maps to the page) will be decompressed individually. The decompressor restores the data by identifying each byte in the compression format, namely directly copying the non-repeated data from the data frame according to the length of the non-repeated data in the data frame, positioning the position of the repeated data according to the matched frame, copying the repeated data from the data frame, and sequentially combining all the non-repeated data and all the repeated data to obtain the decompressed data.

In an example, referring to fig. 6, if only the second data frame and the second matching frame in the sixth data frame need to be decompressed, the non-duplicate data "PAJIDXC" in the second data frame is copied first, the match_length field 0x08 in the second matching frame is greater than the tag_length field 0x04, which indicates that there is an overlapping reference, the second matching frame is shifted by 0x14 bytes, the second matching frame is shifted to the first data frame data field by 0x04, the tag_length field in the second matching frame is 0x04, which indicates that four bytes need to be copied, so that the "1YPO" is copied, and the four bytes of duplicate data "1YPO" is left to the first matching frame, the first data frame data field is shifted by 0x10 bytes according to the first matching frame Offset field, the first data "Z" needs to be copied, the obtained duplicate number is "1 ozp", and the non-duplicate number and the duplicate data "zadxc" pajidc "are sequentially decompressed to obtain the data" pazp ". Therefore, in this embodiment, although three rows of data are aggregated together and compressed, only one row of data may be decompressed, so that decompression with fine granularity is realized, and the problem of decompression and amplification is solved.

It should be noted that the overhead on the decompression path of the decompressor can also be minimized based on a single instruction multiple data (Single Instruction Multiple Data, SIMD) algorithm, i.e., 16 bytes are identified at a time by the compare function, and the data speed is replicated by the SIMD acceleration memcpy () function, thereby maximizing the decompression speed.

Referring to fig. 10, in some embodiments, step S301 may include, but is not limited to, steps S1001 to S1002:

step S1001, obtaining the physical space of an original data table and the page eviction overhead of a database;

step S1002, predicting the scanning time according to the physical space, the preset memory budget, the preset disk I/O bandwidth and the page eviction cost, and obtaining the scanning time of the original data table.

In steps S1001 to S1002 illustrated in the present embodiment, the scanning time of the original data table is calculatedAs shown in the following equation 1:

，(1)

wherein,representing the physical space of the original data table, +.>Representing a preset memory budget->Representing preset disk I/O bandwidth, +.>Representing page eviction overhead.

Referring to fig. 11, in some embodiments, step S301 may include, but is not limited to, steps S1101 to S1103:

step S1101, obtaining the physical space of the original data table and the page eviction overhead of the database;

Step S1102, obtaining the consumed proportion of the memory alignment in the buffer pool, the compression rate under the second batch size and the decompression speed under the second batch size; wherein the original data table is stored in the buffer pool;

step S1103, performing scan time prediction according to the physical space, the preset memory budget, the preset disk I/O bandwidth, the page eviction overhead, the consumed proportion of memory alignment in the buffer pool, the compression rate at the second batch size, and the decompression speed at the second batch size, to obtain the scan time of the original compressed data table.

In steps S1101 to S1103 illustrated in the present embodiment, for the original compressed data table, since the memory alignment, i.e. "punching" operation is required to be performed in the database page, the space scanning overhead of the original compressed data table needs to be calculated, and the scanning time of the original compressed data table is calculatedAs shown in the following formulas 2 to 4:

，(2)

，(3)/>

，(4)

wherein,representing the physical space of the original data table, +.>Representing the proportion of buffer pool consumed due to memory alignment, +.>Representing a preset memory budget->Representing preset disk I/O bandwidth, +.>Representing page eviction overhead, exit>Indicating a lot size +.>Compression ratio at time,/->Indicating a lot size +. >Decompression speed at time, < >>Physical space representing row data in a data page, < >>Indicating a lot size +.>Physical space at time, < >>Represents the compressed line length, +.>Represents the line length before compression, +.>Indicating the batch size. Since the scanning time is determined by the compression rate and decompression speed at a certain batch size, the batch size corresponding to the minimum scanning time is used as the target batch size, and the balance of the compression rate and decompression speed can be realized when the data table is compressed according to the target batch size.

Referring to fig. 12, an embodiment of the present application further provides a database-based data processing apparatus, which may implement the above database-based data processing method, where the system includes:

a first obtaining module 1201, configured to obtain a new data table and a data type of the new data table;

a screening module 1202, configured to screen a selected data table from a preset historical data table according to a data type of the newly added data table;

an extracting module 1203, configured to extract a preset batch size from the selected data table; the method comprises the steps that when a preset batch size representation is compressed according to the preset batch size to obtain a selected data table, the scanning time of the selected data table is minimum;

A construction module 1204, configured to construct a preset batch range according to the preset batch size and the preset value; wherein the preset batch range includes a plurality of first batch sizes;

the first compression module 1205 is configured to compress the newly added data table according to the first batch size to obtain at least one candidate data table;

a second obtaining module 1206, configured to obtain a first scan time of each candidate data table, and take a first batch size corresponding to the smallest first scan time as a target batch size;

the second compression module 1207 is configured to compress the new data table according to the target batch size to obtain the target data table.

The specific implementation of the database-based data processing apparatus is substantially the same as the specific embodiment of the database-based data processing method described above, and will not be described herein.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the database-based data processing method when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 13, fig. 13 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

Processor 1301 may be implemented by a general-purpose CPU (Central Processing Unit ), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided by the embodiments of the present application;

the Memory 1302 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access Memory (Random Access Memory, RAM). Memory 1302 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present application are implemented in software or firmware, relevant program codes are stored in memory 1302, and the processor 1301 invokes a database-based data processing method for performing the embodiments of the present application;

an input/output interface 1303 for implementing information input and output;

the communication interface 1304 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

A bus 1305 to transfer information between the various components of the device (e.g., the processor 1301, memory 1302, input/output interfaces 1303, and communication interfaces 1304);

wherein the processor 1301, the memory 1302, the input/output interface 1303 and the communication interface 1304 enable a communication connection between each other inside the device via a bus 1305.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the database-based data processing method when being executed by a processor.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not constitute limitations of the embodiments of the present application, and may include more or fewer steps than shown, or may combine certain steps, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A database-based data processing method, the method comprising:

acquiring a new data table and the data type of the new data table;

acquiring a first scanning time of each candidate data table, and taking a first batch size corresponding to the minimum first scanning time as a target batch size;

and compressing the newly added data table according to the target batch size to obtain a target data table.

2. The method of claim 1, wherein prior to the obtaining the new data table and the data type of the new data table, the method further comprises:

compressing the preliminary data table according to the preset batch size to obtain a historical data table;

and storing the historical data table and the preset batch size together.

3. The method of claim 2, wherein prior to the obtaining the preliminary data table, the method further comprises:

acquiring an original data table;

4. The method of claim 1, wherein compressing the new data table according to the target batch size to obtain a target data table comprises:

5. The method according to claim 4, wherein compressing each of the newly added line data according to a preset compression format and the target batch size to obtain at least one data frame and/or matching frame comprises:

6. The method according to claim 4, wherein after said combining at least one of said data frames and/or said matching frames to obtain said target data table, said method further comprises:

acquiring a query plan;

analyzing the query plan to obtain decompressed data selection information;

7. A method according to claim 3, wherein the obtaining the scan time of the raw data table comprises;

8. A method according to claim 3, wherein obtaining the scan time of the original compressed data table comprises:

9. A database-based data processing apparatus, the apparatus comprising:

10. An electronic device comprising a memory storing a computer program and a processor implementing the database-based data processing method according to any one of claims 1 to 8 when the computer program is executed by the processor.