CN115774699B

CN115774699B - Database shared dictionary compression method and device, electronic equipment and storage medium

Info

Publication number: CN115774699B
Application number: CN202310045920.4A
Authority: CN
Inventors: 林科旭; 张程伟; 张皖川
Original assignee: Primitive Data Beijing Information Technology Co ltd
Current assignee: Primitive Data Beijing Information Technology Co ltd
Priority date: 2023-01-30
Filing date: 2023-01-30
Publication date: 2023-05-23
Anticipated expiration: 2043-01-30
Also published as: CN115774699A

Abstract

The embodiment of the application discloses a database shared dictionary compression method, a device, electronic equipment and a storage medium, which relate to the technical field of data compression, write-in data into a data row by executing write-in operation on a data page, train a dictionary by using the write-in data after the write-in data reaches a preset threshold, store first metadata in the data page, record the mapping relation between the data page and the dictionary, store the trained dictionary into a separate dictionary file, and finally select a corresponding dictionary from the dictionary file according to the mapping relation to compress the write-in data of the data row of the data page, and maintain the first metadata in an uncompressed state in the compression process, thereby effectively reducing the decompression times, adopting a small amount of write-in data to train the dictionary, improving the dictionary training efficiency, storing the dictionary by an independent file, caching into a memory for convenient query and management, and effectively improving the database performance.

Description

Database shared dictionary compression method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of data compression technologies, and in particular, to a method and apparatus for compressing a shared dictionary of a database, an electronic device, and a storage medium.

Background

With the continuous development of the information age, various information data also grow exponentially, and the data processing process exists in transmission, storage and the like, so that the data compression is effective in saving space and reducing cost. The database compression is a method for compressing and storing contents stored in a database to save space, and dictionary compression is the most widely used compression algorithm at present, but the database dictionary compression algorithm in the related art has the problems of low compression rate, difficulty in management of dictionary and data storage together, and the like.

Disclosure of Invention

The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the embodiment of the application provides a method, a device, electronic equipment and a storage medium for compressing a database shared dictionary, which can improve the compression rate of database data, and simultaneously associate data pages with the dictionary based on a mapping relation, so that the data can be conveniently and quickly searched for corresponding dictionaries to be compressed and decompressed, the performance of the database is improved, and the storage management of the dictionaries is also facilitated.

In a first aspect, an embodiment of the present application provides a method for compressing a shared dictionary of a database, where the database includes a plurality of database tables, the database tables include a plurality of data pages, and the data pages include a plurality of data rows, and the method includes:

Performing a write operation on the page of data, the write operation for writing write data to a plurality of the rows of data;

training at least one dictionary by using the written data after the written data of the database table reaches a preset threshold; the data page is stored with first metadata, the data line is stored with second metadata, the first metadata is used for storing the mapping relation between the dictionary and at least one data page, and the second metadata is used for storing attribute information of the data line;

storing the dictionary after training into a dictionary file;

and based on the mapping relation, selecting the corresponding dictionary from the dictionary file to compress the writing data of the data row in the data page.

In some embodiments of the present application, after the writing data of the database table reaches a preset threshold, training at least one dictionary by using the writing data further includes:

acquiring the uncompressed quantity of the uncompressed data pages in the database table;

when the uncompressed quantity reaches a preset quantity threshold, taking the writing data of the uncompressed data page as dictionary training data;

And inputting the dictionary training data into a dictionary generating model to generate a plurality of dictionaries, wherein the number of the dictionaries is a preset number.

In some embodiments of the present application, the inputting the dictionary training data into a dictionary generating model generates a plurality of dictionaries, and further includes:

generating the preset quantity and the dictionary size according to the dictionary training data and the preset compression rate;

inputting the dictionary training data into a dictionary generation model, and generating a plurality of dictionaries based on the preset quantity and the dictionary size;

the first metadata of the data page to the dictionary is generated based on the mapping relationship.

In some embodiments of the present application, after the uncompressed number reaches a preset number threshold, the writing data of the uncompressed data page is used as dictionary training data, and further includes:

taking the uncompressed written data of the data page as initial dictionary training data;

and selecting the initial dictionary training data according to a preset selection strategy to obtain the dictionary training data.

In some embodiments of the present application, the selecting, according to a preset selection policy, the dictionary training data from the initial dictionary training data further includes:

Acquiring a training data threshold;

and randomly selecting data with a corresponding number of sizes from the initial dictionary training data or selecting data with a corresponding number of sizes from preset positions according to the training data threshold as the dictionary training data.

In some embodiments of the present application, after the compressing the write data of the data line in the data page by selecting the corresponding dictionary from the dictionary file based on the mapping relationship, the method further includes:

if the compression ratio of the data page does not reach a preset compression threshold, taking the written data of the data page as alternative data;

acquiring newly added write-in data;

training the dictionary with the newly added write data and the alternative data to update the dictionary of the data page;

and compressing the data page by using the updated dictionary.

In some embodiments of the present application, the first metadata is stored in a first preset location of the data page, and the first metadata includes: one or more of dictionary file name, file offset, dictionary length.

In some embodiments of the present application, the second metadata is stored in a second preset location of the data line, and the second metadata includes: line data length and/or line transaction information.

In some embodiments of the present application, the mapping relationship between the dictionary and at least one of the data pages is generated according to a preset mapping rule, where the preset mapping rule includes a continuous mapping rule, a discontinuous mapping rule, or a content-related mapping rule;

when the preset mapping rule is a continuous mapping rule, selecting a continuous first number of data pages to be associated with the same dictionary;

when the preset mapping rule is a discontinuous mapping rule, selecting a discontinuous second number of data pages to be associated with the same dictionary;

and when the preset mapping rule is a content-related mapping rule, selecting a third number of data pages with the written data having relevance to be associated with the same dictionary.

In some embodiments of the present application, the first metadata of the data page and the second metadata of the data row in each of the data pages are not compressed when the write data of the data page in the database table is compressed.

In some embodiments of the present application, after the compressing the write data of the data line in the data page by selecting the corresponding dictionary from the dictionary file based on the mapping relationship, the method further includes: decompressing the compressed data line to obtain the corresponding written data;

The decompression process comprises the following steps:

reading the first metadata of the data page to obtain the mapping relation;

searching the corresponding dictionary in the dictionary file by utilizing the mapping relation;

decompressing the write data of the data line using the dictionary.

In a second aspect, an embodiment of the present application further provides a database shared dictionary compression apparatus, including:

a write module for performing a write operation on a page of data, the write operation for writing write data to a plurality of rows of data;

the training module is used for training at least one dictionary by using the written data after the written data of the database table reaches a preset threshold;

the storage module is used for storing the dictionary after training into a dictionary file;

and the compression module is used for selecting the corresponding dictionary from the dictionary file based on the mapping relation and compressing the written data of the data line in the data page.

In a third aspect, an embodiment of the present application further provides an electronic device, including a memory, and a processor, where the memory stores a computer program, and the processor implements the method for compressing a database sharing dictionary according to the embodiment of the first aspect of the present application when executing the computer program.

In a fourth aspect, embodiments of the present application further provide a computer readable storage medium storing a program, where the program is executed by a processor to implement a method for compressing a database shared dictionary according to embodiments of the first aspect of the present application.

The embodiment of the application at least comprises the following beneficial effects: the embodiment of the application provides a method, a device, electronic equipment and a storage medium for compressing a shared dictionary of a database, which are characterized in that firstly writing operation is performed on a data page in a database table, writing data into a data row of the data page, training the dictionary by using the writing data after the writing data reaches a preset threshold, then storing first metadata in the data page, recording the mapping relation between the data page and the dictionary, storing the trained dictionary into a separate dictionary file, and finally selecting a corresponding dictionary from the dictionary file according to the mapping relation to compress the writing data of the data row of the data page.

Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, wherein:

FIG. 1 is a flow chart of a method for compressing a shared dictionary of a database according to one embodiment of the present application;

fig. 2 is a schematic flow chart of step S102 in fig. 1;

fig. 3 is a schematic flow chart of step S203 in fig. 2;

FIG. 4 is a schematic diagram of a first metadata mapping provided in one embodiment of the present application;

fig. 5 is a schematic flow chart of step S202 in fig. 2;

FIG. 6 is a flow chart of step S402 in FIG. 5;

fig. 7 is a schematic flow chart after step S104 in fig. 1;

fig. 8 is a schematic flow chart after step S104 in fig. 1;

FIG. 9 is a schematic diagram of a mapping relationship provided in one embodiment of the present application;

FIG. 10 is a schematic diagram of a database shared dictionary compression apparatus according to one embodiment of the present application;

FIG. 11 is a flow chart of a database sharing dictionary creation application provided in one embodiment of the present application;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Reference numerals: writing module 100, training module 200, storage module 300, compression module 400, electronic device 1000, processor 1001, memory 1002.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

In the description of the present application, it should be understood that references to orientation descriptions, such as directions of up, down, front, back, left, right, etc., are based on the orientation or positional relationship shown in the drawings, are merely for convenience of describing the present application and simplifying the description, and do not indicate or imply that the apparatus or element referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application.

In the description of the present application, the meaning of a number is one or more, the meaning of a number is two or more, greater than, less than, exceeding, etc. are understood to not include the present number, and the meaning of a number above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present application, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present application can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical solution.

For a better understanding of the technical solutions provided in the present application, terms appearing herein will be described accordingly:

data compression: the method is a technical method for reducing the data volume to reduce the storage space and improving the transmission, storage and processing efficiency on the premise of not losing information. Or reorganizing the data according to a certain algorithm, so as to reduce redundancy and storage space of the data.

Database: a database is a "repository" that organizes, stores, and manages data according to a data structure, and is a collection of large amounts of data that are stored in a computer for a long period of time, organized, sharable, and uniformly managed.

Database table: table, which contains all the data in the database, is the object in the database for storing data, is the set of structured data, and is the basis of the whole database system.

Data page: page, which is the basic unit of disk and memory exchange in the database, and also the basic unit of database management storage space, represents the unit of writing data in the database to disk, and is generally 8k,16k, etc.

Metadata: the descriptive information of data and information resources, mainly describing data attribute information, is used for supporting functions such as indicating storage positions, historical data, resource searching, file recording and the like.

Compression ratio: the higher the compression rate, the smaller the data after compression.

Disk IO: is the Input and Output (abbreviation of Input and Output) of the disk, and refers to the reading speed of bytes, i.e. the read-write capability of the hard disk.

Persistence: is a mechanism for transferring program data between a persistent state and an instantaneous state, i.e., storing data (e.g., objects in memory) in a permanently storable storage device (e.g., disk). The primary application of persistence is to store objects in memory in a database, or in a disk file, an XML data file, etc., that is, to mean that transient data (e.g., data in memory that is not permanently preserved) is persisted into persistent data (e.g., persisted into a database that is permanently preserved).

And (3) caching: refers to a memory that can exchange data at high speed, which exchanges data with the CPU prior to the memory, and thus at a fast rate.

Bytes: byte, abbreviated as B, is a unit of measure used by computer information technology to measure storage capacity, and also represents data types and language characters in some computer programming languages, one Byte stores 8-bit unsigned numbers, and the stored numerical value ranges from 0 to 255. Wherein, 1 Byte (Byte) =8 bits (bit), 1KB (Kilobyte ) =1024b, 1MB (Megabyte) =1024kb, 1gb (Gigabyte ) =1024 MB.

Database compression is a method for compressing and storing main text of content stored in a database to save space, and compression of data in the database has at least the following advantages: 1. using less space, saving storage costs, studies have shown that in practice data is stored more costly than processors and memory, and that the expansion of storage requirements space is higher than the need for computation. 2. In many business scenes, performance bottlenecks of a database are all on the disk IO, and if original uncompressed data of the IO are changed into IO compressed data, the IO quantity can be greatly reduced, so that performance bottlenecks are reduced, and the performance of the database is improved. 3. And the cache hit rate is improved. Modern server storage hardware configuration is quite high, but memory is always limited, because it is impossible to cache all user data into memory, if the data in the cache can also be changed into compressed data, more data can be cached, so that the cache hit rate is provided, and the purpose of improving performance is achieved.

In the related art, database row-level compression techniques mainly include a table-level dictionary and a page-level dictionary. The table-level dictionary is mainly represented by DB2, a corresponding static dictionary is created for each database table in the database and stored in hidden data rows in the database table, the static dictionary is used for compressing the data of each data row in the database table, but when the dictionary is created, the data written into the database table later are not updated, and all the data need to be compressed through the dictionary, so the compression rate of the data is reduced along with the increase of written data, and the size of the dictionary of each database table is fixed and cannot adapt to the sizes of different database tables. The page-level dictionary is mainly a page-level dictionary compression algorithm represented by Oracle and DB2, which creates a dictionary in units of data pages, compresses and decompresses data of each data line in the data page by the dictionary, and occupies a large space of the dictionary because of storing the dictionary in each data page, for example, a data page of 8KB generally needs to occupy 512B-1KB for storing the dictionary, thereby reducing the compression rate.

Based on the above, the embodiment of the application provides a method, a device, an electronic device and a storage medium for compressing a database shared dictionary, which can improve the compression rate and performance of database data and facilitate the searching and management of the dictionary.

Referring to fig. 1, an embodiment of the present application provides a method for compressing a shared dictionary of a database, which is applied to the database, it should be understood that the database includes a plurality of database tables, the database tables include a plurality of data pages, and the data pages include a plurality of records, that is, data rows, and the method for compressing the shared dictionary of the database includes, but is not limited to, the following steps S101 to S104.

Step S101, a write operation is performed on the data page.

In some embodiments, the write operation is used to write the write data into a plurality of data rows, i.e., to store the write data into data pages of a database table in the database. Specifically, the writing operation may be triggered by an input request of a user, or may be triggered by cache data of the system, or may be triggered by a system timer for a preset time, which is not limited in this embodiment, further, in some embodiments, when a writing operation is performed on a data page, it is further required to determine whether there is enough remaining space in the data page, for example, there is 16KB of writing data to be written into a data page with a size of 32KB, and when the data page has already been written with 24KB of data, the remaining space of the data page is 8KB and smaller than the writing data of 16KB, so the determination result is that the data page does not have enough remaining space.

Step S102, training at least one dictionary by using the written data after the written data of the database table reaches a preset threshold.

In some embodiments, when the written data of the database table reaches a preset threshold, for example, the data amount of the written data reaches the data amount stored by a preset number of data pages, or reaches a compressed data amount threshold adaptively selected by the database algorithm, specifically, whether the zstd compression algorithm or the lz4 compression algorithm or other compression algorithms, the decompression speed is usually slow if the compression rate is relatively high (the space is saved after compression), otherwise, the decompression speed is fast, so that the effect of high and fast compression rate is difficult to achieve, while the database algorithm adaptation refers to the fact that the data must be selected according to the specific situation, and the database file can be divided into a plurality of blocks and then compressed by using a suitable algorithm. Or selecting a compression format that allows for segmentation and recompression, and selecting a container file format that allows for segmentation and compression, such as Avro and part, that can be used in combination with other compression formats to achieve the desired speed and compression rate, so that in the database algorithm adaptive selection, different compression algorithms correspond to different compression data amount thresholds. In some embodiments, at least one dictionary is trained with the written data.

In some embodiments, each data page stores first metadata and each data row stores second metadata. Specifically, the first metadata is used for storing mapping relations between the dictionary and a preset number of data pages, for example, ten data pages share one dictionary, so that mapping relations corresponding to the dictionary are stored in the ten data pages, and the second metadata is used for storing attribute information of the corresponding data pages.

Step S103, storing the trained dictionary into a dictionary file.

In some embodiments, the dictionary is stored in a separate dictionary file for persistence after the dictionary training is completed, it being understood that the dictionary is stored separately from the database, i.e., in neither the database table nor the data page, but rather in a separate dictionary file, and further, the dictionary file is cached in memory, it being understood that the cache is a memory chip on the hard disk controller with extremely fast access speed, which is a buffer between the internal storage of the hard disk and the external interface. Since the internal data transmission speed and the external data transmission speed of the hard disk are different, the buffer function is realized in the hard disk. The size and the speed of the buffer memory are important factors directly related to the transmission speed of the hard disk, and the overall performance of the hard disk can be greatly improved. When the hard disk accesses the fragmented data, the data are required to be continuously exchanged between the hard disk and the memory, and if the large cache exists, the fragmented data can be temporarily stored in the cache, so that the data transmission speed is improved. The cache of the hard disk has three main functions: the first function is pre-reading, when the hard disk is controlled by the processor instruction to start reading data, the control chip on the hard disk can control the magnetic head to read the data in the next cluster or clusters in the read cluster into the cache, when the data in the next cluster or clusters is required to be read, the hard disk does not need to read the data again, and the data in the cache is directly transmitted into the memory, and the speed of the cache is far higher than the reading and writing speed of the magnetic head, so that the aim of obviously improving the performance can be achieved; the second is to cache the write action and the third is to temporarily store recently accessed data. Therefore, in the embodiment, the corresponding dictionary can be searched by reading the dictionary file of the memory, so that the read data can be conveniently and rapidly compressed or decompressed, and the performance of the database is effectively improved.

In some embodiments, since the database table includes a plurality of data pages, it may be preset that different numbers of data pages share one dictionary, so one database table generally corresponds to a plurality of dictionaries, and one dictionary file may store a plurality of dictionaries, for example, 1024 data pages of write data are stored in the database table, and each 32 data pages share one dictionary, so 1024/32=32 dictionaries should be stored in the dictionary file corresponding to the database table. It will be appreciated that each database table in the database corresponds to an independent dictionary file, the names of the dictionary files are consistent with the names of the database tables, and the dictionaries of the database tables are sequentially arranged in the dictionary files.

In some embodiments, dictionary training is complete on behalf of the dictionary if the dictionary is persisted into the dictionary file.

Step S104, selecting a corresponding dictionary from the dictionary file to compress the written data of the data line in the data page.

In some embodiments, according to the first metadata stored in the data page, based on the mapping relation of the first metadata storage, a corresponding dictionary is selected from the dictionary file to compress the write data of the data line in the data page. It will be appreciated that the compression of the write data of the data page is intended by first searching the corresponding dictionary from the first metadata stored in the data page and compressing it according to the dictionary.

Writing the writing data into the data row through writing operation, after the writing data reach a preset threshold value, training the dictionary by using the writing data, storing the trained dictionary into an independent dictionary file, and storing the mapping relation between the dictionary and the data page by storing first metadata in the data page, so that the corresponding dictionary is searched according to the mapping relation when the writing data are compressed, the dictionary is easy to store and manage while being searched, and the training speed of the dictionary is effectively improved by setting the preset threshold value.

As shown with reference to fig. 2, in some embodiments of the present application, the step S102 may further include, but is not limited to, the following steps S201 to S203.

In step S201, the uncompressed number of uncompressed data pages in the database table is acquired.

In some embodiments, the acquisition of uncompressed data pages in the database table may be triggered by a compression request, and then the number of uncompressed data pages is acquired, e.g., the written data for 16 or 32 data pages in the database table is not compressed, and the corresponding number is acquired by statistics.

In step S202, after the number of uncompressed data reaches the preset number threshold, the written data of the uncompressed data page is used as dictionary training data.

In some embodiments, the preset number of threshold values may be set to 20 data pages, and it may be understood that when the writing data of one data page is 16KB, then the writing data of 20 data pages of the preset number of threshold values is 320KB, that is, the preset threshold value of the writing data is 320KB, so that the writing data of the 20 uncompressed data pages may be used as training data of the dictionary.

Step S203, inputting dictionary training data into a dictionary generating model to generate a plurality of dictionaries.

In some embodiments, a plurality of dictionaries are generated through a dictionary generating model, the number of the dictionaries specifically generated is a preset number, and the dictionary generating model can be configured by a user, for example, different dictionaries are correspondingly generated according to different compression algorithms, and the dictionary with the highest compression rate is selected to be saved. It may be understood that the dictionary generating model is a neural network model, compression is performed according to the frequency of the fields of the data in the database table, and the compression algorithm may be zstd, lzw, lz4, which is not limited in this application. Specifically, referring to the written data of the data page in the database table shown in table 1 below, as can be seen from table 1, boo appears three times in NAME column, street appears 7 times in ADDRESS column, san appears 7 times in CITY column for the highest frequency, francisco appears three times, jose appears four times, CA 9 appears 7 times in STATE ZIP column, whereby the dictionary generation model performs compression training by counting the frequencies of the respective fields, to obtain the dictionary of table 2. It will be appreciated that the various fields of the original data page that occur multiple times are represented by an index sequence in a dictionary, specifically 1 representing Boo,2 representing Street,3 representing San,4 representing Francisco,5 representing Jose, and 6 representing CA 9, through which the original data page may be compressed and represented as shown in table 3, thereby enabling compression of the data of the database.

TABLE 1

TABLE 2

TABLE 3 Table 3

The method comprises the steps of obtaining the uncompressed number of uncompressed data pages in a database table, if the uncompressed data page number reaches a preset number threshold, inputting written data in the uncompressed data page number as dictionary training data into a dictionary generating model, training according to a compression algorithm, generating a preset number of dictionaries, training the dictionaries by selecting part of data, effectively improving the training speed of the dictionaries, and selecting the dictionary with the highest compression rate to store the dictionary for compression of the subsequent database data through comparison of a plurality of dictionaries, so that the optimal compression rate is achieved.

As shown in fig. 3, in some embodiments of the present application, the step S203 may further include, but is not limited to, the following steps S301 to S303.

Step S301, a preset number and a dictionary size are generated according to dictionary training data and a preset compression rate.

In some embodiments, the compression performance of the dictionary can be ensured by setting the preset compression rate, and it can be understood that the preset number and the dictionary size can be configured by a user in the process of generating the preset number and the dictionary size according to the dictionary training data and the preset compression rate, and the optimal compression rate can be achieved through self-adaptive selection of a database algorithm.

Specifically, if the size of the dictionary is configured by the user, by modifying the start configuration file of the database, for example, by modifying the hared_dic_row_num, in general, default hared_dic_row_num=1000000, one dictionary is shared for one million ROWs, and the user can configure the size of the dictionary by modifying a specific value, and if the size of the dictionary is not configured by the user, the optimum compression rate is achieved by adaptively selecting the database algorithm or selecting the database algorithm by selecting how many data ROWs share one dictionary according to the data type of the written data, for example.

In step S302, dictionary training data is input into a dictionary generating model, and a plurality of dictionaries are generated based on a preset number and the dictionary size.

In some embodiments, after dictionary training data is input into a dictionary generating model, a dictionary is generated according to a preset number and a dictionary size, so that the generated dictionary size and number meet preset conditions, for example, the preset number is 16 dictionaries, the dictionary size can be 64K, 4MB and the like, generally speaking, the larger the dictionary is, the better the compression effect is, the smaller the compressed file is, but the slower the compression speed is, the more resources such as memory and processor are occupied during compression, and the smaller the dictionary is, although the faster the compression speed is, the less resources such as memory and processor are occupied during compression, but the worse the compression effect is, the larger the compressed file is, so that the dictionary is neither smaller nor larger nor better, but does not occupy other more resources while achieving the optimal compression rate in a preset range.

Step S303, generating first metadata of the data page pair dictionary based on the mapping relation.

In some embodiments, the dictionary training is stored in a separate dictionary file after completion, so in order to compress or decompress the written data of the data page through the dictionary, the corresponding dictionary needs to be found in the dictionary file through the mapping relationship, and the mapping relationship is stored in the first metadata of the data page, so in the dictionary generation, the first metadata of the data page-to-dictionary needs to be synchronously generated. Specifically, the first metadata is stored in a first preset location of each data page, for example, the tail or the head of the data page, and the first metadata may include: dictionary file name, file offset, dictionary length. It will be appreciated that the dictionary file names generally correspond to the names of the database tables, one database table corresponds to one dictionary file, a plurality of dictionaries are stored in one dictionary file, the file offset represents the offset of the dictionary in the dictionary file, the dictionary length is the data of the dictionary length, and the dictionary in the dictionary file corresponding to the data page can be conveniently and quickly located and searched through the first metadata.

Referring to fig. 4, the left side of the diagram represents a database table in the database, where the database table includes n data pages, each data page is enlarged, and as shown in the diagram, specifically, the enlarged data page is page 1, which stores header information page header, free space is used to store write data, the tail of page 1 stores first metadata part meta, specifically, the first metadata part meta stores dictionary file name file, file offset, and dictionary length file, and according to these information, the compressed dictionary corresponding to the data page 1 is the first dictionary part 0 stored in the dictionary file, so that the corresponding dictionary can be successfully searched for to perform compression or decompression operation.

The dictionary training data is input into a dictionary generation model, and then the preset number and the dictionary size are generated according to the dictionary training data and the preset compression rate to further generate a plurality of dictionaries, wherein the dictionary size and the preset number are adaptively selected through user configuration or database algorithm, so that the optimal compression rate is facilitated, the dictionary training data are stored into an independent dictionary file after being generated, first metadata for synchronously generating the mapping relation between a storage data page and the dictionary are stored in each data page, the storage management of the dictionary is facilitated, and meanwhile, the quick positioning and searching can be realized, so that the compression or decompression operation of writing data in a database is realized.

Referring to fig. 5, in some embodiments of the present application, the step S202 may further include, but is not limited to, the following steps S401 to S402.

In step S401, the writing data of the uncompressed data page is used as the initial dictionary training data.

In some embodiments, the written data of the uncompressed data page is not completely random, but has a certain association, for example, nationality field in the database, possibly most of the data lines are Chinese, if the data lines are age field, most of the data lines are in the range of 0-80, and if the data lines are gender field, the data written by the uncompressed data page are male or female, so that the written data of the uncompressed data page is used as initial dictionary training data, and further analysis and processing are performed subsequently, so that the performance of the training dictionary is facilitated to be improved.

Step S402, selecting dictionary training data from the initial dictionary training data according to a preset selection strategy.

In some embodiments, the relevant writing data is selected from the initial dictionary training data according to a preset selection policy as the training data of the dictionary, it can be understood that 10GB of writing data in a database table in the database shares one dictionary, that is, 10GB of writing data all uses the same dictionary to perform compression or decompression operation, but the database does not need to train the dictionary through the complete 10GB of writing data, but samples the 10GB of writing data, and selects a part of data reaching a preset threshold value to perform training.

The writing data of the uncompressed data page is used as initial dictionary training data, the initial dictionary training data is further selected according to a preset selection strategy, dictionary training data are obtained, compression can be carried out in advance according to selected partial data, compression rate is not lost, and the training performance of the dictionary is improved.

Referring to fig. 6, in some embodiments of the present application, the step S402 may further include, but is not limited to, the following steps S501 to S502.

In step S501, a training data threshold is acquired.

In some embodiments, if a large amount of written data is to be compressed or decompressed using the same dictionary, only a portion of the written data need be selected for training to generate a shared dictionary in which all of the written data is compressed or decompressed. It can be understood that the obtained partial write data needs to reach the training data threshold, otherwise, the dictionary compression performance generated by too little training of the write data for training is poor, specifically, when 10GB of write data shares one dictionary, the training data threshold may be set to 128MB or 256MB, which is not limited in this embodiment.

Step S502, selecting data with corresponding quantity and size from the initial dictionary training data randomly or at a preset position as dictionary training data.

In some embodiments, after the training data threshold is obtained, the writing data with the corresponding number and size may be randomly selected from the initial dictionary training data to be used as the dictionary training data, or the writing data with the corresponding number and size may be selected from the preset position of the initial dictionary training to be used as the dictionary training data, specifically, the obtained training data threshold is 128MB, the writing data with 128MB may be randomly selected from the writing data with 10GB, or the writing data with the front 128MB or the writing data with the back 128MB may be selected from the writing data with 10GB, which is not limited in this application.

The dictionary is trained by acquiring the training data threshold value and then randomly sampling and selecting the writing data with the corresponding quantity and the corresponding quantity from the writing data or taking the writing data as the dictionary training data at the preset position, and the rest writing data sharing the dictionary can be compressed or decompressed by using the dictionary subsequently, so that the compression rate is not lost, and the training speed of the dictionary is improved.

Referring to fig. 7, in some embodiments of the present application, after the above step S104, if the compression ratio of the data page does not reach the preset compression threshold, the write data of the data page is taken as the candidate data, and then the following steps S601 to S603 may be further included, but are not limited thereto.

Step S601, obtaining newly added write data.

In some embodiments, the newly added writing data is written into the database through a writing operation, and the newly added writing data can be acquired through a triggering condition such as a user request or a database command.

Step S602, training the dictionary by using the newly added writing data and the alternative data to update the dictionary of the data page.

In some embodiments, if the compression ratio of the data page during compression does not reach the preset compression threshold, for example, the preset compression threshold is 80%, that is, the compression ratio of the data page during compression is less than 80%, the data page is considered to not reach the preset compression threshold, and thus the compression fails, so that the write data of the data page is taken as the candidate data. It will be appreciated that the dictionary is retrained with the newly added write data and the alternative data to update the dictionary of the data page for which compression failed, specifically, the associated data of the first metadata stored in the data page.

In step S603, the data page is compressed using the updated dictionary.

In some embodiments, the data page is compressed by searching a corresponding updated dictionary from the dictionary file through the mapping relation of the first metadata storage, if the compressed data page reaches a preset compression threshold, the compression is considered successful, if the compressed data page still does not reach the preset compression threshold, the written data of the data page is continued to be used as the alternative data, newly added written data is acquired again for training, and the above process is repeated until the compressed data page reaches the preset compression threshold.

In some embodiments, when compressing the write data for a data page in a database table, the first metadata for the data page and the second metadata for the data row in each data page are not compressed. Specifically, the first metadata is stored in each data page, when the data is read to the data page, the corresponding dictionary can be found through the mapping relation of the first metadata storage, and related information of the dictionary can be obtained without decompression operation, so that compression or decompression operation is performed on the written data, and the performance of the database is improved. It will be appreciated that the write data stored in each data line in the data page includes user data and second metadata, the second metadata is used to store attribute information of the data line, including a line data length and/or line transaction information, specifically, the line data length represents length information of the data line, the line transaction information represents sequence information of operations, when the data line is compressed, the user data in the write data is mainly compressed, the second metadata does not need to be compressed, so that the number of decompression times is reduced, and the improvement of the database performance is realized, and because the second metadata occupies a small space, usually only 12B or 24B, the influence on the compression rate is almost negligible.

The writing data which does not reach the compression preset data page is used as the alternative data, then the writing data is used as dictionary training data together with newly added writing data in a subsequent database, the dictionary is retrained, the related information is updated, the data is compressed through the updated dictionary, the first metadata stored in the data page and the second metadata stored in the data row are not compressed in the compression process, the corresponding dictionary can be searched without decompression, the decompression times are reduced, the database performance is effectively improved, and therefore the database achieves the optimal compression rate.

Referring to fig. 8, in some embodiments of the present application, after the step S104, the compressed write data is read, decompression is required to obtain the corresponding write data, where the decompression process includes, but is not limited to, the following steps S701 to S703.

In step S701, the first metadata of the data page is read to obtain a mapping relationship.

In some embodiments, the mapping relationship between the data page and the dictionary is obtained by reading the first metadata of the data page, specifically, after the data page is read, whether the data page is compressed is judged by the written data of the data line first, and if the written data of the data line is compressed, the mapping relationship between the data page and the dictionary is obtained by the first metadata stored in the data page where the data line is located.

Step S702, searching a corresponding dictionary in the dictionary file by using the mapping relation.

In some embodiments, the mapping relationship is embodied by one or more of a dictionary file name, an offset of the dictionary in the dictionary file, and a dictionary length, and it can be understood that the mapping relationship is used to search for the corresponding dictionary in the dictionary file by searching for the corresponding dictionary by the dictionary file name, and then calculating and locating the specific dictionary according to the offset of the dictionary in the dictionary file or the dictionary length.

In step S703, the written data of the data page is decompressed using the dictionary.

In some embodiments, the dictionary is utilized to decompress the compressed data page to obtain corresponding write data, specifically, decompress the read data line, it may be understood that the dictionary may be cached in the memory, further improving the performance of compression or decompression, and further, an encoder used in the compression process or a decoder used in the decompression process may be cached in the memory.

After the data is read to the data line of the data page in the database table and the written data of the data line is compressed, the mapping relation between the data line and the dictionary is obtained according to the first metadata stored in the corresponding data page, and the corresponding dictionary in the dictionary file is quickly located and searched based on the mapping relation, so that the data line is decompressed to obtain the written data, and the performance is effectively improved by caching the dictionary in the memory.

In some embodiments of the present application, the mapping of each dictionary in the dictionary file to at least one data page is generated according to a preset mapping rule, and in particular, the preset mapping rule includes a continuous mapping rule, a discontinuous mapping rule, or a content-related mapping rule.

In some embodiments, when the preset mapping rule is a continuous mapping rule, a first number of continuous data pages is selected to be associated with the same dictionary, for example, the first number is set to be 128, that is, 128 continuous data pages are selected to be associated with the same dictionary, for example, 0 th to 127 th data pages are selected to share a first dictionary, 128 th to 255 th data pages share a second dictionary, and so on, the same dictionary is used to compress or decompress 128 continuous data pages, and corresponding mapping relationships are stored in each data page through the first metadata.

In some embodiments, when the preset mapping rule is a discontinuous mapping rule, a second number of non-continuous data pages is selected to be associated with the same dictionary, for example, the second number is set to be 128, that is, 128 data pages are selected to be associated with the same dictionary, and for example, the 0,2,4,5,6,9,12,16,32 … … th non-continuous 128 data pages may be selected to share one dictionary. Specifically, referring to the mapping relationship diagram shown in fig. 9, the first data page 1 and the third data page 3 share the same dictionary part 1, the second data page 2, the fourth data page 4 and the fifth data page 5 share the same dictionary part 2, and the data can be quickly located and searched for a corresponding dictionary to compress or decompress through the mapping relationship of the first metadata storage in each data page.

In some embodiments, when the preset mapping rule is a content-related mapping rule, a third number of data pages with a correlation of the writing data is selected to be associated with the same dictionary, for example, the third number is set to be 128, that is, 128 data pages are selected, and the correlation exists among the data pages, and for example, the same field exists among each data page, specifically, in a database of an enterprise, a database table with employee information exists, wherein each data page comprises the same field data of an age field, a sex field and the like, so that the data pages with the correlation can be considered to be the data pages with the correlation, and a dictionary can be shared for compression or decompression, thereby improving the performance of the database and the efficiency of training the dictionary.

The embodiment of the present invention further provides a database shared dictionary compression device, which can implement the above-mentioned database shared dictionary compression method, and referring to fig. 10, in some embodiments of the present application, the database shared dictionary compression device includes a writing module 100, a training module 200, a storage module 300, and a compression module 400. Specifically, the writing module 100 is configured to perform a writing operation on a data page, where the writing operation is configured to write writing data into a plurality of data rows; the training module 200 is configured to train at least one dictionary using the written data after the written data of the database table reaches a preset threshold; the storage module 300 is used for storing the trained dictionary into a dictionary file in a lasting manner; the compression module 400 is configured to select a corresponding dictionary from the dictionary file based on the mapping relationship, and compress the write data of the data line in the data page.

Referring to FIG. 11, in some embodiments of the present application, the compressed object is each data line in a data page, so that the data lines in the same data page share a dictionary. Specifically, in fig. 11.1, an empty database table is created, and no write data is currently written. In fig. 11.2, part of the data is written in the database table, but the preset threshold for creating the dictionary has not been reached, for example, the preset threshold is set to 128MB, and then the data amount of the written data needs to reach 128MB to create a new dictionary. In fig. 11.3, the writing of data into the database table is continued, at which point the amount of data in the database table reaches a preset threshold for creating a dictionary. In fig. 11.4.1, these written data are used as dictionary training data, input into a dictionary generation model, train and generate a dictionary according to a preset dictionary size and number, and store the trained dictionary persistence alone into a dictionary file, further, the dictionary file may be cached into a memory. In fig. 11.4.2, the data pages in the database table are compressed by using the dictionary with successful training in fig. 11.4.1, specifically, in fig. 11.4.2, part of the data pages are compressed successfully, for example, data page 1, data page 3 and data page 128, and part of the data pages are not compressed successfully, for example, data page 2, it is understood that the partial data pages are not compressed successfully, because the compression rate of the dictionary on the data page does not reach the preset compression threshold, for example, the preset compression threshold is set to 80%, and the compression rate of the data page 2 is represented to be less than 80%, so that the dictionary is not compressed successfully, and the dictionary needs to be retrained with the writing data newly added as new dictionary training data, and then the dictionary file is retrained for compression again by using the new dictionary.

The specific implementation manner of the database shared dictionary compression device in this embodiment is basically the same as that of the database shared dictionary compression method described above, and will not be described in detail here.

Fig. 12 shows an electronic device 1000 provided in an embodiment of the present application. The electronic device 1000 includes: the processor 1001, the memory 1002, and a computer program stored on the memory 1002 and executable on the processor 1001, the computer program when executed is configured to perform the database sharing dictionary compression method described above.

The processor 1001 and the memory 1002 may be connected by a bus or other means.

The memory 1002 is used as a non-transitory computer readable storage medium for storing non-transitory software programs and non-transitory computer executable programs, such as the database sharing dictionary compression method described in the embodiments of the present application. The processor 1001 implements the database sharing dictionary compression method described above by running non-transitory software programs and instructions stored in the memory 1002.

Memory 1002 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store and perform the database sharing dictionary compression method described above. In addition, the memory 1002 may include high-speed random access memory 1002, and may also include non-transitory memory 1002, such as at least one storage device memory device, flash memory device, or other non-transitory solid state memory device. In some implementations, the memory 1002 optionally includes memory 1002 remotely located relative to the processor 1001, which remote memory 1002 can be connected to the electronic device 1000 over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to implement the database-sharing dictionary compression method described above are stored in the memory 1002, which when executed by the one or more processors 1001, perform the database-sharing dictionary compression method described above, for example, performing method steps S101 through S104 in fig. 1, method steps S201 through S203 in fig. 2, method steps S301 through S303 in fig. 3, method steps S401 through S402 in fig. 5, method steps S501 through S502 in fig. 6, method steps S601 through S603 in fig. 7, and method steps S701 through S703 in fig. 8.

The embodiment of the application also provides a storage medium, which is a computer readable storage medium, and the storage medium stores a computer program, and the computer program realizes the database sharing dictionary compression method when being executed by a processor. The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

According to the database shared dictionary compression method, device, electronic equipment and storage medium, write-in data are written into data lines of the data pages by executing write-in operation on the data pages in the database table, after the write-in data reach a preset threshold, the dictionary is trained by the user, first metadata is stored in the data pages, mapping relations between the data pages and the dictionary are recorded, the trained dictionary is stored in an independent dictionary file, finally the corresponding dictionary is selected from the dictionary file according to the mapping relations to compress the write-in data of the data lines, compression and decompression are carried out according to the line granularity of the data lines, the first metadata of the data pages is kept in an uncompressed state in the compression process, therefore decompression times are effectively reduced, database performance is improved, a small amount of write-in data reaching the preset threshold is adopted to train the dictionary, the size of the dictionary and the number of the data pages sharing the dictionary can be configured by a user, the dictionary can be adaptively selected through a database algorithm, the dictionary training efficiency is improved, the optimal compression rate is achieved at the same time, the independent file is stored, the dictionary is convenient to search and management is convenient, and the data pages and the dictionary is mapped and compressed and mapped for the corresponding dictionary compression and decompression is carried out.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, storage device storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically include computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and may include any information delivery media.

It should also be appreciated that the various embodiments provided in the embodiments of the present application may be arbitrarily combined to achieve different technical effects. While the preferred embodiments of the present application have been described in detail, the present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit and scope of the present application.

Claims

1. A database shared dictionary compression method, which is characterized by being applied to a database, wherein the database comprises a plurality of database tables, the database tables comprise a plurality of data pages, and the data pages comprise a plurality of data rows; the method comprises the following steps:

Storing the dictionary after training into a dictionary file;

2. The method for compressing a shared dictionary of claim 1, wherein said training at least one dictionary using said written data when said written data of said database table reaches a preset threshold value, further comprises:

3. The method of database-sharing dictionary compression according to claim 2, said inputting the dictionary training data into a dictionary generating model, generating a plurality of the dictionaries, further comprising:

4. The method for compressing a shared dictionary of a database according to claim 2, wherein when the uncompressed number reaches a preset number threshold, the method further comprises:

5. The method for compressing a shared dictionary of claim 4, wherein the dictionary training data is selected from the initial dictionary training data according to a preset selection policy, further comprising:

acquiring a training data threshold;

6. The method for compressing a shared dictionary of claim 1, wherein after the writing data of the data line in the data page is compressed by selecting the corresponding dictionary from the dictionary file based on the mapping relationship, further comprising:

acquiring newly added write-in data;

and compressing the data page by using the updated dictionary.

7. The method of claim 1, wherein the first metadata is stored in a first predetermined location of the data page, the first metadata comprising: one or more of dictionary file name, file offset, dictionary length.

8. The method of claim 1, wherein the second metadata is stored in a second preset location of the data row, the second metadata comprising: line data length and/or line transaction information.

9. The method of claim 1, wherein the mapping relationship between the dictionary and at least one of the data pages is generated according to a preset mapping rule, the preset mapping rule including a continuous mapping rule, a discontinuous mapping rule, or a content-related mapping rule;

10. The method of claim 1, wherein the first metadata of the data page and the second metadata of the data row in each of the data pages are not compressed when the write data of the data page in the database table is compressed.

11. The method for compressing a shared dictionary of claim 1, wherein after the writing data of the data line in the data page is compressed by selecting the corresponding dictionary from the dictionary file based on the mapping relationship, further comprising: decompressing the compressed data line to obtain the corresponding written data;

The decompression process comprises the following steps:

reading the first metadata of the data page to obtain the mapping relation;

decompressing the write data of the data line using the dictionary.

12. A database-sharing dictionary compression apparatus, comprising:

13. An electronic device comprising a memory, a processor, the memory storing a computer program, the processor implementing the database sharing dictionary compression method of any one of claims 1-11 when executing the computer program.

14. A computer-readable storage medium storing a program that is executed by a processor to implement the database shared dictionary compression method according to any one of claims 1 to 11.