WO2023216575A1 - 数据页处理的方法及其装置 - Google Patents

数据页处理的方法及其装置 Download PDF

Info

Publication number
WO2023216575A1
WO2023216575A1 PCT/CN2022/137287 CN2022137287W WO2023216575A1 WO 2023216575 A1 WO2023216575 A1 WO 2023216575A1 CN 2022137287 W CN2022137287 W CN 2022137287W WO 2023216575 A1 WO2023216575 A1 WO 2023216575A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
offsets
row
page
offset
Prior art date
Application number
PCT/CN2022/137287
Other languages
English (en)
French (fr)
Inventor
纪德东
尼古拉·科夫里日尼赫
王建朋
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023216575A1 publication Critical patent/WO2023216575A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Definitions

  • the embodiments of the present application relate to the field of information technology, and more specifically, to a data page processing method and an apparatus thereof.
  • Embodiments of the present application provide a data page processing method and a device thereof.
  • This data page processing method not only has a higher compression rate (or decompression rate), but can also be compared with the compression time of existing compression (or decompression) methods. (or decompression time) is basically the same.
  • a data page processing method including: obtaining a second data page according to a first data page; compressing the second data page to obtain a compressed data page; wherein, the third data page is obtained.
  • a data page includes first data based on a row storage method and a first set of offsets, the first set of offsets being used to indicate the offset of each row of data in the first data; the second data
  • the page includes second data based on a row storage method and a second set of offsets.
  • the second set of offsets is used to indicate the offset of each row of data in the second data.
  • the second data is a pair of The data obtained after preprocessing the first data, the second set of offsets is the group offset obtained after the preprocessing of the first set of offsets, the preprocessing includes byte-based Level column conversion.
  • byte-level row-column conversion is performed on the data stored in the first data page based on the row storage method. That is to say, the data based on the row storage method is converted in an orderly and reversible manner. It is a form of data based on column storage, so that the data can be updated in-situ within the data page.
  • the converted second data page is then compressed. Since each row of data stored in the obtained second data page has similarity, repetition and certain regularity, the compression rate of compressing the second data page is higher than that of directly compressing the first data page. , thereby improving the compression rate of data pages.
  • the compression time of the embodiment of the present application is basically the same as that of the existing compression method.
  • obtaining the second data page according to the first data page includes: respectively obtaining the first data and the first data page from the first data page.
  • the first group of offsets includes: respectively obtaining the first data and the first data page from the first data page.
  • the first group of offsets includes: respectively obtaining the first data and the first data page from the first data page.
  • the first group of offsets includes: respectively obtaining the first data and the first data page from the first data page.
  • the first group of offsets includes: respectively obtaining the first data and the first data page from the first data page.
  • the first group of offsets includes: respectively obtaining the first data and the first data page from the first data page.
  • the first group of offsets includes: respectively obtaining the first data and the first data page from the first data page.
  • the first group of offsets includes: respectively obtaining the first data and the first data page from the first data page.
  • the first data page includes a first row of data part and a first directory part, and the first row of data part is used to store the first data,
  • the first directory part is used to store the first set of offsets; and obtaining the second data page according to the second data and the second set of offsets includes: converting the first set of offsets to the second data page.
  • the first data stored in a row of data part is updated to the second data, and the first set of offsets stored in the first directory part is updated to a second set of offsets to obtain the Describe the second data page.
  • performing the preprocessing on the first data in bytes to obtain the second data includes: obtaining an offset of the first data According to the starting point and end point of the offset of the first data, and the unit offset length of the first set of offsets, the first set of offsets is obtained
  • the number of offsets included is M; invalid offsets are removed from the M offsets to obtain N offsets, where the N is less than or equal to the M, and the N and the M are both is a positive integer; arrange the N offsets in order from small to large to obtain the sorted N offsets; according to the sorted N offsets, the first row of data
  • the part is divided into N regions, and the length of each row of data of the first data is obtained, where the number of data in the n-th region among the N regions is the n-th row of data of the first data.
  • the largest arrangement order or the arrangement order of the N offsets in the first group of offsets; performing the preprocessing on the first group of offsets in bytes to obtain the second The group offset includes: performing the preprocessing on the N offsets in the first group of offsets in bytes to obtain the second group of offsets.
  • the method further includes: determining that the difference between the lengths of each row of data in the first data is less than or equal to a first threshold.
  • the first data when it is determined that the difference between the lengths of each row of data in the first data is less than or equal to the third threshold, then according to the order of N offsets, sequentially from the N areas, Take the data corresponding to the i-th byte as the data in the i-th row and Nth column of the second data. In this way, only when the difference between the lengths of each row of the first data is not large, the first data can be converted into rows and columns based on the byte level, thereby avoiding waste of resources.
  • the preprocessing further includes differential processing based on byte level, and the differential processing includes differential processing between column data.
  • performing the preprocessing on the first data in bytes to obtain the second data includes: performing the preprocessing on the first data in bytes.
  • the third data is obtained by converting the rows and columns; the data of adjacent columns on the a1th row of the third data are differentiated according to bytes to obtain the second data, the 1 ⁇ a1 ⁇ a2, the a1 and a2 are both positive integers, and a2 is equal to the maximum line length of the first data or a2 is equal to the minimum line length of the first data; the first set of offsets are calculated in bytes.
  • the preprocessing to obtain the second group of offsets includes: converting the first group of offsets into rows and columns according to bytes to obtain the third group of offsets; converting the third group of offsets into The data of adjacent columns on the b1th row are differentiated in bytes to obtain the second set of offsets, the 1 ⁇ b1 ⁇ b2, the b1 and b2 are both positive integers, and the b2 is equal to the The maximum row length of the first set of offsets or b2 is equal to the minimum row length of the first set of offsets.
  • converting the first data into rows and columns according to bytes to obtain the third data includes: obtaining the offset of the first data. Starting point and end point; according to the starting point and end point of the offset of the first data, and the unit offset length of the first group of offsets, obtain the first group of offsets included The number of offsets M; remove invalid offsets from the M offsets to obtain N offsets, where the N is less than or equal to the M, and both the N and the M are positive Integer; arrange the N offsets in order from small to large to obtain the sorted N offsets; divide the first row of data according to the sorted N offsets is N regions, and obtains the length of each row of data of the first data, where the number of data in the n-th region among the N regions is the length of the n-th row of data of the first data.
  • the arrangement order or the arrangement order of the N offsets in the first group of offsets; the row-to-column conversion of the first group of offsets according to bytes to obtain the third group of offsets includes: converting the N offsets in the first set of offsets into rows and columns according to bytes to obtain the third set of offsets.
  • the method further includes: determining that the difference between the lengths of each row of data in the first data is less than or equal to a first threshold.
  • the first data when it is determined that the difference between the lengths of each row of data in the first data is less than or equal to the first threshold, then according to the order of N offsets, sequentially from the N areas, Take the data corresponding to the i-th byte as the data in the i-th row and Nth column of the third data. In this way, only when the difference between the lengths of each row of the first data is not large, the first data can be converted into rows and columns based on the byte level, thereby avoiding waste of resources.
  • the method before dividing the first row of data parts into N areas according to the N offsets, the method further includes: determining The N is less than or equal to the second threshold.
  • the first row of data is divided into N regions based on N offsets. In this way, only when there are not many rows of the first data, byte-level row-to-row conversion is performed on the first data, thereby avoiding waste of resources.
  • the method further includes: reorganizing multiple third data pages that are continuous and have the same structure to obtain the first data page; wherein, the The third data page includes fourth data based on row storage mode and a fourth set of offsets.
  • the fourth set of offsets is used to indicate the offset of each row of data in the fourth data.
  • the first The data includes a plurality of fourth data corresponding to a plurality of the third data pages, and the maximum row lengths of the plurality of fourth data are the same, and the first set of offsets includes a plurality of the third data pages corresponding to A plurality of the fourth set of offsets.
  • multiple consecutive data pages with the same structure may be reorganized to obtain one data page.
  • the structural characteristics of the data page can be fully utilized to reorganize multiple data pages with high similarity into one data page, thereby further improving the compression rate of the data page.
  • the compression time is basically the same as that of existing compression methods.
  • reorganizing a plurality of consecutive third data pages with the same structure to obtain the first data page includes: respectively obtaining and retrieving the plurality of third data pages.
  • the first data and the first set of offsets are respectively stored in the first data page.
  • the first data page includes information indicating that the first data page has been reorganized.
  • the second data page includes information indicating that the second data page has undergone the preprocessing.
  • the method further includes: decompressing the compressed data page to obtain the second data page; according to the second data page, Obtain the first data page, the first data is data obtained after performing the preprocessing on the second data, and the first set of offsets is obtained by performing the preprocessing on the second set of offsets. The resulting group offset after processing.
  • the decompression rate of decompressing the second data page is relatively high, thereby improving the data Page decompression rate.
  • the decompression rate and time consumption of the embodiment of the present application and the existing decompression method are basically the same.
  • obtaining the first data page according to the second data page includes: respectively obtaining the second data page from the second data page. data and the second set of offsets; perform the preprocessing on the second set of offsets in bytes to obtain the first set of offsets; according to the first set of offsets, perform the preprocessing in bytes Perform the preprocessing on the second data to obtain the first data; obtain the first data page according to the first data and the first set of offsets.
  • the second data page includes a second row of data part and a second directory part, and the second row of data part is used to store the second data,
  • the second directory part is used to store the second set of offsets;
  • obtaining the first data page based on the first data and the first set of offsets includes: converting the first data page
  • the second data stored in the two-row data part is updated to the first data
  • the second set of offsets stored in the second directory part is updated to the first set of offsets to obtain the Describe the first data page.
  • performing the preprocessing on the second set of offsets in bytes to obtain the first set of offsets includes: according to the The unit offset length of the second group of offsets, performing the preprocessing on the second group of offsets in bytes to obtain the first group of offsets; amount, performing the preprocessing on the second data in bytes to obtain the first data, including: removing invalid offsets from the first set of offsets to obtain a fifth set of offsets,
  • the fifth group of offsets includes P offsets; the P offsets are arranged in order from small to large to obtain the sorted P offsets; according to the sorted P offsets offset, create P regions, and obtain the length of each row of data of the first data.
  • the P regions correspond to the P offsets one-to-one; sequentially start from the second row of data Read the data corresponding to R bytes in the part, and sequentially store the data corresponding to the p-th byte among the R bytes to the data corresponding to the q-th byte of the s-th area among the P areas, Complete the reading and writing of the qth data, where the p is a positive integer, and the p is taken from 1 to R, and the R is the number of areas that are not filled with data in the P areas.
  • the amount of data in the sth area is the length of the sth row of data of the first data, the s is a positive integer, and the sth area
  • the offset corresponding to the area is the s-th offset, and the s-th offset is located at an offset in the fifth group of offsets other than the offset corresponding to the area filled with data.
  • the p-th offset in The first data includes: sequentially overwriting the second data stored in the second row data part with the data in the P areas.
  • data corresponding to P bytes are sequentially read from the second row data portion, and the P bytes are sequentially read.
  • the data corresponding to p bytes is stored to the data corresponding to the qth byte in the sth area.
  • the method further includes: determining the number of each row of the first data. The difference between the lengths is less than or equal to the third threshold.
  • P bytes corresponding to the second row of data are sequentially read from the data part of the second row. data, and sequentially store the data corresponding to the p-th byte among the P bytes to the data corresponding to the q-th byte in the s-th area, completing the q-th data reading and writing.
  • the second data can be converted into rows and columns based on byte level, thereby avoiding waste of resources.
  • the preprocessing further includes accumulation processing based on byte level, and the accumulation processing includes accumulation between column data.
  • performing the preprocessing on the second set of offsets in bytes to obtain the first set of offsets includes: converting the The data of adjacent columns on the c1th row of the second set of offsets are accumulated in bytes to obtain the third set of offsets, where 1 ⁇ c1 ⁇ c2, and c1 and c2 are both positive integers, so The c2 is equal to the maximum line length of the second group of offsets or the c2 is equal to the minimum line length of the second group of offsets; perform column-column conversion on the third group of offsets according to bytes to obtain The first set of offsets; performing the preprocessing on the second data in bytes according to the first set of offsets to obtain the first data includes: converting the second data The data of the adjacent columns on the d1th row are accumulated in bytes to obtain the third data, the 1 ⁇ d1 ⁇ d2, the d1 and d2 are both positive integers, and the d2 is equal to the second
  • performing row-to-column conversion on the third set of offsets in bytes to obtain the first set of offsets includes: according to the third set of offsets The unit offset length of the three sets of offsets, the third set of offsets is converted into rows and columns according to bytes to obtain the first set of offsets; according to the first set of offsets, Converting the third data into rows and columns according to bytes to obtain the first data includes: removing invalid offsets from the first set of offsets to obtain a fifth set of offsets, and the fifth set of offsets is obtained.
  • the group offset includes P offsets; the P offsets are arranged in order from small to large to obtain the sorted P offsets; according to the sorted P offsets, Create P areas, and obtain the length of each row of data in the first data.
  • the P areas correspond to the P offsets one-to-one; read R from the third data in sequence.
  • the p is a positive integer
  • the p ranges from 1 to R
  • the R is the number of areas that are not filled with data in the P areas
  • the sth area is When the data is full, the amount of data in the sth area is the length of the sth row of the first data, s is a positive integer, and the offset corresponding to the sth area is is the s-th offset, which is located at the p-th offset among the offsets in the fifth group of offsets excluding the offsets corresponding to the area filled with data.
  • data corresponding to P bytes are sequentially read from the second row data portion, and the P bytes are sequentially read.
  • the data corresponding to p bytes is stored to the data corresponding to the qth byte in the sth area.
  • the method further includes: determining the number of each row of the first data. The difference between the lengths is less than or equal to the third threshold.
  • P bytes corresponding to the second row of data are sequentially read from the data part of the second row. data, and sequentially store the data corresponding to the p-th byte among the P bytes to the data corresponding to the q-th byte in the s-th area, completing the q-th data reading and writing.
  • the third data can be converted into rows and columns based on the byte level, thereby avoiding waste of resources.
  • the method before creating P regions according to the sorted P offsets and obtaining the length of each row of data of the first data , the method further includes: determining that P is less than or equal to a fourth threshold.
  • P regions are created based on the sorted P offsets. In this way, only when there are not many rows of the first data, the third data can be converted into rows and columns based on byte level, thereby avoiding waste of resources.
  • the method further includes: splitting the first data page to obtain the plurality of third data pages.
  • splitting the first data page to obtain the plurality of third data pages includes: obtaining a plurality of the third data pages the starting point and the ending point of the fourth data, and the starting point and the ending point of the fourth set of offsets; according to the plurality of starting points and ending points of the fourth data, obtain multiple starting points and ending points from the first data page. a plurality of the fourth data; and, according to the starting points and end points of a plurality of the fourth group of offsets, a plurality of the fourth group of offsets are obtained from the first data page; and the plurality of fourth group offsets are obtained respectively.
  • a plurality of the fourth data and a plurality of the fourth sets of offsets are respectively stored in a plurality of the third data pages.
  • the first data page includes information indicating that the first data page has been reorganized.
  • a data page processing method including: decompressing the compressed data page to obtain the second data page; and obtaining the first data based on the second data page. page; wherein the second data page includes second data based on row storage and a second set of offsets, the second set of offsets being used to indicate the offset of each row of data in the second data amount; the first data page includes first data based on a row storage method and a first set of offsets, the first set of offsets being used to indicate the offset of each row of data in the first data; The first data is data obtained by preprocessing the second data, and the first set of offsets is a group offset obtained by preprocessing the second set of offsets.
  • the decompression rate of decompressing the second data page is relatively high, thereby improving the data Page decompression rate.
  • the decompression rate and time consumption of the embodiment of the present application and the existing decompression method are basically the same.
  • obtaining the first data page according to the second data page includes: obtaining the second data page from the second data page respectively. data and the second set of offsets; perform the preprocessing on the second set of offsets in bytes to obtain the first set of offsets; according to the first set of offsets, perform the preprocessing in bytes Perform the preprocessing on the second data to obtain the first data; obtain the first data page according to the first data and the first set of offsets.
  • the second data page includes a second row of data part and a second directory part, and the second row of data part is used to store the second data,
  • the second directory part is used to store the second set of offsets;
  • obtaining the first data page based on the first data and the first set of offsets includes: converting the first data page
  • the second data stored in the two-row data part is updated to the first data
  • the second set of offsets stored in the second directory part is updated to the first set of offsets to obtain the Describe the first data page.
  • performing the preprocessing on the second set of offsets in bytes to obtain the first set of offsets includes: according to the The unit offset length of the second group of offsets, performing the preprocessing on the second group of offsets in bytes to obtain the first group of offsets; amount, performing the preprocessing on the second data in bytes to obtain the first data, including: removing invalid offsets from the first set of offsets to obtain a fifth set of offsets,
  • the fifth group of offsets includes P offsets; the P offsets are arranged in order from small to large to obtain the sorted P offsets; according to the sorted P offsets offset, create P regions, and obtain the length of each row of data of the first data.
  • the P regions correspond to the P offsets one-to-one; sequentially start from the second row of data Read the data corresponding to R bytes in the part, and sequentially store the data corresponding to the p-th byte among the R bytes to the data corresponding to the q-th byte of the s-th area among the P areas, Complete the reading and writing of the qth data, where the p is a positive integer, and the p is taken from 1 to R, and the R is the number of areas that are not filled with data in the P areas.
  • the amount of data in the sth area is the length of the sth row of data of the first data, the s is a positive integer, and the sth area
  • the offset corresponding to the area is the s-th offset, and the s-th offset is located at an offset in the fifth group of offsets other than the offset corresponding to the area filled with data.
  • the p-th offset in The first data includes: sequentially overwriting the second data stored in the second row data part with the data in the P areas.
  • data corresponding to P bytes are sequentially read from the second row data portion, and the P bytes are sequentially read.
  • the data corresponding to p bytes is stored to the data corresponding to the qth byte in the sth area.
  • the method further includes: determining the number of each row of the first data. The difference between the lengths is less than or equal to the third threshold.
  • P bytes corresponding to the second row of data are sequentially read from the data part of the second row. data, and sequentially store the data corresponding to the p-th byte among the P bytes to the data corresponding to the q-th byte in the s-th area, completing the q-th data reading and writing.
  • the second data can be converted into rows and columns based on byte level, thereby avoiding waste of resources.
  • the preprocessing further includes byte-level accumulation processing, and the accumulation processing includes accumulation between column data.
  • performing the preprocessing on the second set of offsets in bytes to obtain the first set of offsets includes: converting the The data of adjacent columns on the c1th row of the second set of offsets are accumulated in bytes to obtain the third set of offsets, where 1 ⁇ c1 ⁇ c2, and c1 and c2 are both positive integers, so The c2 is equal to the maximum line length of the second group of offsets or the c2 is equal to the minimum line length of the second group of offsets; perform column-column conversion on the third group of offsets according to bytes to obtain The first set of offsets; performing the preprocessing on the second data in bytes according to the first set of offsets to obtain the first data includes: converting the second data The data of the adjacent columns on the d1th row are accumulated in bytes to obtain the third data, the 1 ⁇ d1 ⁇ d2, the d1 and d2 are both positive integers, and the d2 is equal to the second
  • performing row-column conversion on the third group of offsets according to bytes to obtain the first group of offsets includes: according to the third group of offsets The unit offset length of the three sets of offsets, the third set of offsets is converted into rows and columns according to bytes to obtain the first set of offsets; according to the first set of offsets, Converting the third data into rows and columns according to bytes to obtain the first data includes: removing invalid offsets from the first set of offsets to obtain a fifth set of offsets, and the fifth set of offsets is obtained.
  • the group offset includes P offsets; the P offsets are arranged in order from small to large to obtain the sorted P offsets; according to the sorted P offsets, Create P areas, and obtain the length of each row of data in the first data.
  • the P areas correspond to the P offsets one-to-one; read R from the third data in sequence.
  • the p is a positive integer
  • the p ranges from 1 to R
  • the R is the number of areas that are not filled with data in the P areas
  • the sth area is When the data is full, the amount of data in the sth area is the length of the sth row of the first data, s is a positive integer, and the offset corresponding to the sth area is is the s-th offset, which is located at the p-th offset among the offsets in the fifth group of offsets excluding the offsets corresponding to the area filled with data.
  • data corresponding to P bytes are sequentially read from the second row data portion, and the P bytes are sequentially read.
  • the data corresponding to p bytes is stored to the data corresponding to the qth byte in the sth area.
  • the method further includes: determining the number of each row of the first data. The difference between the lengths is less than or equal to the third threshold.
  • P bytes corresponding to the second row of data are sequentially read from the data part of the second row. data, and sequentially store the data corresponding to the p-th byte among the P bytes to the data corresponding to the q-th byte in the s-th area, completing the q-th data reading and writing.
  • the third data can be converted into rows and columns based on the byte level, thereby avoiding waste of resources.
  • the method before creating P regions according to the sorted P offsets and obtaining the length of each row of data of the first data , the method further includes: determining that P is less than or equal to a fourth threshold.
  • P regions are created based on the sorted P offsets. In this way, only when there are not many rows of the first data, the third data can be converted into rows and columns based on byte level, thereby avoiding waste of resources.
  • the method further includes: splitting the first data page to obtain the plurality of third data pages.
  • splitting the first data page to obtain the plurality of third data pages includes: obtaining a plurality of the third data pages the starting point and the ending point of the fourth data, and the starting point and the ending point of the fourth set of offsets; according to the plurality of starting points and ending points of the fourth data, obtain multiple starting points and ending points from the first data page. a plurality of the fourth data; and, according to the starting points and end points of a plurality of the fourth group of offsets, a plurality of the fourth group of offsets are obtained from the first data page; and the plurality of fourth group offsets are obtained respectively.
  • a plurality of the fourth data and a plurality of the fourth sets of offsets are respectively stored in a plurality of the third data pages.
  • the first data page includes information indicating that the first data page has been reorganized.
  • a device for processing data pages includes a processing unit, and the processing unit is configured to: obtain a second data page based on the first data page; compress the second data page, Obtain a compressed data page; wherein the first data page includes first data based on row storage and a first set of offsets, the first set of offsets being used to indicate each of the first data The offset of the row data; the second data page includes second data based on the row storage method and a second set of offsets, the second set of offsets is used to indicate each row of the second data offset, the second data is the data obtained by preprocessing the first data, and the second set of offsets is a set of offsets obtained by preprocessing the first set of offsets.
  • the preprocessing includes byte-level column-column conversion.
  • the processing unit of the data page processing device performs byte-level row-to-column conversion on the data stored in the first data page based on the row storage method. That is to say, the data based on the row storage method is converted into Convert the data into column-based storage in an ordered and reversible manner, so that the data can be updated in-situ within the data page.
  • the converted second data page is then compressed. Since each row of data stored in the obtained second data page has similarity, repetition and certain regularity, the compression rate of compressing the second data page is higher than that of directly compressing the first data page. , thereby improving the compression rate of data pages.
  • the compression time of the data page processing device is basically the same as that of the existing compression device.
  • the processing unit is further specifically configured to: obtain the first data and the first set of offsets respectively from the first data page; Performing the preprocessing on the first data in bytes to obtain the second data; performing the preprocessing on the first set of offsets in bytes to obtain the second set of offsets; The second data and the second set of offsets are used to obtain the second data page.
  • the first data page includes a first row of data part and a first directory part, and the first row of data part is used to store the first data,
  • the first directory part is used to store the first set of offsets;
  • the processing unit is also specifically used to: update the first data stored in the first row data part to the second data , and updates the first set of offsets stored in the first directory part to a second set of offsets to obtain the second data page.
  • the processing unit is further specifically configured to: obtain the starting point and the end point of the offset of the first data; The starting point and end point of the shift, as well as the unit offset length of the first group of offsets, are used to obtain the number M of offsets included in the first group of offsets; from the M offsets Invalid offsets are removed from the amount to obtain N offsets, the N is less than or equal to the M, and the N and the M are both positive integers; the N offsets are sorted from small to large.
  • the processing unit Arrange in order to obtain the sorted N offsets; divide the first row of data into N areas according to the sorted N offsets, and obtain each of the first data
  • the length of the row data where the number of data in the n-th region among the N regions is the length of the n-th row of data of the first data; according to the order of the N offsets, sequentially From the N areas, the data corresponding to the i-th byte is taken as the data of the i-th row and Nth column of the second data, the i is taken from 1 to L1 in sequence, and the i is a positive integer, so L1 is the maximum row length of the first data, and the arrangement order of the N offsets is that the N offsets are arranged in order from small to large or the N offsets are in the order of the first
  • the arrangement order in a group of offsets; the processing unit is also specifically configured to: perform the preprocessing on the N offsets in the first group of offsets in bytes to obtain the second Group offset.
  • the processing unit is further configured to: in the order of arrangement of the N offsets, sequentially select the Nth area from the Before using the data corresponding to the i bytes as the data of the i-th row and N-th column of the second data, it is determined that the difference between the lengths of each row of the first data is less than or equal to the first threshold.
  • the preprocessing further includes differential processing based on byte levels, and the differential processing includes differential processing between column data.
  • the processing unit is further specifically configured to: convert the first data into rows and columns according to bytes to obtain the third data; convert the third data into The data of the adjacent columns on the a1th row are differentiated according to bytes to obtain the second data, the 1 ⁇ a1 ⁇ a2, the a1 and a2 are both positive integers, and the a2 is equal to the first The maximum row length of the data or the a2 is equal to the minimum row length of the first data; the processing unit is also specifically configured to perform row-column conversion on the first group of offsets according to bytes to obtain the third group Offset; Difference the data of adjacent columns on the b1th row of the third set of offsets in bytes to obtain the second set of offsets, the 1 ⁇ b1 ⁇ b2, the Both b1 and b2 are positive integers, and b2 is equal to the maximum line length of the first group of offsets or b2 is equal to the minimum line length of the first group of offsets.
  • the processing unit is further specifically configured to: obtain the starting point and the end point of the offset of the first data; The starting point and end point of the shift, as well as the unit offset length of the first group of offsets, are used to obtain the number M of offsets included in the first group of offsets; from the M offsets Invalid offsets are removed from the amount to obtain N offsets, the N is less than or equal to the M, and the N and the M are both positive integers; the N offsets are sorted from small to large.
  • the processing unit Arrange in order to obtain the sorted N offsets; divide the first row of data into N areas according to the sorted N offsets, and obtain each of the first data
  • the length of the row data where the number of data in the n-th region among the N regions is the length of the n-th row of data of the first data; according to the order of the N offsets, sequentially From the N areas, the data corresponding to the i-th byte is taken as the data of the i-th row and Nth column of the third data, the i is taken from 1 to L1 in sequence, and the i is a positive integer, so L1 is the maximum row length of the first data, and the arrangement order of the N offsets is that the N offsets are arranged in order from small to large or the N offsets are in the order of the first
  • the arrangement order in a group of offsets; the processing unit is also specifically configured to perform row-column conversion on the N offsets in the first group of offsets according to bytes to obtain the third group of
  • the processing unit is further configured to: in the order of arrangement of the N offsets, sequentially select the Nth area from the Before using the data corresponding to i bytes as the data in the i-th row and N-th column of the third data, it is determined that the difference between the lengths of each row of data in the first data is less than or equal to the first threshold.
  • the processing unit is further configured to: before dividing the first row of data part into N areas according to the N offsets , determine that N is less than or equal to the second threshold.
  • the processing unit is further configured to: reorganize multiple third data pages that are continuous and have the same structure to obtain the first data page; wherein, The third data page includes fourth data based on a row storage method and a fourth set of offsets.
  • the fourth set of offsets is used to indicate the offset of each row of data in the fourth data.
  • the first data includes a plurality of fourth data corresponding to a plurality of third data pages, and the maximum row lengths of the plurality of fourth data are the same, and the first set of offsets includes a plurality of the third data. A plurality of the fourth set of offsets corresponding to the page.
  • the processing unit is further specifically configured to: respectively obtain a plurality of the fourth data corresponding to a plurality of the third data pages and a plurality of the third data pages.
  • a fourth set of offsets ; arranging a plurality of the fourth data in a target order to obtain the first data; and arranging a plurality of the fourth set of offsets in a target order.
  • obtain the first set of offsets and the target sequence is the arrangement sequence of a plurality of third data pages; store the first data and the first set of offsets in the first data page.
  • the first data page includes information indicating that the first data page has been reorganized.
  • the second data page includes information indicating that the second data page has undergone the preprocessing.
  • the processing unit is further configured to: decompress the compressed data page to obtain the second data page; according to the second data page to obtain the first data page, the first data is the data obtained after performing the preprocessing on the second data, and the first set of offsets is obtained by performing the preprocessing on the second set of offsets.
  • the group offset obtained after the above preprocessing.
  • the processing unit is further specifically configured to: obtain the second data and the second set of offsets respectively from the second data page; Performing the preprocessing on the second set of offsets in bytes to obtain the first set of offsets; performing the preprocessing on the second data in bytes according to the first set of offsets.
  • the first data is obtained through processing; and the first data page is obtained according to the first data and the first set of offsets.
  • the second data page includes a second row of data part and a second directory part, and the second row of data part is used to store the second data,
  • the second directory part is used to store the second set of offsets;
  • the processing unit is also specifically used to: update the second data stored in the second row data part to the first data and update the second set of offsets stored in the second directory part to the first set of offsets to obtain the first data page.
  • the processing unit is further specifically configured to: according to the unit offset length of the second group of offsets, calculate the second group of offsets in bytes.
  • the offset is preprocessed to obtain the first group of offsets; the processing unit is also specifically configured to: remove invalid offsets from the first group of offsets to obtain a fifth group of offsets. amount, the fifth group of offsets includes P offsets; arrange the P offsets in order from small to large to obtain the sorted P offsets; according to the sorted P offsets, create P areas, and obtain the length of each row of data of the first data.
  • the P areas correspond to the P offsets one-to-one; in order, start from the second Read the data corresponding to R bytes from the row data part, and sequentially store the data corresponding to the p-th byte among the R bytes to the q-th byte of the s-th area among the P areas.
  • data to complete the reading and writing of the qth data, where the p is a positive integer, and the p is taken from 1 to R, and the R is the number of areas that are not filled with data among the P areas,
  • the amount of data in the sth area is the length of the sth row of data of the first data
  • s is a positive integer
  • the sth area is filled with data.
  • the offsets corresponding to the s areas are the s-th offset, and the s-th offset is located at an offset in the fifth group of offsets other than the offset corresponding to the area filled with data.
  • the processing unit is further specifically configured to: read data corresponding to P bytes from the second row of data portion in sequence, And sequentially store the data corresponding to the p-th byte among the P bytes to the data corresponding to the q-th byte in the s-th area. Before completing the reading and writing of the q-th data, determine each of the first data. The difference between the lengths of row data is less than or equal to the third threshold.
  • the preprocessing further includes byte-level accumulation processing, and the accumulation processing includes accumulation between column data.
  • the processing unit is further specifically configured to: accumulate the data of adjacent columns on the c1th row of the second set of offsets in bytes. , obtain the third set of offsets, the 1 ⁇ c1 ⁇ c2, the c1 and c2 are both positive integers, the c2 is equal to the maximum line length of the second set of offsets or the c2 is equal to the The minimum line length of the second group of offsets; perform row-column conversion on the third group of offsets according to bytes to obtain the first group of offsets; the processing unit is also specifically configured to: convert the third group of offsets into The data of adjacent columns on the d1th row of the second data are accumulated by bytes to obtain the third data, the 1 ⁇ d1 ⁇ d2, the d1 and d2 are both positive integers, and the d2 is equal to the second The maximum row length of data or the d2 is equal to the minimum row length of the second data; according to the first set of offsets
  • the processing unit is further specifically configured to: according to the unit offset length of the third group of offsets, calculate the third group of offsets in bytes.
  • the offsets are converted into rows and columns to obtain the first group of offsets; the processing unit is also specifically configured to remove invalid offsets from the first group of offsets to obtain a fifth group of offsets.
  • the fifth group of offsets includes P offsets; arrange the P offsets in order from small to large to obtain the sorted P offsets; according to the sorted P offsets, create P regions, and obtain the length of each row of data in the first data.
  • the P regions correspond to the P offsets one-to-one; in sequence, from the third data Read the data corresponding to R bytes in the R bytes, and sequentially store the data corresponding to the p-th byte in the R bytes to the data corresponding to the q-th byte of the s-th area in the P areas, complete
  • the qth time data is read and written, where the p is a positive integer, and the p is taken from 1 to R, and the R is the number of areas that are not filled with data among the P areas.
  • the amount of data in the sth area is the length of the sth row of the first data, the s is a positive integer, and the sth area
  • the corresponding offset is the s-th offset, and the s-th offset is located in the offsets in the fifth group of offsets other than the offset corresponding to the area filled with data.
  • the p-th offset of The second data stored in the second row data section is overwritten.
  • the processing unit is further configured to: read data corresponding to P bytes from the second row data portion in sequence, and Sequentially store the data corresponding to the p-th byte among the P bytes to the data corresponding to the q-th byte in the s-th area. Before completing the reading and writing of the q-th data, determine each row of the first data. The difference between the lengths of the data is less than or equal to the third threshold.
  • the processing unit is further configured to: create P areas according to the sorted P offsets, and obtain the first Before the length of each row of data, it is determined that P is less than or equal to the fourth threshold.
  • the processing unit is further configured to split the first data page to obtain the plurality of third data pages.
  • the processing unit is further specifically configured to: obtain the starting points and ending points of the fourth data of a plurality of the third data pages, and the fourth group of offsets.
  • the starting point and end point of the group offset obtain multiple fourth group offsets from the first data page; respectively combine multiple fourth data and multiple fourth group offsets
  • the amounts are respectively stored in multiple third data pages.
  • the first data page includes information indicating that the first data page has been reorganized.
  • a device for processing data pages includes a processing unit, and the processing unit is configured to: decompress the compressed data page to obtain the second data page; according to the The second data page is obtained to obtain the first data page; wherein the second data page includes second data based on row storage mode and a second set of offsets, and the second set of offsets is used to indicate The offset of each row of data in the second data; the first data page includes first data based on row storage mode and a first set of offsets, and the first set of offsets is used to indicate the The offset of each row of data in the first data; the first data is the data obtained after preprocessing the second data, and the first set of offsets is the result of preprocessing the second set of offsets.
  • the processing unit of the device for processing the data page decompresses the second data page.
  • the rate is relatively high, thereby improving the decompression rate of the data page.
  • the compression time of the data page processing device is basically the same as that of the existing decompression device.
  • the processing unit is further specifically configured to: obtain the second data and the second set of offsets respectively from the second data page; Performing the preprocessing on the second set of offsets in bytes to obtain the first set of offsets; performing the preprocessing on the second data in bytes according to the first set of offsets.
  • the first data is obtained through processing; and the first data page is obtained according to the first data and the first set of offsets.
  • the second data page includes a second row of data part and a second directory part, and the second row of data part is used to store the second data,
  • the second directory part is used to store the second set of offsets;
  • the processing unit is also specifically used to: update the second data stored in the second row data part to the first data and update the second set of offsets stored in the second directory part to the first set of offsets to obtain the first data page.
  • the processing unit is further specifically configured to: according to the unit offset length of the second group of offsets, calculate the second group of offsets in bytes.
  • the offset is preprocessed to obtain the first group of offsets; the processing unit is also specifically configured to: remove invalid offsets from the first group of offsets to obtain a fifth group of offsets. amount, the fifth group of offsets includes P offsets; arrange the P offsets in order from small to large to obtain the sorted P offsets; according to the sorted P offsets, create P areas, and obtain the length of each row of data of the first data.
  • the P areas correspond to the P offsets one-to-one; in order, start from the second Read the data corresponding to R bytes from the row data part, and sequentially store the data corresponding to the p-th byte among the R bytes to the q-th byte of the s-th area among the P areas.
  • data to complete the reading and writing of the qth data, where the p is a positive integer, and the p is taken from 1 to R, and the R is the number of areas that are not filled with data among the P areas,
  • the amount of data in the sth area is the length of the sth row of data of the first data
  • s is a positive integer
  • the sth area is filled with data.
  • the offsets corresponding to the s areas are the s-th offset, and the s-th offset is located at an offset in the fifth group of offsets other than the offset corresponding to the area filled with data.
  • the processing unit is further specifically configured to: read data corresponding to P bytes from the second row of data portion in sequence, And sequentially store the data corresponding to the p-th byte among the P bytes to the data corresponding to the q-th byte in the s-th area. Before completing the reading and writing of the q-th data, determine each of the first data. The difference between the lengths of row data is less than or equal to the third threshold.
  • the preprocessing further includes byte-level accumulation processing, and the accumulation processing includes accumulation between column data.
  • the processing unit is further specifically configured to: accumulate the data of adjacent columns on the c1th row of the second set of offsets in bytes. , obtain the third set of offsets, the 1 ⁇ c1 ⁇ c2, the c1 and c2 are both positive integers, the c2 is equal to the maximum line length of the second set of offsets or the c2 is equal to the The minimum line length of the second group of offsets; perform row-column conversion on the third group of offsets according to bytes to obtain the first group of offsets; the processing unit is also specifically configured to: convert the third group of offsets into The data of adjacent columns on the d1th row of the second data are accumulated by bytes to obtain the third data, the 1 ⁇ d1 ⁇ d2, the d1 and d2 are both positive integers, and the d2 is equal to the second The maximum row length of data or the d2 is equal to the minimum row length of the second data; according to the first set of offsets
  • the processing unit is further specifically configured to: according to the unit offset length of the third group of offsets, calculate the third group of offsets in bytes.
  • the offsets are converted into rows and columns to obtain the first group of offsets; the processing unit is also specifically configured to remove invalid offsets from the first group of offsets to obtain a fifth group of offsets.
  • the fifth group of offsets includes P offsets; arrange the P offsets in order from small to large to obtain the sorted P offsets; according to the sorted P offsets, create P regions, and obtain the length of each row of data in the first data.
  • the P regions correspond to the P offsets one-to-one; in sequence, from the third data Read the data corresponding to R bytes in the R bytes, and sequentially store the data corresponding to the p-th byte in the R bytes to the data corresponding to the q-th byte of the s-th area in the P areas, complete
  • the qth time data is read and written, where the p is a positive integer, and the p is taken from 1 to R, and the R is the number of areas that are not filled with data among the P areas.
  • the amount of data in the sth area is the length of the sth row of the first data, the s is a positive integer, and the sth area
  • the corresponding offset is the s-th offset, and the s-th offset is located in the offsets in the fifth group of offsets other than the offset corresponding to the area filled with data.
  • the p-th offset of The second data stored in the second row data section is overwritten.
  • the processing unit is further configured to: read data corresponding to P bytes from the second row data portion in sequence, and Sequentially store the data corresponding to the p-th byte among the P bytes to the data corresponding to the q-th byte in the s-th area. Before completing the reading and writing of the q-th data, determine each row of the first data. The difference between the lengths of the data is less than or equal to the third threshold.
  • the processing unit is further configured to: create P areas according to the sorted P offsets, and obtain the first Before the length of each row of data, it is determined that P is less than or equal to the fourth threshold.
  • the processing unit is further configured to: split the first data page to obtain the plurality of third data pages.
  • the processing unit is further specifically configured to: obtain the starting point and the end point of the fourth data of a plurality of the third data pages, and the fourth group of offsets.
  • the starting point and end point of the group offset obtain multiple fourth group offsets from the first data page; respectively combine multiple fourth data and multiple fourth group offsets
  • the amounts are respectively stored in multiple third data pages.
  • the first data page includes information indicating that the first data page has been reorganized.
  • a device for processing data pages includes: a processor and a memory; the memory is used to store a computer program; the processor is used to execute the computer program stored in the memory, So that the device performs the method described in the possible implementation of any one of the first aspect or the second aspect.
  • a computer-readable storage medium is provided.
  • a computer program is stored on the computer-readable storage medium.
  • the computer program When the computer program is run on a computer, it causes the computer to execute the first aspect or the second aspect.
  • a seventh aspect provides a chip system, including: a processor configured to call and run a computer program from a memory, making it possible for a device installed with the chip system to execute any one of the above first or second aspects. Implementation of the methods described in .
  • a computer program product containing instructions is provided.
  • the computer program product When the computer program product is run on a device, it causes the device to perform the steps described in any of the possible implementations of the first aspect or the second aspect. Methods.
  • FIG. 1 is a schematic flow chart of an example of a data page compression method 200 provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of an example data page provided by an embodiment of the present application.
  • Figures 3 to 6 are schematic flow charts for obtaining the second data page provided by embodiments of the present application.
  • FIG. 7 is a schematic diagram of an example of reorganization of multiple data pages provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram illustrating an example of the compression performance of the data page compression method provided by the embodiment of the present application and the existing compression method.
  • FIG. 9 is a schematic diagram illustrating another example of the compression performance of the data page compression method provided by the embodiment of the present application and the existing compression method.
  • Figure 10 is a schematic diagram of another example of the compression performance of the data page compression method provided by the embodiment of the present application and the existing compression method.
  • Figure 11 is a schematic diagram of another example of the compression performance of the data page compression method provided by the embodiment of the present application and the existing compression method.
  • Figure 12 is a schematic flow chart of another example of a data page decompression method provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a second data reading and writing process provided by an embodiment of the present application.
  • FIG. 14 is a schematic diagram of another example of a second data reading and writing process provided by an embodiment of the present application.
  • FIG. 15 is a schematic block diagram of an example of a data page processing device provided by an embodiment of the present application.
  • FIG. 16 is a schematic structural diagram of an example of a data page processing device provided by an embodiment of the present application.
  • the data compression library compresses the single data page based on a general compression algorithm (such as zlib, lz4, or zstd, etc.) Compress and store the compressed data to the original address.
  • a general compression algorithm such as zlib, lz4, or zstd, etc.
  • the file system's hole drilling technology file system characteristics, hole drilling unit is 4K
  • the hole drilling unit on the Linux system is 4K
  • the data page size is 32K and 64K, the data page does not support compression.
  • the maximum data page of 16K supports this kind of compression. Therefore, the compression ratio can reach a maximum of 4: 1.
  • dictionary compression For another example, in the dictionary compression involved in Oracle, a dictionary of row fields is created at the block level (data page concept), and the fields store references to dictionary elements. When the dictionary is updated, inserted, or deleted, a threshold is used to control whether to perform compression.
  • this dictionary compression is a dictionary compression algorithm based on storage blocks. A symbol table is maintained within the page, and all operations are performed based on this symbol table for transformation and inverse transformation, making the implementation complex.
  • dictionary compression is easily affected by the characteristics of the data itself. If the data is not highly repetitive, the compression rate will be very low.
  • DB2 supports page-level compression dictionary algorithm and table-level dictionary compression algorithm. Page-level dictionaries and table-level dictionaries are stored in hidden rows of the table. Once the dictionary is created, it will not be updated unless the field table is rebuilt. The compression rate of this compression algorithm strongly depends on the characteristics of the data. If the initial data characteristics are poor and unrepresentative, then this compression algorithm will not have better compression results.
  • embodiments of the present application provide a data page processing method, where the data page processing may include data page compression and/or data page decompression.
  • the data page processing method not only has a higher compression rate (or decompression rate), but also can be basically the same as the compression time (or decompression time) of the existing compression (or decompression) method.
  • the embodiments of this application may not limit the application scenarios of the data page processing method provided by the embodiments of this application.
  • the data page processing method provided by the embodiment of the present application can be, but is not limited to, applied to online production environments (such as online transaction processing (on-line transaction processing, OLTP)), database file compression backup storage, and primary and backup logistics files. in the copied scene.
  • online production environments such as online transaction processing (on-line transaction processing, OLTP)
  • database file compression backup storage such as database file compression backup storage
  • primary and backup logistics files in the copied scene.
  • embodiments of the present application provide a data page compression method.
  • This data page compression method not only has a higher compression rate, but also has a compression time that is basically the same as that of existing compression methods.
  • FIG. 1 is a schematic flow chart of an example of a data page compression method 200 provided by an embodiment of the present application.
  • the method 200 includes S210 and S220, and S220 is executed after S210.
  • S210 and S220 are examples of S210 and S220.
  • the first data page includes first data based on a row storage method and a first set of offsets, where the first set of offsets is used to indicate the offset of each row of the first data.
  • the embodiment of the present application does not limit the form of the offset.
  • the offset may be a position (eg, number of bytes) relative to the head of the data page.
  • FIG. 2 is a schematic diagram of an example data page provided by an embodiment of the present application.
  • the data page may include a row data part and a directory part.
  • the row data part is used to store data
  • the directory part is used to store the offset of each row of data stored in the row data part.
  • the data page may also include a header, an idle part, and/or a tail.
  • the header and/or trailer are used to store information related to the data page.
  • the information related to the data page may include, but is not limited to: the number of the data page, the type of the data page, the number of rows of data stored in the row data part of the data page, the number of data stored in the row data part of the data page.
  • the free part is mainly used for the expansion of the row data part and/or the tail.
  • the first data page may include a first header, a first row of data part, and a first directory part.
  • the first header is used to store information related to the first data page; the first row data part is used to store the first data, and the first directory part is used to store the first set of offsets.
  • Table 1 is an example of the first data.
  • the first data includes three rows of data, in which the first row of data occupies 3 bytes, and the data on the 3 bytes are a, b, and c in order; the second row The data occupies 4 bytes, and the data on the 4 bytes are a, b, c, d in order; the third row of data occupies 5 bytes, and the data on the 5 bytes is a in order ,b,c,d,e.
  • Table 2 is another example of the first data.
  • the first data includes three rows of data, in which the first row of data occupies 5 bytes, and the data on the 5 bytes are a, b, c, d, e in order ;
  • the second row of data occupies 3 bytes, and the data on the 3 bytes are a, b, c in sequence;
  • the third row of data occupies 4 bytes, and the data on the 4 bytes are in sequence are a, b, c, d.
  • the embodiment of the present application does not limit the arrangement order of the offsets in the first group of offsets.
  • the offsets in the first set of offsets are arranged in order from small to large.
  • Table 3 shows an example of the first set of offsets. Among them, the offsets in the first set of offsets shown in Table 3 are arranged in order from small to large. The first set of offsets shown in Table 3 correspond to the first data shown in Table 1. Offset.
  • the first data as shown in Table 1 includes three rows, therefore, the first set of offsets as shown in Table 3 includes three offsets, where the first offset (th row of the first data The offset of one row of data) occupies 2 bytes, and the data on these 2 bytes are 0x00, 0x01 in sequence; the second offset (the offset of the second row of data of the second data) occupies 2 bytes, and the data on the 2 bytes are 0x00, 0x04 in sequence; the third offset (the offset of the third row of data in the third data) occupies 2 bytes, and the 2 The data on the bytes are 0x00, 0x08 in sequence.
  • the offsets in the first set of offsets are arranged out of order.
  • Table 4 is another example of the first set of offsets. Among them, the offsets in the first set of offsets shown in Table 4 are arranged in disorder, and the first set of offsets shown in Table 4 are the offsets corresponding to the first data shown in Table 2.
  • the first data as shown in Table 4 includes three rows, therefore, the first set of offsets as shown in Table 3 includes three offsets, where the first offset (th row of the first data The offset of the second row of data) occupies 2 bytes, and the data on these 2 bytes are 0x00, 0x06 in sequence; the second offset (the offset of the third row of data of the first data) occupies 2 bytes, and the data on the 2 bytes are 0x00, 0x09 in sequence; the third offset (the offset of the first row of data in the first data) occupies 2 bytes, and the 2 The data on the bytes are 0x00 and 0x01 in sequence.
  • the offset is expressed in hexadecimal as an example for description, which should not limit the embodiments of the present application.
  • the second data page includes second data based on a row storage method and a second set of offsets, and the second set of offsets is used to indicate the offset of each row of the second data.
  • the second data is the data obtained by preprocessing the first data
  • the second set of offsets is the group offset obtained by preprocessing the first set of offsets. That is to say, in S210, each partial data included in the first data page is preprocessed to obtain the second data page.
  • preprocessing includes only byte-level based row-column conversion.
  • the preprocessing includes not only byte-level-based row-column conversion, but also includes byte-level-based differential processing. Among them, difference processing includes making differences between column data.
  • byte-level row-to-row conversion can be understood as byte-level row-to-row conversion.
  • Differential processing based on byte level can be understood as differential processing in units of bytes.
  • the preprocessing includes only byte-level based differential processing.
  • Case 1 the case where preprocessing only includes byte-level row-column conversion is recorded as Case 1
  • case 2 the case where preprocessing not only includes byte-level row-column conversion, but also includes byte-level differential processing is recorded
  • case 3 the case where preprocessing only includes byte-level differential processing is recorded as case 3.
  • Figure 3 is a schematic flow chart of obtaining the second data page in an example provided by the embodiment of the present application when the preprocessing is case 1.
  • Figure 4 is a schematic flow chart of an example of obtaining second data provided by the embodiment of the present application when the preprocessing is case 1.
  • Figure 5 is a schematic flow chart of an example of obtaining second data provided by the embodiment of the present application when the preprocessing is case 2.
  • Figure 6 is a schematic flowchart of an example of obtaining the second set of offsets provided by the embodiment of the present application when the preprocessing is case 2.
  • preprocessing only includes byte-level row-column conversion
  • S210 specifically includes S211 to S214.
  • the starting point (rows_begin) and end point (rows_end) of the first data and the starting point (dirs_begin) and end point (dirs_end) of the offset corresponding to the first data can be obtained respectively from the first header. Then, obtain the first data from the first row data part of the first data page according to the starting point (rows_begin) and end point (rows_end) of the first data, and obtain the starting point (dirs_begin) according to the offset corresponding to the first data. and the end point (dirs_end) to obtain the first set of offsets from the first directory portion of the first data page.
  • the embodiments of this application do not limit the form of the starting point and/or the ending point.
  • the start point and/or the end point may be a position (eg, number of bytes) relative to the head of the data page.
  • the embodiment of the present application does not limit the execution order of obtaining the first data and the first set of offsets.
  • the first data can be obtained first, and then the first set of offsets can be obtained; the first set of offsets can be obtained first, and then the first data can be obtained; or the first data and the first set of offsets can be obtained at the same time.
  • S212 Preprocess the first data according to bytes to obtain the second data.
  • S212 includes S2121 to S2125.
  • S2121 to S2125 are introduced in detail below.
  • S2121 According to the starting point (rows_begin) and end point (rows_end) of the offset of the first data, and the unit offset length of the first group of offsets, obtain the offsets included in the first group of offsets. quantity(total_dir_cnt)M.
  • the starting point (rows_begin) and the end point (rows_end) of the offset of the first data may be obtained from the first header.
  • M satisfies the following formula:
  • the unit offset length of the first set of offsets is the unit offset length of the first set of offsets.
  • the embodiment of the present application does not limit the size of the unit offset length of the first set of offsets.
  • the following description takes the example that the unit offset length of the first set of offsets is 2 bytes.
  • the number M of offsets included in the first set of offsets can be understood as the number of rows of the first data.
  • S2122 remove invalid offsets from M offsets to obtain N offsets, N is less than or equal to M, and N and M are both positive integers.
  • S2122 may be implemented based on information indicating an invalid offset.
  • the embodiment of the present application does not limit the storage location of the information indicating the invalid offset.
  • information indicating an invalid offset may be stored in the directory part, header, or trailer.
  • S2122 can also be implemented based on each offset in the first set of offsets and the starting point (rows_begin) and end point (rows_end) of the first data obtained in S211.
  • N offsets among the M offsets are between the starting point (rows_begin) and the end point (rows_end) of the first data, (M-N) offsets are not at the starting point of the first data (rows_end) Between rows_begin) and the end point (rows_end), that is, N offsets among M offsets are valid, and (M-N) offsets are invalid.
  • S2123 Arrange the N offsets in ascending order to obtain the sorted N offsets.
  • the embodiment of the present application does not limit the order in which the N offsets are arranged in the directory part.
  • the N offsets are arranged in order from small to large or from large to small in the directory part. In another example, the N offsets may be arranged out of order in the directory part.
  • the three offsets shown in Table 3 are arranged in ascending order, and the obtained three sorted offsets are 0x01, 0x04, and 0x08.
  • the three offsets shown in Table 4 are arranged in ascending order, and the obtained three sorted offsets are 0x01, 0x06, and 0x09.
  • N offsets can be sorted by insertion sort.
  • S2124 Divide the first row of data into N regions according to the sorted N offsets, and obtain the length of each row of the first data.
  • the number of data in the nth area among the N areas is the length of the nth line of the first data
  • the data in the nth area among the N areas is the length of the nth line of the first data.
  • the previous data from the data corresponding to the n-th offset to the data corresponding to the (n+1)-th offset in the first row of data portion is regarded as the data in the n-th area among the N areas.
  • the data in the n-th area is the data of the n-th row of the first data, where n is taken from 1 to (N-1) in sequence.
  • the data corresponding to the Nth offset in the first row of data part and the data from the Nth offset to the end point of the offset of the first data are regarded as the data in the Nth area among the N areas.
  • the data part of the first row can be divided into N regions, that is, the data in the Nth region is the data of the Nth row of the first data.
  • the difference between the (n+1)th offset and the nth offset is used as the length of the data in the nth area, where n is sequentially taken from 1 to (N-1); and Add 1 to the difference between the end point of the offset of the first data and the Nth offset as the length of the data in the Nth area. In this way, the length of each row of the first data can be obtained.
  • the three offsets are sorted as 0x01, 0x04, and 0x08.
  • the data corresponding to the first offset (0x01) in the data part of the first row shown in Table 1
  • the data corresponding to a) in the first row to the second offset (0x04) (a) in the second row shown in Table 1 and the previous data (c in the first row shown in Table 1) are used as the first data in a region.
  • the data in the first area of the first data shown in Table 1 is the data (abc) in the first row shown in Table 1.
  • the data corresponding to the second offset (0x04) in the first row of data (a in the second row shown in Table 1) to the data corresponding to the third offset (0x08) (shown in Table 1 The previous data of a) in the third row shown (d) in the second row shown in Table 1 is used as the data in the second area.
  • the data in the second area of the first data shown in Table 1 is the data (abcd) in the second row shown in Table 1.
  • the three offsets are sorted into 0x01, 0x06, and 0x09.
  • the data in the first area of the first data shown in Table 1 is the data (abcde) in the first row shown in Table 2.
  • the data corresponding to the second offset (0x06) in the first row of data (a in the second row shown in Table 1) to the data corresponding to the third offset (0x09) (shown in Table 1 The previous data of a) in the third row shown (c) in the second row shown in Table 1 is used as the data in the second area.
  • the data in the second area of the first data shown in Table 1 is the data (abc) in the second row shown in Table 1.
  • the embodiment of the present application does not limit the execution order between dividing the first row of data into N regions and obtaining the length of each row of first data. For example, you can first divide the first row of data into N areas, and then obtain the length of each row of data in the first data; or you can first obtain the length of each row of data in the first data, and then divide the first row into The data part is divided into N areas; or, the data part of the first row can be divided into N areas at the same time, and the length of each row of the first data is obtained.
  • N is less than or equal to the second threshold. And only if N is less than or equal to the second threshold, S2124, as well as S2125, S213 and S214 are executed. In this way, only when there are not many rows of the first data, byte-level row-to-row conversion is performed on the first data, thereby avoiding waste of resources.
  • the embodiment of the present application does not limit the specific value of the second threshold, which can be set according to the actual situation.
  • the data corresponding to the i-th byte is taken from the N areas in sequence as the data of the i-th row and Nth column of the second data, i is taken from 1 to L1, i is a positive integer, L1 is the maximum row length of the first data, and the order of the N offsets is the order of the N offsets from small to large or the N offsets in the first group of offsets. Order.
  • the N offsets are arranged in order from small to large.
  • the data in the first area is the data in the first row (abc) shown in Table 1
  • the data in the second area is the data in the second row (abcd) shown in Table 1
  • the data in the second area is the data in the second row (abcd) shown in Table 1.
  • the data in the three areas is the third row of data (abcde) shown in Table 1.
  • the maximum row length L1 of the first data is 5.
  • the data (a) corresponding to 1 byte is used as the data in the 1st row and 3rd column of the second data. In other words, by reading the first data, the data of the first row of the second data, that is, aaa, can be taken out from the three areas.
  • the data corresponding to 2 bytes (b) is used as the data in the 2nd row and 3rd column of the second data. That is to say, by reading the data for the second time, the data of the second row of the second data, namely bbb, can be taken out from the three areas.
  • the data corresponding to 3 bytes (c) is used as the data in the 3rd row and 3rd column of the second data. That is to say, through the third data reading, the data of the second data and the third row, namely ccc, can be taken out from the three areas.
  • the data of the fifth row of the second data that is, **e, can be taken out from the three areas.
  • the first data shown in Table 1 can be converted into byte-level rows and columns to obtain the second data shown in Table 5.
  • the second data includes five rows of data, where each row of data in the five rows of data occupies 3 bytes, and the data on the 3 bytes of the first row are a, a , a; the data on the 3 bytes of the second row are b, b, b; the data on the 3 bytes of the third row are c, c, c; the data on the 3 bytes of the fourth row are in sequence They are *, d, and d; the data on the 3 bytes of the fifth line are *, *, and e.
  • the arrangement order of the N offsets is the arrangement order of the N offsets in the first group of offsets.
  • the data in the first area is the data in the first row (abcde) shown in Table 2
  • the data in the second area is the data in the second row (abc) shown in Table 1
  • the data in the second area is the data in the second row (abc) shown in Table 1.
  • the data in the three areas is the third row of data (abcd) shown in Table 1.
  • the maximum row length L1 of the first data is 5.
  • Data (a) is used as the data in the 1st row and 2nd column of the second data, and is taken from the 1st Take the data (a) corresponding to the first byte in each area as the data in the first row and third column of the second data.
  • the data of the first row of the second data that is, aaa, can be taken out from the three areas.
  • Data (b) is used as the data in the 2nd row and 2nd column of the second data, and is taken from the 1st Take the data (b) corresponding to the 2nd byte in the first area as the data in the 2nd row and 3rd column of the second data. That is to say, by reading the data for the second time, the data of the second row of the second data, namely bbb, can be taken out from the three areas.
  • Data (c) is used as the data in the 3rd row and 1st column of the second data, and the data corresponding to the 3rd byte is taken from the 3rd area as the data in the 3rd row and 2nd column of the second data, and is taken from the 1st Take the data (c) corresponding to the 3rd byte in the first area as the data in the 3rd row and 3rd column of the second data. That is to say, through the third data reading, the data of the second data and the third row, namely ccc, can be taken out from the three areas.
  • the first area takes the data (e) corresponding to the 5th byte as the data in the 5th row and 3rd column of the second data.
  • the data of the fifth row of the second data that is, **e, can be taken out from the three areas.
  • the first data shown in Table 2 can be converted into rows and columns based on byte level, and the second data shown in Table 5 can be obtained.
  • Table 5 please refer to the relevant description above and will not be repeated here.
  • the difference between the lengths of each row of the first data is less than or equal to the first threshold, that is, S2126. And only when the difference between the lengths of each row of the first data is less than or equal to the first threshold, S2125, S213 and S214 are executed. In this way, only when the difference between the lengths of each row of the first data is not large, the first data can be converted into rows and columns based on the byte level, thereby avoiding waste of resources.
  • the embodiment of the present application does not limit the specific value of the first threshold, which can be set according to the actual situation.
  • S213 Preprocess the first set of offsets according to bytes to obtain the second set of offsets.
  • N offsets in the first set of offsets are converted into rows and columns based on byte level to obtain the second set of offsets.
  • N offsets in the first set of offsets in bytes The second set of offsets is obtained by byte-level column-column conversion based on the order in which the quantities are arranged in the first set of offsets.
  • Table 6 is an example of a second set of offsets obtained by performing byte-level row-column conversion on N offsets in the first set of offsets shown in Table 3.
  • the second set of offsets includes two rows of data. The first row occupies 3 bytes, and the data on the 3 bytes are 0x00, 0x00, 0x00 in order; the second row It occupies 3 bytes, and the data on the 3 bytes are 0x01, 0x04, and 0x08 in sequence.
  • Table 7 is an example of a second set of offsets obtained by performing byte-level row-column conversion on N offsets in the first set of offsets shown in Table 4.
  • the second set of offsets includes two rows of data. The first row occupies 3 bytes, and the data on the 3 bytes are 0x00, 0x00, 0x00 in order; the second row It occupies 3 bytes, and the data on the 3 bytes are 0x06, 0x09, and 0x01 in sequence.
  • a data page may be newly created, and the second data and the second set of offsets may be stored in the newly reconstructed data page to form a second data page.
  • a data page can be newly created, the data page includes a second row data part and a second directory part, the second data is stored in the second row data part, and the second set of offsets is stored in the second directory part .
  • the newly created data page is the second data page.
  • the second data page can be obtained based on the original first data page.
  • the first data stored in the first row data part is updated to the second data
  • the first set of offsets stored in the first directory part is updated to the second set of offsets to obtain the second data page.
  • preprocessing includes byte-level row-column conversion and byte-level differential processing
  • S210 specifically includes S211 to S214.
  • the specific processes of S211 and S214 in case 2 are the same as the specific processes of S211 and S214 in case 1, and will not be described again here.
  • the specific processes of S212 and S213 in case 2 are different from the specific processes of S212 and S213 in case 1.
  • the specific processes of S212 and S213 in case 2 are introduced in detail below.
  • S212 specifically includes S212A and S212B.
  • S212A and S212B The following introduces S212A and S212B in detail.
  • S212A Convert the first data into rows and columns according to bytes to obtain the third data.
  • S212A includes S2121 to S2124, and S2125A.
  • S2121 to S2124 please refer to the relevant descriptions above and will not be repeated here.
  • S2125A we focus on S2125A.
  • the data corresponding to the i-th byte is taken from the N areas in sequence as the data of the i-th row and Nth column of the third data, i is taken from 1 to L1, i is a positive integer, L1 is the maximum row length of the first data, and the order of the N offsets is the order of the N offsets from small to large or the N offsets in the first group of offsets. Order.
  • S2125A According to the above description of S2125A, it can be seen that the process of S2125A is similar to the process of S2125 described above. The only difference between the two is that: S2125A obtains the third data, and S2125 obtains the second data. Therefore, for detailed description of S2125A, please refer to the relevant description of S2125 above, and will not be repeated here.
  • S212B Difference the data of adjacent columns on the a1th row of the third data in bytes to obtain the second data, 1 ⁇ a1 ⁇ a2, a1 and a2 are both positive integers, and a2 is equal to the largest row of the first data. The length or a2 is equal to the minimum row length of the first data.
  • Table 8 is the second data obtained by differing the data of the adjacent columns on the a1th row of the third data shown in Table 5 in bytes.
  • Table 8 takes a2 equal to the minimum row length (3 bytes) of the first data (shown in Table 1 or Table 2) as an example.
  • the difference between the lengths of each row of the first data is less than or equal to the first threshold, that is, S2126A. And only when the difference between the lengths of each row of the first data is less than or equal to the first threshold, S2125A, as well as S213 and S214 are executed. In this way, only when the difference between the lengths of each row of the first data is not large, the first data can be converted into rows and columns based on the byte level, thereby avoiding waste of resources.
  • S213 specifically includes S213A and S213B.
  • S213A and S213B introduces S213A and S213B in detail.
  • S213A convert the first set of offsets into rows and columns according to bytes to obtain the third set of offsets.
  • S213A According to the above description of S213A, it can be seen that the process of S213A is similar to the process of S213 described above. The only difference between the two is that: S213A obtains the third set of offsets, and S213 obtains the second set of offsets. Offset. Therefore, for the detailed description of S213A, please refer to the relevant description of S213 above, and will not be repeated here.
  • S213B differentiate the data of adjacent columns on the b1th row of the third set of offsets in bytes to obtain the second set of offsets, 1 ⁇ b1 ⁇ b2, b1 and b2 are both positive integers, and b2 is equal to The maximum line length of the first set of offsets or b2 is equal to the minimum line length of the first set of offsets.
  • Table 9 is to differentiate the data of adjacent columns on the b1th row of the third set of offsets shown in Table 6 in bytes.
  • Table 10 is to differentiate the data of adjacent columns on the b1th row of the third set of offsets shown in Table 7 in bytes.
  • the second data page includes information indicating that the second data page has been preprocessed.
  • the information may indicate that the second data page has undergone byte-level row-column conversion processing.
  • this information may not only indicate that the second data page has undergone byte-level row-column conversion processing and byte-based differential processing.
  • Level differential processing can also indicate the sequence of byte-level-based row-column conversion processing and byte-level-based differential processing.
  • the embodiment of the present application does not limit the storage location of the information indicating that the second data page has been preprocessed in the second data page.
  • the information indicating that the second data page has been preprocessed may be stored at the head or the tail of the second data page.
  • preprocessing only includes byte-level differential processing
  • S210 specifically includes S211 to S214.
  • the specific processes of S211 and S214 in case 3 are the same as the specific processes of S211 and S214 in case 2, and will not be described again here.
  • the specific processes of S212 and S213 in case 3 are different from the specific processes of S212 and S213 in case 2.
  • the specific processes of S212 and S213 in case 3 are introduced in detail below.
  • S212 specifically includes: differentiating the data of adjacent columns on the e1th row of the first data in bytes to obtain the second data, 1 ⁇ e1 ⁇ a2, and e1 is a positive integer.
  • a2 can refer to the relevant description above.
  • S213 specifically includes: differentiating the data of adjacent columns on the f1th row of the first set of offsets in bytes to obtain the second set of offsets, 1 ⁇ f1 ⁇ b2, and f1 is A positive integer, where b2 can refer to the relevant description above.
  • the first data page described in S210 may be a first data page obtained by reorganizing multiple third data pages that are continuous and have the same structure. That is to say, before S210, the method 200 also includes:
  • S230 Reorganize multiple third data pages that are continuous and have the same structure to obtain the first data page.
  • the third data page includes fourth data based on row storage mode and a fourth set of offsets.
  • the fourth set of offsets is used to indicate the offset of each row of the fourth data.
  • the first data includes multiple A plurality of fourth data corresponding to the third data page, and the maximum row lengths of the plurality of fourth data are the same, and the first set of offsets includes a plurality of fourth sets of offsets corresponding to the plurality of third data pages.
  • S230 includes: S231, respectively acquiring multiple fourth data and multiple fourth sets of offsets corresponding to multiple third data pages.
  • the target order is multiple The sort order of the third data page.
  • S233. Store the first data and the first set of offsets in the first data page respectively.
  • the fourth data stored in the row data part of each of the obtained plurality of third data pages are stored together to obtain the first data, and sequentially store the fourth set of offsets stored in the directory of each third data page among the acquired third data pages together to obtain the first set of offsets.
  • the first data A page is a data page that aggregates data from multiple third data pages.
  • the first data page can be considered to be a giant data page.
  • the implementation of this application does not limit the execution order of the steps of obtaining the first data and obtaining the first set of offsets described in S232 above.
  • the first data can be obtained first and then the first set of offsets can be obtained, or, The first set of offsets may be obtained first and then the first data may be obtained, or the first data and the first set of offsets may be obtained at the same time.
  • the data page includes, in addition to the row data part and the directory part, the data page also includes: a header and/or a trailer
  • the following steps also need to be performed: First, Data stored in multiple headers and/or trailers corresponding to multiple third data pages are respectively obtained. Secondly, the data stored in the plurality of headers and/or tails are respectively arranged in the target order to obtain the data stored in the first header and/or the first tail. Finally, the data stored in the first header and/or the first tail are respectively stored in the head and/or the tail of the first data page.
  • the data stored in the header and/or tail of each of the obtained plurality of third data pages can be stored together.
  • the data stored in the first header and/or the first tail thus forms a giant data page, that is, the first data page.
  • the plurality of third data pages can be converted into rows and columns based on byte.
  • condition 1 Whether the number of valid offsets in the fourth set of offsets stored in each third data page is less than or equal to the fifth threshold
  • Condition 2 The fourth data stored in each third data page Whether the difference between the lengths of each row of data is less than or equal to the sixth threshold.
  • the embodiment of the present application does not limit the specific value of the fifth threshold, which can be set according to the actual situation.
  • the embodiment of the present application does not limit the relationship between the fifth threshold and the fourth threshold and the second threshold respectively.
  • the fifth threshold, the fourth threshold, and the second threshold may all be equal.
  • the embodiment of the present application does not limit the specific value of the sixth threshold, which can be set according to actual conditions.
  • the embodiment of the present application does not limit the relationship between the sixth threshold and the third threshold and the first threshold respectively.
  • the sixth threshold, the third threshold and the first threshold may all be equal.
  • FIG. 7 is a schematic diagram of an example of reorganization of multiple data pages provided by an embodiment of the present application.
  • seven data pages include data page 10 to data page 70 .
  • the data page 10 can be converted into rows and columns, and the maximum row length of the data stored in the row data part of the data page 10 is 40.
  • Both data page 20 and data page 30 can be converted between rows and columns, and the maximum row length of data stored in the row data portion of data page 20 and the maximum row length of data stored in the row data portion of data page 30 are both 50.
  • the data page 40 cannot be converted into rows and columns, and the maximum row length of data stored in the row data portion of the data page 40 is 50.
  • the data page 50 can be converted into rows and columns, and the maximum row length of data stored in the row data portion of the data page 50 is 50.
  • Both data page 60 and data page 70 can be converted between rows and columns, and the maximum row length of data stored in the row data portion of data page 60 and the maximum row length of data stored in the row data portion of data page 70 are both 60.
  • data page 20 and data page 30 can be reorganized into one data page
  • data page 60 and data page 70 can be reorganized into one data page
  • data page 10 , data page 40, and data page 50 cannot be reorganized.
  • the five data pages include data page 10 and data page 20. -30, data page 40, data page 50, and data page 60-70.
  • data pages 20-30 are data pages obtained after the reorganization of data page 20 and data page 30
  • data pages 60-70 are data pages obtained after the reorganization of data page 60 and data page 70.
  • the first data page includes information indicating that the first data page has been reorganized.
  • the embodiment of the present application does not limit the storage location of the information indicating that the first data page has been reorganized in the first data page.
  • the information indicating that the first data page has been reorganized may be stored at the head or the tail of the first data page.
  • S220 Compress the second data page to obtain a compressed data page.
  • the embodiment of the present application does not limit the compression algorithm used to compress the second data page.
  • a general compression algorithm (such as zlib, lz4, zstd, etc.) can be used to compress the second data page to obtain a compressed data page.
  • the user can first set the compression parameters by himself, and then use the compression parameters set by the user, and based on the use of the above-mentioned data page
  • the compression method 200 compresses the data page to complete the compression of the data page.
  • the compression parameters may include at least one of the following: the number of data pages compressed at one time, the preprocessing method, and the type of compression algorithm involved in S220. Among them, the minimum number of data pages compressed at one time is 1. Preprocessing methods include row-column conversion and/or differential processing. The types of compression algorithms involved in S220 may include zlib, lz4, zstd, etc.
  • compression parameters can be designed at the table space level, file level, table level, or the user can design it himself.
  • the data stored in general column storage generally has similarity, repetition and certain regularity. Therefore, the data compression ratio based on column storage method will be higher than that based on row storage method.
  • the preprocessing described in method 200 includes byte-level row-column conversion
  • data based on row storage is converted into a form of data based on column storage in an ordered and reversible manner, so that the The data is updated in-situ in the data page and then the data page is compressed. This can make full use of the characteristics of the data structure and thereby improve the compression rate of the data page.
  • the compression time of the data compression method 200 provided by the embodiment of the present application is basically the same as that of existing compression methods.
  • the column data is After differential processing based on byte level, more duplicate data can be created, which can make full use of the characteristics of the data structure to further improve the duplication and regularity of the data, thereby improving the compression rate of the data page.
  • the compression time of the data compression method 200 provided by the embodiment of the present application is basically the same as that of existing compression methods.
  • the embodiment of the present application conducted a TPCC test on multiple data based on the row storage method. See Table 11 to Table 16 for details.
  • the processing methods in Table 11 to Table 12 are: 1: The processing method of using the existing general compression algorithm for compression; 2: The processing method of performing byte-level row-column conversion based on the data page + using the existing general compression algorithm for compression. Method; 3: byte-level row-column conversion within the data page + differential processing + compression using existing general compression algorithms; 4: data page reorganization + byte-level row-column conversion within the data page + adoption
  • the compression processing method is the existing general compression algorithm; 5: Data page reorganization + byte-level row-column conversion within the data page + differential processing + compression processing method using the existing general compression algorithm.
  • Tables 11 and 12 take the index data in the database GaussDB V3 as an example. Table 11 corresponds to a compression level of 9, and Table 12 corresponds to a compression level of 1.
  • Table 13 takes each index data in the database PG as an example, and Table 13 is an example of compressing a data page.
  • Tables 14 and 15 take the table data in the database GaussDB V3 as an example. Table 14 corresponds to a compression level of 9, and Table 15 corresponds to a compression level of 1.
  • Table 16 takes the table data in database PG as an example, and Table 16 is an example of compressing one data page.
  • Tables 11 to 15 all take the zstd general algorithm used in S220 of method 200 as an example.
  • Table 16 also takes the lz4 universal algorithm used in S220 of Method 200 as an example.
  • the more data pages compressed at a time the better the compression performance. For example, the more data pages that are compressed at a time, the higher the compression rate and the shorter the compression time.
  • Figures 8 to 11 are only for comparing the compression performance of processing methods 1 to 5, and the specific values are still based on Tables 11 to 16.
  • FIGS 8 to 11 are schematic diagrams of four examples of compression performance provided by embodiments of the present application.
  • Figures 8 to 11 The similarities between Figures 8 to 11 are: 1. They all use the zstd general algorithm in S220 of method 200 as an example. 2. They all take the compression of 1G (gigabyte) data as an example, and the compression level is 9 as an example.
  • Figure 8 takes the index data idx_bmsql_oorder_pkey in the database GaussDB V3 as an example
  • Figure 9 takes the index data idx_bmsql_order_line_pkey in the database GaussDB V3 as an example
  • Figure 10 takes the database GaussDB
  • Table data tbl_bmsql_oorder in V3 is taken as an example
  • Figure 11 takes the table data tbl_bmsql_stock in the database GaussDB V3 as an example.
  • some data page compression methods 200 are used to compress the data.
  • the size of the data page obtained is larger than the size of the data page obtained after compressing the data using existing compression methods.
  • FIG. 12 is a schematic flow chart of an example of a data page decompression method 300 provided by an embodiment of the present application.
  • the method 300 includes S310 and S320, and S320 is executed after S310.
  • S310 and S320 are a detailed introduction to S310 and S320.
  • the embodiment of the present application does not limit the decompression method used to decompress the compressed data page.
  • a general decompression algorithm (such as zlib, lz4, zstd, etc.) can be used to decompress the compressed data page and decompress it to obtain the second data page.
  • S320 Obtain the first data page based on the second data page.
  • the second data page may include a second row of data part and a second directory part, wherein the second row of data part is used to store the second data, and the second directory part is used to store the second set of offsets.
  • the first data in the first data page is data obtained by preprocessing the second data in the second data page.
  • the first set of offsets in the first data page is the set of offsets obtained by preprocessing the second set of offsets in the second data page.
  • preprocessing includes only byte-level based row-column conversion.
  • preprocessing not only includes byte-level-based column-column conversion, but also includes byte-level-based accumulation processing. Among them, the accumulation process includes accumulation between column data.
  • byte-level accumulation processing can be understood as accumulation in bytes.
  • preprocessing includes only accumulation processing.
  • preprocessing only includes byte-level row-column conversion
  • S320 specifically includes S321 to S324.
  • the starting point and the end point of the second data and the starting point and the end point of the offset corresponding to the second data can be respectively obtained from the second header. Then, obtain the second data from the second row data part of the second data page according to the starting point and end point of the second data, and obtain the second data from the second row data part according to the starting point and end point of the offset corresponding to the second data.
  • the second directory unit obtains a second set of offsets.
  • S322 Preprocess the second set of offsets according to bytes to obtain the first set of offsets.
  • the second set of offsets is preprocessed in bytes to obtain the first set of offsets.
  • the embodiment of the present application does not limit the size of the unit offset length of the second set of offsets.
  • the following description takes the example that the unit offset length of the second set of offsets is 2 bytes.
  • performing byte-level column-column conversion on the second set of offsets shown in Table 7 can obtain the first set of offsets shown in Table 4.
  • Table 4 and Table 7 please refer to the relevant descriptions above and will not be repeated here.
  • S323 Preprocess the second data in bytes according to the first set of offsets to obtain the first data.
  • this S323 includes:
  • S3231 Remove invalid offsets from the first set of offsets to obtain a fifth set of offsets.
  • the fifth set of offsets includes P offsets.
  • S3231 may be implemented based on information indicating an invalid offset.
  • the embodiment of the present application does not limit the storage location of the information indicating the invalid offset.
  • information indicating an invalid offset may be stored in the directory portion, header, or trailer of the second data page.
  • 3231 can also be implemented based on each offset in the first set of offsets and the starting point and end point of the second data.
  • the starting point and the ending point of the second data can be obtained according to the head or the tail of the second data page.
  • (N-P) offsets are not between the starting point and the end point of the second data, that is, P offsets among N offsets are valid, and (N-P) offsets are invalid.
  • (N-P) offsets that are not between the starting point and the end point of the second data need to be removed from the first set of offsets to obtain effective P offsets.
  • S3232 Arrange the P offsets in ascending order to obtain the sorted P offsets.
  • the P offsets are 0x01, 0x04, and 0x08 respectively.
  • the P offsets are arranged in order from small to large to obtain the sorted P
  • the offsets are 0x01, 0x04, and 0x08.
  • the P offsets are 0x06, 0x09, and 0x01 respectively.
  • the P offsets are arranged in order from small to large to obtain the sorted P
  • the offsets are 0x01, 0x06, and 0x09.
  • S3233 Create P regions based on the sorted P offsets, and obtain the length of each row of the first data.
  • the P regions correspond to the P offsets one-to-one.
  • the P offsets and P areas have a one-to-one correspondence.
  • the offset corresponding to the area ranked at the k-th position among the P areas is the k-th offset
  • the k-th offset is the P offset in order from small to large.
  • the offset at the kth position is the offset corresponding to the area ranked at the k-th position among the P areas.
  • the offset corresponding to the area ranked first among the three areas is the offset ranked first among the three offsets shown in Table 3, which is 0x01;
  • the offset corresponding to the area ranked second among the three areas is the offset ranked second among the three offsets shown in Table 3, that is, 0x04;
  • the offset corresponding to the area ranked No. 3 among the three areas is the offset at the third position among the three offsets shown in Table 3, that is, 0x08.
  • the offset corresponding to the area ranked first among the three areas is the offset ranked third among the three offsets shown in Table 4, that is, 0x01 ;
  • the offset corresponding to the area ranked second among the three areas is the offset ranked first among the three offsets shown in Table 3, that is, 0x06;
  • the offset corresponding to the area ranked second among the three areas is the offset at the second position among the three offsets shown in Table 3, that is, 0x09.
  • the difference between the (d+1)th offset and the dth offset is taken as the length of the dth row of the first data.
  • d is taken from 1 to (P-1) in sequence. And add 1 to the difference between the end point of the P-th offset and the offset of the first data as the length of the P-th row of the first data.
  • the three offsets are sorted as 0x01, 0x04, and 0x08, and the difference between the second offset (0x04) and the first offset (0x01) is used as the first data
  • the three offsets are sorted as 0x01, 0x06, and 0x09.
  • the difference between the second offset (0x06) and the first offset (0x01) is regarded as the first offset.
  • the implementation of this application does not limit the locations of the P areas and whether the locations of the P areas are on the second data page or the first data page.
  • the fourth threshold before performing S3233, it may be determined whether P is less than or equal to the fourth threshold. And only when P is less than or equal to the fourth threshold, S3233, S3234 and S324 are executed. In this way, only when there are not many rows of the first data, the second data can be converted into rows and columns based on byte level, thereby avoiding waste of resources.
  • the embodiment of the present application does not limit the specific value of the fourth threshold, which can be set according to the actual situation.
  • the embodiment of the present application does not limit the relationship between the fourth threshold and the second threshold.
  • the fourth threshold may be equal to the second threshold.
  • S3234 Read the data corresponding to R bytes from the second row data part in sequence, and store the data corresponding to the p-th byte among the R bytes to the s-th area in the P areas.
  • the data corresponding to q bytes completes the reading and writing of the q-th data.
  • p is a positive integer, and p ranges from 1 to R.
  • R is the number of areas that are not filled with data in the P areas.
  • the number of data in the sth area is the length of the sth row of data in the first data.
  • s is a positive integer.
  • the offset corresponding to the s-th area is the s-th offset.
  • the s-th offset is located in the fifth group of offsets except for the offset corresponding to the area filled with data. The p-th offset among the offsets.
  • the p-th offset can be understood as the offset of the p-th position among the offsets in the fifth group of offsets excluding the offsets corresponding to the area filled with data. .
  • q takes the value L2 from 1, and L2 is the maximum row length of the first data.
  • FIG. 13 is a schematic diagram of a second data reading and writing process provided by an embodiment of the present application.
  • the second data described in Figure 13 is shown in Table 5, and the fifth set of offsets corresponding to the second data is shown in Table 3.
  • the offset of 1 position, that is, 0x01; the area ranked second among the three areas is area 402, and the offset corresponding to area 402 is the third offset among the three offsets shown in Table 3.
  • the offset of 2 positions, that is, 0x04; the area ranked third among the three areas is area 403, and the offset corresponding to area 403 is the third offset among the three offsets shown in Table 3.
  • the offset of 3 positions which is 0x08. It can be seen that in the example of Figure 13, the arrangement order of the offsets corresponding to the three areas is the arrangement order of the three offsets shown in Table 3.
  • the data (a) is stored in the data corresponding to the first byte of area 402 (the area ranked second among the three areas), and the data corresponding to the third byte among the three bytes (a ) is stored in the data corresponding to the first byte in area 403 (the area ranked third among the three areas). In this way, the first data reading and writing is completed.
  • the data (b) is stored in the data corresponding to the second byte of area 402 (the area corresponding to the offset ranked second among the three offsets shown in Table 3), and the three words
  • the data (b) corresponding to the third byte in the section is stored to the second byte corresponding to area 403 (the area corresponding to the offset ranked third among the three offsets shown in Table 3). data, in this way, the second data reading and writing is completed.
  • the data (c) is stored in the data corresponding to the third byte of area 402 (the area corresponding to the offset ranked second among the three offsets shown in Table 3), and the three words
  • the data (c) corresponding to the third byte in the section is stored in the third byte corresponding to area 403 (the area corresponding to the offset ranked third among the three offsets shown in Table 3). data, in this way, the third reading and writing of data is completed.
  • the area 401 ranked first among the three areas (the offset ranked first among the three offsets shown in Table 3
  • the amount of data (3 bytes) in the area corresponding to the amount has reached the length of the first row of data (0x03). At this time, area 401 can be considered to be full of data.
  • the two areas that are not filled with data include: among the three areas, the offsets other than the offsets that are ranked first among the three offsets shown in Table 3 are ranked first.
  • the area corresponding to the offset of 1 position (area 402), and the offsets among the 3 areas except the offset at the first position among the 3 offsets shown in Table 3 The area corresponding to the offset at the second position (area 403).
  • the data (dd) corresponding to the 2 bytes is read from the second data shown in Table 5 in sequence, and the data (d) corresponding to the first byte of the 2 bytes is stored in The data corresponding to the 4th byte of area 402, and store the data (d) corresponding to the 2nd byte of the 2 bytes to the data corresponding to the 4th byte of area 403. In this way, the 4th byte is completed. Read and write data.
  • the area 402 ranked second among the three areas (the offset ranked second among the three offsets shown in Table 3
  • the amount of data (4 bytes) in the area corresponding to the amount has reached the length (0x04) of the second row of the first data.
  • the area 402 can be considered to be filled with data.
  • the area that is not filled with data is an offset other than the offset at the first position and the offset at the second position among the three offsets shown in Table 3.
  • the area corresponding to the offset at the first position (area 403).
  • the data (e) corresponding to 1 byte is sequentially read from the second data as shown in Table 5, and the data (e) corresponding to 1 byte is sequentially stored in area 403 (not full). Data corresponding to the 5th byte of the data area), in this way, the fifth data reading and writing is completed.
  • the second data can be read from the second data page, and the read second data can be written into three areas (area 401, area 402 and area 403).
  • the data written in the area ranked first among the three areas is abc, which occupies 3 bytes; among the three areas ranked second among the three areas, The data written in the area is abcd, which occupies 4 bytes; the data written in the area ranked third among the 3 areas is abcde, which occupies 5 bytes.
  • FIG. 14 is a schematic diagram of another example of a second data reading and writing process provided by an embodiment of the present application.
  • the second data described in Figure 14 is shown in Table 5.
  • the fifth set of offsets corresponding to the second data is shown in Table 4.
  • the fifth set of offsets ranks first among the offsets.
  • the offset of is 0x06
  • the offset of the second position is 0x09
  • the offset of the third position is 0x01.
  • P 3 as described in S3234
  • the area ranked first among the three areas is area 501
  • the offset corresponding to this area 501 is the area ranked first among the three offsets shown in Table 4.
  • the length of the first row of the first data 0x05 (bytes), the length of the second row of the first data is 0x03 (bytes), and the length of the third row of the first data is 0x04 (bytes).
  • the data (a) is stored in the data corresponding to the first byte of area 503 (the area corresponding to the offset ranked second among the three offsets shown in Table 4), and the 3 words
  • the data (a) corresponding to the third byte in the section is stored to the first byte corresponding to area 501 (the area corresponding to the offset ranked third among the three offsets shown in Table 4) data, in this way, the first reading and writing of data is completed.
  • the data (b) is stored in the data corresponding to the second byte of area 503 (the area corresponding to the offset ranked second among the three offsets shown in Table 4), and the three words
  • the data (b) corresponding to the third byte in the section is stored to the second byte corresponding to area 501 (the area corresponding to the offset ranked third among the three offsets shown in Table 4) data, in this way, the second data reading and writing is completed.
  • the data (c) is stored in the data corresponding to the third byte of area 503 (the area corresponding to the offset ranked second among the three offsets shown in Table 4), and the three words
  • the data (c) corresponding to the third byte in the section is stored to the third byte corresponding to area 501 (the area corresponding to the offset ranked third among the three offsets shown in Table 4) data, in this way, the third reading and writing of data is completed.
  • the area 502 ranked second among the three areas (the offset ranked first among the three offsets shown in Table 4
  • the amount of data (3 bytes) in the area corresponding to the amount has reached the length (0x03) of the second row of the first data.
  • the area 502 can be considered to be filled with data.
  • the data (dd) corresponding to the 2 bytes is read from the second data shown in Table 5 in sequence, and the data (d) corresponding to the first byte of the 2 bytes is stored in The data corresponding to the 4th byte of area 503, and store the data (d) corresponding to the 2nd byte of the 2 bytes to the data corresponding to the 4th byte of area 501. In this way, the 4th byte is completed. Read and write data.
  • the area 503 ranked third among the three areas (the offset ranked second among the three offsets shown in Table 4
  • the amount of data (4 bytes) in the area corresponding to the amount has reached the length of the third row of data (0x04) of the first data.
  • area 503 can be considered to be filled with data.
  • the second data can be read from the second data page, and the read second data can be written into three areas (area 501, area 502 and area 503).
  • the data written in the area ranked first among the three areas is abcde, which occupies 5 bytes; the data written in the area ranked second among the three areas is abcde.
  • the data written in the area is abc, which occupies 3 bytes; the data written in the area ranked third among the 3 areas is abcd, which occupies 4 bytes.
  • the difference between the lengths of each row of the first data is less than or equal to the third threshold, that is, S3235. And only when the difference between the lengths of each row of the first data is less than or equal to the third threshold, S3234 and S324 are executed. In this way, only when the difference between the lengths of each row of the first data is not large, the first data can be converted into rows and columns based on the byte level, thereby avoiding waste of resources.
  • the embodiment of the present application does not limit the specific value of the third threshold, which can be set according to the actual situation.
  • the embodiment of the present application does not limit the relationship between the third threshold and the first threshold.
  • the third threshold may be equal to the first threshold.
  • S324 Obtain the first data page based on the first data and the first set of offsets.
  • a data page may be newly created, and the first data and the first set of offsets may be stored in the newly reconstructed data page to form the first data page.
  • the newly created data page is the first data page.
  • the first data page can be obtained based on the original second data page.
  • the second data stored in the second row data part is updated to the first data
  • the second set of offsets stored in the second directory part is updated to the first set of offsets to obtain the first data page.
  • updating the second data stored in the second row data part to the first data specifically includes sequentially overwriting the second data stored in the second row data part with the data in the P areas obtained in S3234.
  • the data in the three areas are abcabcdabcde in sequence, that is, the second data is abcabcdabcde.
  • the data in the three areas are abcdeabcabcd in sequence, that is, the second data is abcabcdabcde.
  • preprocessing includes byte-level row-column conversion and byte-level accumulation processing
  • S320 specifically includes S321 to S324.
  • the specific processes of S321 and S324 in case 3 are the same as the specific processes of S321 and S324 in case 1, and will not be described again here.
  • the specific processes of S322 and S323 in case 3 are different from the specific processes of S322 and S323 in case 1.
  • the specific processes of S322 and S323 in case 3 are introduced in detail below.
  • S322 specifically includes S322A and S322B.
  • S322A Accumulate the data of adjacent columns on the c1th row of the second set of offsets in bytes to obtain the third set of offsets.
  • 1 ⁇ c1 ⁇ c2, c1 and c2 are both positive integers, and c2 is equal to the maximum row length of the second group of offsets or c2 is equal to the minimum row length of the second group of offsets.
  • Table 6 is to accumulate the data of adjacent columns on the c1th row of the second set of offsets shown in Table 9 in bytes.
  • Table 6 is to convert the data of adjacent columns on the c1th row of the second set of offsets shown in Table 7 by bytes.
  • S322B Convert the third set of offsets into rows and columns according to bytes to obtain the first set of offsets.
  • the third group of offsets are preprocessed in bytes to obtain the first group of offsets.
  • the embodiment of the present application does not limit the size of the unit offset length of the third set of offsets.
  • the following description takes the example that the unit offset length of the third group of offsets is 2 bytes.
  • performing byte-level column-column conversion on the third set of offsets shown in Table 6 can obtain the first set of offsets shown in Table 3.
  • Table 3 and Table 6 please refer to the relevant descriptions above and will not be repeated here.
  • performing byte-level column-column conversion on the third set of offsets shown in Table 7 can obtain the first set of offsets shown in Table 4.
  • Table 4 and Table 7 please refer to the relevant descriptions above and will not be repeated here.
  • S323 specifically includes S323A and S323B.
  • S323A Accumulate the data of adjacent columns on the d1th row of the second data by bytes to obtain the third data.
  • 1 ⁇ d1 ⁇ d2, d1 and d2 are both positive integers, d2 is equal to the maximum line length of the second data or d2 is equal to the minimum line length of the second data.
  • Table 5 is the third data obtained by accumulating the data of adjacent columns on the c1th row of the second data shown in Table 8 in bytes.
  • d2 is equal to the minimum line length of the second data (3 bytes) as an example.
  • S323B According to the first set of offsets, convert the third data into rows and columns according to bytes to obtain the first data.
  • S323B includes S3231 to S3233 and S3234A.
  • S3231 to S3233 please refer to the relevant descriptions above and will not be repeated here.
  • S3234A we focus on S3234A.
  • S3234A read the data corresponding to R bytes from the third data in sequence, and store the data corresponding to the p-th byte among the R bytes to the q-th of the s-th area in the P areas.
  • the data corresponding to the byte completes the reading and writing of the qth data.
  • p is a positive integer, and p ranges from 1 to R.
  • R is the number of areas that are not filled with data in the P areas.
  • the number of data in the sth area is the length of the sth row of data in the first data.
  • s is a positive integer.
  • the offset corresponding to the s-th area is the s-th offset.
  • the s-th offset is located in the fifth group of offsets except for the offset corresponding to the area filled with data. The p-th offset among the offsets.
  • q takes the value L2 from 1, and L2 is the maximum row length of the first data
  • S3234A According to the above description of S3234A, it can be seen that the process of S3234A is similar to the process of S3234 described above. The only difference between the two is that: S3234A reads data corresponding to R bytes from the third data, and S3234 Data corresponding to R bytes is read from the second row of data, that is, the second data. Therefore, for detailed description of S3234A, please refer to the relevant description of S3234 above, and will not be described again here.
  • S320 specifically includes S321 to S324.
  • the specific processes of S321 and S324 in case 4 are the same as the specific processes of S321 and S324 in case 3, and will not be described again here.
  • the specific processes of S322 and S323 in case 4 are different from the specific processes of S322 and S323 in case 3.
  • the specific processes of S322 and S323 in case 4 are introduced in detail below.
  • S322 specifically includes: accumulating the data of adjacent columns on the g1th row of the second set of offsets in bytes to obtain the first set of offsets.
  • g1 is a positive integer.
  • c2 can refer to the relevant description above.
  • S323 specifically includes: accumulating the data of adjacent columns on the h1th row of the second data in bytes to obtain the first data.
  • h1 is a positive integer, where d2 can refer to the relevant description above.
  • the first data page if the first data page is reorganized from multiple data pages, the first data page needs to be split to obtain multiple data pages.
  • the first data page includes information indicating that the first data page has been reorganized. In this way, whether the first data page has been reorganized can be known through the first data page.
  • the method 300 may also include:
  • S330 includes: S331, according to the header of the first data page, obtain the starting point and end point of the fourth data of the plurality of third data pages, and the starting point and end point of the fourth group of offsets.
  • the data page includes, in addition to the row data part and the directory part, the data page also includes: a header and/or a trailer
  • the following steps need to be performed: First, Data stored in multiple headers and/or trailers corresponding to multiple third data pages are respectively obtained. Secondly, split the data stored in multiple headers and/or tails according to the target order, obtain the data stored in the headers and/or tails of multiple third data pages, and combine the multiple third data pages. The data stored in the header and/or the trailer of the page are respectively stored in the headers and/or trailers of the plurality of third data pages. Finally, based on the data stored in the header and/or tail of each third data page, the starting point and end point of the fourth data corresponding to each third data page are obtained, as well as the starting point of the fourth set of offsets. starting point and ending point.
  • S332 Obtain a plurality of fourth data from the first data page according to the starting points and ending points of the plurality of fourth data; and, obtain a plurality of fourth data from the first data according to the starting points and ending points of the plurality of fourth sets of offsets. Multiple fourth set of offsets are obtained in the page.
  • the implementation of this application does not limit the execution order of the steps of obtaining the fourth data and obtaining the fourth set of offsets described in S332 above.
  • the fourth data can be obtained first and then the fourth set of offsets can be obtained, or, The fourth set of offsets can be obtained first and then the fourth data can be obtained, or the fourth data and the fourth set of offsets can be obtained at the same time.
  • S333 Store multiple fourth data and multiple fourth sets of offsets in multiple third data pages respectively.
  • the decompression rate of decompressing the second data page is relatively high, and thus Improved data page decompression rate.
  • the decompression rate and time consumption of the embodiment of the present application and the existing decompression method are basically the same.
  • Figure 15 is a schematic block diagram of a data page processing device provided by an embodiment of the present application.
  • the device 600 includes: a processing unit 610.
  • the processing unit 610 is used to implement each step described in the above method 200, which will not be described again here.
  • processing unit 610 is used to implement each step described in the above method 300, which will not be described again here.
  • Figure 16 shows a schematic structural diagram of another example of a data page processing device provided by an embodiment of the present application.
  • the data page processing device 700 includes: one or more processors 710, one or more memories 720, the one or more memory stores 720 store one or more computer programs, the one or more A number of computer programs include instructions.
  • the instruction is executed by the one or more processors 710, the data page processing device is caused to perform each step described in the above-mentioned method 200 or method 300.
  • An embodiment of the present application provides a computer program product.
  • the computer program product When the computer program product is run on a data page processing device, the data page processing device performs each step described in the above method 200 or method 300.
  • the implementation principles and technical effects are similar to the above-mentioned method-related embodiments, and will not be described again here.
  • An embodiment of the present application provides a readable storage medium.
  • the readable storage medium contains instructions.
  • the data page processing device causes the data page processing device to perform the above method 200 or method 300. the various steps described. The implementation principles and technical effects are similar and will not be described again here.
  • An embodiment of the present application provides a readable storage medium.
  • the readable storage medium contains instructions.
  • the data page processing device causes the data page processing device to perform the above method 200 or method 300. the various steps described. The implementation principles and technical effects are similar and will not be described again here.
  • An embodiment of the present application provides a chip system, including: a processor, configured to call and run a computer program from a memory, so that a device installed with the chip system executes each step described in the above method 200 or method 300.
  • a processor configured to call and run a computer program from a memory, so that a device installed with the chip system executes each step described in the above method 200 or method 300.
  • the implementation principles and technical effects are similar and will not be described again here.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code. .

Abstract

数据页处理的方法及其装置,该方法包括: 根据第一数据页,得到第二数据页(S210);对所述第二数据页进行压缩,得到压缩后的数据页(S220);其中,所述第一数据页包括基于行存储方式的第一数据和第一组偏移量,所述第一组偏移量用于指示所述第一数据的每行数据的偏移量;所述第二数据页包括基于行存储方式的第二数据和第二组偏移量,所述第二组偏移量用于指示所述第二数据的每行数据的偏移量,所述第二数据是对第一数据进行预处理后得到的数据,所述第二组偏移量是对所述第一组偏移量进行所述预处理后得到的组偏移量,所述预处理包括基于字节级的行列转换。上述对数据页处理的方法,不仅压缩率较高,还能和现有压缩方法的压缩耗时基本持平。

Description

数据页处理的方法及其装置
本申请要求于2022年5月11日提交俄罗斯联邦专利局、申请号为2022112514、申请名称为“数据页处理的方法及其装置”的俄罗斯联邦专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及信息技术领域,并且更具体地,涉及一种数据页处理的方法及其装置。
背景技术
由于数据压缩技术不仅可以节省存储空间,还可以增加数据传输的速率,因此,其已经广泛地应用于信息技术领域。
目前,现有的数据压缩通常是基于字典压缩算法、前缀压缩算法或通用压缩算法(如,zlib、lz4、zstd等)来实现数据的压缩。但是,这些压缩算法均没有考虑数据分布的特点,进而使得压缩率较低。
发明内容
本申请实施例提供一种数据页处理的方法及其装置,该数据页处理的方法不仅压缩率(或解压缩率)较高,还能和现有压缩(或解压缩)方法的压缩耗时(或解压缩耗时)基本持平。
第一方面,提供了一种数据页处理的方法,包括:根据第一数据页,得到第二数据页;对所述第二数据页进行压缩,得到压缩后的数据页;其中,所述第一数据页包括基于行存储方式的第一数据和第一组偏移量,所述第一组偏移量用于指示所述第一数据的每行数据的偏移量;所述第二数据页包括基于行存储方式的第二数据和第二组偏移量,所述第二组偏移量用于指示所述第二数据的每行数据的偏移量,所述第二数据是对第一数据进行预处理后得到的数据,所述第二组偏移量是对所述第一组偏移量进行所述预处理后得到的组偏移量,所述预处理包括基于字节级的行列转换。
在本申请实施例中,对基于行存储方式的第一数据页中存储的数据进行基于字节级的行列转换,也就是说,将基于行存储方式的数据以一种有序可逆的方式转换为基于列存储方式的数据的形式,使数据在数据页内原地更新。然后再对转换后的第二数据页进行压缩。而由于得到的第二数据页中存储的每行数据具有相似性、重复度和一定规律性,这样对第二数据页进行压缩的压缩率比直接对第一数据页进行压缩的压缩率要高,进而提高了数据页的压缩率。此外,本申请实施例和现有压缩方法的压缩耗时基本持平。
结合第一方面,在第一方面的某些实现方式中,所述根据所述第一数据页,得到第二数据页,包括:从所述第一数据页中分别获取所述第一数据和所述第一组偏移量;按照字 节对所述第一数据进行所述预处理得到所述第二数据;按照字节对所述第一组偏移量进行所述预处理得到所述第二组偏移量;根据所述第二数据和所述第二组偏移量,得到所述第二数据页。
结合第一方面,在第一方面的某些实现方式中,所述第一数据页包括第一行数据部和第一目录部,所述第一行数据部用于存储所述第一数据,所述第一目录部用于存储所述第一组偏移量;所述根据所述第二数据和所述第二组偏移量,得到所述第二数据页,包括:将所述第一行数据部中存储的所述第一数据更新为所述第二数据,并将所述第一目录部中存储的所述第一组偏移量更新为第二组偏移量,得到所述第二数据页。
结合第一方面,在第一方面的某些实现方式中,所述按照字节对所述第一数据进行所述预处理得到所述第二数据,包括:获取所述第一数据的偏移量的起始点和结束点;根据所述第一数据的偏移量的起始点和结束点,以及所述第一组偏移量的单位偏移量长度,得到所述第一组偏移量包括的偏移量的数量M;从所述M个偏移量中去除无效的偏移量,得到N个偏移量,所述N小于或等于所述M,所述N和所述M均为正整数;将所述N个偏移量按照从小到大的顺序进行排列,得到排序后的N个偏移量;根据所述排序后的N个偏移量,将所述第一行数据部划分为N个区域,并得到所述第一数据的每行数据的长度,其中所述N个区域中的第n个区域中的数据的个数为所述第一数据的第n行数据的长度;按照所述N个偏移量的排列顺序,依次从所述N个区域中,取第i个字节对应的数据作为所述第二数据第i行第N列的数据,所述i依次从1取至L1,所述i为正整数,所述L1为所述第一数据的最大行长度,所述N个偏移量的排列顺序为所述N个偏移量按照从小到大的排列顺序或所述N个偏移量在所述第一组偏移量中的排列顺序;所述按照字节对所述第一组偏移量进行所述预处理得到所述第二组偏移量包括:按照字节对所述第一组偏移量中的所述N个偏移量进行所述预处理得到所述第二组偏移量。
结合第一方面,在第一方面的某些实现方式中,在所述按照所述N个偏移量的排列顺序,依次从所述N个区域中,取第i个字节对应的数据作为第二数据第i行第N列的数据之前,所述方法还包括:确定所述第一数据的每行数据的长度之间的差异小于或等于第一阈值。
在本申请实施例中,在确定第一数据的每行数据的长度之间的差异小于或等于第三阈值的情况下,再按照N个偏移量的排列顺序,依次从N个区域中,取第i个字节对应的数据作为第二数据第i行第N列的数据。这样,只有在第一数据的每行数据的长度之间的差异不大的情况下,才去对第一数据进行基于字节级的行列转换,进而可以避免资源的浪费。
结合第一方面,在第一方面的某些实现方式中,所述预处理还包括基于字节级的差分处理,所述差分处理包括列数据之间进行差分。
在本申请实施例中,将列数据之间进行基于字节级的差分处理后,能够制造出更多的重复数据,这样可以充分利用数据结构特点,进一步地提升数据重复度和规律性,进而可以提高数据页的压缩率。
结合第一方面,在第一方面的某些实现方式中,所述按照字节对所述第一数据进行所述预处理得到所述第二数据包括:按照字节对所述第一数据进行行列转换得到所述第三数据;将所述第三数据的第a1行上的相邻列的数据按照字节进行差分,得到所述第二数据,所述1≤a1≤a2,所述a1和a2均为正整数,所述a2等于所述第一数据的最大行长度或所述a2等于所述第一数据的最小行长度;所述按照字节对所述第一组偏移量进行所述预处理 得到所述第二组偏移量包括:按照字节对所述第一组偏移量进行行列转换得到所述第三组偏移量;将所述第三组偏移量的第b1行上的相邻列的数据按照字节进行差分,得到所述第二组偏移量,所述1≤b1≤b2,所述b1和b2均为正整数,所述b2等于所述第一组偏移量的最大行长度或所述b2等于所述第一组偏移量的最小行长度。
结合第一方面,在第一方面的某些实现方式中,所述按照字节对所述第一数据进行行列转换得到所述第三数据,包括:获取所述第一数据的偏移量的起始点和结束点;根据所述第一数据的偏移量的起始点和结束点,以及所述第一组偏移量的单位偏移量长度,得到所述第一组偏移量包括的偏移量的数量M;从所述M个偏移量中去除无效的偏移量,得到N个偏移量,所述N小于或等于所述M,所述N和所述M均为正整数;将所述N个偏移量按照从小到大的顺序进行排列,得到排序后的N个偏移量;根据所述排序后的N个偏移量,将所述第一行数据部划分为N个区域,并得到所述第一数据的每行数据的长度,其中所述N个区域中的第n个区域中的数据的个数为所述第一数据的第n行数据的长度;按照所述N个偏移量的排列顺序,依次从所述N个区域中,取第i个字节对应的数据作为所述第三数据第i行第N列的数据,所述i依次从1取至L1,所述i为正整数,所述L1为所述第一数据的最大行长度,所述N个偏移量的排列顺序为所述N个偏移量按照从小到大的排列顺序或所述N个偏移量在所述第一组偏移量中的排列顺序;所述按照字节对所述第一组偏移量进行行列转换得到所述第三组偏移量包括:按照字节对所述第一组偏移量中的所述N个偏移量进行行列转换得到所述第三组偏移量。
结合第一方面,在第一方面的某些实现方式中,在所述按照所述N个偏移量的排列顺序,依次从所述N个区域中,取第i个字节对应的数据作为所述第三数据第i行第N列的数据之前,所述方法还包括:确定所述第一数据的每行数据的长度之间的差异小于或等于第一阈值。
在本申请实施例中,在确定第一数据的每行数据的长度之间的差异小于或等于第一阈值的情况下,再按照N个偏移量的排列顺序,依次从N个区域中,取第i个字节对应的数据作为第三数据第i行第N列的数据。这样,只有在第一数据的每行数据的长度之间的差异不大的情况下,才去对第一数据进行基于字节级的行列转换,进而可以避免资源的浪费。
结合第一方面,在第一方面的某些实现方式中,在所述根据所述N个偏移量,将所述第一行数据部划分为N个区域之前,所述方法还包括:确定所述N小于或等于第二阈值。
在本申请实施例中,在确定N小于或等于第二阈值的情况下,再根据N个偏移量,将第一行数据部划分为N个区域。这样,只有在第一数据行数不多的情况下,才去对第一数据进行基于字节级的行列转换,进而可以避免资源的浪费。
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:将连续的且结构相同的多个第三数据页进行重组,得到所述第一数据页;其中,所述第三数据页包括基于行存储方式的第四数据和第四组偏移量,所述第四组偏移量用于指示所述第四数据的每行数据的偏移量,所述第一数据包括多个所述第三数据页对应的多个所述第四数据,且多个所述第四数据的最大行长度相同,第一组偏移量包括多个所述第三数据页对应的多个所述第四组偏移量。
在本申请实施例中,在对数据页进行基于字节级的行列转换之前,可以将连续的且结构相同的多个数据页进行重组得到一个数据页。这样,可以充分利用数据页结构特点,将相似度较高的多个数据页重组成一个数据页,进而可以进一步提高数据页的压缩率。此外, 压缩耗时基本和现有压缩方法的压缩耗时也是基本持平。
结合第一方面,在第一方面的某些实现方式中,所述将连续的且结构相同的多个第三数据页进行重组,得到所述第一数据页,包括:分别获取与多个所述第三数据页对应的多个所述第四数据和多个所述第四组偏移量;分别将多个所述第四数据按照目标顺序进行排列,得到所述第一数据;以及,分别将多个所述第四组偏移量按照所述目标顺序进行排列,得到所述第一组偏移量,所述目标顺序为多个所述第三数据页的排列顺序;将所述第一数据和所述第一组偏移量分别存储至所述第一数据页。
结合第一方面,在第一方面的某些实现方式中,所述第一数据页包括用于指示所述第一数据页进行过重组的信息。
结合第一方面,在第一方面的某些实现方式中,所述第二数据页包括用于指示所述第二数据页进行过所述预处理的信息。
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:对所述压缩后的数据页进行解压缩,得到所述第二数据页;根据所述第二数据页,得到所述第一数据页,所述第一数据是对第二数据进行所述预处理后得到的数据,所述第一组偏移量是对所述第二组偏移量进行所述预处理后得到的组偏移量。
在本申请实施例中,由于第二数据页中存储的每行数据具有相似性、重复度和一定规律性,这样对第二数据页进行解压缩的解压缩率就比较高,进而提高了数据页的解压缩率。此外,本申请实施例和现有解压缩方法的解压缩率耗时基本持平。
结合第一方面,在第一方面的某些实现方式中,所述根据所述第二数据页,得到所述第一数据页,包括:从所述第二数据页中分别获取所述第二数据和所述第二组偏移量;按照字节对所述第二组偏移量进行所述预处理得到所述第一组偏移量;根据所述第一组偏移量,按照字节对所述第二数据进行所述预处理得到所述第一数据;根据所述第一数据和所述第一组偏移量,得到所述第一数据页。
结合第一方面,在第一方面的某些实现方式中,所述第二数据页包括第二行数据部和第二目录部,所述第二行数据部用于存储所述第二数据,所述第二目录部用于存储所述第二组偏移量;所述根据所述第一数据和所述第一组偏移量,得到所述第一数据页,包括:将所述第二行数据部中存储的所述第二数据更新为所述第一数据,并将所述第二目录部中存储的所述第二组偏移量更新为第一组偏移量,得到所述第一数据页。
结合第一方面,在第一方面的某些实现方式中,所述按照字节对所述第二组偏移量进行所述预处理得到所述第一组偏移量,包括:根据所述第二组偏移量的单位偏移量长度,按照字节对所述第二组偏移量进行所述预处理得到所述第一组偏移量;所述根据所述第一组偏移量,按照字节对所述第二数据进行所述预处理得到所述第一数据,包括:从所述第一组偏移量中去除无效的偏移量,得到第五组偏移量,所述第五组偏移量包括P个偏移量;将所述P个偏移量按照从小到大的顺序进行排列,得到排序后的P个偏移量;根据所述排序后的P个偏移量,创建P个区域,并得到所述第一数据的每行数据的长度,所述P个区域与所述P个偏移量一一对应;依次按顺序从所述第二行数据部中读取R个字节对应的数据,并依次将R个字节中第p个字节对应的数据存储至所述P个区域中第s个区域的第q个字节对应的数据,完成第q次数据的读写,其中,所述p为正整数,且所述p从1取至R,所述R为所述P个区域中未被写满数据的区域的数量,在所述第s个区域被写满数据的情况下,所述第s个区域中的数据的数量为所述第一数据的第s行数据的长度,所述 s为正整数,所述第s个区域对应的偏移量为第s个偏移量,所述第s个偏移量位于所述第五组偏移量中除被写满数据的区域对应的偏移量之外的偏移量中的第p个偏移量;所述q从1取值L2,所述L2为所述第一数据的最大行长度;将所述第二行数据部中存储的所述第二数据更新为所述第一数据包括:依次将所述P个区域中的数据覆盖所述第二行数据部中存储的所述第二数据。
结合第一方面,在第一方面的某些实现方式中,在所述依次按顺序从所述第二行数据部中读取P个字节对应的数据,并依次将P个字节中第p个字节对应的数据存储至第s个区域的第q个字节对应的数据,完成第q次数据的读写之前,所述方法还包括:确定所述第一数据的每行数据的长度之间的差异小于或等于第三阈值。
在本申请实施例中,在确定第一数据的每行数据的长度之间的差异小于或等于第三阈值的情况下,再依次按顺序从第二行数据部中读取P个字节对应的数据,并依次将P个字节中第p个字节对应的数据存储至第s个区域的第q个字节对应的数据,完成第q次数据的读写。这样,只有在第一数据的每行数据的长度之间的差异不大的情况下,才去对第二数据进行基于字节级的行列转换,进而可以避免资源的浪费。
结合第一方面,在第一方面的某些实现方式中,所述预处理还包括基于字节级的累加处理,所述累加处理包括列数据之间进行累加。
结合第一方面,在第一方面的某些实现方式中,所述按照字节对所述第二组偏移量进行所述预处理得到所述第一组偏移量,包括:将所述第二组偏移量的第c1行上的相邻列的数据按照字节进行累加,得到第三组偏移量,所述1≤c1≤c2,所述c1和c2均为正整数,所述c2等于所述第二组偏移量的最大行长度或所述c2等于所述第二组偏移量的最小行长度;按照字节对所述第三组偏移量进行行列转换,得到所述第一组偏移量;所述根据所述第一组偏移量,按照字节对所述第二数据进行所述预处理得到所述第一数据,包括:将所述第二数据的第d1行上的相邻列的数据进行按照字节累加,得到第三数据,所述1≤d1≤d2,所述d1和d2均为正整数,所述d2等于所述第二数据的最大行长度或所述d2等于所述第二数据的最小行长度;根据所述第一组偏移量,按照字节对所述第三数据进行行列转换得到所述第一数据。
结合第一方面,在第一方面的某些实现方式中,所述按照字节对所述第三组偏移量进行行列转换,得到所述第一组偏移量,包括:根据所述第三组偏移量的单位偏移量长度,按照字节对所述第三组偏移量进行行列转换处理得到所述第一组偏移量;所述根据所述第一组偏移量,按照字节对所述第三数据进行行列转换得到所述第一数据,包括:从所述第一组偏移量中去除无效的偏移量,得到第五组偏移量,所述第五组偏移量包括P个偏移量;将所述P个偏移量按照从小到大的顺序进行排列,得到排序后的P个偏移量;根据所述排序后的P个偏移量,创建P个区域,并得到所述第一数据的每行数据的长度,所述P个区域与所述P个偏移量一一对应;依次按顺序从所述第三数据中读取R个字节对应的数据,并依次将R个字节中第p个字节对应的数据存储至所述P个区域中第s个区域的第q个字节对应的数据,完成第q次数据的读写,其中,所述p为正整数,且所述p从1取至R,所述R为所述P个区域中未被写满数据的区域的数量,在所述第s个区域被写满数据的情况下,所述第s个区域中的数据的数量为所述第一数据的第s行数据的长度,所述s为正整数,所述第s个区域对应的偏移量为第s个偏移量,所述第s个偏移量位于所述第五组偏移量中除被写满数据的区域对应的偏移量之外的偏移量中的第p个偏移量;所述q从1 取值L2,所述L2为所述第一数据的最大行长度;将所述第二行数据部中存储的所述第二数据更新为所述第一数据包括:依次将所述P个区域中的数据覆盖所述第二行数据部中存储的所述第二数据。
结合第一方面,在第一方面的某些实现方式中,在所述依次按顺序从所述第二行数据部中读取P个字节对应的数据,并依次将P个字节中第p个字节对应的数据存储至第s个区域的第q个字节对应的数据,完成第q次数据的读写之前,所述方法还包括:确定所述第一数据的每行数据的长度之间的差异小于或等于第三阈值。
在本申请实施例中,在确定第一数据的每行数据的长度之间的差异小于或等于第三阈值的情况下,再依次按顺序从第二行数据部中读取P个字节对应的数据,并依次将P个字节中第p个字节对应的数据存储至第s个区域的第q个字节对应的数据,完成第q次数据的读写。这样,只有在第一数据的每行数据的长度之间的差异不大的情况下,才去对第三数据进行基于字节级的行列转换,进而可以避免资源的浪费。
结合第一方面,在第一方面的某些实现方式中,在所述根据所述排序后的P个偏移量,创建P个区域,并得到所述第一数据的每行数据的长度之前,所述方法还包括:确定所述P小于或等于第四阈值。
在本申请实施例中,在确定P小于或等于第四阈值的情况下,再根据排序后的P个偏移量,创建P个区域。这样,只有在第一数据行数不多的情况下,才去对第三数据进行基于字节级的行列转换,进而可以避免资源的浪费。
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:将所述第一数据页进行拆分,得到所述多个第三数据页。
结合第一方面,在第一方面的某些实现方式中,所述将所述第一数据页进行拆分,得到所述多个第三数据页,包括:获取多个所述第三数据页的第四数据的起始点和结束点,以及第四组偏移量的起始点和结束点;根据多个所述第四数据的起始点和结束点,从所述第一数据页中得到多个所述第四数据;以及,根据多个所述第四组偏移量的起始点和结束点,从所述第一数据页中得到多个所述第四组偏移量;分别将多个所述第四数据和多个所述第四组偏移量分别存储至多个所述第三数据页。
结合第一方面,在第一方面的某些实现方式中,所述第一数据页包括用于指示所述第一数据页进行过重组的信息。
第二方面,提供了一种数据页处理的方法,包括:对所述压缩后的数据页进行解压缩,得到所述第二数据页;根据所述第二数据页,得到所述第一数据页;其中,所述第二数据页包括基于行存储方式的第二数据和第二组偏移量,所述第二组偏移量用于指示所述第二数据的每行数据的偏移量;所述第一数据页包括基于行存储方式的第一数据和第一组偏移量,所述第一组偏移量用于指示所述第一数据的每行数据的偏移量;所述第一数据是对第二数据进行预处理后得到的数据,所述第一组偏移量是对所述第二组偏移量进行所述预处理后得到的组偏移量。
在本申请实施例中,由于第二数据页中存储的每行数据具有相似性、重复度和一定规律性,这样对第二数据页进行解压缩的解压缩率就比较高,进而提高了数据页的解压缩率。此外,本申请实施例和现有解压缩方法的解压缩率耗时基本持平。
结合第二方面,在第二方面的某些实现方式中,所述根据所述第二数据页,得到所述第一数据页,包括:从所述第二数据页中分别获取所述第二数据和所述第二组偏移量;按 照字节对所述第二组偏移量进行所述预处理得到所述第一组偏移量;根据所述第一组偏移量,按照字节对所述第二数据进行所述预处理得到所述第一数据;根据所述第一数据和所述第一组偏移量,得到所述第一数据页。
结合第二方面,在第二方面的某些实现方式中,所述第二数据页包括第二行数据部和第二目录部,所述第二行数据部用于存储所述第二数据,所述第二目录部用于存储所述第二组偏移量;所述根据所述第一数据和所述第一组偏移量,得到所述第一数据页,包括:将所述第二行数据部中存储的所述第二数据更新为所述第一数据,并将所述第二目录部中存储的所述第二组偏移量更新为第一组偏移量,得到所述第一数据页。
结合第二方面,在第二方面的某些实现方式中,所述按照字节对所述第二组偏移量进行所述预处理得到所述第一组偏移量,包括:根据所述第二组偏移量的单位偏移量长度,按照字节对所述第二组偏移量进行所述预处理得到所述第一组偏移量;所述根据所述第一组偏移量,按照字节对所述第二数据进行所述预处理得到所述第一数据,包括:从所述第一组偏移量中去除无效的偏移量,得到第五组偏移量,所述第五组偏移量包括P个偏移量;将所述P个偏移量按照从小到大的顺序进行排列,得到排序后的P个偏移量;根据所述排序后的P个偏移量,创建P个区域,并得到所述第一数据的每行数据的长度,所述P个区域与所述P个偏移量一一对应;依次按顺序从所述第二行数据部中读取R个字节对应的数据,并依次将R个字节中第p个字节对应的数据存储至所述P个区域中第s个区域的第q个字节对应的数据,完成第q次数据的读写,其中,所述p为正整数,且所述p从1取至R,所述R为所述P个区域中未被写满数据的区域的数量,在所述第s个区域被写满数据的情况下,所述第s个区域中的数据的数量为所述第一数据的第s行数据的长度,所述s为正整数,所述第s个区域对应的偏移量为第s个偏移量,所述第s个偏移量位于所述第五组偏移量中除被写满数据的区域对应的偏移量之外的偏移量中的第p个偏移量;所述q从1取值L2,所述L2为所述第一数据的最大行长度;将所述第二行数据部中存储的所述第二数据更新为所述第一数据包括:依次将所述P个区域中的数据覆盖所述第二行数据部中存储的所述第二数据。
结合第二方面,在第二方面的某些实现方式中,在所述依次按顺序从所述第二行数据部中读取P个字节对应的数据,并依次将P个字节中第p个字节对应的数据存储至第s个区域的第q个字节对应的数据,完成第q次数据的读写之前,所述方法还包括:确定所述第一数据的每行数据的长度之间的差异小于或等于第三阈值。
在本申请实施例中,在确定第一数据的每行数据的长度之间的差异小于或等于第三阈值的情况下,再依次按顺序从第二行数据部中读取P个字节对应的数据,并依次将P个字节中第p个字节对应的数据存储至第s个区域的第q个字节对应的数据,完成第q次数据的读写。这样,只有在第一数据的每行数据的长度之间的差异不大的情况下,才去对第二数据进行基于字节级的行列转换,进而可以避免资源的浪费。
结合第二方面,在第二方面的某些实现方式中,所述预处理还包括基于字节级的累加处理,所述累加处理包括列数据之间进行累加。
结合第二方面,在第二方面的某些实现方式中,所述按照字节对所述第二组偏移量进行所述预处理得到所述第一组偏移量,包括:将所述第二组偏移量的第c1行上的相邻列的数据按照字节进行累加,得到第三组偏移量,所述1≤c1≤c2,所述c1和c2均为正整数,所述c2等于所述第二组偏移量的最大行长度或所述c2等于所述第二组偏移量的最小行长 度;按照字节对所述第三组偏移量进行行列转换,得到所述第一组偏移量;所述根据所述第一组偏移量,按照字节对所述第二数据进行所述预处理得到所述第一数据,包括:将所述第二数据的第d1行上的相邻列的数据进行按照字节累加,得到第三数据,所述1≤d1≤d2,所述d1和d2均为正整数,所述d2等于所述第二数据的最大行长度或所述d2等于所述第二数据的最小行长度;根据所述第一组偏移量,按照字节对所述第三数据进行行列转换得到所述第一数据。
结合第二方面,在第二方面的某些实现方式中,所述按照字节对所述第三组偏移量进行行列转换,得到所述第一组偏移量,包括:根据所述第三组偏移量的单位偏移量长度,按照字节对所述第三组偏移量进行行列转换处理得到所述第一组偏移量;所述根据所述第一组偏移量,按照字节对所述第三数据进行行列转换得到所述第一数据,包括:从所述第一组偏移量中去除无效的偏移量,得到第五组偏移量,所述第五组偏移量包括P个偏移量;将所述P个偏移量按照从小到大的顺序进行排列,得到排序后的P个偏移量;根据所述排序后的P个偏移量,创建P个区域,并得到所述第一数据的每行数据的长度,所述P个区域与所述P个偏移量一一对应;依次按顺序从所述第三数据中读取R个字节对应的数据,并依次将R个字节中第p个字节对应的数据存储至所述P个区域中第s个区域的第q个字节对应的数据,完成第q次数据的读写,其中,所述p为正整数,且所述p从1取至R,所述R为所述P个区域中未被写满数据的区域的数量,在所述第s个区域被写满数据的情况下,所述第s个区域中的数据的数量为所述第一数据的第s行数据的长度,所述s为正整数,所述第s个区域对应的偏移量为第s个偏移量,所述第s个偏移量位于所述第五组偏移量中除被写满数据的区域对应的偏移量之外的偏移量中的第p个偏移量;所述q从1取值L2,所述L2为所述第一数据的最大行长度;将所述第二行数据部中存储的所述第二数据更新为所述第一数据包括:依次将所述P个区域中的数据覆盖所述第二行数据部中存储的所述第二数据。
结合第二方面,在第二方面的某些实现方式中,在所述依次按顺序从所述第二行数据部中读取P个字节对应的数据,并依次将P个字节中第p个字节对应的数据存储至第s个区域的第q个字节对应的数据,完成第q次数据的读写之前,所述方法还包括:确定所述第一数据的每行数据的长度之间的差异小于或等于第三阈值。
在本申请实施例中,在确定第一数据的每行数据的长度之间的差异小于或等于第三阈值的情况下,再依次按顺序从第二行数据部中读取P个字节对应的数据,并依次将P个字节中第p个字节对应的数据存储至第s个区域的第q个字节对应的数据,完成第q次数据的读写。这样,只有在第一数据的每行数据的长度之间的差异不大的情况下,才去对第三数据进行基于字节级的行列转换,进而可以避免资源的浪费。
结合第二方面,在第二方面的某些实现方式中,在所述根据所述排序后的P个偏移量,创建P个区域,并得到所述第一数据的每行数据的长度之前,所述方法还包括:确定所述P小于或等于第四阈值。
在本申请实施例中,在确定P小于或等于第四阈值的情况下,再根据排序后的P个偏移量,创建P个区域。这样,只有在第一数据行数不多的情况下,才去对第三数据进行基于字节级的行列转换,进而可以避免资源的浪费。
结合第二方面,在第二方面的某些实现方式中,所述方法还包括:将所述第一数据页进行拆分,得到所述多个第三数据页。
结合第二方面,在第二方面的某些实现方式中,所述将所述第一数据页进行拆分,得到所述多个第三数据页,包括:获取多个所述第三数据页的第四数据的起始点和结束点,以及第四组偏移量的起始点和结束点;根据多个所述第四数据的起始点和结束点,从所述第一数据页中得到多个所述第四数据;以及,根据多个所述第四组偏移量的起始点和结束点,从所述第一数据页中得到多个所述第四组偏移量;分别将多个所述第四数据和多个所述第四组偏移量分别存储至多个所述第三数据页。
结合第二方面,在第二方面的某些实现方式中,所述第一数据页包括用于指示所述第一数据页进行过重组的信息。
第三方面,提供了一种数据页处理的装置,所述装置包括处理单元,所述处理单元用于:根据第一数据页,得到第二数据页;对所述第二数据页进行压缩,得到压缩后的数据页;其中,所述第一数据页包括基于行存储方式的第一数据和第一组偏移量,所述第一组偏移量用于指示所述第一数据的每行数据的偏移量;所述第二数据页包括基于行存储方式的第二数据和第二组偏移量,所述第二组偏移量用于指示所述第二数据的每行数据的偏移量,所述第二数据是对第一数据进行预处理后得到的数据,所述第二组偏移量是对所述第一组偏移量进行所述预处理后得到的组偏移量,所述预处理包括基于字节级的行列转换。
在本申请实施例中,该数据页处理的装置的处理单元对基于行存储方式的第一数据页中存储的数据进行基于字节级的行列转换,也就是说,将基于行存储方式的数据以一种有序可逆的方式转换为基于列存储方式的数据的形式,使数据在数据页内原地更新。然后再对转换后的第二数据页进行压缩。而由于得到的第二数据页中存储的每行数据具有相似性、重复度和一定规律性,这样对第二数据页进行压缩的压缩率比直接对第一数据页进行压缩的压缩率要高,进而提高了数据页的压缩率。此外,该该数据页处理的装置和现有压缩装置的压缩耗时基本持平。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还具体用于:从所述第一数据页中分别获取所述第一数据和所述第一组偏移量;按照字节对所述第一数据进行所述预处理得到所述第二数据;按照字节对所述第一组偏移量进行所述预处理得到所述第二组偏移量;根据所述第二数据和所述第二组偏移量,得到所述第二数据页。
结合第三方面,在第三方面的某些实现方式中,所述第一数据页包括第一行数据部和第一目录部,所述第一行数据部用于存储所述第一数据,所述第一目录部用于存储所述第一组偏移量;所述处理单元还具体用于:将所述第一行数据部中存储的所述第一数据更新为所述第二数据,并将所述第一目录部中存储的所述第一组偏移量更新为第二组偏移量,得到所述第二数据页。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还具体用于:获取所述第一数据的偏移量的起始点和结束点;根据所述第一数据的偏移量的起始点和结束点,以及所述第一组偏移量的单位偏移量长度,得到所述第一组偏移量包括的偏移量的数量M;从所述M个偏移量中去除无效的偏移量,得到N个偏移量,所述N小于或等于所述M,所述N和所述M均为正整数;将所述N个偏移量按照从小到大的顺序进行排列,得到排序后的N个偏移量;根据所述排序后的N个偏移量,将所述第一行数据部划分为N个区域,并得到所述第一数据的每行数据的长度,其中所述N个区域中的第n个区域中的数据的个数为所述第一数据的第n行数据的长度;按照所述N个偏移量的排列顺序,依次从所 述N个区域中,取第i个字节对应的数据作为所述第二数据第i行第N列的数据,所述i依次从1取至L1,所述i为正整数,所述L1为所述第一数据的最大行长度,所述N个偏移量的排列顺序为所述N个偏移量按照从小到大的排列顺序或所述N个偏移量在所述第一组偏移量中的排列顺序;所述处理单元还具体用于:按照字节对所述第一组偏移量中的所述N个偏移量进行所述预处理得到所述第二组偏移量。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还用于:在所述按照所述N个偏移量的排列顺序,依次从所述N个区域中,取第i个字节对应的数据作为第二数据第i行第N列的数据之前,确定所述第一数据的每行数据的长度之间的差异小于或等于第一阈值。
结合第三方面,在第三方面的某些实现方式中,所述预处理还包括基于字节级的差分处理,所述差分处理包括列数据之间进行差分。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还具体用于:按照字节对所述第一数据进行行列转换得到所述第三数据;将所述第三数据的第a1行上的相邻列的数据按照字节进行差分,得到所述第二数据,所述1≤a1≤a2,所述a1和a2均为正整数,所述a2等于所述第一数据的最大行长度或所述a2等于所述第一数据的最小行长度;所述处理单元还具体用于:按照字节对所述第一组偏移量进行行列转换得到所述第三组偏移量;将所述第三组偏移量的第b1行上的相邻列的数据按照字节进行差分,得到所述第二组偏移量,所述1≤b1≤b2,所述b1和b2均为正整数,所述b2等于所述第一组偏移量的最大行长度或所述b2等于所述第一组偏移量的最小行长度。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还具体用于:获取所述第一数据的偏移量的起始点和结束点;根据所述第一数据的偏移量的起始点和结束点,以及所述第一组偏移量的单位偏移量长度,得到所述第一组偏移量包括的偏移量的数量M;从所述M个偏移量中去除无效的偏移量,得到N个偏移量,所述N小于或等于所述M,所述N和所述M均为正整数;将所述N个偏移量按照从小到大的顺序进行排列,得到排序后的N个偏移量;根据所述排序后的N个偏移量,将所述第一行数据部划分为N个区域,并得到所述第一数据的每行数据的长度,其中所述N个区域中的第n个区域中的数据的个数为所述第一数据的第n行数据的长度;按照所述N个偏移量的排列顺序,依次从所述N个区域中,取第i个字节对应的数据作为所述第三数据第i行第N列的数据,所述i依次从1取至L1,所述i为正整数,所述L1为所述第一数据的最大行长度,所述N个偏移量的排列顺序为所述N个偏移量按照从小到大的排列顺序或所述N个偏移量在所述第一组偏移量中的排列顺序;所述处理单元还具体用于:按照字节对所述第一组偏移量中的所述N个偏移量进行行列转换得到所述第三组偏移量。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还用于:在所述按照所述N个偏移量的排列顺序,依次从所述N个区域中,取第i个字节对应的数据作为所述第三数据第i行第N列的数据之前,确定所述第一数据的每行数据的长度之间的差异小于或等于第一阈值。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还用于:在所述根据所述N个偏移量,将所述第一行数据部划分为N个区域之前,确定所述N小于或等于第二阈值。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还用于:将连续的且结 构相同的多个第三数据页进行重组,得到所述第一数据页;其中,所述第三数据页包括基于行存储方式的第四数据和第四组偏移量,所述第四组偏移量用于指示所述第四数据的每行数据的偏移量,所述第一数据包括多个所述第三数据页对应的多个所述第四数据,且多个所述第四数据的最大行长度相同,第一组偏移量包括多个所述第三数据页对应的多个所述第四组偏移量。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还具体用于:分别获取与多个所述第三数据页对应的多个所述第四数据和多个所述第四组偏移量;分别将多个所述第四数据按照目标顺序进行排列,得到所述第一数据;以及,分别将多个所述第四组偏移量按照所述目标顺序进行排列,得到所述第一组偏移量,所述目标顺序为多个所述第三数据页的排列顺序;将所述第一数据和所述第一组偏移量分别存储至所述第一数据页。
结合第三方面,在第三方面的某些实现方式中,所述第一数据页包括用于指示所述第一数据页进行过重组的信息。
结合第三方面,在第三方面的某些实现方式中,所述第二数据页包括用于指示所述第二数据页进行过所述预处理的信息。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还用于:对所述压缩后的数据页进行解压缩,得到所述第二数据页;根据所述第二数据页,得到所述第一数据页,所述第一数据是对第二数据进行所述预处理后得到的数据,所述第一组偏移量是对所述第二组偏移量进行所述预处理后得到的组偏移量。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还具体用于:从所述第二数据页中分别获取所述第二数据和所述第二组偏移量;按照字节对所述第二组偏移量进行所述预处理得到所述第一组偏移量;根据所述第一组偏移量,按照字节对所述第二数据进行所述预处理得到所述第一数据;根据所述第一数据和所述第一组偏移量,得到所述第一数据页。
结合第三方面,在第三方面的某些实现方式中,所述第二数据页包括第二行数据部和第二目录部,所述第二行数据部用于存储所述第二数据,所述第二目录部用于存储所述第二组偏移量;所述处理单元还具体用于:将所述第二行数据部中存储的所述第二数据更新为所述第一数据,并将所述第二目录部中存储的所述第二组偏移量更新为第一组偏移量,得到所述第一数据页。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还具体用于:根据所述第二组偏移量的单位偏移量长度,按照字节对所述第二组偏移量进行所述预处理得到所述第一组偏移量;所述处理单元还具体用于:从所述第一组偏移量中去除无效的偏移量,得到第五组偏移量,所述第五组偏移量包括P个偏移量;将所述P个偏移量按照从小到大的顺序进行排列,得到排序后的P个偏移量;根据所述排序后的P个偏移量,创建P个区域,并得到所述第一数据的每行数据的长度,所述P个区域与所述P个偏移量一一对应;依次按顺序从所述第二行数据部中读取R个字节对应的数据,并依次将R个字节中第p个字节对应的数据存储至所述P个区域中第s个区域的第q个字节对应的数据,完成第q次数据的读写,其中,所述p为正整数,且所述p从1取至R,所述R为所述P个区域中未被写满数据的区域的数量,在所述第s个区域被写满数据的情况下,所述第s个区域中的数据的数量为所述第一数据的第s行数据的长度,所述s为正整数,所述第s个区域对应的偏移量为第s个偏移量,所述第s个偏移量位于所述第五组偏移量中除被写满数据的 区域对应的偏移量之外的偏移量中的第p个偏移量;所述q从1取值L2,所述L2为所述第一数据的最大行长度;所述处理单元还具体用于:依次将所述P个区域中的数据覆盖所述第二行数据部中存储的所述第二数据。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还具体用于:在所述依次按顺序从所述第二行数据部中读取P个字节对应的数据,并依次将P个字节中第p个字节对应的数据存储至第s个区域的第q个字节对应的数据,完成第q次数据的读写之前,确定所述第一数据的每行数据的长度之间的差异小于或等于第三阈值。
结合第三方面,在第三方面的某些实现方式中,所述预处理还包括基于字节级的累加处理,所述累加处理包括列数据之间进行累加。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还具体用于:将所述第二组偏移量的第c1行上的相邻列的数据按照字节进行累加,得到第三组偏移量,所述1≤c1≤c2,所述c1和c2均为正整数,所述c2等于所述第二组偏移量的最大行长度或所述c2等于所述第二组偏移量的最小行长度;按照字节对所述第三组偏移量进行行列转换,得到所述第一组偏移量;所述处理单元还具体用于:将所述第二数据的第d1行上的相邻列的数据进行按照字节累加,得到第三数据,所述1≤d1≤d2,所述d1和d2均为正整数,所述d2等于所述第二数据的最大行长度或所述d2等于所述第二数据的最小行长度;根据所述第一组偏移量,按照字节对所述第三数据进行行列转换得到所述第一数据。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还具体用于:根据所述第三组偏移量的单位偏移量长度,按照字节对所述第三组偏移量进行行列转换处理得到所述第一组偏移量;所述处理单元还具体用于:从所述第一组偏移量中去除无效的偏移量,得到第五组偏移量,所述第五组偏移量包括P个偏移量;将所述P个偏移量按照从小到大的顺序进行排列,得到排序后的P个偏移量;根据所述排序后的P个偏移量,创建P个区域,并得到所述第一数据的每行数据的长度,所述P个区域与所述P个偏移量一一对应;依次按顺序从所述第三数据中读取R个字节对应的数据,并依次将R个字节中第p个字节对应的数据存储至所述P个区域中第s个区域的第q个字节对应的数据,完成第q次数据的读写,其中,所述p为正整数,且所述p从1取至R,所述R为所述P个区域中未被写满数据的区域的数量,在所述第s个区域被写满数据的情况下,所述第s个区域中的数据的数量为所述第一数据的第s行数据的长度,所述s为正整数,所述第s个区域对应的偏移量为第s个偏移量,所述第s个偏移量位于所述第五组偏移量中除被写满数据的区域对应的偏移量之外的偏移量中的第p个偏移量;所述q从1取值L2,所述L2为所述第一数据的最大行长度;所述处理单元还具体用于:依次将所述P个区域中的数据覆盖所述第二行数据部中存储的所述第二数据。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还用于:在所述依次按顺序从所述第二行数据部中读取P个字节对应的数据,并依次将P个字节中第p个字节对应的数据存储至第s个区域的第q个字节对应的数据,完成第q次数据的读写之前,确定所述第一数据的每行数据的长度之间的差异小于或等于第三阈值。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还用于:在所述根据所述排序后的P个偏移量,创建P个区域,并得到所述第一数据的每行数据的长度之前,确定所述P小于或等于第四阈值。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还用于:将所述第一数 据页进行拆分,得到所述多个第三数据页。
结合第三方面,在第三方面的某些实现方式中,所述处理单元还具体用于:获取多个所述第三数据页的第四数据的起始点和结束点,以及第四组偏移量的起始点和结束点;根据多个所述第四数据的起始点和结束点,从所述第一数据页中得到多个所述第四数据;以及,根据多个所述第四组偏移量的起始点和结束点,从所述第一数据页中得到多个所述第四组偏移量;分别将多个所述第四数据和多个所述第四组偏移量分别存储至多个所述第三数据页。
结合第三方面,在第三方面的某些实现方式中,所述第一数据页包括用于指示所述第一数据页进行过重组的信息。
第四方面,提供了一种数据页处理的装置,所述装置包括处理单元,所述处理单元用于:对所述压缩后的数据页进行解压缩,得到所述第二数据页;根据所述第二数据页,得到所述第一数据页;其中,所述第二数据页包括基于行存储方式的第二数据和第二组偏移量,所述第二组偏移量用于指示所述第二数据的每行数据的偏移量;所述第一数据页包括基于行存储方式的第一数据和第一组偏移量,所述第一组偏移量用于指示所述第一数据的每行数据的偏移量;所述第一数据是对第二数据进行预处理后得到的数据,所述第一组偏移量是对所述第二组偏移量进行所述预处理后得到的组偏移量。
在本申请实施例中,由于第二数据页中存储的每行数据具有相似性、重复度和一定规律性,这样该数据页处理的装置的处理单元对第二数据页进行解压缩的解压缩率就比较高,进而提高了数据页的解压缩率。此外,该该数据页处理的装置和现有解压缩装置的压缩耗时基本持平。
结合第四方面,在第四方面的某些实现方式中,所述处理单元还具体用于:从所述第二数据页中分别获取所述第二数据和所述第二组偏移量;按照字节对所述第二组偏移量进行所述预处理得到所述第一组偏移量;根据所述第一组偏移量,按照字节对所述第二数据进行所述预处理得到所述第一数据;根据所述第一数据和所述第一组偏移量,得到所述第一数据页。
结合第四方面,在第四方面的某些实现方式中,所述第二数据页包括第二行数据部和第二目录部,所述第二行数据部用于存储所述第二数据,所述第二目录部用于存储所述第二组偏移量;所述处理单元还具体用于:将所述第二行数据部中存储的所述第二数据更新为所述第一数据,并将所述第二目录部中存储的所述第二组偏移量更新为第一组偏移量,得到所述第一数据页。
结合第四方面,在第四方面的某些实现方式中,所述处理单元还具体用于:根据所述第二组偏移量的单位偏移量长度,按照字节对所述第二组偏移量进行所述预处理得到所述第一组偏移量;所述处理单元还具体用于:从所述第一组偏移量中去除无效的偏移量,得到第五组偏移量,所述第五组偏移量包括P个偏移量;将所述P个偏移量按照从小到大的顺序进行排列,得到排序后的P个偏移量;根据所述排序后的P个偏移量,创建P个区域,并得到所述第一数据的每行数据的长度,所述P个区域与所述P个偏移量一一对应;依次按顺序从所述第二行数据部中读取R个字节对应的数据,并依次将R个字节中第p个字节对应的数据存储至所述P个区域中第s个区域的第q个字节对应的数据,完成第q次数据的读写,其中,所述p为正整数,且所述p从1取至R,所述R为所述P个区域中未被写满数据的区域的数量,在所述第s个区域被写满数据的情况下,所述第s个区域中 的数据的数量为所述第一数据的第s行数据的长度,所述s为正整数,所述第s个区域对应的偏移量为第s个偏移量,所述第s个偏移量位于所述第五组偏移量中除被写满数据的区域对应的偏移量之外的偏移量中的第p个偏移量;所述q从1取值L2,所述L2为所述第一数据的最大行长度;所述处理单元还具体用于:依次将所述P个区域中的数据覆盖所述第二行数据部中存储的所述第二数据。
结合第四方面,在第四方面的某些实现方式中,所述处理单元还具体用于:在所述依次按顺序从所述第二行数据部中读取P个字节对应的数据,并依次将P个字节中第p个字节对应的数据存储至第s个区域的第q个字节对应的数据,完成第q次数据的读写之前,确定所述第一数据的每行数据的长度之间的差异小于或等于第三阈值。
结合第四方面,在第四方面的某些实现方式中,所述预处理还包括基于字节级的累加处理,所述累加处理包括列数据之间进行累加。
结合第四方面,在第四方面的某些实现方式中,所述处理单元还具体用于:将所述第二组偏移量的第c1行上的相邻列的数据按照字节进行累加,得到第三组偏移量,所述1≤c1≤c2,所述c1和c2均为正整数,所述c2等于所述第二组偏移量的最大行长度或所述c2等于所述第二组偏移量的最小行长度;按照字节对所述第三组偏移量进行行列转换,得到所述第一组偏移量;所述处理单元还具体用于:将所述第二数据的第d1行上的相邻列的数据进行按照字节累加,得到第三数据,所述1≤d1≤d2,所述d1和d2均为正整数,所述d2等于所述第二数据的最大行长度或所述d2等于所述第二数据的最小行长度;根据所述第一组偏移量,按照字节对所述第三数据进行行列转换得到所述第一数据。
结合第四方面,在第四方面的某些实现方式中,所述处理单元还具体用于:根据所述第三组偏移量的单位偏移量长度,按照字节对所述第三组偏移量进行行列转换处理得到所述第一组偏移量;所述处理单元还具体用于:从所述第一组偏移量中去除无效的偏移量,得到第五组偏移量,所述第五组偏移量包括P个偏移量;将所述P个偏移量按照从小到大的顺序进行排列,得到排序后的P个偏移量;根据所述排序后的P个偏移量,创建P个区域,并得到所述第一数据的每行数据的长度,所述P个区域与所述P个偏移量一一对应;依次按顺序从所述第三数据中读取R个字节对应的数据,并依次将R个字节中第p个字节对应的数据存储至所述P个区域中第s个区域的第q个字节对应的数据,完成第q次数据的读写,其中,所述p为正整数,且所述p从1取至R,所述R为所述P个区域中未被写满数据的区域的数量,在所述第s个区域被写满数据的情况下,所述第s个区域中的数据的数量为所述第一数据的第s行数据的长度,所述s为正整数,所述第s个区域对应的偏移量为第s个偏移量,所述第s个偏移量位于所述第五组偏移量中除被写满数据的区域对应的偏移量之外的偏移量中的第p个偏移量;所述q从1取值L2,所述L2为所述第一数据的最大行长度;所述处理单元还具体用于:依次将所述P个区域中的数据覆盖所述第二行数据部中存储的所述第二数据。
结合第四方面,在第四方面的某些实现方式中,所述处理单元还用于:在所述依次按顺序从所述第二行数据部中读取P个字节对应的数据,并依次将P个字节中第p个字节对应的数据存储至第s个区域的第q个字节对应的数据,完成第q次数据的读写之前,确定所述第一数据的每行数据的长度之间的差异小于或等于第三阈值。
结合第四方面,在第四方面的某些实现方式中,所述处理单元还用于:在所述根据所述排序后的P个偏移量,创建P个区域,并得到所述第一数据的每行数据的长度之前,确 定所述P小于或等于第四阈值。
结合第四方面,在第四方面的某些实现方式中,所述处理单元还用于:将所述第一数据页进行拆分,得到所述多个第三数据页。
结合第四方面,在第四方面的某些实现方式中,所述处理单元还具体用于:获取多个所述第三数据页的第四数据的起始点和结束点,以及第四组偏移量的起始点和结束点;根据多个所述第四数据的起始点和结束点,从所述第一数据页中得到多个所述第四数据;以及,根据多个所述第四组偏移量的起始点和结束点,从所述第一数据页中得到多个所述第四组偏移量;分别将多个所述第四数据和多个所述第四组偏移量分别存储至多个所述第三数据页。
结合第四方面,在第四方面的某些实现方式中,所述第一数据页包括用于指示所述第一数据页进行过重组的信息。
第五方面,提供了一种数据页处理的装置,该装置包括:处理器和存储器;所述存储器,用于存储计算机程序;所述处理器,用于执行所述存储器中存储的计算机程序,以使得所述装置执行上述第一方面或第二方面中任一项可能的实现中所述的方法。
第六方面,提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行上述第一方面或第二方面中任一项可能的实现中所述的方法。
第七方面,提供了一种芯片系统,包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有所述芯片系统的装置执行上述第一方面或第二方面中任一项可能的实现中所述的方法。
第八方面,提供了一种包含指令的计算机程序产品,当所述计算机程序产品在装置上运行时,使得所述装置执行上述第一方面或第二方面中任一项可能的实现中所述的方法。
附图说明
图1为本申请实施例提供的一例数据页压缩的方法200的示意性流程图。
图2为本申请实施例提供的一例数据页的示意图。
图3至图6均为本申请实施例提供的得到第二数据页的示意性流程图。
图7为本申请实施例提供的一例多个数据页重组的示意图。
图8为本申请实施例提供的数据页压缩的方法和现有的压缩方法的压缩性能的一例示意图。
图9为本申请实施例提供的数据页压缩的方法和现有的压缩方法的压缩性能的另一例示意图。
图10为本申请实施例提供的数据页压缩的方法和现有的压缩方法的压缩性能的又一例示意图。
图11为本申请实施例提供的数据页压缩的方法和现有的压缩方法的压缩性能的又一例示意图。
图12为本申请实施例提供的另一例数据页解压的方法的示意性流程图。
图13为本申请实施例提供的一例第二数据读写的过程示意图。
图14为本申请实施例提供的另一例第二数据读写的过程示意图。
图15为本申请实施例提供的一例数据页处理的装置的示意性框图。
图16为本申请实施例提供的一例数据页处理的装置的示意性结构图。
具体实施方式
由于数据压缩技术不仅可以节省存储空间,还可以增加数据传输的速率,因此,其已经广泛地应用于信息技术领域。
例如,在MySQL涉及的透明压缩中,单个数据页的所有数据在落盘前,先交给数据压缩库,该数据压缩库基于通用压缩算法(例如zlib、lz4、或者zstd等)对单个数据页进行压缩,再将压缩后的数据存储到原始地址。随后,利用文件系统的打洞技术(文件系统特性,打洞单位是4K)对空闲空间进行打洞处理。但是,在Linux系统上打洞单位是4K,且当数据页大小为32K和64K时,该数据页不支持压缩,最大是16K的数据页支持这种压缩,因此,压缩比最大能到4:1。
又例如,在Oracle涉及的字典压缩中,在块级别(数据页概念)创建行字段的字典,字段存储字典元素的引用。当字典发生更新、插入,删除操作的时候,通过一个阈值控制是否进行压缩。但是,该字典压缩是基于存储块的字典压缩算法,在页面内维护一块符号表,所有操作都基于此符号表做变换及逆变换,实现复杂。而且字典压缩很容易受到数据本身特征的影响,如果是重复度不高的数据,压缩率就很低。
又例如,DB2支持页面级压缩字典算法和表级字典压缩算法。页面级字典和表级字典存储在表的隐藏行中。一旦建立起字典,除非使字段表重建,否则不会更新字典。该压缩算法的压缩率强依赖于数据的特征。如果初期数据特征很差且不具有代表性,那么此压缩算法不会有比较好的压缩结果。
因此,本申请实施例提供了一种数据页处理的方法,其中数据页处理可以包括数据页压缩和/或数据页解压缩。通过该数据页处理的方法不仅压缩率(或解压缩率)较高,还能和现有压缩(或解压缩)方法的压缩耗时(或解压缩耗时)基本持平。
本申请实施例对本申请实施例提供的数据页处理的方法的应用场景可不作限定。
例如,本申请实施例提供的数据页处理的方法的可以但不限于应用于在线生产环境(如联机事务处理过程(on-line transaction processing,OLTP))、数据库文件压缩备份存储、主备物流文件复制的场景中。
因此,本申请实施例提供了一种数据页压缩的方法,通过该数据页压缩的方法不仅压缩率较高,还能和现有压缩方法的压缩耗时基本持平。
下面将结合附图,对本申请实施例中的技术方案进行描述。
图1为本申请实施例提供的一例数据页压缩的方法200的示意性流程图。
例如,如图1所示,该方法200包括S210和S220,S220在S210之后执行。下面对S210和S220进行详细介绍。
S210,根据第一数据页,得到第二数据页。
第一数据页包括基于行存储方式的第一数据和第一组偏移量,第一组偏移量用于指示第一数据的每行数据的偏移量。
本申请实施例对偏移量的形式不作限定。示例性地,偏移量可以是相对于数据页的头部的位置(例如字节数)。
图2为本申请实施例提供的一例数据页的示意图。例如,如图2所示,该数据页可以包括行数据部和目录部。其中,行数据部用于存储数据;目录部用于存储行数据部存储的 数据的每行数据的偏移量。
可选地,在一些实施例中,如图2所示,该数据页还可以包括头部、空闲部和/或尾部。其中,头部和/或尾部用于存储与该数据页相关的信息。例如,与该数据页相关的信息可以但不限于包括:数据页的编号,数据页的类型,该数据页的行数据部存储的数据的行数、该数据页的行数据部存储的数据的开始点和结束点、该数据页的目录部存储的偏移量的数量、该数据页的目录部存储的偏移量的开始点和结束点。空闲部主要用于行数据部和/或尾部的扩充。
示例性地,第一数据页可以包括第一头部、第一行数据部和第一目录部。第一头部用于存储与第一数据页相关的信息;第一行数据部用于存储第一数据,第一目录部用于存储第一组偏移量。
表1为第一数据的一个示例。例如,如表1所示,该第一数据包括三行数据,其中,第一行数据占用了3个字节,且该3个字节上的数据依次为a、b、c;第二行数据占用了4个字节,且该4个字节上的数据依次为a、b、c、d;第三行数据占用了5个字节,且该5个字节上的数据依次为a、b、c、d、e。
表1
Figure PCTCN2022137287-appb-000001
表2为第一数据的另一个示例。例如,如表2所示,该第一数据包括三行数据,其中,第一行数据占用了5个字节,且该5个字节上的数据依次为a、b、c、d、e;第二行数据占用了3个字节,且该3个字节上的数据依次为a、b、c;第三行数据占用了4个字节,且该4个字节上的数据依次为a、b、c、d。
表2
Figure PCTCN2022137287-appb-000002
本申请实施例对第一组偏移量中偏移量的排列顺序不作限定。
在一个示例中,第一组偏移量中偏移量是按照从小到大的顺序进行排列的。
表3为第一组偏移量的一个示例。其中,表3所示的第一组偏移量中偏移量是按照从小到大的顺序进行排列的,表3所示的第一组偏移量为表1所示的第一数据对应的偏移量。
例如,如表1所示的第一数据包括三行,因此,如表3所示的第一组偏移量包括三个偏移量,其中,第一个偏移量(第一数据的第一行数据的偏移量)占用2个字节,且该2个字节上的数据依次为0x00、0x01;第二个偏移量(第二数据的第二行数据的偏移量)占用2个字节,且该2个字节上的数据依次为0x00、0x04;第三个偏移量(第三数据的第三 行数据的偏移量)占用2个字节,且该2个字节上的数据依次为0x00、0x08。
表3
Figure PCTCN2022137287-appb-000003
在另一个示例中,第一组偏移量中偏移量是乱序进行排列的。
表4为第一组偏移量的另一个示例。其中,表4所示的第一组偏移量中偏移量是乱序进行排列的,表4所示的第一组偏移量为表2所示的第一数据对应的偏移量。
例如,如表4所示的第一数据包括三行,因此,如表3所示的第一组偏移量包括三个偏移量,其中,第一个偏移量(第一数据的第二行数据的偏移量)占用2个字节,且该2个字节上的数据依次为0x00、0x06;第二个偏移量(第一数据的第三行数据的偏移量)占用2个字节,且该2个字节上的数据依次为0x00、0x09;第三个偏移量(第一数据的第一行数据的偏移量)占用2个字节,且该2个字节上的数据依次为0x00、0x01。
表4
Figure PCTCN2022137287-appb-000004
需要说明的是,本申请实施例中都是以偏移量用16进制表示为例进行描述,其不应对本申请实施例构成限制。
第二数据页包括基于行存储方式的第二数据和第二组偏移量,第二组偏移量用于指示第二数据的每行数据的偏移量。
其中,第二数据是对第一数据进行预处理后得到的数据,第二组偏移量是对第一组偏移量进行预处理后得到的组偏移量。也就是说,在S210中,对第一数据页包括的各部分数据进行预处理,得到第二数据页。
在一个示例中,预处理仅包括基于字节级的行列转换。
在另一个示例中,预处理不仅包括基于字节级的行列转换,预处理还包括基于字节级的差分处理。其中,差分处理包括列数据之间进行差分。
需要说明的是,基于字节级的行列转换可以理解为以字节为单位进行的行列转换。基于字节级的差分处理可以理解为以字节为单位进行差分。
在又一个示例中,预处理仅包括基于字节级的差分处理。
为了方便描述,将预处理仅包括基于字节级的行列转换的情况记为情况1,将预处理不仅包括基于字节级的行列转换,预处理还包括基于字节级的差分处理的情况记为情况2。将预处理仅包括基于字节级的差分处理的情况记为情况3。
下面,结合图3至图13,分别以预处理为情况1、情况2和情况3为例,对S210进行详细描述。其中,图3为预处理为情况1时本申请实施例提供的一例得到第二数据页的 示意性流程图。图4为预处理为情况1时本申请实施例提供的一例得到第二数据的示意性流程图。图5为预处理为情况2时本申请实施例提供的一例得到第二数据的示意性流程图。图6为预处理为情况2时本申请实施例提供的一例得到第二组偏移量的示意性流程图。
情况1,预处理仅包括基于字节级的行列转换
在情况1中,如图3所示,S210具体包括S211至S214。
S211,从第一数据页中分别获取第一数据和第一组偏移量。
具体地,可以先从第一头部分别获取第一数据的起始点(rows_begin)和结束点(rows_end)以及第一数据对应的偏移量的起始点(dirs_begin)和结束点(dirs_end)。然后,根据第一数据的起始点(rows_begin)和结束点(rows_end)从第一数据页的第一行数据部获取第一数据,以及根据第一数据对应的偏移量的起始点(dirs_begin)和结束点(dirs_end)从第一数据页的第一目录部获取第一组偏移量。
本申请实施例对起始点和/或结束点的形式不作限定。示例性地,起始点和/或结束点可以是相对于数据页的头部的位置(例如字节数)。
本申请实施例对获取第一数据和第一组偏移量的执行顺序不作限定。例如,可以先获取第一数据,后获取第一组偏移量;也可以先获取第一组偏移量,后获取第一数据;也可以同时获取第一数据和第一组偏移量。
S212,按照字节对第一数据进行预处理得到第二数据。
例如,如图4所示,该S212包括S2121至S2125。下面详细介绍S2121至S2125。
S2121,根据第一数据的偏移量的起始点(rows_begin)和结束点(rows_end),以及第一组偏移量的单位偏移量长度,得到第一组偏移量包括的偏移量的数量(total_dir_cnt)M。其中,第一数据的偏移量的起始点(rows_begin)和结束点(rows_end)可以是从第一头部中获取的。
具体地,M满足以下公式:
Figure PCTCN2022137287-appb-000005
其中,第一组偏移量的单位偏移量长度。
本申请实施例对第一组偏移量的单位偏移量长度的大小不作限定。下文均以第一组偏移量的单位偏移量长度为2字节为例进行描述。
例如,若第一数据的偏移量的起始点(rows_begin)为第a个字节,结束点(rows_end)为第a+5个字节,则M=(a+5-a+1)÷2=3,即第一数据对应的第一组偏移量包括3个偏移量。
需要说明的是,第一组偏移量包括的偏移量的数量M即可以理解为是第一数据的行数。
S2122,从M个偏移量中去除无效的偏移量,得到N个偏移量,N小于或等于M,N和M均为正整数。
在一个示例中,可以根据指示无效的偏移量的信息即可实现S2122。
若指示无效的偏移量的信息中指示有(M-N)个偏移量是无效的,那么此时需要从S2121得到的偏移量的数量M中去除该(M-N)个偏移量,得到有效的N个偏移量。
本申请实施例对指示无效的偏移量的信息的存储位置不做限定。例如,指示无效的偏 移量的信息可以存储在目录部、头部或尾部。
在另一个示例中,也可以根据第一组偏移量中每个偏移量,以及S211中获取的第一数据的起始点(rows_begin)和结束点(rows_end),即可实现S2122。
具体的,若M个偏移量中有N个偏移量在第一数据的起始点(rows_begin)和结束点(rows_end)之间,(M-N)个偏移量不在第一数据的起始点(rows_begin)和结束点(rows_end)之间,即M个偏移量中有N个偏移量是有效的,(M-N)个偏移量是无效的。
在该示例下,需要从S2121得到的偏移量的数量M中去除不在第一数据的起始点(rows_begin)和结束点(rows_end)之间的(M-N)个偏移量,得到有效的N个偏移量。
S2123,将N个偏移量按照从小到大的顺序进行排列,得到排序后的N个偏移量。
本申请实施例对N个偏移量在目录部中的排列顺序不作限定。
在一个示例中,该N个偏移量在目录部中按照从小到大或者从大到小的顺序进行排列。在另一个示例中,该N个偏移量在目录部中可以是乱序排列。
需要说明的是,若N个偏移量已经是按照从小到大的顺序排列,此时,可以无需再执行对该N个偏移量按照从小到大的顺序进行排列的步骤,即可得到排序后的N个偏移量。
例如,如表3所示的3个偏移量按照从小到大的顺序进行排列,得到的排序后的3个偏移量依次为0x01、0x04、0x08。
又例如,如表4所示的3个偏移量按照从小到大的顺序进行排列,得到的排序后的3个偏移量依次为0x01、0x06、0x09。
本申请实施例对将N个偏移量进行排列的方式不作限定。例如,可以通过插入排序的方式对N个偏移量进行排序。
S2124,根据排序后的N个偏移量,将第一行数据部划分为N个区域,并得到第一数据的每行数据的长度。
其中,N个区域中的第n个区域中的数据的个数为第一数据的第n行数据的长度,N个区域中的第n个区域中的数据为第一数据的第n行的数据。
具体地,将第一行数据部中第n个偏移量对应的数据至第(n+1)个偏移量对应的数据的前一个数据作为N个区域中的第n个区域中的数据,也即第n个区域中的数据为第一数据的第n行的数据,其中,n依次从1取至(N-1)。并将第一行数据部中第N个偏移量对应的数据以及第N个偏移量至第一数据的偏移量的结束点的数据作为N个区域中的第N个区域中的数据。这样,便可以将第一行数据部划分为N个区域,也即第N个区域中的数据为第一数据的第N行的数据。
具体地,将第(n+1)个偏移量与第n个偏移量的差值作为第n个区域中的数据的长度,其中,n依次从1取至(N-1);并将第一数据的偏移量的结束点与第N个偏移量的差值加1作为第N个区域中的数据的长度。这样,便可以得到第一数据的每行数据的长度。
例如,如表3所示的3个偏移量排序后依次为0x01、0x04、0x08,首先,将第一行数据部中第1个偏移量(0x01)对应的数据(表1所示的第一行的a)至第2个偏移量(0x04)对应的数据(表1所示的第二行的a)的前一个数据(表1所示的第一行的c)作为第1个区域中的数据。此时,如表1所示的第一数据的第1个区域中的数据即为表1中所示的第一行的数据(abc)。并将第2个偏移量(0x04)和第1个偏移量(0x01)的差值作为第1 个区域中的数据的长度,即该第1个区域中的数据的长度为(0x04-0x01)=0x03(字节)。
其次,将第一行数据部中第2个偏移量(0x04)对应的数据(表1所示的第二行的a)至第3个偏移量(0x08)对应的数据(表1所示的第三行的a)的前一个数据(表1所示的第二行的d)作为第2个区域中的数据。此时,如表1所示的第一数据的第2个区域中的数据即为表1中所示的第二行的数据(abcd)。并将第3个偏移量(0x08)和第2个偏移量(0x04)的差值作为第2个区域中的数据的长度,即该第2个区域中的数据的长度为(0x08-0x04)=0x04(字节)。
最后,将第一行数据部中第3个偏移量(0x08)对应的数据(表1所示的第三行的a)以及第一行数据部中第3个偏移量(0x08)至第一数据的偏移量的结束点(例如为0x0C)的数据作为第3个区域中的数据(表1所示的第三行的b至e)。此时,如表1所示的第一数据的第3个区域中的数据即为表1中所示的第三行的数据(abcde)。并将第一数据的偏移量的结束点和第3个偏移量(0x08)的差值加1作为第3个区域中的数据的长度,即该第3个区域中的数据的长度为(0x0C-0x08+0x01)=0x05(字节)。
又例如,如表4所示的3个偏移量排序后依次为0x01、0x06、0x09,首先,将第一行数据部中第1个偏移量(0x01)对应的数据(表1所示的第一行的a)至第2个偏移量(0x06)对应的数据(表1所示的第二行的a)的前一个数据(表1所示的第一行的e)作为第1个区域中的数据。此时,如表1所示的第一数据的第1个区域中的数据即为表2中所示的第一行的数据(abcde)。并将第2个偏移量(0x06)和第1个偏移量(0x01)的差值作为第1个区域中的数据的长度,即该第1个区域中的数据的长度为(0x06-0x01)=0x05(字节)。
其次,将第一行数据部中第2个偏移量(0x06)对应的数据(表1所示的第二行的a)至第3个偏移量(0x09)对应的数据(表1所示的第三行的a)的前一个数据(表1所示的第二行的c)作为第2个区域中的数据。此时,如表1所示的第一数据的第2个区域中的数据即为表1中所示的第二行的数据(abc)。并将第3个偏移量(0x09)和第2个偏移量(0x06)的差值作为第2个区域中的数据的长度,即该第2个区域中的数据的长度为(0x09-0x06)=0x03(字节)。
最后,将第一行数据部中第3个偏移量(0x09)对应的数据(表1所示的第三行的a)以及第一行数据部中第3个偏移量(0x09)至第一数据的偏移量的结束点(例如为0x0C)的数据作为第3个区域中的数据(表1所示的第三行的b至d)。此时,如表1所示的第一数据的第3个区域中的数据即为表1中所示的第三行的数据(abcd)。并将第一数据的偏移量的结束点和第3个偏移量(0x09)的差值加1作为第3个区域中的数据的长度,即该第3个区域中的数据的长度为(0x0C-0x09+0x01)=0x04(字节)。
本申请实施例对将第一行数据部划分为N个区域和得到第一数据的每行数据的长度之间的执行顺序不作限定。例如,可以是先将第一行数据部划分为N个区域,然后得到第一数据的每行数据的长度;或者,可以是先得到第一数据的每行数据的长度,然后将第一行数据部划分为N个区域;或者,可以是同时将第一行数据部划分为N个区域,并得到第一数据的每行数据的长度。
可选地,在一些实施例中,在执行S2124之前,可以先确定N是否小于或等于第二阈值。并在N是否小于或等于第二阈值的情况下,才执行S2124,以及S2125、S213和S214。这样,只有在第一数据行数不多的情况下,才去对第一数据进行基于字节级的行列转换, 进而可以避免资源的浪费。
本申请实施例对第二阈值的具体取值不作限定,其可以根据实际情况进行设置。
S2125,按照N个偏移量的排列顺序,依次从N个区域中,取第i个字节对应的数据作为第二数据第i行第N列的数据,i依次从1取至L1,i为正整数,L1为第一数据的最大行长度,N个偏移量的排列顺序为N个偏移量按照从小到大的排列顺序或N个偏移量在第一组偏移量中的排列顺序。
在一个示例中,若第一数据页中存储的数据是表数据,那么,N个偏移量的排列顺序为N个偏移量按照从小到大的排列顺序。
例如,若第1个区域中的数据为表1中所示的第一行的数据(abc),第2个区域中的数据为表1中所示的第二行的数据(abcd),第3个区域中的数据为表1中所示的第三行的数据(abcde),第一数据的最大行长度L1为5,S2125具体包括:
在i=1时,按照N(此时N=3)个偏移量按照从小到大的排列顺序,即按照第1个区域(第1个偏移量0x01对应的区域)、第2个区域(第2个偏移量0x04对应的区域)和第3个区域(第3个偏移量0x08对应的区域)的顺序,依次从第1个区域取第1个字节对应的数据(a)作为第二数据第1行第1列的数据、从第2个区域取第1个字节对应的数据(a)作为第二数据第1行第2列的数据、从第3个区域取第1个字节对应的数据(a)作为第二数据第1行第3列的数据。也就是说,通过第一次数据的读取,即可从3个区域中取出第二数据第一行的数据即aaa。
在i=2时,按照N(此时N=3)个偏移量按照从小到大的排列顺序,即按照第1个区域(第1个偏移量0x01对应的区域)、第2个区域(第2个偏移量0x04对应的区域)和第3个区域(第3个偏移量0x08对应的区域)的顺序,依次从第1个区域取第2个字节对应的数据(b)作为第二数据第2行第1列的数据、从第2个区域取第2个字节对应的数据(b)作为第二数据第2行第2列的数据、从第3个区域取第2个字节对应的数据(b)作为第二数据第2行第3列的数据。也就是说,通过第二次数据的读取,即可从3个区域中取出第二数据第二行的数据即bbb。
在i=3时,按照N(此时N=3)个偏移量按照从小到大的排列顺序,即按照第1个区域(第1个偏移量0x01对应的区域)、第2个区域(第2个偏移量0x04对应的区域)和第3个区域(第3个偏移量0x08对应的区域)的顺序,依次从第1个区域取第3个字节对应的数据(c)作为第二数据第3行第1列的数据、从第2个区域取第3个字节对应的数据(c)作为第二数据第3行第2列的数据、从第3个区域取第3个字节对应的数据(c)作为第二数据第3行第3列的数据。也就是说,通过第三次数据的读取,即可从3个区域中取出第二数据第三行的数据即ccc。
在i=4时,按照N(此时N=3)个偏移量按照从小到大的排列顺序,即按照第1个区域(第1个偏移量0x01对应的区域)、第2个区域(第2个偏移量0x04对应的区域)和第3个区域(第3个偏移量0x08对应的区域)的顺序,依次从第1个区域取第4个字节对应的数据(没有数据)作为第二数据第4行第1列的数据、从第2个区域取第4个字节对应的数据(d)作为第二数据第4行第2列的数据、从第3个区域取第4个字节对应的数据(d)作为第二数据第4行第3列的数据。也就是说,通过第四次数据的读取,即可从3个区域中取出第二数据第四行的数据即*dd。
在i=5时,按照N(此时N=3)个偏移量按照从小到大的排列顺序,即按照第1个区 域(第1个偏移量0x01对应的区域)、第2个区域(第2个偏移量0x04对应的区域)和第3个区域(第3个偏移量0x08对应的区域)的顺序,依次从第1个区域取第5个字节对应的数据(没有数据)作为第二数据第5行第1列的数据、从第2个区域取第5个字节对应的数据(没有数据)作为第二数据第5行第2列的数据、从第3个区域取第5个字节对应的数据(e)作为第二数据第5行第3列的数据。也就是说,通过第五次数据的读取,即可从3个区域中取出第二数据第五行的数据即**e。
在该示例中,经过五次数据的读取,便可完成为对如表1所示的第一数据进行基于字节级的行列转换,得到如表5所示的第二数据。例如,如表5所示,该第二数据包括五行数据,其中,该五行数据中的每行数据都占用了3个字节,且第一行的3字节上的数据依次为a、a、a;第二行的3字节上的数据依次为b、b、b;第三行的3字节上的数据依次为c、c、c;第四行的3字节上的数据依次为*、d、d;第五行的3字节上的数据依次为*、*、e。
需要说明的是,本申请实施例中所述的*即为该字节上没有数据。
表5
Figure PCTCN2022137287-appb-000006
在另一个示例中,若第一数据也中存储的数据是索引数据,那么,N个偏移量的排列顺序为N个偏移量在第一组偏移量中的排列顺序。
例如,若第1个区域中的数据为表2中所示的第一行的数据(abcde),第2个区域中的数据为表1中所示的第二行的数据(abc),第3个区域中的数据为表1中所示的第三行的数据(abcd),第一数据的最大行长度L1为5,S2125具体包括:
在i=1时,按照N(此时N=3)个偏移量在第一组偏移量中的排列顺序,即按照第2个区域(第1个偏移量0x06对应的区域)、第3个区域(第2个偏移量0x09对应的区域)和第1个区域(第3个偏移量0x01对应的区域)的顺序,依次从第2个区域取第1个字节对应的数据(a)作为第二数据第1行第1列的数据、从第3个区域取第1个字节对应的数据(a)作为第二数据第1行第2列的数据、从第1个区域取第1个字节对应的数据(a)作为第二数据第1行第3列的数据。也就是说,通过第一次数据的读取,即可从3个区域中取出第二数据第一行的数据即aaa。
在i=2时,按照N(此时N=3)个偏移量在第一组偏移量中的排列顺序,即按照第2个区域(第1个偏移量0x06对应的区域)、第3个区域(第2个偏移量0x09对应的区域)和第1个区域(第3个偏移量0x01对应的区域)的顺序,依次从第2个区域取第2个字节对应的数据(b)作为第二数据第2行第1列的数据、从第3个区域取第2个字节对应的数据(b)作为第二数据第2行第2列的数据、从第1个区域取第2个字节对应的数据(b)作为第二数据第2行第3列的数据。也就是说,通过第二次数据的读取,即可 从3个区域中取出第二数据第二行的数据即bbb。
在i=3时,按照N(此时N=3)个偏移量在第一组偏移量中的排列顺序,即按照第2个区域(第1个偏移量0x06对应的区域)、第3个区域(第2个偏移量0x09对应的区域)和第1个区域(第3个偏移量0x01对应的区域)的顺序,依次从第2个区域取第3个字节对应的数据(c)作为第二数据第3行第1列的数据、从第3个区域取第3个字节对应的数据(c)作为第二数据第3行第2列的数据、从第1个区域取第3个字节对应的数据(c)作为第二数据第3行第3列的数据。也就是说,通过第三次数据的读取,即可从3个区域中取出第二数据第三行的数据即ccc。
在i=4时,按照N(此时N=3)个偏移量在第一组偏移量中的排列顺序,即按照第2个区域(第1个偏移量0x06对应的区域)、第3个区域(第2个偏移量0x09对应的区域)和第1个区域(第3个偏移量0x01对应的区域)的顺序,依次从第2个区域取第4个字节对应的数据(没有数据)作为第二数据第4行第1列的数据、从第3个区域取第4个字节对应的数据(d)作为第二数据第4行第2列的数据、从第1个区域取第4个字节对应的数据(d)作为第二数据第4行第3列的数据。也就是说,通过第四次数据的读取,即可从3个区域中取出第二数据第四行的数据即*dd。
在i=5时,按照N(此时N=3)个偏移量在第一组偏移量中的排列顺序,即按照第2个区域(第1个偏移量0x06对应的区域)、第3个区域(第2个偏移量0x09对应的区域)和第1个区域(第3个偏移量0x01对应的区域)的顺序,依次从第2个区域取第5个字节对应的数据(没有数据)作为第二数据第5行第1列的数据、从第3个区域取第5个字节对应的数据(没有数据)作为第二数据第5行第2列的数据、从第1个区域取第5个字节对应的数据(e)作为第二数据第5行第3列的数据。也就是说,通过第五次数据的读取,即可从3个区域中取出第二数据第五行的数据即**e。
在该示例中,经过五次数据的读取,便可完成为对如表2所示的第一数据进行基于字节级的行列转换,得到如表5所示的第二数据。关于表5的描述可以参见上文的相关描述,这里不再赘述。
可选地,在一些实施例中,在执行S2125之前,可以先确定第一数据的每行数据的长度之间的差异是否小于或等于第一阈值,即S2126。并在第一数据的每行数据的长度之间的差异小于或等于第一阈值的情况下,才执行S2125,以及S213和S214。这样,只有在第一数据的每行数据的长度之间的差异不大的情况下,才去对第一数据进行基于字节级的行列转换,进而可以避免资源的浪费。
本申请实施例对第一阈值的具体取值不作限定,其可以根据实际情况进行设置。
S213,按照字节对第一组偏移量进行预处理得到第二组偏移量。
具体的,按照字节对第一组偏移量中的N个偏移量进行基于字节级的行列转换得到第二组偏移量。
需要说明的是,按照字节对第一组偏移量中的N个偏移量进行基于字节级的行列转换得到第二组偏移量可以理解为:按照字节对该N个偏移量在第一组偏移量中的排列顺序进行基于字节级的行列转换得到第二组偏移量。
例如,表6为对如表3所示的第一组偏移量中的N个偏移量进行基于字节级的行列转换得到的第二组偏移量的一个示例。例如,如表6所示,第二组偏移量包括两行数据,其中,第一行占用3个字节,且该3个字节上的数据依次为0x00、0x00、0x00;第二行占 用3个字节,且该3个字节上的数据依次为0x01、0x04、0x08。
表6
Figure PCTCN2022137287-appb-000007
例如,表7为对如表4所示的第一组偏移量中的N个偏移量进行基于字节级的行列转换得到的第二组偏移量的一个示例。例如,如表7所示,第二组偏移量包括两行数据,其中,第一行占用3个字节,且该3个字节上的数据依次为0x00、0x00、0x00;第二行占用3个字节,且该3个字节上的数据依次为0x06、0x09、0x01。
表7
Figure PCTCN2022137287-appb-000008
S214,根据第二数据和第二组偏移量,得到第二数据页。
在一个示例中,可以新创建一个数据页,将该第二数据和第二组偏移量存储至新重建的数据页中,以形成第二数据页。
例如,可以新创建一个数据页,该数据页包括第二行数据部和第二目录部,将第二数据存储至第二行数据部,并将第二组偏移量存储至第二目录部。这样,该新创建的数据页即为第二数据页。
在另一个示例中,可以在原有的第一数据页的基础上,得到第二数据页。
例如,将第一行数据部中存储的第一数据更新为第二数据,并将第一目录部中存储的第一组偏移量更新为第二组偏移量,得到第二数据页。
情况2,预处理包括基于字节级的行列转换和基于字节级的差分处理
在情况2中,S210具体包括S211至S214。情况2中S211和S214的具体过程,与情况1中S211和S214的具体过程是相同的,这里不再赘述。情况2中S212和S213的具体过程,与情况1中S212和S213的具体过程是不同的,下面详细介绍情况2中S212和S213的具体过程。
在该情况2中,如图5所示,该S212具体包括S212A和S212B。下面详细介绍S212A和S212B。
S212A,按照字节对第一数据进行行列转换得到第三数据。
具体的,S212A包括S2121至S2124、以及S2125A。其中,关于S2121至S2124的描述可以参见上文的相关描述,这里不再赘述。这里着重介绍S2125A。
S2125A,按照N个偏移量的排列顺序,依次从N个区域中,取第i个字节对应的数据作为第三数据第i行第N列的数据,i依次从1取至L1,i为正整数,L1为第一数据的最大行长度,N个偏移量的排列顺序为N个偏移量按照从小到大的排列顺序或N个偏移量在第一组偏移量中的排列顺序。
根据上文对该S2125A的描述,可以看出该S2125A和上文所述的S2125的过程类似, 两者的区别仅在于:S2125A得到的是第三数据,S2125得到的是第二数据。故关于该S2125A的详细描述可以参考上文S2125的相关描述,这里不再赘述。
S212B,将第三数据的第a1行上的相邻列的数据按照字节进行差分,得到第二数据,1≤a1≤a2,a1和a2均为正整数,a2等于第一数据的最大行长度或a2等于第一数据的最小行长度。
例如,若根据S212A得到的第三数据如表5所示,表8为将表5所示的第三数据的第a1行上的相邻列的数据按照字节进行差分得到的第二数据的一个示例。其中,表8以a2等于第一数据(表1或表2所示)的最小行长度(3个字节)为例。
表8
Figure PCTCN2022137287-appb-000009
可选地,在一些实施例中,在执行S2125A之前,可以先确定第一数据的每行数据的长度之间的差异是否小于或等于第一阈值,即S2126A。并在第一数据的每行数据的长度之间的差异小于或等于第一阈值的情况下,才执行S2125A,以及S213和S214。这样,只有在第一数据的每行数据的长度之间的差异不大的情况下,才去对第一数据进行基于字节级的行列转换,进而可以避免资源的浪费。
在该情况2中,如图6所示,S213具体包括S213A和S213B。下面详细介绍S213A和S213B。
S213A,按照字节对第一组偏移量进行行列转换得到第三组偏移量。
根据上文对该S213A的描述,可以看出该S213A和上文所述的S213的过程类似,两者的区别仅在于:S213A得到的是第三组偏移量,S213得到的是第二组偏移量。故关于该:S213A的详细描述可以参考上文S213的相关描述,这里不再赘述。
S213B,将第三组偏移量的第b1行上的相邻列的数据按照字节进行差分,得到第二组偏移量,1≤b1≤b2,b1和b2均为正整数,b2等于第一组偏移量的最大行长度或b2等于第一组偏移量的最小行长度。
例如,若根据S213A得到的第三组偏移量如表6所示,表9为将表6所示的第三组偏移量的第b1行上的相邻列的数据按照字节进行差分得到的第二组偏移量的一个示例。其中,以b2等于第一组偏移量(表3所示)的最小行长度(2个字节)为例。
表9
Figure PCTCN2022137287-appb-000010
例如,若根据S213A得到的第三组偏移量如表7所示,表10为将表7所示的第三组 偏移量的第b1行上的相邻列的数据按照字节进行差分得到的第二组偏移量的一个示例。其中,以b2等于第一组偏移量(表4所示)的最小行长度(2个字节)为例。
表10
Figure PCTCN2022137287-appb-000011
可选地,在一些实施例中,第二数据页包括用于指示第二数据页进行过预处理的信息。例如,在预处理仅包括基于字节级的行列转换的情况下,该信息可以指示第二数据页进行过基于字节级的行列转换处理。又例如,在预处理仅包括基于字节级的行列转换和基于字节级的差分处理的情况下,该信息不仅可以指示第二数据页进行过基于字节级的行列转换处理和基于字节级的差分处理,还可以指示基于字节级的行列转换处理和基于字节级的差分处理的先后顺序。
本申请实施例对用于指示第二数据页进行过预处理的信息在第二数据页的存储位置不作限定。例如,用于指示第二数据页进行过预处理的信息可以是存储在第二数据页的头部或尾部。
情况3,预处理仅包括基于字节级的差分处理
在该情况3中,S210具体包括S211至S214。情况3中S211和S214的具体过程,与情况2中S211和S214的具体过程是相同的,这里不再赘述。情况3中S212和S213的具体过程,与情况2中S212和S213的具体过程是不同的,下面详细介绍情况3中S212和S213的具体过程。
在该情况3中,S212具体包括:将第一数据的第e1行上的相邻列的数据按照字节进行差分,得到第二数据,1≤e1≤a2,e1为正整数。其中,a2可以参考上文的相关描述。
在该情况3中,S212的具体过程和情况2中S212B的过程类似,关于情况3中S212的具体过程可以参考情况2中S212B的相应的描述,这里不再详细描述。
在该情况3中,S213具体包括:将第一组偏移量的第f1行上的相邻列的数据按照字节进行差分,得到第二组偏移量,1≤f1≤b2,f1为正整数,其中b2可以参考上文的相关描述。
在该情况3中,S213的具体过程和情况2中S213B的过程类似,关于情况3中S212的具体过程可以参考情况2中S213B的相应的描述,这里不再详细描述。
可选地,在一些实施例中,S210中所述的第一数据页可以是将连续的且结构相同的多个第三数据页进行重组后得到的第一数据页。也就是说,在S210之前,所述方法200还包括:
S230,将连续的且结构相同的多个第三数据页进行重组,得到所述第一数据页。
其中,第三数据页包括基于行存储方式的第四数据和第四组偏移量,第四组偏移量用于指示第四数据的每行数据的偏移量,第一数据包括多个第三数据页对应的多个第四数据,且多个第四数据的最大行长度相同,第一组偏移量包括多个第三数据页对应的多个第四组偏移量。
需要说明的是,数据页结构相同可以理解为数据页的组成部分是一致的。
具体的,S230包括:S231,分别获取与多个第三数据页对应的多个第四数据和多个第四组偏移量。S232,分别将多个第四数据按照目标顺序进行排列,得到第一数据;以及,分别将多个第四组偏移量按照目标顺序进行排列,得到第一组偏移量,目标顺序为多个第三数据页的排列顺序。S233,将第一数据和第一组偏移量分别存储至第一数据页。换句话说,按照多个第三数据页的排列顺序,依次将获取的多个第三数据页中的每个第三数据页的行数据部存储的第四数据存放在一起即可得到第一数据,以及依次将获取的多个第三数据页中的每个第三数据页的目录部存储的第四组偏移量存放在一起即可得到第一组偏移量,这样,第一数据页就是汇聚了多个第三数据页的数据的数据页,此时可认为第一数据页是一个巨型数据页。
本申请实施对上文S232中所述的得到第一数据和得到第一组偏移量的步骤的执行顺序不作限定,例如,可以先得到第一数据后得到第一组偏移量,或者,可以先得到第一组偏移量后得到第一数据,或者,可以同时得到第一数据和第一组偏移量。
可选地,在一些实施例中,若数据页除了包括行数据部和目录部外,数据页还包括:头部和/或尾部,在执行S230的过程中,还需要执行以下步骤:首先,分别获取与多个第三数据页对应的多个头部和/或尾部中存储的数据。其次,分别将多个头部和/或尾部中存储的数据按照目标顺序进行排列,得到第一头部和/或第一尾部中存储的数据。最后,将第一头部和/或第一尾部中存储的数据分别存储至第一数据页的头部和/或尾部。换句话说,按照多个第三数据页的排列顺序,依次将获取的多个第三数据页中的每个第三数据页的头部和/或尾部中存储的数据存放在一起即可得到第一头部和/或第一尾部中存储的数据,这样就形成了一个巨型数据页即第一数据页。
可选地,在一些实施例中,该多个第三数据页是可以进行基于字节的行列转换的。
示例性地,可以通过以下两个条件判断每个第三数据页是否可以进行基于字节的行列转换。条件1:每个第三数据页中存储的第四组偏移量中有效的偏移量的个数是否小于或等于第五阈值;条件2:每个第三数据页中存储的第四数据的每行数据的长度之间的差异是否小于或等于第六阈值。
本申请实施例对第五阈值的具体取值不作限定,其可以根据实际情况进行设置。
本申请实施例对第五阈值分别与第四阈值和第二阈值的关系不作限定。例如,第五阈值、第四阈值和第二阈值可以均相等。
本申请实施例对第六阈值的具体取值不作限定,其可以根据实际情况进行设置。
本申请实施例对第六阈值分别与第三阈值和第一阈值的关系不作限定。例如,第六阈值、第三阈值和第一阈值可以均相等。
图7为本申请实施例提供的一例多个数据页重组的示意图。
例如,如图7中上方图所示,7个数据页包括数据页10至数据页70。其中,数据页10可以被行列转换,数据页10中行数据部所存储的数据的最大行长度为40。数据页20和数据页30都可以被行列转换,且数据页20中行数据部所存储的数据的最大行长度和数据页30中行数据部所存储的数据的最大行长度都为50。数据页40不可以被行列转换,数据页40中行数据部所存储的数据的最大行长度为50。数据页50可以被行列转换,数据页50中行数据部所存储的数据的最大行长度为50。数据页60和数据页70都可以被行列转换,且数据页60中行数据部所存储的数据的最大行长度和数据页70中行数据部所存储的数据的最大行长度都为60。
根据如图7中上方图所示的数据页10至数据页70可知,数据页20和数据页30可以重组成一个数据页,数据页60和数据页70可以重组成一个数据页,数据页10、数据页40、数据页50均不能重组。
进一步,将如图7中上方图所示的数据页10至数据页70进行重组可得到如图7中下方图所示的5个数据页,该5个数据页包括数据页10、数据页20-30、数据页40、数据页50、和数据页60-70。其中,数据页20-30是数据页20和数据页30重组后得到的数据页,数据页60-70是数据页60和数据页70重组后得到的数据页。
可选地,在一些实施例中,第一数据页包括用于指示第一数据页进行过重组的信息。
本申请实施例对用于指示第一数据页进行过重组的信息在第一数据页的存储位置不作限定。例如,用于指示第一数据页进行过重组的信息可以是存储在第一数据页的头部或尾部。
S220,对第二数据页进行压缩,得到压缩后的数据页。
本申请实施例对第二数据页进行压缩所使用的压缩算法不作限定。
示例性地,可以使用通用压缩算法(如,zlib、lz4、zstd等)对第二数据页进行压缩,得到压缩后的数据页。
需要说明的是,用户在采用上文所述的数据页压缩的方法200对数据页压缩前,可以自己先设置压缩参数,进而通过用户设置的压缩参数,并基于采用上文所述的数据页压缩的方法200对数据页压缩来完成数据页的压缩。
本申请实施例对压缩参数具体包括的内容不作限定。例如,压缩参数可以包括以下至少一项:一次压缩的数据页的页数、预处理的方式、S220中涉及的压缩算法的类型。其中,一次压缩的数据页的页数的最小值为1。预处理的方式包括行列转换和/或差分处理。S220中涉及的压缩算法的类型可以包括zlib、lz4、zstd等。
本申请实施例对压缩参数具体表现形式不作限定。例如,压缩参数可设计成表空间级别、文件级别、表级别、或者用户可自行设计。
一方面,一般列存的数据一般具有相似性、重复度和一定规律性,因此,基于列存储方式的数据压缩比要比基于行存储方式的数据会更高。在方法200中所述的预处理包括基于字节级的行列转换的实施例中,将基于行存储方式的数据以一种有序可逆的方式转换为基于列存储方式的数据的形式,使该数据在数据页内原地更新后再对数据页进行压缩,这样可以充分利用数据结构特点,进而提高数据页的压缩率。此外,本申请实施例提供的数据压缩的方法200和现有压缩方法的压缩耗时基本持平。
另一方面,一般同一列的数据大概率是有很高的重复度以及规律性,因此,在方法200中所述的预处理包括基于字节级的差分处理的实施例中,将列数据之间进行基于字节级的差分处理后,能够制造出更多的重复数据,这样可以充分利用数据结构特点,进一步地提升数据重复度和规律性,进而可以提高数据页的压缩率。此外,本申请实施例提供的数据压缩的方法200和现有压缩方法的压缩耗时基本持平。
又一方面,一般数据页也是有一定规律性,因此,在对数据页进行基于字节级的行列转换之前,可以将连续的且结构相同的多个数据页进行重组得到一个数据页。这样,可以充分利用数据页结构特点,将相似度较高的多个数据页重组成一个数据页,进而可以提高数据页的压缩率。此外,本申请实施例提供的数据压缩的方法200和现有压缩方法的压缩耗时基本持平。
下面结合表11至表16,对本申请实施例提供的数据压缩的方法200的压缩性能(例如,压缩率或压缩耗时)进行具体详细的介绍。
基于本申请实施例提供的数据压缩的方法200和现有的压缩方法,本申请实施例对基于行存储方式的多个数据进行了TPCC测试,详见表11至表16。
其中,表11至表12中处理方式中:①:采用现有通用压缩算法进行压缩的处理方式;②:数据页内进行基于字节级的行列转换+采用现有通用压缩算法进行压缩的处理方式;③:数据页内进行基于字节级的行列转换+差分处理+采用现有通用压缩算法进行压缩的处理方式;④:数据页重组+数据页内进行基于字节级的行列转换+采用现有通用压缩算法进行压缩的处理方式;⑤:数据页重组+数据页内进行基于字节级的行列转换+差分处理+采用现有通用压缩算法进行压缩的处理方式。
表11和表12是以数据库GaussDB V3中的各个索引数据为例,表11对应的压缩等级为9,表12对应的压缩等级为1。
表13是以数据库PG中的各个索引数据为例,且表13是对一个数据页进行压缩的示例。
表14和表15是以数据库GaussDB V3中的各个表数据为例,表14对应的压缩等级为9,表15对应的压缩等级为1。
表16是以数据库PG中的各个表数据为例,且表16是对一个数据页进行压缩的示例。
表11至表15均以方法200的S220中采用zstd通用算法为例。表16除了以方法200的S220中采用zstd通用算法为例,还以方法200的S220中采用lz4通用算法为例。
表11
Figure PCTCN2022137287-appb-000012
Figure PCTCN2022137287-appb-000013
表12
Figure PCTCN2022137287-appb-000014
Figure PCTCN2022137287-appb-000015
表13
Figure PCTCN2022137287-appb-000016
Figure PCTCN2022137287-appb-000017
表14
Figure PCTCN2022137287-appb-000018
Figure PCTCN2022137287-appb-000019
表15
Figure PCTCN2022137287-appb-000020
Figure PCTCN2022137287-appb-000021
表16
Figure PCTCN2022137287-appb-000022
Figure PCTCN2022137287-appb-000023
由表11至表16可知:
1、在方法200的S220中采用zstd通用算法的情况下,压缩等级越低,压缩耗时越短,压缩率相对越低。
2、单次压缩的数据页越多,压缩性能越好。如单次压缩的数据页越多,压缩率越高,压缩耗时越短。
由表16可知,在本申请实施例中,方法200的S220中采用lz4通用算法和方法200的S220中采用zstd通用算法相比,采用lz4通用算法的实施例的总体压缩性能(压缩率和压缩耗时)不如采用zstd通用算法的实施例的压缩性能。
下面,结合图8至图11,详细描述表11和表14中分别采用本申请实施例提供的数据页压缩的方法200和现有的压缩方法对数据库GaussDB V3中的数据进行压缩的过程中两者各自的压缩性能。关于表11至表16中未描述的部分具体可以参见表中所示,这里不再多述。
需要说明的是,图8至图11仅是为了对比①~⑤处理方式的压缩性能,其中具体数值仍以表11至表16中为准。
图8至图11分别为本申请实施例提供的四例压缩性能的示意图。
在图8中的(a)、图9中的(a)、图10中的(a)、和图11中的(a)所示的图中,横坐标表示数据页的个数,纵坐标表示压缩后的数据页的大小(单位:M(兆))。
在图8中的(b)、图9中的(b)、图10中的(b)、和图11中的(b)所示的图中,横坐标表示数据页的个数,纵坐标表示压缩耗时(单位:s(秒))。
图8至图11的相同之处在于:1、均以方法200的S220中采用zstd通用算法为例。2、均是以对1G(千兆字节)的数据进行压缩为例,且均是以压缩等级为9为例。
图8至图11的不同之处在于:图8中是以数据库GaussDB V3中的索引数据idx_bmsql_oorder_pkey为例,图9中是以数据库GaussDB V3中的索引数据idx_bmsql_order_line_pkey为例,图10中是以数据库GaussDB V3中的表数据tbl_bmsql_oorder为例,图11中是以数据库GaussDB V3中的表数据tbl_bmsql_stock为例。
一方面,由图8至图10可知,在压缩数据之前,对数据做的处理不同,压缩后得到的数据页的大小和压缩耗时也不同。但总体上来说,无论是索引数据还是表数据,采用本申请实施例提供的数据页压缩的方法200对数据进行压缩后得到的数据页的大小均比采用现有的压缩方法对数据进行压缩后得到的数据页的大小要小。换句话说,本申请实施例提供的数据页压缩的方法200的压缩率均比采用现有的压缩方法的压缩率要高。另一方面,由图8至图11可知,本申请实施例提供的数据页压缩的方法200压缩过程的耗时均和采 用现有的压缩方法对数据进行压缩的耗时差不多。由此可见,本申请实施例提供的数据页压缩的方法200不仅压缩率较高,而且和现有压缩方法的压缩耗时基本持平。
此外,由于表数据bmsql_order中随机数较多,因此,在对表数据bmsql_order进行的预处理的包括差分处理的情况下,某些采用本申请实施例提供的数据页压缩的方法200对数据进行压缩后得到的数据页的大小比采用现有的压缩方法对数据进行压缩后得到的数据页的大小要大。
上面对数据页压缩的过程进行了介绍。下面对数据页解压的过程进行介绍。
应理解,数据页压缩的过程和数据页解压的过程可以分开实施也可以结合实施,本申请实施例对此不作限定。
需要说明的是,下文以数据页解压的过程和数据页压缩的过程是结合实施为例进行描述,其不应对本申请构成限制。
图12为本申请实施例提供的一例数据页解压的方法300的示意性流程图。
例如,如图12所示,该方法300包括S310和S320,S320在S310之后执行。下面对S310和S320进行详细介绍。
S310,对压缩后的数据页进行解压缩,得到第二数据页。
本申请实施例对压缩后的数据页进行解压缩所使用的解压缩方法不作限定。
示例性地,可以使用通用解压缩算法(如,zlib、lz4、zstd等)对压缩后的数据页进行解压缩进行解压缩得到第二数据页。
S320,根据第二数据页,得到第一数据页。
第一数据页的结构可以参见上文的描述,这里不再赘述。
第二数据页可以包括第二行数据部和第二目录部,其中,第二行数据部用于存储第二数据,第二目录部用于存储第二组偏移量。
在S320中,第一数据页中的第一数据是对第二数据页中的第二数据进行预处理后得到的数据。第一数据页中的第一组偏移量是对第二数据页中的第二组偏移量进行预处理后得到的组偏移量。
在一个示例中,预处理仅包括基于字节级的行列转换。
在另一个示例中,预处理不仅包括基于字节级的行列转换,预处理还包括基于字节级的累加处理。其中,累加处理包括列数据之间进行累加。
需要说明的是,基于字节级的累加处理可以理解为以字节为单位进行累加。
在又一个示例中,预处理仅包括累加处理。
为了方便描述,将预处理仅包括基于字节级的行列转换的情况记为情况1,将预处理不仅包括基于字节级的行列转换,预处理还包括基于字节级的累加处理的情况记为情况3。将预处理仅包括累加处理记为情况4。
下面,分别以预处理为情况1、情况3和情况4为例,对S320进行详细描述。
情况1,预处理仅包括基于字节级的行列转换
在情况1中,S320具体包括S321至S324。
S321,从第二数据页中分别获取第二数据和第二组偏移量。
示例性地,可以先从第二头部分别获取第二数据的起始点和结束点以及第二数据对应的偏移量的起始点和结束点。然后,根据第二数据的起始点和结束点从第二数据页的第二行数据部获取第二数据,以及根据第二数据对应的偏移量的起始点和结束点从第二数据页 的第二目录部获取第二组偏移量。
关于起始点和结束点的相关描述可以参见上文的相关描述,这里不再赘述。
S322,按照字节对第二组偏移量进行预处理得到第一组偏移量。
具体地,根据第二组偏移量的单位偏移量长度,按照字节对第二组偏移量进行预处理得到第一组偏移量。
本申请实施例对第二组偏移量的单位偏移量长度的大小不作限定。下文均以第二组偏移量的单位偏移量长度为2字节为例进行描述。
例如,对表6所示的第二组偏移量进行基于字节级的行列转换可得到如表3所示的第一组偏移量。关于表3和表6的描述可以参考上文的相关描述,这里不再赘述。
又例如,对表7所示的第二组偏移量进行基于字节级的行列转换可得到如表4所示的第一组偏移量。关于表4和表7的描述可以参考上文的相关描述,这里不再赘述。
S323,根据第一组偏移量,按照字节对第二数据进行预处理得到第一数据。
具体地,该S323包括:
S3231,从第一组偏移量中去除无效的偏移量,得到第五组偏移量,第五组偏移量包括P个偏移量。
在一个示例中,可以根据指示无效的偏移量的信息即可实现S3231。
若指示无效的偏移量的信息中指示有(N-P)个偏移量是无效的,那么此时需要从第一组偏移量中去除该(N-P)个偏移量,得到有效的P个偏移量,即第五组偏移量。
本申请实施例对指示无效的偏移量的信息的存储位置不做限定。例如,指示无效的偏移量的信息可以存储在第二数据页的目录部、头部或尾部。
在另一个示例中,也可以根据第一组偏移量中每个偏移量,以及第二数据的起始点和结束点,即可实现3231。
其中,第二数据的起始点和结束点可以根据第二数据页的头部或尾部获取。
具体的,若N个偏移量中有P个偏移量在第二数据的起始点和结束点之间,(N-P)个偏移量不在第二数据的起始点和结束点之间,即N个偏移量中有P个偏移量是有效的,(N-P)个偏移量是无效的。
在该示例下,需要从第一组偏移量中去除不在第二数据的起始点和结束点之间的(N-P)个偏移量,得到有效的P个偏移量。
S3232,将P个偏移量按照从小到大的顺序进行排列,得到排序后的P个偏移量。
例如,若第五组偏移量为如表3所示,该P个偏移量分别为0x01、0x04、0x08,该P个偏移量按照从小到大的顺序进行排列得到排序后的P个偏移量依次为0x01、0x04、0x08。
又例如,若第五组偏移量为如表4所示,该P个偏移量分别为0x06、0x09、0x01,该P个偏移量按照从小到大的顺序进行排列得到排序后的P个偏移量依次为0x01、0x06、0x09。
S3233,根据排序后的P个偏移量,创建P个区域,并得到第一数据的每行数据的长度,P个区域与P个偏移量一一对应。
该P个偏移量和P个区域具有一一对应的关系。例如,P个区域中排在第k个位置的区域对应的偏移量为第k个偏移量,所述第k个偏移量为P个偏移量按照从小到大的排列顺序中排在第k个位置的偏移量。
例如,若P=3,3个区域中排在第1个位置的区域对应的偏移量为表3所示的3个偏 移量中排在第1个位置的偏移量,即0x01;3个区域中排在第2个位置的区域对应的偏移量为表3所示的3个偏移量中排在第2个位置的偏移量,即0x04;3个区域中排在第3个位置的区域对应的偏移量为表3所示的3个偏移量中排在第3个位置的偏移量,即0x08。
又例如,若P=3,3个区域中排在第1个位置的区域对应的偏移量为表4所示的3个偏移量中排在第3个位置的偏移量,即0x01;3个区域中排在第2个位置的区域对应的偏移量为表3所示的3个偏移量中排在第1个位置的偏移量,即0x06;3个区域中排在第3个位置的区域对应的偏移量为表3所示的3个偏移量中排在第2个位置的偏移量,即0x09。
将第(d+1)个偏移量与第d个偏移量的差值作为第一数据的第d行数据的长度。其中,d依次从1取至(P-1)。并将第P个偏移量与第一数据的偏移量的结束点的差值加1作为第一数据的第P行数据的长度。
例如,如表3所示的3个偏移量排序后依次为0x01、0x04、0x08,将第2个偏移量(0x04)和第1个偏移量(0x01)的差值作为第一数据第1行数据的长度,即该第一数据第1行数据的长度为(0x04-0x01)=0x03(字节);将第3个偏移量(0x08)和第2个偏移量(0x04)的差值作为第一数据第2行数据的长度,即该第一数据第2行数据的长度为(0x08-0x04)=0x04(字节);以及将第一数据的偏移量的结束点(例如,0x0C)和第3个偏移量(0x08)的差值作为第一数据第3行数据的长度,即该第一数据第3行数据的长度为(0x0C-0x08+0x01)=0x05(字节)。
又例如,如表4所示的3个偏移量排序后依次为0x01、0x06、0x09,将第2个偏移量(0x06)和第1个偏移量(0x01)的差值作为第一数据第1行数据的长度,即该第一数据第1行数据的长度为(0x06-0x01)=0x05(字节);将第3个偏移量(0x09)和第2个偏移量(0x06)的差值作为第一数据第2行数据的长度,即该第一数据第2行数据的长度为(0x09-0x06)=0x03(字节);以及将第一数据的偏移量的结束点(例如,0x0C)和第3个偏移量(0x09)的差值加1作为第一数据第3行数据的长度,即该第一数据第3行数据的长度为(0x0C-0x09+0x01)=0x04(字节)。
本申请实施对P个区域的位置不作限定以及该P个区域的位置是否在第二数据页或第一数据页也不作限定。
可选地,在一些实施例中,在执行S3233之前,可以先确定P是否小于或等于第四阈值。并在P小于或等于第四阈值的情况下,才执行S3233,以及S3234和S324。这样,只有在第一数据行数不多的情况下,才去对第二数据进行基于字节级的行列转换,进而可以避免资源的浪费。
本申请实施例对第四阈值的具体取值不作限定,其可以根据实际情况进行设置。
本申请实施例对第四阈值和第二阈值的关系不作限定。例如,第四阈值可以等于第二阈值。
S3234,依次按顺序从第二行数据部中读取R个字节对应的数据,并依次将R个字节中第p个字节对应的数据存储至P个区域中第s个区域的第q个字节对应的数据,完成第q次数据的读写。
其中,p为正整数,且p从1取至R。
R为P个区域中未被写满数据的区域的数量,在第s个区域被写满数据的情况下,第s个区域中的数据的数量为第一数据的第s行数据的长度,s为正整数,第s个区域对应的偏移量为第s个偏移量,第s个偏移量位于第五组偏移量中除被写满数据的区域对应的偏 移量之外的偏移量中的第p个偏移量。
需要说明的是,第p个偏移量可以理解为排在第五组偏移量中除被写满数据的区域对应的偏移量之外的偏移量中第p个位置的偏移量。
q从1取值L2,L2为第一数据的最大行长度。
下面,结合图13和图14为S3234进行详细描述。
图13为本申请实施例提供的一例第二数据读写的过程示意图。图13中所述的第二数据如表5所示,该第二数据对应的第五组偏移量如表3所示。那么,S3234中所述的P=3,3个区域中排在第1个位置的区域为区域401,该区域401对应的偏移量为表3所示的3个偏移量中排在第1个位置的偏移量,即0x01;3个区域中排在第2个位置的区域为区域402,该区域402对应的偏移量为表3所示的3个偏移量中排在第2个位置的偏移量,即0x04;3个区域中排在第3个位置的区域为区域403,该区域403对应的偏移量为表3所示的3个偏移量中排在第3个位置的偏移量,即0x08。可见,在图13的示例中,3个区域对应的偏移量的排列顺序为表3所示的3个偏移量的排列顺序。第一数据第1行数据的长度为=0x03(字节),第一数据第2行数据的长度为0x04(字节),以及第一数据第3行数据的长度为0x05(字节)。
如图13所示,在q=1时,由于3个区域中未被写满数据的区域为3个,则此时R=3。这样,依次按顺序从如表5所示的第二数据中读取3个字节对应的数据(aaa),并依次将3个字节中第1个字节对应的数据(a)存储至区域401(表3所示的3个偏移量中排在第1个位置的偏移量对应的区域)的第1个字节对应的数据,将3个字节中第2个字节对应的数据(a)存储至区域402(3个区域中排在第2个位置的区域)的第1个字节对应的数据,以及将3个字节中第3个字节对应的数据(a)存储至区域403(3个区域中排在第3个位置的区域)的第1个字节对应的数据,这样,便完成第1次数据的读写。
在q=2时,由于3个区域中未被写满数据的区域为3个,则此时R=3。这样,依次按顺序从如表5所示的第二数据中读取3个字节对应的数据(bbb),并依次将3个字节中第1个字节对应的数据(b)存储至区域401(表3所示的3个偏移量中排在第1个位置的偏移量对应的区域)的第2个字节对应的数据,将3个字节中第2个字节对应的数据(b)存储至区域402(表3所示的3个偏移量中排在第2个位置的偏移量对应的区域)的第2个字节对应的数据,以及将3个字节中第3个字节对应的数据(b)存储至区域403(表3所示的3个偏移量中排在第3个位置的偏移量对应的区域)的第2个字节对应的数据,这样,便完成第2次数据的读写。
在q=3时,由于3个区域中未被写满数据的区域为3个,则此时R=3。这样,依次按顺序从如表5所示的第二数据中读取3个字节对应的数据(ccc),并依次将3个字节中第1个字节对应的数据(c)存储至区域401(表3所示的3个偏移量中排在第1个位置的偏移量对应的区域)的第3个字节对应的数据,将3个字节中第2个字节对应的数据(c)存储至区域402(表3所示的3个偏移量中排在第2个位置的偏移量对应的区域)的第3个字节对应的数据,以及将3个字节中第3个字节对应的数据(c)存储至区域403(表3所示的3个偏移量中排在第3个位置的偏移量对应的区域)的第3个字节对应的数据,这样,便完成第3次数据的读写。
在完成第3次数据的读写后,3个区域中的3个区域中排在第1个位置的区域401(表3所示的3个偏移量中排在第1个位置的偏移量对应的区域)中的数据的数量(3个字节) 已达到第一数据的第1行数据的长度(0x03),此时,区域401即可认为已被写满数据。
在q=4时,由于3个区域中未被写满数据的区域为2个,则此时R=2。此外,该2个未被写满数据的区域包括:3个区域中除表3所示的3个偏移量中排在第1个位置的偏移量之外的偏移量中排在第1个位置的偏移量对应的区域(区域402),以及3个区域中除表3所示的3个偏移量中排在第1个位置的偏移量之外的偏移量中排在第2个位置的偏移量对应的区域(区域403)。这样,依次按顺序从如表5所示的第二数据中读取2个字节对应的数据(dd),并依次将2个字节中第1个字节对应的数据(d)存储至区域402的第4个字节对应的数据,并将2个字节中第2个字节对应的数据(d)存储至区域403的第4个字节对应的数据,这样,便完成第4次数据的读写。
在完成第4次数据的读写后,3个区域中的3个区域中排在第2个位置的区域402(表3所示的3个偏移量中排在第2个位置的偏移量对应的区域)中的数据的数量(4个字节)已达到第一数据的第2行数据的长度(0x04),此时,区域402即可认为已被写满数据。
在q=5时,由于3个区域中未被写满数据的区域为1个,则此时R=1。此外,该1个未被写满数据的区域为除表3所示的3个偏移量中排在第1个位置的偏移量和第2个位置的偏移量之外的偏移量中排在第1个位置的偏移量对应的区域(区域403)。这样,按顺序从如表5所示的第二数据中读取1个字节对应的数据(e),并依次将1个字节对应的数据(e)存储至区域403(未被写满数据区域)的第5个字节对应的数据,这样,便完成第5次数据的读写。
经过如图13所示的上述5次数据的读写过程,便可实现从第二数据页中读取第二数据,并将读取的第二数据写入3个区域(区域401、区域402和区域403)。其中,3个区域中的3个区域中排在第1个位置的区域写入的数据为abc,其占用了3个字节;3个区域中的3个区域中排在第2个位置的区域写入的数据为abcd,其占用了4个字节;3个区域中的3个区域中排在第3个位置的区域写入的数据为abcde,其占用了5个字节。
图14为本申请实施例提供的另一例第二数据读写的过程示意图。图14中所述的第二数据如表5所示,该第二数据对应的第五组偏移量如表4所示,第五组偏移量的偏移量中排在第1个位置的偏移量是0x06,排在第2个位置的偏移量是0x09、排在第3个位置的偏移量是0x01。那么,S3234中所述的P=3,3个区域中排在第1个位置的区域为区域501,该区域501对应的偏移量为表4所示的3个偏移量中排在第3个位置的偏移量,即0x01;3个区域中排在第2个位置的区域为区域502,该区域502对应的偏移量为表3所示的3个偏移量中排在第1个位置的偏移量,即0x06;3个区域中排在第3个位置的区域为区域503,该区域503对应的偏移量为表3所示的3个偏移量中排在第2个位置的偏移量,即0x09。可见,在图14的示例中,3个区域对应的偏移量的排列顺序为表4所示的3个偏移量按照从到大的排列顺序。第一数据第1行数据的长度为=0x05(字节),第一数据第2行数据的长度为0x03(字节),以及第一数据第3行数据的长度为0x04(字节)。
如图14所示,在q=1时,由于3个区域中未被写满数据的区域为3个,则此时R=3。这样,依次按顺序从如表5所示的第二数据中读取3个字节对应的数据(aaa),并依次将3个字节中第1个字节对应的数据(a)存储至区域502(表4所示的3个偏移量中排在第1个位置的偏移量对应的区域)的第1个字节对应的数据,将3个字节中第2个字节对应的数据(a)存储至区域503(表4所示的3个偏移量中排在第2个位置的偏移量对应的区域)的第1个字节对应的数据,以及将3个字节中第3个字节对应的数据(a)存储至区 域501(表4所示的3个偏移量中排在第3个位置的偏移量对应的区域)的第1个字节对应的数据,这样,便完成第1次数据的读写。
在q=2时,由于3个区域中未被写满数据的区域为3个,则此时R=3。这样,依次按顺序从如表5所示的第二数据中读取3个字节对应的数据(bbb),并依次将3个字节中第1个字节对应的数据(b)存储至区域502(表4所示的3个偏移量中排在第1个位置的偏移量对应的区域)的第2个字节对应的数据,将3个字节中第2个字节对应的数据(b)存储至区域503(表4所示的3个偏移量中排在第2个位置的偏移量对应的区域)的第2个字节对应的数据,以及将3个字节中第3个字节对应的数据(b)存储至区域501(表4所示的3个偏移量中排在第3个位置的偏移量对应的区域)的第2个字节对应的数据,这样,便完成第2次数据的读写。
在q=3时,由于3个区域中未被写满数据的区域为3个,则此时R=3。这样,依次按顺序从如表5所示的第二数据中读取3个字节对应的数据(ccc),并依次将3个字节中第1个字节对应的数据(c)存储至区域502(表4所示的3个偏移量中排在第1个位置的偏移量对应的区域)的第3个字节对应的数据,将3个字节中第2个字节对应的数据(c)存储至区域503(表4所示的3个偏移量中排在第2个位置的偏移量对应的区域)的第3个字节对应的数据,以及将3个字节中第3个字节对应的数据(c)存储至区域501(表4所示的3个偏移量中排在第3个位置的偏移量对应的区域)的第3个字节对应的数据,这样,便完成第3次数据的读写。
在完成第3次数据的读写后,3个区域中的3个区域中排在第2个位置的区域502(表4所示的3个偏移量中排在第1个位置的偏移量对应的区域)中的数据的数量(3个字节)已达到第一数据的第2行数据的长度(0x03),此时,区域502即可认为已被写满数据。
在q=4时,由于3个区域中未被写满数据的区域为2个,则此时R=2。此外,该2个未被写满数据的区域:3个区域中除表4所示的3个偏移量中排在第1个位置的偏移量之外的偏移量中排在第1个位置的偏移量对应的区域(区域503),以及3个区域中除表4所示的3个偏移量中排在第1个位置的偏移量之外的偏移量中排在第2个位置的偏移量对应的区域(区域501)。这样,依次按顺序从如表5所示的第二数据中读取2个字节对应的数据(dd),并依次将2个字节中第1个字节对应的数据(d)存储至区域503的第4个字节对应的数据,并将2个字节中第2个字节对应的数据(d)存储至区域501的第4个字节对应的数据,这样,便完成第4次数据的读写。
在完成第4次数据的读写后,3个区域中的3个区域中排在第3个位置的区域503(表4所示的3个偏移量中排在第2个位置的偏移量对应的区域)中的数据的数量(4个字节)已达到第一数据的第3行数据的长度(0x04),此时,区域503即可认为已被写满数据。
在q=5时,由于3个区域中未被写满数据的区域为1个,则此时R=1。此外,该1个未被写满数据的区域为3个区域中除表4所示的3个偏移量中排在第1个位置的偏移量和第2个位置的偏移量之外的偏移量中排在第1个位置的偏移量对应的区域(区域501)。这样,按顺序从如表5所示的第二数据中读取1个字节对应的数据(e),并依次将1个字节对应的数据(e)存储至区域501的第5个字节对应的数据,这样,便完成第5次数据的读写。
经过如图14所示的5次数据的读写过程,便可实现从第二数据页中读取第二数据,并将读取的第二数据写入3个区域(区域501、区域502和区域503)。其中,3个区域中 的3个区域中排在第1个位置的区域写入的数据为abcde,其占用了5个字节;3个区域中的3个区域中排在第2个位置的区域写入的数据为abc,其占用了3个字节;3个区域中的3个区域中排在第3个位置的区域写入的数据为abcd,其占用了4个字节。
可选地,在一些实施例中,在执行S3234之前,可以先确定第一数据的每行数据的长度之间的差异是否小于或等于第三阈值,即S3235。并在第一数据的每行数据的长度之间的差异小于或等于第三阈值的情况下,才执行S3234,以及S324。这样,只有在第一数据的每行数据的长度之间的差异不大的情况下,才去对第一数据进行基于字节级的行列转换,进而可以避免资源的浪费。
本申请实施例对第三阈值的具体取值不作限定,其可以根据实际情况进行设置。
本申请实施例对第三阈值和第一阈值的关系不作限定。例如,第三阈值可以等于第一阈值。
S324,根据第一数据和第一组偏移量,得到第一数据页。
在一个示例中,可以新创建一个数据页,将该第一数据和第一组偏移量存储至新重建的数据页中,以形成第一数据页。
例如,可以新创建一个数据页,该数据页包括第一行数据部和第一目录部,将第一数据存储至第一行数据部,并将第一组偏移量存储至第一目录部。这样,该新创建的数据页即为第一数据页。
在另一个示例中,可以在原有的第二数据页的基础上,得到第一数据页。
例如,将第二行数据部中存储的第二数据更新为第一数据,并将第二目录部中存储的第二组偏移量更新为第一组偏移量,得到第一数据页。
其中,将第二行数据部中存储的第二数据更新为第一数据具体包括依次将S3234中得到的P个区域中的数据覆盖第二行数据部中存储的第二数据。
例如,如图13所述的3个区域中的数据依次为abcabcdabcde,即第二数据为abcabcdabcde。又例如,如图14所述的3个区域中的数据依次为abcdeabcabcd,即第二数据为abcabcdabcde。
情况3,预处理包括基于字节级的行列转换和基于字节级的累加处理
在情况3中,S320具体包括S321至S324。情况3中S321和S324的具体过程,与情况1中S321和S324的具体过程是相同的,这里不再赘述。情况3中S322和S323的具体过程,与情况1中S322和S323具体过程是不同的,下面详细介绍情况3中S322和S323的具体过程。
在该情况3中,S322具体包括S322A和S322B。
S322A,将第二组偏移量的第c1行上的相邻列的数据按照字节进行累加,得到第三组偏移量。
其中,1≤c1≤c2,c1和c2均为正整数,c2等于第二组偏移量的最大行长度或c2等于第二组偏移量的最小行长度。
例如,若根据S321得到的第二组偏移量如表9所示,表6为将表9所示的第二组偏移量的第c1行上的相邻列的数据按照字节进行累加得到的第三组偏移量的一个示例。其中,以c2等于第二组偏移量的最小行长度(3个字节)为例。
又例如,若根据S321得到的第二组偏移量如表10所示,表6为将表7所示的第二组偏移量的第c1行上的相邻列的数据按照字节进行累加得到的第三组偏移量的一个示例。 其中,以c2等于第二组偏移量的最小行长度(3个字节)为例。
S322B,按照字节对第三组偏移量进行行列转换,得到第一组偏移量。
具体地,根据第三组偏移量的单位偏移量长度,按照字节对第三组偏移量进行预处理得到第一组偏移量。
本申请实施例对第三组偏移量的单位偏移量长度的大小不作限定。下文均以第三组偏移量的单位偏移量长度为2字节为例进行描述。
例如,对表6所示的第三组偏移量进行基于字节级的行列转换可得到如表3所示的第一组偏移量。关于表3和表6的描述可以参考上文的相关描述,这里不再赘述。
又例如,对表7所示的第三组偏移量进行基于字节级的行列转换可得到如表4所示的第一组偏移量。关于表4和表7的描述可以参考上文的相关描述,这里不再赘述。
在该情况3中,S323具体包括S323A和S323B。
S323A,将第二数据的第d1行上的相邻列的数据进行按照字节累加,得到第三数据。
其中,1≤d1≤d2,d1和d2均为正整数,d2等于第二数据的最大行长度或d2等于第二数据的最小行长度。
例如,若根据S321得到的第二数据如表8所示,表5为将表8所示的第二数据的第c1行上的相邻列的数据按照字节进行累加得到的第三数据的一个示例。其中,以d2等于第二数据的最小行长度(3个字节)为例。
S323B,根据第一组偏移量,按照字节对第三数据进行行列转换得到第一数据。
具体地,S323B包括S3231至S3233和S3234A。其中,关于S3231至S3233的描述可以参见上文的相关描述,这里不再赘述。这里着重介绍S3234A。
S3234A,依次按顺序从第三数据中读取R个字节对应的数据,并依次将R个字节中第p个字节对应的数据存储至P个区域中第s个区域的第q个字节对应的数据,完成第q次数据的读写。
其中,p为正整数,且p从1取至R。
R为P个区域中未被写满数据的区域的数量,在第s个区域被写满数据的情况下,第s个区域中的数据的数量为第一数据的第s行数据的长度,s为正整数,第s个区域对应的偏移量为第s个偏移量,第s个偏移量位于第五组偏移量中除被写满数据的区域对应的偏移量之外的偏移量中的第p个偏移量。
q从1取值L2,L2为第一数据的最大行长度;
根据上文对该S3234A的描述,可以看出该S3234A和上文所述的S3234的过程类似,两者的区别仅在于:S3234A是从第三数据中读取R个字节对应的数据,S3234是从第二行数据部即第二数据中读取R个字节对应的数据。故关于该S3234A的详细描述可以参考上文S3234的相关描述,这里不再赘述。
情况4,预处理仅包括基于字节级的累加处理
在该情况4中,S320具体包括S321至S324。情况4中S321和S324的具体过程,与情况3中S321和S324的具体过程是相同的,这里不再赘述。情况4中S322和S323的具体过程,与情况3中S322和S323具体过程是不同的,下面详细介绍情况4中S322和S323的具体过程。
在该情况4中,S322具体包括:将第二组偏移量的第g1行上的相邻列的数据按照字节进行累加,得到第一组偏移量。
其中,1≤g1≤c2,g1为正整数。其中,c2可以参考上文的相关描述。
在该情况4中,S322的具体过程和情况3中S322A的过程类似,关于情况4中S322的具体过程可以参考情况3中S322A的相应的描述,这里不再详细描述。
在该情况4中,S323具体包括:将第二数据的第h1行上的相邻列的数据进行按照字节累加,得到第一数据。
其中,1≤h1≤d2,h1为正整数,其中d2可以参考上文的相关描述。
在该情况4中,S323的具体过程和情况3中S323A的过程类似,关于情况4中S323的具体过程可以参考情况3中S323A的相应的描述,这里不再详细描述。
可选地,在一些实施例中,若第一数据页是由多个数据页重组而成的,还需将第一数据页进行拆分以得到多个数据页。
可选地,在一些实施例中,第一数据页包括用于指示第一数据页进行过重组的信息。这样,通过第一数据页便可获知该第一数据页是否重组过。
关于用于指示第一数据页进行过重组的信息的描述可以参见上文的相关描述,这里不再赘述。
也就是说,在S320之后,所述方法300还可以包括:
S330,将第一数据页进行拆分,得到多个第三数据页。
S330中所述的第一数据页和第三数据页的描述可以参见上文中的相关描述,这里不再赘述。
具体的,S330包括:S331,根据第一数据页的头部,获取多个第三数据页的第四数据的起始点和结束点,以及第四组偏移量的起始点和结束点。
可选地,在一些实施例中,若数据页除了包括行数据部和目录部外,数据页还包括:头部和/或尾部,在执行S331的过程中,还需要执行以下步骤:首先,分别获取与多个第三数据页对应的多个头部和/或尾部中存储的数据。其次,分别将多个头部和/或尾部中存储的数据按照目标顺序进行拆分,得到多个第三数据页中的头部和/或尾部中存储的数据,并将多个第三数据页中的头部和/或尾部中存储的数据分别存储至多个第三数据页的头部和/或尾部。最后,根据每个第三数据页中的头部和/或尾部中存储的数据,得到每个第三数据页对应的第四数据的起始点和结束点,以及第四组偏移量的起始点和结束点。
S332,根据多个第四数据的起始点和结束点,从第一数据页中得到多个第四数据;以及,根据多个第四组偏移量的起始点和结束点,从第一数据页中得到多个第四组偏移量。
本申请实施对上文S332中所述的得到第四数据和得到第四组偏移量的步骤的执行顺序不作限定,例如,可以先得到第四数据后得到第四组偏移量,或者,可以先得到第四组偏移量后得到第四数据,或者,可以同时得到第四数据和第四组偏移量。
S333,分别将多个第四数据和多个第四组偏移量分别存储至多个第三数据页。
在上文所述的方法300中,由于第二数据页中存储的每行数据具有相似性、重复度和一定规律性,这样对第二数据页进行解压缩的解压缩率就比较高,进而提高了数据页的解压缩率。此外,本申请实施例和现有解压缩方法的解压缩率耗时基本持平。
下面,结合图15和图16,对本申请实施例提供的数据页处理的装置进行描述。
图15是本申请实施例提供的数据页处理的装置的示意性框图。
如图15所示,该装置600包括:处理单元610。
在一种可实现的方式中,该处理单元610用于实现上文方法200中所述的各个步骤, 这里不再赘述。
在另一种可实现的方式中,该处理单元610用于实现上文方法300中所述的各个步骤,这里不再赘述。
图16示出了本申请实施例提供的另一例数据页处理的装置的示意性结构图。
如图16所示,该数据页处理的装置700包括:一个或多个处理器710,一个或多个存储器720,该一个或多个存储器存储720存储有一个或多个计算机程序,该一个或多个计算机程序包括指令。当该指令被所述一个或多个处理器710运行时,使得所述的数据页处理的装置执行上述方法200或方法300中所述的各个步骤。
本申请实施例提供一种计算机程序产品,当所述计算机程序产品在数据页处理的装置运行时,使得数据页处理的装置执行上述方法200或方法300中所述的各个步骤。其实现原理和技术效果与上述方法相关实施例类似,此处不再赘述。
本申请实施例提供一种可读存储介质,所述可读存储介质包含指令,当所述指令在数据页处理的装置运行时,使得所述数据页处理的装置执行上述方法200或方法300中所述的各个步骤。其实现原理和技术效果类似,此处不再赘述。
本申请实施例提供一种可读存储介质,所述可读存储介质包含指令,当所述指令在数据页处理的装置运行时,使得所述数据页处理的装置执行上述方法200或方法300中所述的各个步骤。其实现原理和技术效果类似,此处不再赘述。
本申请实施例提供一种芯片系统,包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有所述芯片系统地装置执行上述方法200或方法300中所述的各个步骤。其实现原理和技术效果类似,此处不再赘述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现 有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (26)

  1. 一种数据页处理的方法,其特征在于,包括:
    根据第一数据页,得到第二数据页;
    对所述第二数据页进行压缩,得到压缩后的数据页;
    其中,所述第一数据页包括基于行存储方式的第一数据和第一组偏移量,所述第一组偏移量用于指示所述第一数据的每行数据的偏移量;
    所述第二数据页包括基于行存储方式的第二数据和第二组偏移量,所述第二组偏移量用于指示所述第二数据的每行数据的偏移量,所述第二数据是对所述第一数据进行预处理后得到的数据,所述第二组偏移量是对所述第一组偏移量进行所述预处理后得到的组偏移量,所述预处理包括基于字节级的行列转换。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述第一数据页,得到第二数据页,包括:
    从所述第一数据页中分别获取所述第一数据和所述第一组偏移量;
    按照字节对所述第一数据进行所述预处理得到所述第二数据;
    按照字节对所述第一组偏移量进行所述预处理得到所述第二组偏移量;
    根据所述第二数据和所述第二组偏移量,得到所述第二数据页。
  3. 根据权利要求2所述的方法,其特征在于,所述第一数据页包括第一行数据部和第一目录部,所述第一行数据部用于存储所述第一数据,所述第一目录部用于存储所述第一组偏移量;
    所述根据所述第二数据和所述第二组偏移量,得到所述第二数据页,包括:
    将所述第一行数据部中存储的所述第一数据更新为所述第二数据,并将所述第一目录部中存储的所述第一组偏移量更新为第二组偏移量,得到所述第二数据页。
  4. 根据权利要求2或3所述的方法,其特征在于,所述预处理还包括基于字节级的差分处理,所述差分处理包括列数据之间进行差分。
  5. 根据权利要求4所述的方法,其特征在于,
    所述按照字节对所述第一数据进行所述预处理得到所述第二数据包括:
    按照字节对所述第一数据进行行列转换得到所述第三数据;
    将所述第三数据的第a1行上的相邻列的数据按照字节进行差分,得到所述第二数据,所述1≤a1≤a2,所述a1和a2均为正整数,所述a2等于所述第一数据的最大行长度或所述a2等于所述第一数据的最小行长度;
    所述按照字节对所述第一组偏移量进行所述预处理得到所述第二组偏移量包括:
    按照字节对所述第一组偏移量进行行列转换得到所述第三组偏移量;
    将所述第三组偏移量的第b1行上的相邻列的数据按照字节进行差分,得到所述第二组偏移量,所述1≤b1≤b2,所述b1和b2均为正整数,所述b2等于所述第一组偏移量的最大行长度或所述b2等于所述第一组偏移量的最小行长度。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述方法还包括:
    将连续的且结构相同的多个第三数据页进行重组,得到所述第一数据页;
    其中,所述第三数据页包括基于行存储方式的第四数据和第四组偏移量,所述第四组 偏移量用于指示所述第四数据的每行数据的偏移量,所述第一数据包括多个所述第三数据页对应的多个所述第四数据,且多个所述第四数据的最大行长度相同,第一组偏移量包括多个所述第三数据页对应的多个所述第四组偏移量。
  7. 根据权利要求6所述的方法,其特征在于,
    所述将连续的且结构相同的多个第三数据页进行重组,得到所述第一数据页,包括:
    分别获取与多个所述第三数据页对应的多个所述第四数据和多个所述第四组偏移量;
    分别将多个所述第四数据按照目标顺序进行排列,得到所述第一数据;以及,分别将多个所述第四组偏移量按照所述目标顺序进行排列,得到所述第一组偏移量,所述目标顺序为多个所述第三数据页的排列顺序;
    将所述第一数据和所述第一组偏移量分别存储至所述第一数据页。
  8. 根据权利要求6或7所述的方法,其特征在于,所述第一数据页包括用于指示所述第一数据页进行过重组的信息。
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,所述第二数据页包括用于指示所述第二数据页进行过所述预处理的信息。
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,所述方法还包括:
    对所述压缩后的数据页进行解压缩,得到所述第二数据页;
    根据所述第二数据页,得到所述第一数据页,所述第一数据是对第二数据进行所述预处理后得到的数据,所述第一组偏移量是对所述第二组偏移量进行所述预处理后得到的组偏移量。
  11. 根据权利要求10所述的方法,其特征在于,所述根据所述第二数据页,得到所述第一数据页,包括:
    从所述第二数据页中分别获取所述第二数据和所述第二组偏移量;
    按照字节对所述第二组偏移量进行所述预处理得到所述第一组偏移量;
    根据所述第一组偏移量,按照字节对所述第二数据进行所述预处理得到所述第一数据;
    根据所述第一数据和所述第一组偏移量,得到所述第一数据页。
  12. 根据权利要求11所述的方法,其特征在于,所述第二数据页包括第二行数据部和第二目录部,所述第二行数据部用于存储所述第二数据,所述第二目录部用于存储所述第二组偏移量;
    所述根据所述第一数据和所述第一组偏移量,得到所述第一数据页,包括:
    将所述第二行数据部中存储的所述第二数据更新为所述第一数据,并将所述第二目录部中存储的所述第二组偏移量更新为第一组偏移量,得到所述第一数据页。
  13. 根据权利要求11或12所述的方法,其特征在于,所述预处理还包括基于字节级的累加处理,所述累加处理包括列数据之间进行累加。
  14. 根据权利要求13所述的方法,其特征在于,
    所述按照字节对所述第二组偏移量进行所述预处理得到所述第一组偏移量,包括:
    将所述第二组偏移量的第c1行上的相邻列的数据按照字节进行累加,得到第三组偏移量,所述1≤c1≤c2,所述c1和c2均为正整数,所述c2等于所述第二组偏移量的最大行长度或所述c2等于所述第二组偏移量的最小行长度;
    按照字节对所述第三组偏移量进行行列转换,得到所述第一组偏移量;
    所述根据所述第一组偏移量,按照字节对所述第二数据进行所述预处理得到所述第一 数据,包括:
    将所述第二数据的第d1行上的相邻列的数据进行按照字节累加,得到第三数据,所述1≤d1≤d2,所述d1和d2均为正整数,所述d2等于所述第二数据的最大行长度或所述d2等于所述第二数据的最小行长度;
    根据所述第一组偏移量,按照字节对所述第三数据进行行列转换得到所述第一数据。
  15. 根据权利要求1至14中任一项所述的方法,其特征在于,所述方法还包括:
    将所述第一数据页进行拆分,得到所述多个第三数据页。
  16. 根据权利要求15所述的方法,其特征在于,
    所述将所述第一数据页进行拆分,得到所述多个第三数据页,包括:
    获取多个所述第三数据页的第四数据的起始点和结束点,以及第四组偏移量的起始点和结束点;
    根据多个所述第四数据的起始点和结束点,从所述第一数据页中得到多个所述第四数据;以及,根据多个所述第四组偏移量的起始点和结束点,从所述第一数据页中得到多个所述第四组偏移量;
    分别将多个所述第四数据和多个所述第四组偏移量分别存储至多个所述第三数据页。
  17. 根据权利要求15或16所述的方法,其特征在于,所述第一数据页包括用于指示所述第一数据页进行过重组的信息。
  18. 一种数据页处理的装置,其特征在于,所述装置包括处理单元,所述处理单元用于:
    根据第一数据页,得到第二数据页;
    对所述第二数据页进行压缩,得到压缩后的数据页;
    其中,所述第一数据页包括基于行存储方式的第一数据和第一组偏移量,所述第一组偏移量用于指示所述第一数据的每行数据的偏移量;
    所述第二数据页包括基于行存储方式的第二数据和第二组偏移量,所述第二组偏移量用于指示所述第二数据的每行数据的偏移量,所述第二数据是对所述第一数据进行预处理后得到的数据,所述第二组偏移量是对所述第一组偏移量进行所述预处理后得到的组偏移量,所述预处理包括基于字节级的行列转换。
  19. 根据权利要求18所述的装置,其特征在于,所述处理单元还具体用于:
    从所述第一数据页中分别获取所述第一数据和所述第一组偏移量;
    按照字节对所述第一数据进行所述预处理得到所述第二数据;
    按照字节对所述第一组偏移量进行所述预处理得到所述第二组偏移量;
    根据所述第二数据和所述第二组偏移量,得到所述第二数据页。
  20. 根据权利要求18或19所述的装置,其特征在于,所述预处理还包括基于字节级的差分处理,所述差分处理包括列数据之间进行差分。
  21. 根据权利要求18至20中任一项所所述的装置,其特征在于,所述处理单元还用于:
    将连续的且结构相同的多个第三数据页进行重组,得到所述第一数据页;
    其中,所述第三数据页包括基于行存储方式的第四数据和第四组偏移量,所述第四组偏移量用于指示所述第四数据的每行数据的偏移量,所述第一数据包括多个所述第三数据页对应的多个所述第四数据,且多个所述第四数据的最大行长度相同,第一组偏移量包括 多个所述第三数据页对应的多个所述第四组偏移量。
  22. 根据权利要求18至21中任一项所述的装置,其特征在于,所述处理单元还用于:
    对所述压缩后的数据页进行解压缩,得到所述第二数据页;
    根据所述第二数据页,得到所述第一数据页,所述第一数据是对第二数据进行所述预处理后得到的数据,所述第一组偏移量是对所述第二组偏移量进行所述预处理后得到的组偏移量。
  23. 根据权利要求18至22中任一项所述的装置,其特征在于,所述处理单元还用于:
    将所述第一数据页进行拆分,得到所述多个第三数据页。
  24. 一种数据页处理的装置,其特征在于,所述装置包括:处理器和存储器;所述存储器,用于存储计算机程序;所述处理器,用于执行所述存储器中存储的计算机程序,以使得所述装置执行权利要求1至17中任一项所述的方法。
  25. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行如权利要求1至17中任一项所述方法。
  26. 一种芯片系统,其特征在于,包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有所述芯片系统的装置执行如权利要求1至17中任一项所述的方法。
PCT/CN2022/137287 2022-05-11 2022-12-07 数据页处理的方法及其装置 WO2023216575A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2022112514 2022-05-11
RU2022112514 2022-05-11

Publications (1)

Publication Number Publication Date
WO2023216575A1 true WO2023216575A1 (zh) 2023-11-16

Family

ID=88729587

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/137287 WO2023216575A1 (zh) 2022-05-11 2022-12-07 数据页处理的方法及其装置

Country Status (1)

Country Link
WO (1) WO2023216575A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012004636A (ja) * 2010-06-14 2012-01-05 Yokogawa Electric Corp データ圧縮装置およびデータ復元装置
US8239421B1 (en) * 2010-08-30 2012-08-07 Oracle International Corporation Techniques for compression and processing optimizations by using data transformations
US20150178305A1 (en) * 2013-12-23 2015-06-25 Ingo Mueller Adaptive dictionary compression/decompression for column-store databases
CN110990402A (zh) * 2019-11-26 2020-04-10 中科驭数(北京)科技有限公司 由行存储到列存储的格式转化方法、查询方法及装置
CN113220651A (zh) * 2021-04-25 2021-08-06 暨南大学 运行数据压缩方法、装置、终端设备以及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012004636A (ja) * 2010-06-14 2012-01-05 Yokogawa Electric Corp データ圧縮装置およびデータ復元装置
US8239421B1 (en) * 2010-08-30 2012-08-07 Oracle International Corporation Techniques for compression and processing optimizations by using data transformations
US20150178305A1 (en) * 2013-12-23 2015-06-25 Ingo Mueller Adaptive dictionary compression/decompression for column-store databases
CN110990402A (zh) * 2019-11-26 2020-04-10 中科驭数(北京)科技有限公司 由行存储到列存储的格式转化方法、查询方法及装置
CN113220651A (zh) * 2021-04-25 2021-08-06 暨南大学 运行数据压缩方法、装置、终端设备以及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QIAO YIFAN; CHEN XUBIN; HAO JINGPENG; LI JIANGPENG; WU QI; WANG JINGQIANG; LIU YANG; ZHANG TONG: "Improving Relational Database Upon the Arrival of Storage Hardware with Built-in Transparent Compression", 2021 IEEE INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE AND STORAGE (NAS), IEEE, 24 October 2021 (2021-10-24), pages 1 - 9, XP034027217, DOI: 10.1109/NAS51552.2021.9605481 *

Similar Documents

Publication Publication Date Title
US11210318B1 (en) Partitioned distributed database systems, devices, and methods
US20200117510A1 (en) Data set compression within a database system
US8645337B2 (en) Storing compression units in relational tables
US7454403B2 (en) Method and mechanism of improving performance of database query language statements using data duplication information
US9507811B2 (en) Compressed data page with uncompressed data fields
CN103326732A (zh) 压缩数据的方法、解压数据的方法、编码器和解码器
JP7426907B2 (ja) 高度なデータベース圧縮解除
US20100278446A1 (en) Structure of hierarchical compressed data structure for tabular data
US11050436B2 (en) Advanced database compression
US10719450B2 (en) Storage of run-length encoded database column data in non-volatile memory
CN111046034A (zh) 管理内存数据及在内存中维护数据的方法和系统
CA2103445A1 (en) Data compression usin multipel levels
CN101271478B (zh) 基于聚类分块的只读兴趣点数据库压缩存储方法
EP1504377A2 (en) Storing and querying relational data in compressed storage format
CN107729406B (zh) 一种数据分类存储方法及装置
US20160233880A1 (en) Data compression apparatus and data decompression apparatus
TWI720086B (zh) 儲存在區塊處理儲存系統上的音頻資料和資料的縮減
US20240126762A1 (en) Creating compressed data slabs that each include compressed data and compression information for storage in a database system
US10601442B2 (en) Memory compression method and apparatus
US20210326320A1 (en) Data segment storing in a database system
WO2023216575A1 (zh) 数据页处理的方法及其装置
CN115774699B (zh) 数据库共享字典压缩方法、装置、电子设备及存储介质
US9424293B2 (en) Row, table, and index compression
CN109271463B (zh) 一种恢复MySQL数据库的innodb压缩数据的方法
CN103049388B (zh) 一种分页存储器件的压缩管理方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22941510

Country of ref document: EP

Kind code of ref document: A1