CN117118453A

CN117118453A - Data compression storage method based on big data

Info

Publication number: CN117118453A
Application number: CN202311161037.8A
Authority: CN
Inventors: 经伯涛; 刘飞
Original assignee: Shenzhen Guowei Yixin Technology Co ltd
Current assignee: Shenzhen Guowei Yixin Technology Co ltd
Priority date: 2023-09-07
Filing date: 2023-09-07
Publication date: 2023-11-24

Abstract

The embodiment of the application provides a data compression storage method based on big data, and belongs to the technical field of computers. The method comprises the following steps: receiving a data compression signal, wherein the data compression signal comprises a data page to be compressed, and the data page comprises at least one piece of data to be compressed and a compression identifier of the data to be compressed; performing data compression on data to be compressed according to a pre-trained compression model to obtain a data compression set, wherein the data compression set comprises a plurality of character type subsets, and the character type subsets are used for storing character information obtained after the data to be compressed is subjected to data compression; constructing a target entropy coder according to the compression identifier and the character type subset; and encoding the data compression set according to the target entropy encoder to obtain a compression result. The embodiment of the application can improve the compression rate of data compression.

Description

Data compression storage method based on big data

Technical Field

The application relates to the field of big data, in particular to a data compression storage method based on big data.

Background

At present, in a big data system, data is compressed and then stored on a storage medium, so that the cost of the storage medium can be saved, and the performance of a database can be improved. However, the existing data compression method has a low compression rate for data. Therefore, how to increase the compression rate of data compression becomes a technical problem to be solved.

Disclosure of Invention

The embodiment of the application mainly aims to provide a data compression storage method based on big data, which can improve the compression rate of data compression.

To achieve the above object, a first aspect of an embodiment of the present application provides a data compression storage method based on big data, the method

Comprising the following steps:

receiving a data compression signal, wherein the data compression signal comprises a data page to be compressed, and the data page comprises at least one piece of data to be compressed and a compression identifier of the data to be compressed;

performing data compression on the data to be compressed according to a pre-trained compression model to obtain a data compression set, wherein the data compression set comprises a plurality of character type subsets, and the character type subsets are used for storing character information obtained after the data to be compressed is subjected to data compression;

constructing a target entropy coder according to the compression identifier and the character type subset;

and encoding the data compression set according to the target entropy encoder to obtain a compression result.

In some embodiments, before the data to be compressed is data-compressed according to the pre-trained compression model, to obtain a data compression set, the method further includes:

Constructing a compression model, which specifically comprises the following steps: constructing a training data set according to the data page, wherein the training data set comprises a plurality of pieces of data to be compressed; performing character division on the data to be compressed according to a preset character length to obtain a plurality of candidate character strings; carrying out hash mapping on the data to be compressed according to each candidate character string to obtain a mapping result and mapping frequency of the mapping result; performing numerical comparison on all the mapping frequencies to determine a reference character string; and constructing the compression model according to the reference character string.

In some embodiments, said constructing a target entropy encoder from said compressed identity and said subset of character types comprises: and if the compression identifier is not the ending identifier, repeatedly executing the data compression of the data to be compressed according to the pre-trained compression model to obtain a data compression set. In some embodiments of the present invention, in some embodiments,

the constructing a target entropy encoder according to the compression identifier and the character type subset further comprises:

if the compression identifier is identified as the ending identifier, constructing a target entropy encoder according to the character type subset.

In some embodiments of the present invention, in some embodiments,

The character type subset comprises an uncompressed character subset and a matched character subset, and if the compressed identifier is identified as the ending identifier, a target entropy encoder is constructed according to the character type subset, and the method comprises the following steps: if the compression identifier is identified as the ending identifier, carrying out frequency statistics on character information in the uncompressed character subset to obtain a first frequency statistics result; constructing a first entropy coder according to the first frequency statistical result; obtaining a second frequency statistic result by carrying out frequency statistics on character information in the matched character subset; constructing a second entropy coder according to the second frequency statistical result;

and obtaining the target entropy coder according to the first entropy coder and the second entropy coder.

In some embodiments of the present invention, in some embodiments,

the encoding the data compression set according to the target entropy encoder to obtain a compression result, including: encoding the uncompressed character subset according to the first entropy encoder to obtain a first encoded data stream; encoding the matched character subset according to the second entropy encoder to obtain a second encoded data stream; and obtaining a compression result according to the first coded data stream and the second coded data stream.

In some embodiments of the present application, in some embodiments,

said obtaining a compression result from said first encoded data stream and said second encoded data stream,

comprising the following steps:

acquiring encoder metadata of the target entropy encoder and a reference character string of the compression model;

obtaining a target compressed data stream according to the first coded data stream and the second coded data stream;

and obtaining a compression result according to the encoder metadata, the reference character string and the target compressed data stream.

In order to achieve the above-mentioned object,

a second aspect of an embodiment of the present application provides a data compression storage device based on big data, where the device

Comprising the following steps:

a compressed signal receiving module is provided with a signal receiving module,

the data compression method comprises the steps of receiving a data compression signal, wherein the data compression signal comprises a data page to be compressed, and the data page to be compressed comprises at least one piece of data to be compressed and a compression identifier of the data to be compressed;

the data compression module is used for compressing the data,

the data compression method comprises the steps of carrying out data compression on data to be compressed according to a pre-trained compression model to obtain a data compression set, wherein the data compression set comprises a plurality of character type subsets, and the character type subsets are used for storing character information obtained after the data to be compressed are subjected to data compression;

The encoder construction module is configured to provide a plurality of encoder signals,

for constructing a target entropy encoder from the compression identity and the subset of character types;

the coding module is used for coding the data,

for encoding the compressed set of data according to the target entropy encoder,

and obtaining a compression result.

To achieve the above object, a third aspect of embodiments of the present application proposes a computer device,

comprising the following steps:

at least one memory;

at least one processor;

at least one computer program;

the at least one computer program is stored in the at least one memory,

the at least one processor executes the at least one computer program to implement the data compression method of the first aspect described above.

In order to achieve the above-mentioned object,

a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium,

the computer readable storage medium stores a computer program,

the computer program is for causing a computer to execute the data compression method described in the first aspect.

The embodiment of the application provides a data compression storage method based on big data,

Data compression apparatus, computer device and storage medium,

by receiving a data compression signal comprising a data page to be compressed,

The data page to be compressed comprises at least one piece of data to be compressed and a compression identifier of the data to be compressed.

And carrying out data compression on the data to be compressed according to the pre-trained compression model to obtain a data compression set, wherein the data compression set comprises a plurality of character type subsets, and each character type subset is used for storing character information obtained after the data to be compressed is subjected to data compression. And constructing a target entropy coder according to the compression identifier and the character type subset, and coding the data compression set according to the obtained target entropy coder to obtain a compression result. The embodiment of the application can improve the compression rate of data compression.

Drawings

FIG. 1 is a first flowchart of a data compression method according to an embodiment of the present application;

FIG. 2 is a second flowchart of a data compression method provided by an embodiment of the present application;

FIG. 3 is a third flowchart of a data compression method according to an embodiment of the present application

Fig. 4 is a flowchart of step S140 in fig. 1;

fig. 5 is a flowchart of step S430 in fig. 4;

FIG. 6 is a structural flow chart of a data compression method provided by an embodiment of the present application;

fig. 7 is a schematic structural diagram of a compression dictionary structure according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a data compression device according to an embodiment of the present application;

Fig. 9 is a schematic hardware structure of a computer device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

First, several nouns involved in the present application are parsed:

Page (page): the storage management part is used for storing the most basic data unit processed by the software, namely the minimum data processing unit when the disk space manager supports processing the external memory.

Metadata (Metadata): also called intermediate data and relay data, which are data describing data, mainly describing data attribute information, and are used to support functions such as indicating storage location, historical data, resource searching, file recording, etc.

Entropy encoder: an entropy encoder constructed according to an entropy encoding method is a technique for performing lossless data compression in which each letter in a section of text is replaced by a section of bits of different length.

At present, the data is compressed and then stored on the storage medium, so that the cost of the storage medium can be saved and the performance of the database can be improved. In big data systems, data is typically written to a storage medium at a page (8 k) granularity after being organized within memory. The data compression method of this page is roughly divided into two types before writing to the storage medium: firstly, compressing the data of the whole page as a whole, and writing the compressed data into a storage medium; second, each piece of data in the page is compressed separately, and then the compressed data is written to the storage medium by a compact arrangement. In addition, existing big data systems mostly use off-the-shelf compression algorithms to compress the data of the entire page, e.g., opengauss databases support compression algorithms such as lz4, zstd, postgre databases may also support compression algorithms such as lz4, zstd. When the zstd compression algorithm is adopted to compress the data of the database, the existing compression method adopting the entropy coder has certain disadvantages, so that the compression rate of the data is lower, for example: (1) Firstly, constructing a corresponding entropy coder according to metadata of an entropy coder existing in a zstd dictionary, and accordingly performing data compression according to a compression algorithm and the entropy coder; (2) The method needs to store metadata of an entropy coder in each compressed data stream and needs corresponding metadata in a database, so that the compression rate of the data is reduced. Therefore, how to increase the compression rate of data compression becomes a technical problem to be solved.

Based on the above, the embodiment of the application provides a data compression storage method based on big data, which can improve the compression rate of data compression.

The data compression method provided by the embodiment of the application can be applied to the terminal, can be applied to the server side, and can also be software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content distribution network (ContentDeliveryNetwork, CDN), basic cloud computing services such as big data and an artificial intelligent platform and the like; the software may be an application or the like that implements the data compression method, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network personal computers (PersonalComputer, PC), minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Referring to fig. 1, fig. 1 is an optional flowchart of a data compression method according to an embodiment of the present application, where the method in fig. 1 may specifically include, but is not limited to, steps S110 to S140, and these four steps are described in detail below in connection with fig. 1.

Step S110, receiving a data compression signal, wherein the data compression signal comprises a data page to be compressed, and the data page comprises at least one piece of data to be compressed and a compression identifier of the data to be compressed;

step S120, data compression is carried out on data to be compressed according to a pre-trained compression model, a data compression set is obtained, the data compression set comprises a plurality of character type subsets, and the character type subsets are used for storing character information obtained after the data to be compressed are subjected to data compression;

step S130, constructing a target entropy coder according to the compression identifier and the character type subset;

and step S140, encoding the data compression set according to the target entropy encoder to obtain a compression result.

It can be appreciated that in steps S110 to S140 of some embodiments, embodiments of the present application receive a data compression signal including a data page to be compressed, where the data page to be compressed includes at least one piece of data to be compressed and a compression identifier of the data to be compressed. And carrying out data compression on the data to be compressed according to the pre-trained compression model to obtain a data compression set, wherein the data compression set comprises a plurality of character type subsets, and each character type subset is used for storing character information obtained after the data to be compressed is subjected to data compression. And constructing a target entropy coder according to the compression identifier and the character type subset, and coding the data compression set according to the obtained target entropy coder to obtain a compression result. According to the embodiment of the application, the more accurate entropy coder is obtained by counting the character information of all data to be compressed, so that the compression rate of data compression is improved.

It should be noted that, the data compression method provided in the embodiment of the present application may be applied to a client or a server, and the calling end may also be a client or a server, which is not limited herein.

In step S110 of some embodiments, when the calling end needs to perform data compression, a data compression signal needs to be sent to a client or a server that performs the data compression method. When the data compression method is executed as the server, a data compression signal is received by the server, the data compression signal comprises data pages to be compressed, the data pages to be compressed comprise at least one piece of data to be compressed and compression identifiers of the data to be compressed, and the compression identifiers are used for marking the data state of each piece of data to be compressed.

It should be noted that, the calling end may input the data page to be compressed by calling a preset data compression interface to generate a corresponding data compression signal.

It should be noted that the data to be compressed may be in the form of a character string to be compressed.

In step S120 of some embodiments, each piece of data to be compressed of a data page to be compressed is input into a pre-trained compression model for data compression, so as to obtain a plurality of relevant character information of the data to be compressed, and the character information is correspondingly stored in a preset character type subset according to the type of the character information, so as to obtain a data compression set after data compression, thereby facilitating statistics storage of different character information.

It should be noted that after data compression is performed on a piece of data to be compressed, the content of the data compression set is correspondingly updated, and meanwhile, the data compression set is cached in the memory.

In step S130 of some embodiments, the compression identifier is used to mark a data state of each piece of data to be compressed, and when it is determined that the compression identifier of the data to be compressed currently performs data compression meets a preset data state, a target entropy encoder is constructed according to the character type subset of the obtained data compression set, and the target entropy encoder is used to perform entropy encoding compression on the character type subset obtained by compression so as to improve the compression rate of data compression.

In step S140 of some embodiments, the character information corresponding to the subset of character types in the data compression set is compression-encoded again according to the obtained target entropy encoder, so as to obtain a better compression rate. Referring to fig. 2, fig. 2 is another optional flowchart of a data compression method according to an embodiment of the present application, and in some embodiments of the present application, before step S120, the method according to the embodiment of the present application further includes: the compression model is constructed, specifically including but not limited to steps S210 to S250, which are described in detail below in conjunction with fig. 2.

Step S210, constructing a training data set according to the data page, wherein the training data set comprises a plurality of pieces of data to be compressed;

step S220, performing character division on the data to be compressed according to a preset character length to obtain a plurality of candidate character strings;

step S230, hash mapping is carried out on the data to be compressed according to each candidate character string, and a mapping result and the mapping frequency of the mapping result are obtained;

step S240, comparing the values of all the mapping frequencies to determine a reference character string;

step S250, constructing a compression model according to the reference character string.

In step S210 of some embodiments, to better implement finding and compressing the duplicate data, first, a training data set is constructed from the data page to be compressed, the training data set including a plurality of pieces of data to be compressed.

It should be noted that, in some embodiments, all data to be compressed may be divided into two parts, where the data to be compressed in the first part is training data, and the data in the first part is used to construct a training data set to train to obtain a compression model; the data to be compressed of the second part is verification data, and the part data is used for verifying the validity of the generated compression model. Specifically, the specific division of the data of the two parts can be set according to the actual requirement, and the method is not limited herein, or verification data can be not set according to the requirement, and all the data to be compressed are used for constructing the training data set.

In step S220 of some embodiments, in order to improve the compression rate of the repeated data to a greater extent, the data to be compressed is subjected to character division according to a preset character length, so as to obtain a plurality of candidate character strings. For example, when the preset character length is eight bytes, the data to be compressed may be character-divided in units of eight bytes, and space filling may be performed on character strings of less than eight bytes to obtain a plurality of candidate character strings having the same character length. The embodiment of the application is not limited to the preset character length of eight bytes, and can be flexibly adjusted according to actual requirements.

In order to obtain the character length with the best dividing effect, the character length may be set to four bytes, six bytes, eight bytes, etc., and corresponding candidate compression models may be constructed according to each character length, each candidate compression model may be verified according to the data to be compressed in the training data set, and the candidate compression model with the highest compression rate may be determined to be the required compression model, and then the character length corresponding to the candidate compression model may be the required character length.

In step S230 of some embodiments, hash-mapping the divided candidate strings with the data to be compressed of the training data set sequentially, specifically, a corresponding counter may be set for each candidate string, when one candidate string hashes one data to be compressed, when the mapping result is successful, the counter is correspondingly increased by 1, and when the mapping result is unsuccessful, hash-mapping is continued backward until the mapping of the data to be compressed is completed. And then, continuing to carry out hash mapping on the next data to be compressed, and when mapping is completed on all the data to be compressed in the training data set, setting the value of the counter corresponding to the candidate character string as the mapping frequency of the mapping result.

In step S240 of some embodiments, the mapping frequency corresponding to each candidate character string is compared in numerical value, and the candidate character string with the highest mapping frequency is selected as the reference character string. Specifically, data compression is performed on data to be compressed, and data compression needs to be performed with reference to a compression dictionary, wherein the reference character string obtained through training is dictionary content of the compression dictionary, and the dictionary content is used for representing character strings with higher occurrence frequency in the data to be compressed.

In step S250 of some embodiments, after the reference string is determined, a compression model is constructed from the reference string and a preset data compression algorithm. For example, a data compression algorithm such as deflate, gzip, brotli may be used, and will not be described in detail herein.

In some embodiments, step S130 may include: and if the compression identifier is not the ending identifier, repeatedly executing data compression on the data to be compressed according to the pre-trained compression model to obtain a data compression set.

Specifically, according to the embodiment of the application, a data compression set of data to be compressed is cached in a memory, the next piece of data to be compressed is continuously subjected to data compression, when a compression model recognizes that the compression identifier of the current data to be compressed is not an ending identifier, the current data to be compressed is continuously subjected to data compression according to a pre-trained compression model, and the data compression set is updated according to the content of the data after the data compression.

It should be noted that, the compression identifier may be a boolean flag, that is, the compression identifier may be false or true, where the compression identifier is false, which indicates that the data page to be compressed is not yet transmitted, that is, there is data to be compressed behind the data to be compressed, and the compression identifier is true, which indicates that the data in the data page to be compressed is already transmitted to the compression model.

In some embodiments, step S130 may further include: if the compression identifier is identified as the end identifier, a target entropy encoder is constructed according to the character type subset.

Specifically, in order to improve the compression rate of the data to be compressed by the entropy encoder, when the compression model recognizes that the compression identifier of the current data to be compressed is an end identifier, and indicates that the data input of the data page to be compressed is completed, the target entropy encoder is constructed according to the obtained character type subset. According to the embodiment of the application, the target entropy encoder with higher compression rate can be obtained according to the character type subset obtained by counting all the data to be compressed, so that the compression rate of data compression is improved.

It should be noted that, the data to be compressed, in which the compression identifier is the end identifier, may be set to be empty, and when the compression identifier is identified as the result identifier, the subsequent operation is directly performed.

Referring to fig. 3, fig. 3 is another optional flowchart of a data compression method provided in an embodiment of the present application, in some embodiments, the character type subset includes an uncompressed character subset and a matched character subset, and step S130 provided in the embodiment of the present application may include, but is not limited to, steps S310 to S350 when the compressed identifier is identified as the end identifier, and the following description will describe these five steps in detail with reference to fig. 3.

Step S310, if the compression identifier is identified as an end identifier, carrying out frequency statistics on character information in the uncompressed character subset to obtain a first frequency statistics result;

step S320, constructing a first entropy coder according to the first frequency statistic result;

step S330, obtaining a second frequency statistic by performing frequency statistics on character information in the matched character subset

Results;

step S340, constructing a second entropy coder according to the second frequency statistic result;

step S350, obtaining a target entropy encoder according to the first entropy encoder and the second entropy encoder.

Specifically, if the compression identifier is identified as the end identifier, frequency statistics is performed on all character information in the uncompressed character subset to obtain a first frequency statistics result, and a first entropy encoder is constructed according to the first frequency statistics result. Wherein the uncompressed character subset is used to represent a set of character information that does not match the reference character string, the first entropy encoder corresponding to the character information in the uncompressed character subset. And meanwhile, carrying out frequency statistics on character information in the matched character subset to obtain a second frequency statistics result, and constructing a second entropy encoder according to the second frequency statistics result. Wherein the matched character subset is used to represent a set of character information matched to the reference character string, the second entropy encoder corresponding to the character information in the matched character subset. Finally, the target entropy encoder is constructed according to the first entropy encoder and the second entropy encoder, and a plurality of input ports can be arranged on the target entropy encoder, and each input port corresponds to one entropy encoder, so that the data compression rate of one data page to be compressed is better improved.

It should be noted that, in some embodiments, when the compression model is constructed by using the deflate data compression algorithm, the character information in the uncompressed character subset may include the unmatched character information of the uncompressed character subset and the first character length of the uncompressed character subset, and the character information in the matched character subset may include the second character length of the matched character set and the relative character length of the matched character subset. The unmatched character information is used for representing specific character information which cannot be compressed after the character information passes through the compression model, the first character length is used for representing the length of specific characters which cannot be compressed after the character information passes through the compression model, the second character length is used for representing the length of characters which can be completely matched after the character information passes through the compression model, and the relative character length is used for representing the distance between the character length of a reference character string which can be completely matched after the character information passes through the compression model and the current character.

Referring to fig. 4, fig. 4 is an optional flowchart of step S140 provided in an embodiment of the present application, and in some embodiments, step S140 may include, but is not limited to, steps S410 to S430, and these three steps are described in detail below in connection with fig. 4.

Step S410, encoding the uncompressed character subset according to a first entropy encoder to obtain a first encoded data stream;

step S420, encoding the matched character subset according to a second entropy encoder to obtain a second encoded data stream;

step S430, obtaining a compression result according to the first coded data stream and the second coded data stream.

Specifically, in order to obtain a more accurate compression result, character information in each character type subset is encoded through a corresponding entropy encoder, namely, an uncompressed character subset is encoded according to a first entropy encoder to obtain a first encoded data stream, and a matched character subset is encoded according to a second entropy encoder to obtain a second encoded data stream. And finally, obtaining a compression result according to the first coded data stream and the second coded data stream.

Referring to fig. 5, fig. 5 is an optional flowchart of step S430 provided in an embodiment of the present application. In some embodiments, step S430 may include, but is not limited to, steps S510 to S530, which are described in detail below in conjunction with fig. 5.

Step S510, obtaining encoder metadata of a target entropy encoder and a reference character string of a compression model;

Step S520, obtaining a target compressed data stream according to the first coded data stream and the second coded data stream;

in step S530, a compression result is obtained according to the encoder metadata, the reference string, and the target compressed data stream.

In steps S510 to S530 of some embodiments, the corresponding encoded data stream is obtained by using the entropy encoder, and the encoder metadata of the target entropy encoder is obtained at the same time, where the encoder metadata is used to obtain the encoded information corresponding to each character information when decoding the compression result, so as to quickly complete decoding. Specifically, the encoder metadata of the target entropy encoder and the reference character string of the compression model are obtained, and the first encoded data stream and the second encoded data stream are subjected to character stitching to obtain the target compressed data stream. And combining the reference character string with the metadata of the encoder according to a preset compression dictionary structure to obtain the target compression dictionary because the reference character string obtained through training is the dictionary content of the compression dictionary. And obtaining a compression result according to the target compression dictionary and the target compressed data stream. And returning the obtained compression result to the calling port to finish data compression.

As shown in fig. 6, the data to be compressed in the data page to be compressed is a string to be compressed, and when the strings to be compressed 1 to n are compressed by using a compression model constructed by a deflate algorithm, specifically, first, the data to be compressed is compressed according to a reference string in the compression model constructed by the deflate algorithm, so as to obtain a data compression set. The data compression set comprises four character type subsets, namely a first character subset, a second character subset, a third character subset and a fourth character subset, wherein the first character subset corresponds to a set of unmatched character information in the character strings 1 to n to be compressed, the second character subset corresponds to a set of first character lengths in the character strings 1 to n to be compressed, the third character subset corresponds to a set of second character lengths in the character strings 1 to n to be compressed, and the fourth character subset corresponds to a set of relative character lengths in the character strings 1 to n to be compressed. Therefore, the construction of the target entropy encoder corresponding to the compression model is specifically that if the compression identifier is identified as the end identifier, the first sub-result of the first frequency statistic result is obtained by performing frequency statistics on the information of the first character subset; obtaining a second sub-result of the first frequency statistics result by carrying out frequency statistics on the information of the second character subset; obtaining a third sub-result of the second frequency statistics result by carrying out frequency statistics on the information of the third character subset; and obtaining a fourth sub-result of the second frequency statistics result by carrying out frequency statistics on the information of the fourth character subset. And then, respectively constructing corresponding entropy encoders according to different sub-results, namely constructing a third entropy encoder according to the first sub-result, constructing a fourth entropy encoder according to the second sub-result, constructing a fifth entropy encoder according to the third sub-result, constructing a sixth entropy encoder according to the second sub-result, and finally obtaining the target entropy encoder according to the third entropy encoder, the fourth entropy encoder, the fifth entropy encoder and the sixth entropy encoder. Obtaining a third coded data stream through a third entropy coder according to the unmatched character information of the first character subset; obtaining a fourth coded data stream through a fourth entropy coder according to the information of the first character length of the second character subset; obtaining a fifth coded data stream through a fifth entropy coder according to the information of the second character length of the third character subset; and obtaining a sixth coded data stream through a sixth entropy coder according to the information of the relative character length of the fourth character subset. And finally, carrying out data stream splicing on the third coded data stream, the third coded data stream and the fourth coded data stream to obtain a target compressed data stream.

It should be noted that, as shown in fig. 7, the preset compression dictionary structure includes a dictionary header portion, a dictionary content portion, and an entropy encoder portion, where the dictionary content portion is used to store the content of the reference string, and the entropy encoder portion is used to store the encoder metadata of the target entropy encoder.

When needed, the dictionary header part comprises an enteropySize variable and an enteropyOffset variable, wherein the enteropySize variable is used for representing the size of entropy coding, and the enteropyOffset variable entropy codes position information in a dictionary.

After receiving the compression result, the calling end may store the compression result in a disk or other memory, and when the compression result needs to be decompressed, specifically, the calling end obtains a target compression dictionary and a target compression data stream according to the obtained compression content, can obtain a reference character string and encoder metadata according to the target compression dictionary, and decompresses the target compression data stream according to the reference character string and the encoder metadata to obtain a data page to be compressed.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a data compression apparatus according to an embodiment of the present application, which can implement the data compression method according to the above embodiment, and the apparatus includes a compressed signal receiving module 810, a data compression module 820, an encoder constructing module 830 and an encoding module 840.

The compressed signal receiving module 810 is configured to receive a data compressed signal, where the data compressed signal includes a data page to be compressed, and the data page to be compressed includes at least one piece of data to be compressed and a compression identifier of the data to be compressed;

the data compression module 820 is configured to perform data compression on data to be compressed according to a pre-trained compression model to obtain a data compression set, where the data compression set includes a plurality of character type subsets, and the character type subsets are used to store character information obtained by data compression on the data to be compressed;

an encoder construction module 830 for constructing a target entropy encoder from the compression identity and the subset of character types;

and the encoding module 840 is configured to encode the data compression set according to the target entropy encoder, so as to obtain a compression result.

It should be noted that, the data compression device according to the embodiment of the present application is used to implement the data compression method according to the foregoing embodiment, and the specific processing procedure is referred to the foregoing data compression method and will not be described herein.

The embodiment of the application also provides a computer device, which comprises: at least one memory, at least one processor, at least one computer program stored in the at least one memory, the at least one processor executing the at least one computer program to implement the data compression method of any of the above embodiments. The computer equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of a computer device according to another embodiment, the computer device includes:

the processor 910 may be implemented by a general purpose central processing unit (CentralProcessingUnit, CPU), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc., for executing related programs to implement the technical scheme provided by the embodiments of the present application; memory 920 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). Memory 920 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present disclosure are implemented in software or firmware, relevant program codes are stored in memory 920 and invoked by processor 910 to perform a data compression method according to an embodiment of the present disclosure;

an input/output interface 930 for inputting and outputting information;

the communication interface 940 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g., USB, network cable, etc.), or may implement communication in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);

A bus 950 for transferring information between components of the device (e.g., processor 910, memory 920, input/output interface 930, and communication interface 940);

wherein processor 910, memory 920, input/output interface 930, and communication interface 940 implement communication connections among each other within the device via a bus 950.

The embodiment of the application also provides a computer readable storage medium storing a computer program for causing a computer to execute the data compression method in the above embodiment.

The embodiment of the application provides a data compression storage method based on big data, which comprises the steps of firstly, receiving a data compression signal comprising a data page to be compressed, wherein the data page to be compressed comprises at least one piece of data to be compressed and a compression identifier of the data to be compressed. A training data set is constructed from the data pages, the training data set comprising a plurality of pieces of data to be compressed. Performing character division on the data to be compressed according to the preset character length to obtain a plurality of candidate character strings, and performing hash mapping on the data to be compressed according to each candidate character string to obtain a mapping result and the mapping frequency of the mapping result. And carrying out numerical comparison on all the mapping frequencies, and determining a reference character string so as to construct a compression model. And carrying out data compression on the data to be compressed according to the compression model to obtain a data compression set, wherein the data compression set comprises a plurality of character type subsets, and each character type subset is used for storing character information obtained after the data to be compressed is subjected to data compression. If the compression identifier is not the ending identifier, repeatedly executing data compression on the data to be compressed according to the pre-trained compression model to obtain a data compression set; if the compression identifier is identified as the end identifier, frequency statistics is carried out on character information in the uncompressed character subset to obtain a first frequency statistics result, and a first entropy encoder is constructed according to the first frequency statistics result. And carrying out frequency statistics on character information in the matched character subset to obtain a second frequency statistics result, constructing a second entropy encoder according to the second frequency statistics result, and further obtaining a target entropy encoder according to the first entropy encoder and the second entropy encoder. And then, encoding the uncompressed character subset according to the first entropy encoder to obtain a first encoded data stream, and encoding the matched character subset according to the second entropy encoder to obtain a second encoded data stream. And obtaining encoder metadata of the target entropy encoder and a reference character string of the compression model, obtaining a target compressed data stream according to the first encoded data stream and the second encoded data stream, and obtaining a compression result according to the encoder metadata, the reference character string and the target compressed data stream. In the embodiment of the application, the construction of the entropy coder is not carried out before the end mark is not recognized in the data compression process, and the entropy coder is generated according to all the obtained statistical results only after the end mark is recognized, so that the entropy coder with the best coding effect and higher compression rate can be ensured to be generated. Meanwhile, the embodiment of the application keeps the metadata of one part of entropy coder for each data page to be compressed, and can better improve the compression rate of data compression. In addition, the embodiment of the application outputs the compression result to the calling end after compressing all the data to be compressed in one data page, thereby reducing the use of CPU load and effectively improving the data compression performance of the equipment.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1 to 7 do not constitute a limitation of the embodiments of the present application, and may include more or fewer steps than shown, or may combine certain steps, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

The foregoing description of the preferred embodiments of the present application has been presented with reference to the drawings and is not intended to limit the scope of the claims. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A data compression storage method based on big data, the method comprising:

receiving a data compression signal, wherein the data compression signal comprises a data page to be compressed, the data page comprises at least one piece of data to be compressed and a compression identifier of the data to be compressed, and the compression identifier is used for marking the data state of each piece of data to be compressed;

performing data compression on the data to be compressed according to a pre-trained compression model to obtain a data compression set, wherein the data compression set comprises a plurality of character type subsets, the character type subsets are used for storing character information obtained by data compression of the data to be compressed, and the compression model comprises a reference character string;

encoding the data compression set according to the target entropy encoder to obtain a compression result;

wherein said constructing a target entropy encoder from said compressed identity and said subset of character types comprises:

if the compression identifier is identified as an end identifier, the target entropy encoder is constructed according to the character type subset, wherein the character type subset comprises an uncompressed character subset and a matched character subset, the uncompressed character subset is used for representing a set of character information which is not matched with the reference character string, and the matched character subset is used for representing a set of character information which is matched with the reference character string.

2. The method of claim 1, wherein prior to data compressing the data to be compressed according to the pre-trained compression model to obtain a data compression set, the method further comprises: constructing a compression model, which specifically comprises the following steps:

constructing a training data set according to the data page, wherein the training data set comprises a plurality of pieces of data to be compressed;

performing character division on the data to be compressed according to a preset character length to obtain a plurality of candidate character strings;

carrying out hash mapping on the data to be compressed according to each candidate character string to obtain a mapping result and mapping frequency of the mapping result;

performing numerical comparison on all the mapping frequencies to determine a reference character string;

and constructing the compression model according to the reference character string.

3. The method of claim 1, wherein said constructing a target entropy encoder from said compressed identification and said subset of character types further comprises:

and if the compression identifier is not the ending identifier, repeatedly executing the data compression of the data to be compressed according to the pre-trained compression model to obtain a data compression set.

4. The method of claim 1, wherein the subset of character types includes a subset of uncompressed characters, a subset of matched characters, and wherein if the compressed identification is identified as the end identification, constructing a target entropy encoder from the subset of character types comprises:

if the compression identifier is identified as the ending identifier, carrying out frequency statistics on character information in the uncompressed character subset to obtain a first frequency statistics result;

constructing a first entropy coder according to the first frequency statistical result;

obtaining a second frequency statistic result by carrying out frequency statistics on character information in the matched character subset;

constructing a second entropy coder according to the second frequency statistical result;

5. The method of claim 4, wherein said encoding the compressed set of data according to the target entropy encoder results in a compressed result, comprising:

encoding the uncompressed character subset according to the first entropy encoder to obtain a first encoded data stream;

encoding the matched character subset according to the second entropy encoder to obtain a second encoded data stream;

And obtaining a compression result according to the first coded data stream and the second coded data stream.

6. The method of claim 5, wherein said obtaining a compression result from said first encoded data stream and said second encoded data stream comprises:

7. A big data based data compression storage device, the device comprising:

the device comprises a compression signal receiving module, a compression signal receiving module and a data processing module, wherein the data compression signal comprises a data page to be compressed, the data page to be compressed comprises at least one piece of data to be compressed and a compression identifier of the data to be compressed, and the compression identifier is used for marking the data state of each piece of data to be compressed;

the data compression module is used for carrying out data compression on the data to be compressed according to a pre-trained compression model to obtain a data compression set, wherein the data compression set comprises a plurality of character type subsets, the character type subsets are used for storing character information obtained after the data to be compressed is subjected to data compression, and the compression model comprises a reference character string;

An encoder construction module, configured to construct a target entropy encoder according to the compression identifier and the character type subset, where if the compression identifier is identified as an end identifier, the target entropy encoder is constructed according to the character type subset, where the character type subset includes an uncompressed character subset and a matched character subset, the uncompressed character subset is used to represent a set of character information that is not matched with the reference character string, and the matched character subset is used to represent a set of character information that is matched with the reference character string;

and the encoding module is used for encoding the data compression set according to the target entropy encoder to obtain a compression result.

8. A computer device, comprising:

at least one memory;

at least one processor;

at least one computer program;

the at least one computer program is stored in the at least one memory, the at least one processor executing the at least one computer program to implement:

the method of any one of claims 1 to 6.

9. A computer-readable storage medium storing a computer program for causing a computer to execute:

The method of any one of claims 1 to 6.