CN114780502B - Database method, system, device and medium based on compressed data direct computation - Google Patents

Database method, system, device and medium based on compressed data direct computation Download PDF

Info

Publication number
CN114780502B
CN114780502B CN202210535252.9A CN202210535252A CN114780502B CN 114780502 B CN114780502 B CN 114780502B CN 202210535252 A CN202210535252 A CN 202210535252A CN 114780502 B CN114780502 B CN 114780502B
Authority
CN
China
Prior art keywords
data
data block
written
compressed
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210535252.9A
Other languages
Chinese (zh)
Other versions
CN114780502A (en
Inventor
张峰
万韦涛
杜小勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin University of China
Original Assignee
Renmin University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renmin University of China filed Critical Renmin University of China
Priority to CN202210535252.9A priority Critical patent/CN114780502B/en
Publication of CN114780502A publication Critical patent/CN114780502A/en
Application granted granted Critical
Publication of CN114780502B publication Critical patent/CN114780502B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The present invention relates to a database method, system, device and medium based on compressed data direct computation. The method comprises the following steps: partitioning a file to be compressed according to the data granularity of a storage system, compressing each obtained data block, and storing the data blocks into the storage system; and processing the compressed data in the storage system by adopting a bottom-up compressed data processing method under the condition of not decompressing. When the file is compressed, the algorithm for directly processing the compressed data based on the grammar rule analysis is adopted, and meanwhile, the compressed data is directly processed in the storage layer, so that the times of data transmission and the number of transmission can be reduced, and the direct processing performance of the compressed data is greatly improved. Therefore, the invention can be widely applied to the field of data processing.

Description

Database method, system, device and medium based on compressed data direct computation
Technical Field
The invention relates to a database method, a system, equipment and a medium based on compressed data direct calculation, belonging to the technical field of big data processing.
Background
Data processing is important for many applications, from web searching to system diagnostics, security, and so forth. In the big data era, data processing faces mainly two challenges: firstly, the storage overhead is very large after the data size is large; second, processing large-scale data is time consuming. Especially in the case of the continuous and rapid increase of processed data, data analysis is very time consuming and often requires a large amount of storage space and memory space. One common approach to alleviating the spatial problem is data compression. The existing compressed data direct processing technology can use a compression method based on grammar rule description to compress data, and directly process the compressed data under the condition of not decompressing the data by analyzing the grammar rule.
In the prior art, a compressed structure is generally divided into three layers, namely an element layer, a rule layer and a DAG (directed acyclic graph) layer from bottom to top. Wherein: the element layer contains the smallest grammatical unit, which is generally a word; the rule layer comprises a sequence consisting of a plurality of elements or rules; the DAG layer refers to a complete syntax structure, which includes a sequence of rules and elements, each rule is composed of several rules or elements, and the whole structure is a directed acyclic graph. On this basis, the prior art typically traverses the entire structure in a top-down or bottom-up order while performing data analysis.
While existing solutions show great potential in read-only query processing, a fully functional big data system must support both data queries and data manipulation. In particular, large data systems must support the updating of random location records and the insertion and deletion of records. However, existing solutions do not support these functions by themselves, and therefore if a compressed file is to be modified, the considerable data must be decompressed and recompressed each time it is modified, resulting in significant performance overhead.
Disclosure of Invention
In view of the foregoing problems, it is an object of the present invention to provide a database method, system, device and medium based on compressed data direct computation, which can apply compressed data direct processing technology in a storage system in a big data environment to support extensive data management and analysis.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, the present invention provides a database method based on compressed data direct computation, which comprises the following steps:
partitioning a file to be compressed according to the data granularity of a storage system, compressing each obtained data block, and storing the data blocks into the storage system;
and processing the compressed data in the storage system by adopting a bottom-up compressed data processing method under the condition of not decompressing.
Further, the method for blocking the file to be compressed according to the data granularity of the storage system, compressing each obtained data block and storing the data block into the storage system comprises the following steps:
1.1) partitioning a file to be compressed to obtain a plurality of data blocks;
1.2) searching in a hash table according to data to be written in a data block, and entering a step 1.3 if the data to be written in the data block has repeated data blocks, or entering a step 1.4) if the data to be written in the data block has repeated data blocks;
1.3) judging whether the data block to be written is only quoted once, if so, increasing the quote count of the repeated data block, pointing a pointer pointing to the data block to be written to the repeated data block, releasing the data block to be written, and deleting the record in the hash table; if the data block to be written is referred more than once, the data block to be written cannot be released, and the reference times are reduced;
1.4) judging whether the data block to be written is quoted only once, if so, deleting the record in the hash table to release the data block to be written, and modifying the corresponding record in the hash table; if the data block to be written is referred for more than one time, reducing the reference count of the content before the data block to be written is modified, allocating a new data block to store the data block to be written, and pointing a pointer of the data block to be written to a new data block;
and 1.5) compressing each data block by adopting an improved Sequitur compression method, limiting the generated DAG structure and depth according to a preset rule to obtain a DAG graph, and storing the DAG graph in a storage system.
Further, the preset rule is as follows: except for leaf nodes, other nodes can only have one parent node.
Further, the processing the compressed data in the storage system includes: insert, delete, extract, update, search, append, and count.
Further, the inserting operation includes:
judging whether the data blocks are aligned or not according to the insertion operation instruction, namely judging whether the insertion position and the length of the inserted data are multiples of the size of the data blocks in the storage system at the same time or not;
if the data blocks are aligned, determining the corresponding rule positions of the data blocks in the DAG graph according to the insertion positions; if the rule at the insertion position is not full, directly adding the pointer corresponding to the new element to be inserted into the rule; if the rule at the insertion position is full, splitting the rule at the corresponding position or a parent rule thereof, and adding a pointer of a new element to be inserted into the new rule;
if the data blocks are not aligned, introducing a hole structure, and combining the hole structure and the new element to be inserted to enable the whole size formed by the hole structure and the new element to be inserted to be integral multiple of the size of the data blocks; and then inserting the hole structure and the new element to be inserted into a certain rule in the DAG as a whole by adopting the same method in the steps.
Further, the deleting operation includes:
judging whether the deletion position and the length of the data to be deleted are multiples of the size of the data block in the storage system at the same time according to the deletion operation instruction;
when the data blocks are aligned, determining the corresponding rule positions of the data blocks in the DAG graph according to the deletion positions, and then deleting corresponding data at the corresponding rule positions according to the deleted data length;
when the data blocks are not aligned, combining the hole structure and the data to be deleted by introducing the hole structure to enable the whole size formed by the hole structure and the data to be deleted to be multiple times of the size of the data blocks, and deleting the hole structure and the data to be deleted from a corresponding rule in a DAG (direct current) graph as a whole.
Further, the search operation includes three phases:
in the intra-block searching stage, searching is carried out in each data block of the file according to a target data segment to be searched, and the positions of the target data segment in all the data blocks are counted;
and in the cross-block searching stage, searching in every two adjacent data blocks of the file according to the target data segment to be searched, and counting the occurrence position of the target data segment.
And a merging stage, merging the results of the intra-block search and the cross-block search, and returning the final search result.
In a second aspect, the present invention provides a database system for direct computation based on compressed data, comprising:
the data compression module is used for partitioning the data to be processed according to the data granularity of the storage system, compressing each data block and storing the data block into the storage system;
and the data processing module is used for processing the compressed data in the storage system by adopting a bottom-up compressed data processing method under the condition of not decompressing.
In a third aspect, the present invention provides a processing device comprising at least a processor and a memory, the memory having stored thereon a computer program, the processor executing the computer program to perform the steps of the database method based on compressed data direct computation.
In a fourth aspect, the present invention provides a computer storage medium having computer readable instructions stored thereon, the computer readable instructions being executable by a processor to implement the steps of the compressed data direct computation based database method.
Due to the adoption of the technical scheme, the invention has the following advantages:
1. the invention provides a compressed data direct processing algorithm suitable for a storage system, compression and corresponding processing are pushed down to a storage layer, the applicability is strong, the efficiency is high, and the method can be suitable for various upper database applications;
2. the processing operation provided by the invention not only comprises read-only operation, but also comprises direct update of compressed data, and is more flexible, and the upper database application can design more complex operators by means of the read-only operation and the direct update of the compressed data, and the acquisition performance is improved;
3. the direct processing algorithm for the compressed data of the storage system provided by the invention directly processes the compressed data in the storage layer, can reduce the times of data transmission and the number of transmission, and greatly improves the direct processing performance of the compressed data.
Therefore, the invention can be widely applied to the technical field of big data processing.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Like reference numerals refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flow chart of a database method for direct computation based on compressed data according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of writing a data block into a storage system according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a DAG layer according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention, are within the scope of the invention.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The database method based on compressed data direct calculation provided by the invention enables the compressed data direct processing technology to support the updating, inserting, deleting and the like of data and work in the storage layer, thereby realizing a space-efficient storage system which supports both data query and data operation.
Three levels of challenges are faced by letting compressed data direct processing techniques work on the storage tier and support updates, insertions, and deletions. First, element-level challenges: the granularity of data in a storage system is large and the complexity is increased. While traditional methods process data at word granularity, storage systems typically organize data at a larger block granularity, such as 1KB or 4 KB. Simply increasing the processing granularity reduces the compression effect because two large data blocks may have a portion of the same data. Worse, when two pieces of identical partial data are not aligned, they cannot be represented by the same rule. Second, the challenge at the rule level: and carrying out an updating operation of the random position on the rule. The processing of random updates is difficult, especially for large numbers of rules. The random update of the layered compressed data needs recursive rule segmentation, and when the depth of the DAG is large, the recursive rule segmentation efficiency is low. Third, challenges at the DAG level: when the storage tier operates on the DAG, the performance (efficiency) of the operation needs to be guaranteed.
To achieve our core goal of "maximizing reuse while minimizing overhead," the present invention proposes a solution to the three challenges described above. The first challenge is to allow holes in the rules, and the granularity of blocks used by the storage system is fixed, unlike the granularity of the prior art, which is variable. Data that is otherwise aligned with the block size may no longer be aligned with the block size after a random update. Therefore, the present invention proposes a hole structure to fill in the unaligned blocks, thereby enabling the storage system to support flexible random updates. The solution to the second challenge is to propose a new rule-level organizational design: the nodes, except for the leaf nodes, are organized into a tree structure, and only the leaf nodes may contain data blocks. Because the DAG structure rule organization in the prior art is too complex, one node can correspond to a plurality of father nodes, so that the update is very inefficient due to recursive rule division and is difficult to realize real-time update. A third solution is to limit the depth of the DAG, which is very deep in prior art DAGs, mainly to reduce memory space, which can help reduce redundancy including rules, but can significantly reduce performance. By limiting the DAG depth, compression rate and performance can be controlled to an acceptable range. By these schemes, the present invention is very effective for reducing the processing overhead and increasing the compression ratio.
This embodiment systematically explores the application of direct processing of compressed data to storage systems, and studies the hierarchical structure of the compression results based on a compression algorithm named Sequitur. The method specifically comprises two parts: compression algorithm adaptation and compressed data processing (including query and update operations). When the file is compressed, the algorithm for directly processing the compressed data based on the grammar rule analysis is adopted, and meanwhile, the compressed data is directly processed in the storage layer, so that the times of data transmission and the number of transmission can be reduced, and the direct processing performance of the compressed data is greatly improved.
Example 1
As shown in fig. 1, the database method based on compressed data direct computation provided by this embodiment includes the following steps:
1) and partitioning the data to be processed according to the data granularity of the storage system, compressing each data block by adopting an improved Sequitur compression method, limiting the generated DAG structure and depth according to a preset rule, and storing the DAG structure and depth into the storage system.
2) And processing the compressed data in the storage system by adopting a bottom-up compressed data processing method under the condition of not decompressing.
Preferably, as shown in fig. 2, the process of writing the data block into the storage system includes the following steps:
1.1) partitioning a file to be compressed to obtain a plurality of data blocks, wherein each data block stores a section of data.
1.2) searching in a hash table according to the data to be written in the data block, and entering step 1.3) if the data to be written in the data block already exists, namely the repeated data block exists, or entering step 1.4).
The hash table stores the mapping from data to data blocks, and the content of the data block to be written is a piece of data, so that whether a data block in the storage system already exists and has the same data with the data block can be searched in the hash table according to the piece of data. That is, the data block to be written is searched for, that is, whether the content to be written to the current block is already present in the system (the present form is already in a certain data block).
1.3) judging whether the data block to be written is only quoted once, if so, increasing the quote count of the repeated data block, pointing a pointer pointing to the data block to be written to the repeated data block, releasing the data block to be written, and deleting the record in the hash table; if the data block to be written has been referred to more than once, the data block to be written cannot be released and the number of references is reduced.
1.4) judging whether the data block to be written is quoted only once, if so, deleting the record in the hash table to release the data block to be written, and modifying the corresponding record in the hash table; if the data block to be written is referred for more than one time, reducing the reference count of the content before the data block to be written is modified, allocating a new data block to store the data block to be written, and pointing a pointer of the data block to be written to a new data block;
and 1.5) compressing each data block by adopting an improved Sequitur compression method, limiting the generated DAG structure and depth according to a preset rule, and storing the DAG structure and depth into a storage system.
Preferably, in step 1.5), the following assumption conditions are included when the improved Sequitur compression method compresses the data block: the smallest operation element in the storage system is a data block for storing data; the rules are special data blocks, pointers pointing to other sub-rules or data blocks are stored in the rules, and the rules correspond to a sequence of logically-previous rules consisting of sub-rules and elements; the entire file is abstracted into a sequence of pointers to rules or data blocks, and each rule in the sequence is composed of pointers to sub-rules or data blocks, forming a DAG graph structure. That is, the entire file is abstracted into a rule of indefinite length.
In this embodiment, by modifying the compression method of Sequitur, the data compression granularity is changed to be compressed according to the data granularity of the storage system, such as 1KB or 4KB, and the structure and depth of the generated DAG graph are limited, so that other nodes (rules) except leaf nodes (elements) are limited to have only one parent node, and the finally formed DAG should have a tree structure.
Fig. 3 is a diagram illustrating an example of data compressed by the improved Sequitur compression method. In the figure, the element has three data blocks b1, b2 and b3, which form an element layer; two rules R0, R1, each of which has pointers to elements, forming a rule layer; all rules and elements constitute the DAG layer. In the figure, if the node R0 is regarded as a root node, starting from the root node R0, a plurality of nodes such as R01, R02, R03, R04 may be included, and the node R1 may be composed of a plurality of leaf nodes. Except for leaf nodes, other nodes can only have one parent node.
Preferably, when the structure and depth of the generated DAG are limited according to a preset rule, the rule is: except for leaf nodes (elements), other nodes (rules) can only have one parent.
Preferably, in step 2), the operation of processing the compressed data in the storage system includes: insert, delete, extract, update, search, append, count, etc.
Preferably, in the step 2), the inserting operation is to insert any data into any position of the data. The conventional memory system does not provide such an operation or only supports an insertion operation in units of data blocks, and thus the function is very limited. The invention supports the operation at any position by introducing the hole structure, and solves the problem that the traditional storage system cannot provide high-efficiency insertion operation. Specifically, the insertion operation includes the steps of:
firstly, judging whether the data blocks are aligned or not according to an insertion operation instruction, namely judging whether the insertion position and the inserted data length are multiples of the size of the minimum operation element, namely the data block, in the storage system at the same time;
wherein, whether the data block is aligned with the size of the data block refers to whether a piece of inserted data is aligned with the size of the data block, the insertion operation instruction includes an insertion position and an inserted data length, and when the insertion position (offset, such as at 100 th byte in the file) and the inserted data length (length, for example, inserted 100 bytes) are both multiples of the data block size (generally 1024, 2048, 4096, etc.). The insertion at this point is equivalent to inserting several data blocks in its entirety.
If the data blocks are aligned, the insert operation is equivalent to inserting a new element into the whole DAG, and the method comprises the following steps: determining the corresponding rule position in the DAG graph according to the insertion position; if the rule at the insertion position is not full (namely the number of pointers contained in the rule does not reach a preset value), directly adding the pointer corresponding to the new element to be inserted into the rule; if the rule at the insertion position is full, splitting the rule at the corresponding position or the parent rule of the rule, and adding a pointer of a new element to be inserted into the new rule. Meanwhile, according to the compression algorithm, if a new element already exists, only the pointer of the existing element needs to be multiplexed, and the overall compression state cannot be damaged.
Actually, when a rule at an insertion position is full, checking whether a parent rule corresponding to the rule is full, if the parent rule is not full, newly building a rule, pointing an idle pointer of the parent rule to the new rule, and storing a new element to be inserted into the new rule; if the parent rule is full, the parent rule of the parent rule continues to be searched, and recursively proceeds by adding the pointer of the new element to the new rule.
If the data blocks are not aligned, introducing a preset hollow structure, and combining the hollow structure with a new element to be inserted to enable the size of the whole formed by the hollow structure and the new element to be inserted to be aligned with the size of the data blocks; then, the hole structure and the new element to be inserted are inserted into a certain rule in the DAG by adopting the same method, so that the state of compression is still ensured after insertion.
Preferably, in the step 2), the deleting operation refers to deleting any data at any position of the data. Similar to the insert operation, the conventional storage system does not provide such an operation or only supports a delete operation in units of data blocks, and thus the function is very limited. Specifically, the deletion operation includes the steps of:
first, according to a delete operation instruction, determining whether data blocks are aligned, that is, determining whether a delete position (offset, e.g., at 100 th byte in a file) and a length (length, e.g., 100 bytes inserted) of data to be deleted are simultaneously multiples of a size of a minimum operation element, i.e., a data block, in a storage system;
when the data blocks are aligned, determining the corresponding rule positions of the data blocks in the DAG graph according to the deletion positions, and then deleting corresponding data at the corresponding rule positions according to the deleted data length;
when the data blocks are not aligned, deleting at any position is supported by introducing a hole structure, the hole structure and the data to be deleted are combined to enable the whole size formed by the hole structure and the data to be deleted to be aligned with the data blocks, then the hole structure and the data to be deleted are taken as elements to be deleted from the rules, the rest DAG structures are not modified, and the compressed state is still formed after deletion.
Preferably, in step 2), the extracting operation refers to reading the compressed data and acquiring any data at any position before compression. The extraction operation method comprises the following steps: and positioning the data block of the data to be acquired through the DAG structure according to the extraction position in the extraction operation instruction, and extracting the data with the corresponding length. In general, the number of pointers included in each rule is constant, and only leaf nodes include elements, so that a node corresponding to each position can be calculated, and a corresponding data block can be read, without decompressing data.
Preferably, in step 2), the update operation refers to updating data at any position (overwriting and updating existing data), and the operation can modify one data block without decompression as long as the corresponding data block is written by the above-described compression process. Meanwhile, the modified data is not required to be aligned with the data block, the content of the data block is generally read into the cache before writing, and only the content in the cache is modified to form a new data block, and the data block is written according to the size of the data block during writing.
Preferably, in step 2), the searching operation searches for the position where the specific sequence appears on the basis of the compressed data, and the searching operation includes three stages:
the first stage is intra-block search, which searches in each data block of the file according to the target data segment to be searched, and counts the positions of the target data segment appearing in all the data blocks.
When searching in each data block according to the target data segment to be searched, a KMP matching algorithm may be adopted. In the embodiment, when the data block is stored, the repeated data block is stored only once, so that when the searched content appears in the data block, other similar appearance positions can be quickly found, and all appearance positions in the data block are counted.
The second stage is cross-block search, which searches in every two adjacent data blocks of the file according to the target data segment to be searched, i.e. the cross-block content of the continuous data blocks is checked in parallel according to the target data segment to be searched.
The third stage is a merge stage, which merges the intra-block search and cross-block search results and returns the final search result.
Preferably, in step 2), the additional operation refers to inserting any data at the end of the data, and the operation belongs to a special inserting operation, and compared with other inserting operations, the process of locating the inserting position is reduced, because the inserting operation is always performed at the tail part of the data, and other parts and the inserting operation are the same.
Preferably, in the step 2), the counting operation means counting the number of occurrences of a specific sequence based on the compressed data, similar to the searching operation, but can save part of the time in the searching stage in the block, because the position of the occurrence in the whole data does not need to be calculated, and only the number of occurrences needs to be calculated.
Example 2
The foregoing embodiment 1 provides a database method based on compressed data direct calculation, and correspondingly, this embodiment provides a database system based on compressed data direct calculation. The system provided in this embodiment may implement the database method based on compressed data direct computation of embodiment 1, and the system may be implemented by software, hardware, or a combination of software and hardware. For example, the system may comprise integrated or separate functional modules or functional units to perform the corresponding steps in the methods of embodiment 1. Since the system of this embodiment is substantially similar to the method embodiment, the description process of this embodiment is relatively simple, and reference may be made to part of the description of embodiment 1 for relevant points.
The database system provided by the embodiment and based on compressed data direct calculation comprises:
the data compression module is used for partitioning the data to be processed according to the data granularity of the storage system, compressing each data block by adopting an improved Sequitur compression method, and storing the generated DAG into the storage system after limiting the structure and the depth of the DAG according to a preset rule;
and the data processing module is used for processing the compressed data in the storage system by adopting a bottom-up compressed data processing method under the condition of not decompressing.
Example 3
This embodiment provides a processing device corresponding to the database method based on compressed data direct computation provided in embodiment 1, where the processing device may be a processing device for a client, such as a mobile phone, a notebook computer, a tablet computer, a desktop computer, etc., to execute the method of embodiment 1.
The processing equipment comprises a processor, a memory, a communication interface and a bus, wherein the processor, the memory and the communication interface are connected through the bus so as to complete mutual communication. The memory stores a computer program that can be executed on the processor, and the processor executes the database method based on compressed data direct calculation provided by embodiment 1 when executing the computer program.
In some embodiments, the Memory may be a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory, such as at least one disk Memory.
In other embodiments, the processor may be various general-purpose processors such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), and the like, and is not limited herein.
Example 4
The database method based on compressed data direct computation of this embodiment 1 can be embodied as a computer program product, which can include a computer readable storage medium carrying computer readable program instructions for executing the database method based on compressed data direct computation of this embodiment 1.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any combination of the foregoing.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A database method based on compressed data direct computation is characterized by comprising the following steps:
partitioning a file to be compressed according to the data granularity of a storage system, compressing each obtained data block, and storing the data blocks into the storage system;
processing the compressed data in the storage system by adopting a bottom-up compressed data processing method under the condition of not decompressing;
the method for blocking the file to be compressed according to the data granularity of the storage system, compressing each obtained data block and storing the data block into the storage system comprises the following steps:
1.1) partitioning a file to be compressed to obtain a plurality of data blocks;
1.2) searching in a hash table according to data to be written in a data block, and entering a step 1.3 if the data to be written in the data block has repeated data blocks, or entering a step 1.4) if the data to be written in the data block has repeated data blocks;
1.3) judging whether the data block to be written is only quoted once, if so, increasing the quote count of the repeated data block, pointing a pointer pointing to the data block to be written to the repeated data block, releasing the data block to be written, and deleting the record in the hash table; if the data block to be written is referred more than once, the data block to be written cannot be released, and the reference times are reduced;
1.4) judging whether the data block to be written is quoted only once, if so, deleting the record in the hash table to release the data block to be written, and modifying the corresponding record in the hash table; if the data block to be written is referred for more than one time, reducing the reference count of the content before the data block to be written is modified, allocating a new data block to store the data block to be written, and pointing a pointer of the data block to be written to a new data block;
1.5) compressing each data block by adopting an improved Sequitur compression method, limiting the structure and the depth of the generated directed acyclic graph according to a preset rule to obtain a directed acyclic graph, and storing the directed acyclic graph in a storage system;
when the improved Sequitur compression method is adopted to compress the data block, the assumed conditions are as follows: the minimum operation element in the storage system is a data block for storing data; the rule is stored with pointers to other sub-rules or data blocks; the whole file is abstracted into a sequence of pointers to rules or data blocks, and each rule in the sequence is composed of a plurality of pointers to sub-rules or data blocks to form a directed acyclic graph.
2. The database method based on compressed data direct computation of claim 1, wherein the preset rule is: except for leaf nodes, other nodes can only have one parent node.
3. The database method based on compressed data direct computation of claim 1, wherein the operation of processing the compressed data in the storage system comprises: insert, delete, extract, update, search, append, and count.
4. A database method based on compressed data direct computation according to claim 3, characterized in that the insertion operation comprises:
judging whether the data blocks are aligned or not according to the insertion operation instruction, namely judging whether the insertion position and the length of the inserted data are multiples of the size of the data blocks in the storage system at the same time or not;
if the data blocks are aligned, determining the corresponding regular positions of the data blocks in the directed acyclic graph according to the insertion positions; if the rule at the insertion position is not full, directly adding the pointer corresponding to the new element to be inserted into the rule; if the rule at the insertion position is full, splitting the rule at the corresponding position or a parent rule thereof, and adding a pointer of a new element to be inserted into the new rule;
if the data blocks are not aligned, introducing a hole structure, and combining the hole structure and the new element to be inserted to enable the whole size formed by the hole structure and the new element to be inserted to be integral multiple of the size of the data blocks; and then inserting the hole structure and the new element to be inserted into a certain rule in the directed acyclic graph as a whole by adopting the same method in the steps.
5. The database method based on compressed data direct computation of claim 3, characterized in that the delete operation comprises:
judging whether the deletion position and the length of the data to be deleted are multiples of the size of the data block in the storage system at the same time according to the deletion operation instruction;
when the data blocks are aligned, determining the corresponding rule positions of the data blocks in the directed acyclic graph according to the deletion positions, and then deleting corresponding data at the corresponding rule positions according to the deleted data length;
when the data blocks are not aligned, combining the hole structure and the data to be deleted by introducing the hole structure to enable the whole size formed by the hole structure and the data to be deleted to be multiple of the size of the data blocks, and deleting the hole structure and the data to be deleted as a whole from a corresponding rule in the directed acyclic graph.
6. A database method based on compressed data direct computation according to claim 3, characterized in that the search operation comprises three phases:
searching in blocks, namely searching in each data block of the file according to a target data segment to be searched, and counting the positions of the target data segment in all the data blocks;
searching in a cross-block manner, namely searching in every two adjacent data blocks of the file according to a target data segment to be searched, and counting the occurrence position of the target data segment;
and a merging stage, merging the results of the intra-block search and the cross-block search, and returning the final search result.
7. A database system for direct computation based on compressed data, comprising:
the data compression module is used for partitioning data to be processed according to the data granularity of the storage system, compressing each data block by adopting an improved Sequitor compression method, limiting the structure and the depth of the generated directed acyclic graph according to a preset rule, and storing the directed acyclic graph into the storage system;
the data processing module is used for processing the compressed data in the storage system by adopting a bottom-up compressed data processing method under the condition of not decompressing;
the method for partitioning a file to be compressed according to the data granularity of a storage system, compressing each obtained data block, and storing the data block in the storage system includes:
partitioning a file to be compressed to obtain a plurality of data blocks;
searching in a hash table according to the data to be written in the data block, and according to whether the data to be written in the data block has repeated data blocks:
if the data block to be written exists, judging whether the data block to be written is only quoted once, if so, increasing the quote count of the repeated data block, pointing a pointer pointing to the data block to be written to the repeated data block, releasing the data block to be written, and deleting the record in the hash table; if the data block to be written is referred more than once, the data block to be written cannot be released, and the reference times are reduced;
if not, judging whether the data block to be written is only quoted once, if so, deleting the record in the hash table to release the data block to be written, and modifying the corresponding record in the hash table; if the data block to be written is referred for more than one time, reducing the reference count of the content before the data block to be written is modified, allocating a new data block to store the data block to be written, and pointing a pointer of the data block to be written to a new data block;
compressing each data block by adopting an improved Sequitur compression method, limiting the structure and the depth of the generated directed acyclic graph according to a preset rule to obtain a directed acyclic graph, and storing the directed acyclic graph in a storage system;
when the improved Sequitur compression method is adopted to compress the data block, the assumed conditions are as follows: the minimum operation element in the storage system is a data block for storing data; the rule is stored with pointers to other sub-rules or data blocks; the whole file is abstracted into a sequence of pointers to rules or data blocks, and each rule in the sequence is composed of a plurality of pointers to sub-rules or data blocks to form a directed acyclic graph.
8. A processing device comprising at least a processor and a memory, the memory having stored thereon a computer program, characterized in that the processor, when executing the computer program, performs the steps of implementing the database method based on compressed data direct computation according to any of claims 1 to 6.
9. A computer storage medium having computer readable instructions stored thereon which are executable by a processor to perform the steps of the database method for direct computation based on compressed data according to any one of claims 1 to 6.
CN202210535252.9A 2022-05-17 2022-05-17 Database method, system, device and medium based on compressed data direct computation Active CN114780502B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210535252.9A CN114780502B (en) 2022-05-17 2022-05-17 Database method, system, device and medium based on compressed data direct computation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210535252.9A CN114780502B (en) 2022-05-17 2022-05-17 Database method, system, device and medium based on compressed data direct computation

Publications (2)

Publication Number Publication Date
CN114780502A CN114780502A (en) 2022-07-22
CN114780502B true CN114780502B (en) 2022-09-16

Family

ID=82436666

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210535252.9A Active CN114780502B (en) 2022-05-17 2022-05-17 Database method, system, device and medium based on compressed data direct computation

Country Status (1)

Country Link
CN (1) CN114780502B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115482147B (en) * 2022-09-14 2023-04-28 中国人民大学 Efficient parallel graph processing method and system based on compressed data direct calculation
CN115811317A (en) * 2022-10-17 2023-03-17 中国人民大学 Stream processing method and system based on self-adaptive non-decompression direct calculation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6327699B1 (en) * 1999-04-30 2001-12-04 Microsoft Corporation Whole program path profiling
CN107612886A (en) * 2017-08-15 2018-01-19 中国科学院大学 A kind of Spark platforms Shuffle process compresses algorithm decision-making techniques
CN108427539A (en) * 2018-03-15 2018-08-21 深信服科技股份有限公司 Offline duplicate removal compression method, device and the readable storage medium storing program for executing of buffer memory device data
CN112764663A (en) * 2019-10-21 2021-05-07 阿里巴巴集团控股有限公司 Space management method, device and system of cloud storage space, electronic equipment and computer readable storage medium
CN113064870A (en) * 2021-03-22 2021-07-02 中国人民大学 Big data processing method based on compressed data direct calculation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9495427B2 (en) * 2010-06-04 2016-11-15 Yale University Processing of data using a database system in communication with a data processing framework
US9798727B2 (en) * 2014-05-27 2017-10-24 International Business Machines Corporation Reordering of database records for improved compression

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6327699B1 (en) * 1999-04-30 2001-12-04 Microsoft Corporation Whole program path profiling
CN107612886A (en) * 2017-08-15 2018-01-19 中国科学院大学 A kind of Spark platforms Shuffle process compresses algorithm decision-making techniques
CN108427539A (en) * 2018-03-15 2018-08-21 深信服科技股份有限公司 Offline duplicate removal compression method, device and the readable storage medium storing program for executing of buffer memory device data
CN112764663A (en) * 2019-10-21 2021-05-07 阿里巴巴集团控股有限公司 Space management method, device and system of cloud storage space, electronic equipment and computer readable storage medium
CN113064870A (en) * 2021-03-22 2021-07-02 中国人民大学 Big data processing method based on compressed data direct calculation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《基于可变比特率编码无环图的图像无损压缩》;陈达等;《控制工程》;20200531;第27卷(第5期);812-818 *
《相邻重复数据块相似性去重性能优化研究与实现》;谭佳豪;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190615(第06期);I137-52 *

Also Published As

Publication number Publication date
CN114780502A (en) 2022-07-22

Similar Documents

Publication Publication Date Title
CN114780502B (en) Database method, system, device and medium based on compressed data direct computation
US20160335299A1 (en) Hierarchical Data Storage
CN109947791B (en) Database statement optimization method, device, equipment and storage medium
US10671586B2 (en) Optimal sort key compression and index rebuilding
US8954407B2 (en) System and method for partially deferred index maintenance
JP7426907B2 (en) Advanced database decompression
KR101549220B1 (en) Method and System for Managing Database, and Tree Structure for Database
CN104063384A (en) Data retrieval method and device
CN111104377A (en) File management method, electronic device and computer-readable storage medium
CN114416670A (en) Index creating method and device suitable for network disk document, network disk and storage medium
US10558636B2 (en) Index page with latch-free access
US8812523B2 (en) Predicate result cache
US10719494B2 (en) Accelerating operations in B+-tree
US10078647B2 (en) Allocating free space in a database
CN111026736B (en) Data blood margin management method and device and data blood margin analysis method and device
US20180275961A1 (en) Method and system for fast data comparison using accelerated and incrementally synchronized cyclic data traversal algorithm
CN115469810A (en) Data acquisition method, device, equipment and storage medium
KR101089722B1 (en) Method and apparatus for prefix tree based indexing, and recording medium thereof
CN113495901A (en) Variable-length data block oriented quick retrieval method
KR102013839B1 (en) Method and System for Managing Database, and Tree Structure for Database
CN107315806B (en) Embedded storage method and device based on file system
CN113448957A (en) Data query method and device
CN115905246B (en) KV caching method and device based on dynamic compression prefix tree
CN109815225B (en) Parallel prefix data retrieval method and system based on prefix tree structure
CN106776772A (en) A kind of method and device of data retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant