CN1425986A

CN1425986A - Automatic compressing/decompressing file system and its compressing algorithm

Info

Publication number: CN1425986A
Application number: CN03100603A
Authority: CN
Inventors: 张跃; 甄成
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2003-01-17
Filing date: 2003-01-17
Publication date: 2003-06-25
Anticipated expiration: 2023-01-17
Also published as: CN1200354C

Abstract

The automatic compressing/decompressing file system and its compressing algorithm features that one magnetic disc storage block abstract layer integrated to physical memory hardware is added into common file system and physical memory of storing compressed data to separate common file system from compressed disc data and support multiple file system modules. The magnetic disc storage block abstract layer contains virtual disc logic data block storing space, long structure physical storing data block mapping layer and data compressing/decompressing layer. In the system data compressing algorithm, each new character is created with two characters available in dictionary. The present invention is especially suitable for embedded system and can raise greatly the utilization of embedded equipment resource, especially memory resource, and improve its performance.

Description

Automatic compression/de-compression file system and compression algorithm thereof

Technical field:

Automatically compression/de-compression file system and compression algorithm thereof belong to the field of filesystems with automatic compression/de-compression function, relate in particular to the field of filesystems that embedded OS uses.

Background technology:

Along with the development of the fast development of microelectric technique, particularly microprocessor, the past, many progressively being turned to by soft, hardware by pure more hard-wired products, system etc. realized jointly.So not only improve the dirigibility of system, and can introduce current more advanced technology such as artificial intelligence, self-adaptation easily.Therefore the product of " using both hard and soft tactics ", system etc. will be the main directions of future development.Basis as software systems---operating system is occupied crucial status undoubtedly in such system.Meanwhile, the status of the file system that closely links to each other with operating system nucleus already in desktop system and effect also become increasingly conspicuous.

In the past, owing to be subjected to the restriction of conditions such as hardware, (as the control system based on single-chip microcomputer etc.) all do not have operating system in a lot of software and hardware systems, the kernel of an operating system perhaps only arranged, and do not have file system at all.Yet along with the performance of microprocessor improves constantly, the price of storer constantly descends, and makes the Palmtop built equipment fast development of class pc, and the trend of dressing on knee and desktop system occurred.This trend is had higher requirement to embedded OS, makes also that simultaneously the requirement of adding of file system is urgent day by day.But the general volume of embedded system is less, and system program, application program are contained among FLASH or the ROM usually, the equipment such as hard disk that no memory space is big.And the price of storage mediums such as FLASH or ROM is very expensive with respect to widely used hard disk, CD etc. in desktop system.Therefore, simply the mass file system in the desktop system is indiscriminately imitated that to use in the embedded device be infeasible.And at the characteristics of embedded system, should research and develop a kind of novel file system or the desk file system is transformed and reduces, so that be fit to the use of embedded device.In the middle of these two kinds of ways, " data compression " technology all will play a part very important.It can increase substantially the utilization factor of embedded device resource, particularly storage resources, thereby improves its performance greatly.This not only has theory significance, and has very large using value and can bring considerable economic, thereby can promote further developing of embedded device.

Summary of the invention:

The object of the present invention is to provide a kind of file system and corresponding data compression algorithm thereof with automatic compression/de-compression function, difference from prior art is, it has proposed to be used for the level of abstraction of file system---the notion of disk block level of abstraction, the disk block level of abstraction is that a new module that adds between ordinary file system and physical storage is utilized the disk block level of abstraction, changes abstractly once file system and the data in magnetic disk that compressed to be kept apart.By revising the disk block level of abstraction, system can support multiple file system.Simultaneously compression algorithm on LZW dictionary compression algorithm basis again construction new dictionary building method, improved the performance of former algorithm.It is particularly useful for embedded system.

Automatic compression/de-compression file system proposed by the invention is characterized in that:

It is in the ordinary file system and is used to store through having added a disk block level of abstraction in the middle of the hardware that is incorporated into physical storage between the physical storage of the data of overcompression, described disk block level of abstraction is one the ordinary file system is separated with the data in magnetic disk that compressed, and by revising the module that this disk block level of abstraction goes to support multiple file system, it contains successively:

1) virtual disk logic data block storage space, be empty block space: it is one section continuous logic data block storage area that interface is provided to the upper strata file system for the disk block level of abstraction, promptly be designated as an array of virtual logical block number down, its content is a pointer that points to the physical store place;

2) log-structured physical storage data piece mapping layer: it is that log-structured algorithm is responsible for above-mentioned empty block space to the mapping of disk physical data block, promptly keeps supplying layer file system data are write or read from disk by this mapping layer from empty block space;

3) the data compression/decompression layer that contracts: it is used to deposit the data through the lzw algorithm compression.

One on above-mentioned empty block space band contains 32 pointer of logic data block number, and in the time of this pointed physical store place, above-mentioned empty block space just provides with lower interface to the upper strata file system for the disk block level of abstraction:

1) application keeps fast: it be a kind of for one section empty block space of topmost paper system reservation so that the function of Shi Yonging from now on;

2) application release block: it is a kind ofly to discharge one section empty block space so that offer the function of alternative document system use from now on for the topmost paper system;

3) read piece: it is the function that a kind of content of the data block that need visit the topmost paper system writes appointed buffer;

4) write-in block: it is that a kind of content by above-mentioned buffer zone data designated piece that need provide the topmost paper system writes the function of specifying dummy block will;

5) deleted block: the data failure in topmost paper system validation dummy block will and when requiring to carry out delete file operation, call it and notify the disk block level of abstraction.

Above-mentioned log-structured physical storage data piece mapping layer is that a content is the logic exhausting section of continuous compression data block storage area; Described compression data block is above-mentioned logic data block to be compressed to handle with certain compression ratio manage the data unit that the back forms, and described logic exhausting section compresses to handle on the back formation physical disk with certain compression ratio again stores the physics exhausting section of packed data; The address of the piece of the piece of above-mentioned logic data block number and compression data block number, logic exhausting section has constituted the data structure of 32 bit space pointers of empty block space successively.

Above-mentioned data compression/decompression contract the layer be the physical disk that the data that above-mentioned log-structured physical storage data piece mapping layer transmits is compressed storage, it reclaims the piece storage area by space management piece and continuous physics successively and forms, wherein, the corresponding relation of described physics recovery piece and above-mentioned logic recovery piece is determined by above-mentioned 32 bit space pointers; Described space management piece contains:

1) idle compression data block mapping table: show the idle condition of the corresponding compression data block of binary code of interior each bit, and whether determine whether to reclaim this section greater than the threshold values of setting with idle compression data block number;

2) idle logic data block mapping table: corresponding one 2 byte arrays of each compression data block that the logic exhausting section is interior, whether each bit in one 2 byte arrays is with the idle condition of a logic data block in the corresponding compression data block of binary code, determine whether and should carry out mark in idle compression data block mapping table greater than the threshold values of setting with the idle number of the logic data block in the compression data block;

3) compression data block address reference table: it by quantitatively with the logic exhausting section in the identical array of compression data block number form the corresponding compression data block of each array; In order to optimize data write, compression data block start address after each compression is included in n/2 times of the logic data block number in each packed data, n is the integer greater than 1, thereby makes the difference of former and later two compression data block start addresses equal the length of this compression data block.

According to the designed compression algorithm of above-mentioned automatic compression/de-compression file system, it is characterized in that: it is a kind of dynamic dictionary compression algorithm, and each fresh character is created by two in the dictionary existing characters; When doing compression algorithm, above-mentioned two existing characters all are to belong to the longest characters matched string in the character of importing successively, if by the fresh character of their two establishments not in dictionary, then fresh character is added dictionary, otherwise just judged whether once more that second coupling character the longest occurs, if any, repeat above deterministic process, until no longer include till the character input; When doing decompression algorithm, if the fresh character that obtains not in dictionary, is just exported fresh character, otherwise, just judge two existing character inputs once more, repeat above process.

Evidence adopts automatic compression/de-compression file system and compression algorithm thereof proposed by the invention to reach its intended purposes.

Description of drawings:

Fig. 1: the file system structure figure that the present invention proposes;

Fig. 2: virtual disk logic data block storage space synoptic diagram;

Fig. 3: logic reclaims the data structure diagram of 32 bit space pointers of piece and empty block space;

Fig. 4: logic reclaims piece and physics reclaims piece corresponding relation figure;

Fig. 5: the structural representation of space management piece;

Fig. 6: compression algorithm program flow chart;

Fig. 7: decompression algorithm program flow chart;

Fig. 8: read-write operation FB(flow block);

Fig. 9: create automatic compression/de-compression disk partition operational flowchart;

Figure 10: delete automatic compression/de-compression disk partition operational flowchart.

Embodiment:

Accompanying drawings the specific embodiment of the present invention.

Compressed file system of the present invention is based on Journaling File System.The disk block level of abstraction is a new technology that proposes in the present invention, also is emphasis part of the present invention.This part is divided into three layers, referring to Fig. 1: disk block level of abstraction (1.), log-structured physical storage data piece mapping layer (2.), the data compression/decompression layer (3.) that contracts.

The disk block level of abstraction is a new module that adds between ordinary file system and physical storage, as shown in Figure 1.It provides a virtual disk logic data block storage space (empty block space) for the ordinary file system on upper strata, as shown in Figure 2.The minimum memory unit that logic data block is the topmost paper system generally is 1kB, 2kB, 4kB or more, is example with 1kB among Fig. 2.As we can see from the figure, empty block space is one section continuous logic data block storage area, and this zone is divided according to the requirement to logical block size of topmost paper system, and provides numbering.Log-structured physical storage data piece mapping layer then is responsible for the mapping of logical block to the disk physical data block.The algorithm of mapping usage log structure.So the topmost paper system just can write disk from empty block space by mapping layer with data, perhaps reads from disk.Writing or reading of data all also will be through the processing of compression layer.Really leave data on the disk in and be data through overcompression.

Because the file system on upper strata is just carried out the physical disk data access by empty block space interface, therefore, the use of disk block level of abstraction is isolated topmost paper system and lower floor's data storage method fully.This makes the disk block level of abstraction that design focal point can be placed on the usage log structure and have the compression storing data function, and the file system on upper strata can be used any existing file system, so just can obtain the file system of an automatic compression/de-compression.And the example of a plurality of file system can coexist as on the disk block level of abstraction, so, just can obtain benefit log-structured and that compression storing data is brought under the situation of any code need not to revise.Because the independence of this layer and topmost paper system makes that we can be further, the design of this one deck is incorporated in the middle of the hardware of physical storage, thereby improves system performance greatly.

Be example with a computing machine that (SuSE) Linux OS (file system is the Ext2 file system) is installed below, illustrated in conjunction with the accompanying drawings.

The hard disk (perhaps floppy disk, USB flash memory disk, ram disc or the like) that concrete physical store hardware is meant on the computing machine to be installed.

1. the realization of virtual disk logic data block storage space (1.), its inner structure be as shown in Figure 2:

We adopt an array to realize empty block space.The subscript of array is exactly a virtual logical block (1.) number, and its content is one 32 a pointer, points to physical store place (concrete pointer data structural details is seen below).Simultaneously, level of abstraction provides with lower interface to the upper strata file system:

1) application reserved block: int blk_reserve (int start_blk, int num_blk)

This function can keep one section empty block space so that use later on for the topmost paper system.First parameter s tart_blk is a starting block number, and second parameter is the piece number.Function returns the piece number of actual reservation.

2) application release block: int blk_free (int start_blk, int num_blk)

This function can discharge one section empty block space so that the alternative document system uses for the topmost paper system.First parameter s tart_blk is a starting block number, and second parameter is the piece number.Function returns the piece number of actual release.

3) read piece: int blk_read (int start_blk, int num_blk, char*buf)

The content of the data block that this function need be visited topmost paper writes the buffer zone buf of appointment.Function returns the actual piece number that reads.

4) write-in block: int blk_write (int start_blk, int num_blk, char*buf)

The content of the data block that this function need be preserved topmost paper (being specified by buf) writes the dummy block will of appointment.Function returns the actual piece number that writes.

5) deleted block: int blk_delete (int start_blk, int num_blk)

In the data failure of topmost paper system in confirming dummy block will, for example delete file operation should be called this interface and notify level of abstraction, so that the recovery of physical storage block can be carried out more efficiently.

The storage of array that realizes empty block space is in the .blockmap file.Selecting to use the empty block space mapping table of file storage is in order to implement the thought of daily record.Because this concordance list is that system will often be inquired about and be revised in data access, therefore, if be placed in the fixing physical sector section and will bring unnecessary magnetic head mechanical motion when system's visit data, this has just run counter to log-structured design original intention.We are embodied as a file with this table, and its information node is kept in the superblock of level of abstraction, and its content then writes daily record with data, but do not compress storage, to avoid increasing system loading.

2. the realization (2.) of log-structured physical storage data piece mapping layer, referring to Fig. 4:

Log-structured storage is the thought of always implementing during we design.

What at first need to determine is the size of exhausting section.The efficient that integrated survey is reclaimed and the use dirigibility of data in magnetic disk piece, the size of an exhausting section is that 512kB is comparatively suitable.The exhausting section here is on the physical disk, and we are referred to as physics exhausting section (1.).The physics exhausting section is after overcompression is handled.We are according to the corresponding section of the definite mapping layer of this size, and the size that is referred to as logic exhausting section (2.) is 2MB.This is in order to make that a logic exhausting section can be greater than the size of physics exhausting section after overcompression, otherwise will have the certain physical storage space to be wasted owing to the end of section.

In order to obtain higher ratio of compression, we are that a data unit (compression data block (3.)) compresses processing with 16kB.Therefore, comprise 128 compression data blocks in a logic exhausting section.Because the logical data block size of topmost paper system is 1kB, so comprise 16 logic data blocks in each compression data block.

So we can obtain the data structure of 32 bit pointers of empty block space, as shown in Figure 3.

Data compression/decompression contract the layer realization (3.),

This one deck is real physical store layer, and it compresses storage with the data that the daily record layer transmits.The corresponding relation of physics recovery piece and logic recovery piece as shown in Figure 4.In order to optimize data write, the compression data block start address after each compression is 8 integral multiple.So, only need 16 and just can represent the address of compression data block.

Wherein the space management piece is the start address of each compression data block of record and the situation that use in the interior space of this exhausting section.Its structure as shown in Figure 5.Except that the space management piece all is below the data block (marking among the figure) each data segment to be described in detail.

1) idle compression data block mapping table (1.): the idle condition of the corresponding compression data block of each bit in the table.' 1 ' expression takies, and ' 0 ' expression is idle.The space reclamation process relies on the represented information of this table to judge whether and will reclaim.When free block quantity during, need to reclaim this section greater than a threshold values.

2) idle logic data block mapping table (2.): this table is made up of 128 2 byte arrays.Compression data block in the corresponding section of each array element.Each array element is made up of 16 bits, the idle condition of a logic data block in the corresponding compression data block of each bit.' 1 ' expression takies, and ' 0 ' expression is idle.When the topmost paper system upgrades some logic data blocks, perhaps call blk_delete and deleted this piece, all can carry out record here.The idle quantity of logic data block of finding this compression data block inside when record the process will be carried out mark greater than some threshold values the time in idle compression data block mapping table.

3) compression data block address reference table (3.): this is shown this table and is made up of 128 2 byte arrays.Compression data block in the corresponding section of each array element.The value of array element and 8 has obtained the start address side-play amount of corresponding compression data block mutually at convenience.This side-play amount is the skew with respect to the start address of data block.The difference of former and later two compression data block start addresses is exactly the length of data block.

4) next free segment pointer (4.): the physics exhausting section is a big chained list on disk, and this data segment is exactly the pointer that points to next idle exhausting section.Its length is 32 bits.Collection process is responsible for safeguarding and upgrading this data segment.

4. the realization of compression algorithm

Algorithm adopts the dynamic dictionary compression, and new word is created by two existing words in the dictionary.Specifically describe as Fig. 6.By process flow diagram as can be seen, algorithm has made full use of existing coding in the dictionary.Under conventional characters string appearance situation seldom, just it can be added dictionary.Decompression algorithm such as Fig. 7.

Comprise the file of a large amount of different-formats in the actual file system, these file sizes differ, and corresponding compression algorithm can't keep unified compressibility to them.In order compressibility can be controlled at 50% level, compression algorithm provides configurable interface to the upper strata when realizing, promptly during the file system call compression algorithm, and can the prescribed coding figure place.Just can adopt long coding in the time of ratio of compression is bigger like this file, thus the situation of having avoided dictionary to overflow.Because the coding figure place is not 8 multiple, and promptly coding can not be stored in the integer byte, we have utilized two buffer to do the buffering inputoutput data simultaneously.We are 12 to be example with code length, and concrete function and functional description thereof are as follows:

1)input_code(char*input)

Depositing the data block that needs decompress(ion) among the input, the basic compressed encoding length in the data block is 12, so function need be converted into byte 12 valid data outputs.Input buffering is one 32 a storage unit, when initial, reads in four bytes from the input data block, then head 12 bit data is moved into temporary storage cell, returns the data in this temporary storage cell.When calling this function later on, all to test the figure place of valid data in the buffer zone at every turn,, then need from input, to read in new data if less than 24, until its valid data length greater than 24.So just realized of the conversion of 8 bit data of input to 12 bit data.

2)output_code(char*output，unsigned?int?code)

Function will compress coding later and output in the output data block.Code length is 12, and the base unit of output is a byte.So the functions reversed that this function and input_code realize, it is the output of octet data with 12 code conversion.Adopted one 32 buffer zone in the algorithm equally, at first made its valid data length greater than 20 this buffer zone filling.Begin to export 8 bit data from the buffer zone head then, till valid data length is less than 8.After later on new coded data can make valid data length reach more than 20 earlier when coming, output data again.Data in the last buffer zone need extra this function that calls, with its output.

3)find_match(int?hash_prefix，unsigned?int?hash_character)

Be input as a coded character and a common character (its value is less than 256).Function mainly is to search character string hash_prefix+hash_character in Hash table whether in dictionary.Hash function adopt hash_prefix move to right 4 then with hash_character with or.By the value that obtains after the function operation in corresponding ltsh chain table the sequential search character string whether in dictionary.If have then return index value in table, if not then return last node index value in the table.This function can only be tested dictionary encoding+one a new word that common character constituted whether in dictionary.

4)longest_match(int?hash_prefix，unsigned?int?hash_character)

Function will be exported the longest matched character string coding.At first utilize find_match () function to find a character string A the longest, and then utilize find_match () to search the longest matched character string B of the next one, whether test A+B is in dictionary, if then its value is inserted among the A, repeat said process, not in dictionary, return A until A+B.Because the formation of new word is made of two new words, so exist one to pass the rule process and search the longest matched character string.

5)add_dictionary_item(int?prefix，int?sufix，int?index)

Function adds prefix+sufix in the dictionary.

6)char*decode_stnng(unsigned?char*buffer，unsigned?int?code)

Code is certain character in the decompression piece, and function unzips to its corresponding characters string among the buffer.Function adopts storehouse storage decompression character.During initialization code is pressed into storehouse, following steps are done in circulation then: the decoding of searching storehouse top character, if it is a character then with its ejection, and be pressed among the buffer, an if compressed encoding, then search two characters of its correspondence in dictionary, pop-up a stack top character is pressed into storehouse with two characters.Repeating said process, is empty until storehouse, and the character string of storing among the buffer this moment is the decoding of code.

7)compress(char*input，int*size，char*output)

Be input as data block to be compressed and its size (byte number), be output as compression data block, in size, return the size of the good data block of compression simultaneously.Function utilizes longest_match () to find the longest matched character string, function has returned a not matched character string (also being the word in the dictionary) simultaneously, call function output_code () output squeezing characters string, two words that longest_match () is returned join in the new dictionary simultaneously.The last output encoder maximal value of function (4k-1) is with the afterbody of sign compression data block.

8)expand(char*input，int*size，char*output)

Be input as and treat decompressed data piece and size, be output as the decompressed data piece, in size, return the size of the good data block that decompresses simultaneously.At first call input_code () and obtain compressed encoding, call decode_string () output decoder then, simultaneously this compressed encoding and previous compressed encoding are pressed in the dictionary.

Fig. 8, Fig. 9 and Figure 10 are respectively the read-write operation FB(flow block), create automatic compression/de-compression disk partition operating process

Figure, the automatic compression/de-compression disk partition operational flowchart of deletion.Shown clear and definite operation steps among the figure, no longer narration.

The present invention increases substantially the utilization factor of embedded device resource, particularly storage resources, thereby improves its performance greatly, not only has theory significance, and has very large using value and can bring considerable economic.

Claims

1, automatic compression/de-compression file system, it is characterized in that: in the ordinary file system be used to store through having added a disk block level of abstraction in the middle of the hardware that is incorporated into physical storage between the physical storage of the data of overcompression, described disk block level of abstraction is one the ordinary file system is separated with the data in magnetic disk that compressed, and by revising the module that this disk block level of abstraction goes to support multiple file system, it contains successively:

2) log-structured physical storage data piece mapping layer: it is log-structured algorithm, is responsible for above-mentioned empty block space to the mapping of disk physical data block, promptly keeps supplying layer file system data are write or read from disk by this mapping layer from empty block space;

2, automatic compression/de-compression file system as claimed in claim 1, it is characterized in that: one on described empty block space band contains 32 pointer of logic data block number, in the time of this pointed physical store place, above-mentioned empty block space just provides with lower interface to the upper strata file system for the disk block level of abstraction:

3, automatic compression/de-compression file system as claimed in claim 1 is characterized in that: described log-structured physical storage data piece mapping layer is that a content is the logic exhausting section of continuous compression data block storage area; Described compression data block is that above-mentioned logic data block is compressed the data unit that the processing back forms with certain compression ratio, and described logic exhausting section compresses to handle on the back formation physical disk with certain compression ratio again stores the physics exhausting section of packed data; The address of the piece of the piece of above-mentioned logic data block number and compression data block number, logic exhausting section has constituted the data structure of 32 bit space pointers of empty block space successively.

4, automatic compression/de-compression file system as claimed in claim 1, it is characterized in that: described data compression/decompression contract the layer be the physical disk that the data that above-mentioned log-structured physical storage data piece mapping layer transmits is compressed storage, it reclaims the piece storage area by space management piece and continuous physics successively and forms, wherein, the corresponding relation of described physics recovery piece and above-mentioned logic recovery piece is determined by above-mentioned 32 bit space pointers; Described space management piece then contains:

5, the designed compression algorithm of automatic compression/de-compression file system according to claim 1, it is characterized in that: it is a kind of dynamic dictionary compression algorithm, each fresh character is created by two in the dictionary existing characters; When doing compression algorithm, above-mentioned two existing characters all are to belong to the longest characters matched string in the character of importing successively, if by the fresh character of their two establishments not in dictionary, then fresh character is added dictionary, otherwise just judged whether once more that second coupling character the longest occurs, if any, repeat above deterministic process, until no longer include till the character input; When doing decompression algorithm, if the fresh character that obtains not in dictionary, is just exported fresh character, otherwise, just judge two existing character inputs once more, repeat above process.