CN106294548A

CN106294548A - The compression method of a kind of data of tracing to the source and system

Info

Publication number: CN106294548A
Application number: CN201610588856.4A
Authority: CN
Inventors: 谢雨来; 荣震; 陈俭喜; 冯丹; 秦磊华
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2016-07-25
Filing date: 2016-07-25
Publication date: 2017-01-04

Abstract

The invention discloses a kind of compression to trace to the source the method for data, belong to technical field of computer data storage.The method data of tracing to the source are divided into ancestral data and identity data, use four steps to be compressed, respectively for ancestral data: list ancestral data, for the node in ancestral data is formed list by dependence；With reference to compressed encoding, for finding the ancestor list having most common ancestor most like；Run length encoding, for finding the ancestor node of serial number, and uses start node number and continuous number length to represent each continuous number list；Residual quantity encodes, and represents each ancestors by the difference between coding ancestor node.Identity data dictionary encoding is compressed, by using integer to represent the character string frequently occurred.Present invention also offers the system realizing said method.The present invention improves the compression performance to data of tracing to the source, and the data that ensure to trace to the source after being compressed still can use normally.

Description

The compression method of a kind of data of tracing to the source and system

Technical field

The invention belongs to technical field of computer data storage, trace to the source data compression method more particularly, to one.

Background technology

Data of tracing to the source are a kind of metadata for description object historical data.Utilization data of tracing to the source can realize a lot of new Function, including experiment document, safety, search and program debugging etc..Therefore, a lot of institutions for academic research construct collection and trace to the source System.These system major parts concentrate in how to collect data of tracing to the source, and some systems also pay attention to build some can be made With the application program of data of tracing to the source, but all these system ignores an important aspect: deposit for a long time and effectively Store up data of tracing to the source.

Although tracing to the source, data can be used in a series of aspect such as safety, search, has also expedited the emergence of various traceability system, But the data that make to trace to the source can be widely used, and a critically important aspect seeks to store data of tracing to the source efficiently. This is because, the data of tracing to the source being not optimised can take substantial amounts of space, thus use data of tracing to the source and form the biggest obstacle.

Although existing data compression algorithm is done well in terms of compression performance, but it is not bound with the spy of data of tracing to the source Property, after these compression algorithm, some characteristics of data of tracing to the source are lost completely, and the data that cause tracing to the source cannot use.Existing In the special use that the method also occurring in that some data compressions that can apply to trace to the source, such as Chapman (Chapman) et al. propose In compressing the decomposition traced to the source and inheriting compress technique, but this algorithm does not take into full account the characteristic of data of tracing to the source, compression performance Less desirable.It is necessary so proposing a kind of method that Efficient Compression traces to the source.

Summary of the invention

For disadvantages described above or the Improvement requirement of prior art, the invention provides a kind of compression trace to the source data method and System, its object is to combine the characteristic that data set of metadata of similar data of tracing to the source is many and continuous data is many, and the existing information of tracing to the source is divided into ancestors Data and identity data, by ancestral data list, use afterwards with reference to compressed encoding, run length encoding and residual quantity coding Ancestral data is compressed, identity data dictionary encoding is compressed, can be with high degree by this compression method Compression trace to the source the size of data, and can ensure that data of tracing to the source after being compressed still can be normally used.

In order to realize the technical purpose of the present invention, the invention provides one and trace to the source data compression method, it is characterised in that Said method comprising the steps of:

(1) data-classification step: data of tracing to the source are divided into identity data and ancestral data: identity data describes data originally The attribute character of body；Ancestral data represents the dependence between data object；

(2) ancestral data compression step: inquire about set of metadata of similar data, the company tracing to the source in data according to the dependence of ancestral data Continuous data, more successively set of metadata of similar data, continuous data and remaining data are progressively compressed；

(3) identity data compression step: use dictionary encoding the identity data traced to the source in data to be compressed, by sweeping Retouch data base or text searches the character string frequently occurred, then replace them with integer code name, and by integer Mapping relations between code name and character string are stored in data base.

Further, described ancestral data compression step comprises following sub-step:

(11) list ancestral data:

Ancestral data is represented by a series of nodes of tracing to the source having one or more ancestors, and each node is according to generation of tracing to the source Time order-assigned node number, the form of the dependence node listing between node data is represented, definition Out (x) Represent the ancestor list of node x, list is the ancestral data node set of back end x；

(12) with reference to compressed encoding:

Checking the ancestor list of W node before back end x, W represents window parameter, looks for and has the most common with Out (x) The ancestor list of ancestor node, definition Out (y) is this ancestor list, and Out (x), just for reference mode, is encoded into three by y Point: reference number, bit sequence and remaining ancestor node, reference number is x node number and the difference of y node number；For Out (y) List generates a bit sequence, and in Out (y) list and Out (x), common ancestor saves corresponding bit is 1, is otherwise 0；Remaining ancestor node is that former ancestor node deducts common ancestor's node；

(13) run length encoding:

From Out (x), remaining ancestor node finds out continuous print node number, is encoded to start node number and continuous number Length two parts, remaining ancestor node is that upper step remainder ancestor node deducts continuous nodes；

(14) residual quantity coding:

In definition, in step, remaining ancestor node is x₁, x₂, x₃..., x_k, and node number x₁≤x₂≤…≤x₃, encode it Difference x between₁-x, x₂-x₁, x₃-x₂..., x_k-x_k-1For difference list.

In order to realize the technical purpose of the present invention, present invention also offers one and trace to the source data compression system, its feature exists In, described system includes with lower module:

Data categorization module, for tracing to the source, data are divided into identity data and ancestral data: identity data describes data The attribute character of itself；Ancestral data represents the dependence between data object；

Ancestral data compression module, for inquire about according to the dependence of ancestral data trace to the source in data set of metadata of similar data, Continuous data, more successively set of metadata of similar data, continuous data and remaining data are progressively compressed；

Identity data compression module, for using dictionary encoding to be compressed the identity data tracing to the source in data, passes through Scan database or text search the character string frequently occurred, and then replace them with integer code name, and by whole Mapping relations between number code name and character string are stored in data base.

Further, described ancestral data compression module comprises following subelement:

List ancestral data subelement, for by ancestral data list, ancestral data is had one or many by a series of The node of tracing to the source of individual ancestors represents, each node is according to the order-assigned node number traced to the source when producing, between node data The form of dependence node listing represent, definition Out (x) represents the ancestor list of node x, is data in list The ancestral data node set of node x；

With reference to compressed encoding subelement, for checking the ancestor list of W node before back end x, W represents that window is joined Number, looks for and has the ancestor list of most common ancestor's node with Out (x), and definition Out (y) is this ancestor list, and y is just for reference Node, is encoded into three parts: reference number, bit sequence and remaining ancestor node, reference number is x node number and y by Out (x) The difference of node number；A bit sequence is generated, common ancestor's joint in Out (y) list and Out (x) for Out (y) list Corresponding bit is 1, is otherwise 0；Remaining ancestor node is that former ancestor node deducts common ancestor's node；

Run length encoding subelement, finds out continuous print node number for ancestor node remaining from Out (x), is compiled Code is start node number and continuous number length two parts, and remaining ancestor node is that upper step remainder ancestor node deducts joint continuously Point；

Residual quantity coded sub-units, being used for defining in step remaining ancestor node is x₁, x₂, x₃..., x_k, and node number x₁ ≤x₂≤…≤x₃, encode difference x between them₁-x, x₂, x₁, x₃-x₂..., x_k-x_k-1For difference list.

In general, by the contemplated above technical scheme of the present invention compared with prior art, have following characteristics and Beneficial effect:

(1) present invention solves the compression algorithm of in the past those classics does not has reservation to trace back during data are traced to the source in compression The feature of source data, but compress according to general ordinary file, the Character losing of data of therefore tracing to the source after compression, result in Data of tracing to the source cannot use problem.

(2) present invention compares some existing compression algorithms based on data characteristic of tracing to the source, and makes better use of number of tracing to the source According to characteristic, taken into full account the compression of ancestral data in data of tracing to the source so that compression effectiveness has had further lifting.

Accompanying drawing explanation

Fig. 1 is that a kind of compression of the present invention is traced to the source the method flow diagram of data；

Fig. 2 is ancestral data list schematic diagram；

Fig. 3 is ancestral data implementation of compression illustration.

Detailed description of the invention

In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, right The present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, not For limiting the present invention.As long as additionally, technical characteristic involved in each embodiment of invention described below that The conflict of not constituting between this just can be mutually combined.

Fig. 1 is that a kind of compression of the present invention is traced to the source the method flow diagram of data, comprises the following steps:

(1) data-classification step:

Data of tracing to the source are divided into two parts, ancestral data and identity data: wherein ancestral data represent data object it Between dependence；Identity data describes the attribute character of data itself, such as filename, file ID, process name, process ID Etc.；

(2) ancestral data compression step:

Ancestral data compression step is divided into again following sub-step:

(11) list ancestral data:

It is illustrated in figure 2 the embodiment of the dependence representing node data with ancestor list.

It can be seen that there are two nodes of tracing to the source in figure, they have multiple ancestor node, owing to each node is according to tracing to the source Order-assigned during generation node number, it is possible to by ancestor node list in the dependence figure between back end Form represents；

(12) with reference to compressed encoding:

Fig. 3 first step show the embodiment with reference to compressed encoding:

In figure, Out (15) is to have the most like ancestor list of most common ancestor with Out (16), and Out (15) is Out (16) reference listing.Reference number is that the difference of node number 16 and 15 is for 1.Out (15) and common ancestor's node of Out (16) Being 11 and 14, corresponding reference listing Out (15) generates a bit sequence, and in reference listing Out (15), ancestor node is common The bit of identical forebears node is 1, is otherwise 0, three parts in being encoded into Out (16) such as figure: reference number 1, bit sequence Row 01010 and remaining ancestor node are 19,20,21,31,33；

The reference number of reference listing Out (15) is 0, and bit sequence is empty, it is possible to compiled by reference listing Out (15) Code Cheng Rutu in three parts: reference number, bit sequence be empty and remaining ancestor node be 3,11,13,14,17；

(13) run length encoding:

Fig. 3 second step show the embodiment of run length encoding:

Figure finds continuous print node number 13,14 in remaining ancestor node from list Out (15), uses start node number 13 and continuous number length 2 represent this continuous nodes list, remaining ancestor node is 3,11,17；

From list Out (16), remaining ancestor node finds out continuous print node number 19,20,21, use start node number 19 and continuous number length 3 represent this continuous nodes list, remaining ancestor node is 31,33；

(14) residual quantity coding:

In definition, in step, remaining ancestor node is x₁, x₂, x₃..., x_k, and node number x₁≤x₂≤…≤x₃, encode it Difference x between₁-x, x₂-x₁, x₃, x₂..., x_k-x_k-1For difference list；

Fig. 3 the 3rd step show residual quantity coding embodiment:

Figure is found out in list Out (15) remaining ancestor node, after above-mentioned steps ancestor node be only left 3,11, 17, then encode difference 3-15 between them, 11-3,17-11, obtain-12,8,6；

Finding out remaining ancestor node in list Out (16) in figure, after above-mentioned steps, ancestor node is only left 31,33, So encode difference 31-16 between them, 33-31, obtain 15,2；

(3) identity data compression step:

Use dictionary encoding that the identity data traced to the source in data is compressed, by scan database or text literary composition Part searches the character string frequently occurred, and then replaces them with integer code name, and by reflecting between integer code name and character string The relation of penetrating is stored in data base.

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Any amendment, equivalent and the improvement etc. made within god and principle, should be included within the scope of the present invention.

Claims

1. a compression is traced to the source the method for data, it is characterised in that comprise the steps of

(1) data-classification step: data of tracing to the source are divided into identity data and ancestral data: identity data describes data itself Attribute character；Ancestral data represents the dependence between data object；

(2) ancestral data compression step: inquire about set of metadata of similar data, the consecutive numbers tracing to the source in data according to the dependence of ancestral data According to, more successively set of metadata of similar data, continuous data and remaining data are progressively compressed；

(3) identity data compression step: use dictionary encoding that the identity data traced to the source in data is compressed.

A kind of compression the most according to claim 1 is traced to the source the method for data, it is characterised in that described ancestral data compression step Suddenly following sub-step is comprised:

(11) list ancestral data:

Ancestral data is represented by a series of nodes of tracing to the source having one or more ancestors, and each node is according to tracing to the source when producing Order-assigned node number, represents the form of the dependence node listing between node data, and definition Out (x) carrys out table Show the ancestor list of node x, list is the ancestral data node set of back end x；

(12) with reference to compressed encoding:

Checking the ancestor list of W node before back end x, W represents window parameter, looks for and has most common ancestor with Out (x) The ancestor list of node, definition Out (y) is this ancestor list, and Out (x), just for reference mode, is encoded into three parts by y: join The number of examining, bit sequence and remaining ancestor node, reference number is x node number and the difference of y node number；Raw for Out (y) list Becoming a bit sequence, in Out (y) list and Out (x), common ancestor saves corresponding bit is 1, is otherwise 0；Remaining Ancestor node is that former ancestor node deducts common ancestor's node；

(13) run length encoding:

(14) residual quantity coding:

In definition, in step, remaining ancestor node is x₁, x₂, x₃..., x_k, and node number x₁≤x₂≤…≤x₃, encode between them Difference x₁-x, x₂-x₁, x₃-x₂..., x_k-x_k-1For difference list.

3. a compression is traced to the source the system of data, it is characterised in that comprise with lower module:

Data categorization module, for tracing to the source, data are divided into identity data and ancestral data: identity data describes data itself Attribute character；Ancestral data represents the dependence between data object；

Ancestral data compression module, for inquiring about the set of metadata of similar data, continuously tracing to the source in data according to the dependence of ancestral data Data, more successively set of metadata of similar data, continuous data and remaining data are progressively compressed；

Identity data compression module, for using dictionary encoding to be compressed the identity data tracing to the source in data.

A kind of compression the most according to claim 3 is traced to the source the system of data, it is characterised in that described ancestral data compression mould Block comprises following subelement:

List ancestral data subelement, for by ancestral data list, ancestral data is had one or more ancestral by a series of First node of tracing to the source represents, each node is according to the order-assigned node number traced to the source when producing, by depending between node data The form relying relation node listing represents, definition Out (x) represents the ancestor list of node x, is back end in list The ancestral data node set of x；

With reference to compressed encoding subelement, for checking the ancestor list of W node before back end x, W represents window parameter, Looking for and have the ancestor list of most common ancestor's node with Out (x), definition Out (y) is this ancestor list, and y is just for reference node Point, is encoded into three parts: reference number, bit sequence and remaining ancestor node, reference number is x node number and y joint by Out (x) The difference of period；Generating a bit sequence for Out (y) list, in Out (y) list and Out (x), common ancestor saves institute Corresponding bit is 1, is otherwise 0；Remaining ancestor node is that former ancestor node deducts common ancestor's node；

Run length encoding subelement, finds out continuous print node number for ancestor node remaining from Out (x), is encoded to Start node number and continuous number length two parts, remaining ancestor node is that upper step remainder ancestor node deducts continuous nodes；

Residual quantity coded sub-units, being used for defining in step remaining ancestor node is x₁, x₂, x₃..., x_k, and node number x₁≤x₂ ≤…≤x₃, encode difference x between them₁-x, x₂-x₁, x₃-x₂..., x_k-x_k-1For difference list.