CN106294548A - The compression method of a kind of data of tracing to the source and system - Google Patents

The compression method of a kind of data of tracing to the source and system Download PDF

Info

Publication number
CN106294548A
CN106294548A CN201610588856.4A CN201610588856A CN106294548A CN 106294548 A CN106294548 A CN 106294548A CN 201610588856 A CN201610588856 A CN 201610588856A CN 106294548 A CN106294548 A CN 106294548A
Authority
CN
China
Prior art keywords
data
node
ancestor
list
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610588856.4A
Other languages
Chinese (zh)
Inventor
谢雨来
荣震
陈俭喜
冯丹
秦磊华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201610588856.4A priority Critical patent/CN106294548A/en
Publication of CN106294548A publication Critical patent/CN106294548A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3088Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method employing the use of a dictionary, e.g. LZ78

Abstract

The invention discloses a kind of compression to trace to the source the method for data, belong to technical field of computer data storage.The method data of tracing to the source are divided into ancestral data and identity data, use four steps to be compressed, respectively for ancestral data: list ancestral data, for the node in ancestral data is formed list by dependence;With reference to compressed encoding, for finding the ancestor list having most common ancestor most like;Run length encoding, for finding the ancestor node of serial number, and uses start node number and continuous number length to represent each continuous number list;Residual quantity encodes, and represents each ancestors by the difference between coding ancestor node.Identity data dictionary encoding is compressed, by using integer to represent the character string frequently occurred.Present invention also offers the system realizing said method.The present invention improves the compression performance to data of tracing to the source, and the data that ensure to trace to the source after being compressed still can use normally.

Description

The compression method of a kind of data of tracing to the source and system
Technical field
The invention belongs to technical field of computer data storage, trace to the source data compression method more particularly, to one.
Background technology
Data of tracing to the source are a kind of metadata for description object historical data.Utilization data of tracing to the source can realize a lot of new Function, including experiment document, safety, search and program debugging etc..Therefore, a lot of institutions for academic research construct collection and trace to the source System.These system major parts concentrate in how to collect data of tracing to the source, and some systems also pay attention to build some can be made With the application program of data of tracing to the source, but all these system ignores an important aspect: deposit for a long time and effectively Store up data of tracing to the source.
Although tracing to the source, data can be used in a series of aspect such as safety, search, has also expedited the emergence of various traceability system, But the data that make to trace to the source can be widely used, and a critically important aspect seeks to store data of tracing to the source efficiently. This is because, the data of tracing to the source being not optimised can take substantial amounts of space, thus use data of tracing to the source and form the biggest obstacle.
Although existing data compression algorithm is done well in terms of compression performance, but it is not bound with the spy of data of tracing to the source Property, after these compression algorithm, some characteristics of data of tracing to the source are lost completely, and the data that cause tracing to the source cannot use.Existing In the special use that the method also occurring in that some data compressions that can apply to trace to the source, such as Chapman (Chapman) et al. propose In compressing the decomposition traced to the source and inheriting compress technique, but this algorithm does not take into full account the characteristic of data of tracing to the source, compression performance Less desirable.It is necessary so proposing a kind of method that Efficient Compression traces to the source.
Summary of the invention
For disadvantages described above or the Improvement requirement of prior art, the invention provides a kind of compression trace to the source data method and System, its object is to combine the characteristic that data set of metadata of similar data of tracing to the source is many and continuous data is many, and the existing information of tracing to the source is divided into ancestors Data and identity data, by ancestral data list, use afterwards with reference to compressed encoding, run length encoding and residual quantity coding Ancestral data is compressed, identity data dictionary encoding is compressed, can be with high degree by this compression method Compression trace to the source the size of data, and can ensure that data of tracing to the source after being compressed still can be normally used.
In order to realize the technical purpose of the present invention, the invention provides one and trace to the source data compression method, it is characterised in that Said method comprising the steps of:
(1) data-classification step: data of tracing to the source are divided into identity data and ancestral data: identity data describes data originally The attribute character of body;Ancestral data represents the dependence between data object;
(2) ancestral data compression step: inquire about set of metadata of similar data, the company tracing to the source in data according to the dependence of ancestral data Continuous data, more successively set of metadata of similar data, continuous data and remaining data are progressively compressed;
(3) identity data compression step: use dictionary encoding the identity data traced to the source in data to be compressed, by sweeping Retouch data base or text searches the character string frequently occurred, then replace them with integer code name, and by integer Mapping relations between code name and character string are stored in data base.
Further, described ancestral data compression step comprises following sub-step:
(11) list ancestral data:
Ancestral data is represented by a series of nodes of tracing to the source having one or more ancestors, and each node is according to generation of tracing to the source Time order-assigned node number, the form of the dependence node listing between node data is represented, definition Out (x) Represent the ancestor list of node x, list is the ancestral data node set of back end x;
(12) with reference to compressed encoding:
Checking the ancestor list of W node before back end x, W represents window parameter, looks for and has the most common with Out (x) The ancestor list of ancestor node, definition Out (y) is this ancestor list, and Out (x), just for reference mode, is encoded into three by y Point: reference number, bit sequence and remaining ancestor node, reference number is x node number and the difference of y node number;For Out (y) List generates a bit sequence, and in Out (y) list and Out (x), common ancestor saves corresponding bit is 1, is otherwise 0;Remaining ancestor node is that former ancestor node deducts common ancestor's node;
(13) run length encoding:
From Out (x), remaining ancestor node finds out continuous print node number, is encoded to start node number and continuous number Length two parts, remaining ancestor node is that upper step remainder ancestor node deducts continuous nodes;
(14) residual quantity coding:
In definition, in step, remaining ancestor node is x1, x2, x3..., xk, and node number x1≤x2≤…≤x3, encode it Difference x between1-x, x2-x1, x3-x2..., xk-xk-1For difference list.
In order to realize the technical purpose of the present invention, present invention also offers one and trace to the source data compression system, its feature exists In, described system includes with lower module:
Data categorization module, for tracing to the source, data are divided into identity data and ancestral data: identity data describes data The attribute character of itself;Ancestral data represents the dependence between data object;
Ancestral data compression module, for inquire about according to the dependence of ancestral data trace to the source in data set of metadata of similar data, Continuous data, more successively set of metadata of similar data, continuous data and remaining data are progressively compressed;
Identity data compression module, for using dictionary encoding to be compressed the identity data tracing to the source in data, passes through Scan database or text search the character string frequently occurred, and then replace them with integer code name, and by whole Mapping relations between number code name and character string are stored in data base.
Further, described ancestral data compression module comprises following subelement:
List ancestral data subelement, for by ancestral data list, ancestral data is had one or many by a series of The node of tracing to the source of individual ancestors represents, each node is according to the order-assigned node number traced to the source when producing, between node data The form of dependence node listing represent, definition Out (x) represents the ancestor list of node x, is data in list The ancestral data node set of node x;
With reference to compressed encoding subelement, for checking the ancestor list of W node before back end x, W represents that window is joined Number, looks for and has the ancestor list of most common ancestor's node with Out (x), and definition Out (y) is this ancestor list, and y is just for reference Node, is encoded into three parts: reference number, bit sequence and remaining ancestor node, reference number is x node number and y by Out (x) The difference of node number;A bit sequence is generated, common ancestor's joint in Out (y) list and Out (x) for Out (y) list Corresponding bit is 1, is otherwise 0;Remaining ancestor node is that former ancestor node deducts common ancestor's node;
Run length encoding subelement, finds out continuous print node number for ancestor node remaining from Out (x), is compiled Code is start node number and continuous number length two parts, and remaining ancestor node is that upper step remainder ancestor node deducts joint continuously Point;
Residual quantity coded sub-units, being used for defining in step remaining ancestor node is x1, x2, x3..., xk, and node number x1 ≤x2≤…≤x3, encode difference x between them1-x, x2, x1, x3-x2..., xk-xk-1For difference list.
In general, by the contemplated above technical scheme of the present invention compared with prior art, have following characteristics and Beneficial effect:
(1) present invention solves the compression algorithm of in the past those classics does not has reservation to trace back during data are traced to the source in compression The feature of source data, but compress according to general ordinary file, the Character losing of data of therefore tracing to the source after compression, result in Data of tracing to the source cannot use problem.
(2) present invention compares some existing compression algorithms based on data characteristic of tracing to the source, and makes better use of number of tracing to the source According to characteristic, taken into full account the compression of ancestral data in data of tracing to the source so that compression effectiveness has had further lifting.
Accompanying drawing explanation
Fig. 1 is that a kind of compression of the present invention is traced to the source the method flow diagram of data;
Fig. 2 is ancestral data list schematic diagram;
Fig. 3 is ancestral data implementation of compression illustration.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, right The present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, not For limiting the present invention.As long as additionally, technical characteristic involved in each embodiment of invention described below that The conflict of not constituting between this just can be mutually combined.
Fig. 1 is that a kind of compression of the present invention is traced to the source the method flow diagram of data, comprises the following steps:
(1) data-classification step:
Data of tracing to the source are divided into two parts, ancestral data and identity data: wherein ancestral data represent data object it Between dependence;Identity data describes the attribute character of data itself, such as filename, file ID, process name, process ID Etc.;
(2) ancestral data compression step:
Ancestral data compression step is divided into again following sub-step:
(11) list ancestral data:
Ancestral data is represented by a series of nodes of tracing to the source having one or more ancestors, and each node is according to generation of tracing to the source Time order-assigned node number, the form of the dependence node listing between node data is represented, definition Out (x) Represent the ancestor list of node x, list is the ancestral data node set of back end x;
It is illustrated in figure 2 the embodiment of the dependence representing node data with ancestor list.
It can be seen that there are two nodes of tracing to the source in figure, they have multiple ancestor node, owing to each node is according to tracing to the source Order-assigned during generation node number, it is possible to by ancestor node list in the dependence figure between back end Form represents;
(12) with reference to compressed encoding:
Checking the ancestor list of W node before back end x, W represents window parameter, looks for and has the most common with Out (x) The ancestor list of ancestor node, definition Out (y) is this ancestor list, and Out (x), just for reference mode, is encoded into three by y Point: reference number, bit sequence and remaining ancestor node, reference number is x node number and the difference of y node number;For Out (y) List generates a bit sequence, and in Out (y) list and Out (x), common ancestor saves corresponding bit is 1, is otherwise 0;Remaining ancestor node is that former ancestor node deducts common ancestor's node;
Fig. 3 first step show the embodiment with reference to compressed encoding:
In figure, Out (15) is to have the most like ancestor list of most common ancestor with Out (16), and Out (15) is Out (16) reference listing.Reference number is that the difference of node number 16 and 15 is for 1.Out (15) and common ancestor's node of Out (16) Being 11 and 14, corresponding reference listing Out (15) generates a bit sequence, and in reference listing Out (15), ancestor node is common The bit of identical forebears node is 1, is otherwise 0, three parts in being encoded into Out (16) such as figure: reference number 1, bit sequence Row 01010 and remaining ancestor node are 19,20,21,31,33;
The reference number of reference listing Out (15) is 0, and bit sequence is empty, it is possible to compiled by reference listing Out (15) Code Cheng Rutu in three parts: reference number, bit sequence be empty and remaining ancestor node be 3,11,13,14,17;
(13) run length encoding:
From Out (x), remaining ancestor node finds out continuous print node number, is encoded to start node number and continuous number Length two parts, remaining ancestor node is that upper step remainder ancestor node deducts continuous nodes;
Fig. 3 second step show the embodiment of run length encoding:
Figure finds continuous print node number 13,14 in remaining ancestor node from list Out (15), uses start node number 13 and continuous number length 2 represent this continuous nodes list, remaining ancestor node is 3,11,17;
From list Out (16), remaining ancestor node finds out continuous print node number 19,20,21, use start node number 19 and continuous number length 3 represent this continuous nodes list, remaining ancestor node is 31,33;
(14) residual quantity coding:
In definition, in step, remaining ancestor node is x1, x2, x3..., xk, and node number x1≤x2≤…≤x3, encode it Difference x between1-x, x2-x1, x3, x2..., xk-xk-1For difference list;
Fig. 3 the 3rd step show residual quantity coding embodiment:
Figure is found out in list Out (15) remaining ancestor node, after above-mentioned steps ancestor node be only left 3,11, 17, then encode difference 3-15 between them, 11-3,17-11, obtain-12,8,6;
Finding out remaining ancestor node in list Out (16) in figure, after above-mentioned steps, ancestor node is only left 31,33, So encode difference 31-16 between them, 33-31, obtain 15,2;
(3) identity data compression step:
Use dictionary encoding that the identity data traced to the source in data is compressed, by scan database or text literary composition Part searches the character string frequently occurred, and then replaces them with integer code name, and by reflecting between integer code name and character string The relation of penetrating is stored in data base.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Any amendment, equivalent and the improvement etc. made within god and principle, should be included within the scope of the present invention.

Claims (4)

1. a compression is traced to the source the method for data, it is characterised in that comprise the steps of
(1) data-classification step: data of tracing to the source are divided into identity data and ancestral data: identity data describes data itself Attribute character;Ancestral data represents the dependence between data object;
(2) ancestral data compression step: inquire about set of metadata of similar data, the consecutive numbers tracing to the source in data according to the dependence of ancestral data According to, more successively set of metadata of similar data, continuous data and remaining data are progressively compressed;
(3) identity data compression step: use dictionary encoding that the identity data traced to the source in data is compressed.
A kind of compression the most according to claim 1 is traced to the source the method for data, it is characterised in that described ancestral data compression step Suddenly following sub-step is comprised:
(11) list ancestral data:
Ancestral data is represented by a series of nodes of tracing to the source having one or more ancestors, and each node is according to tracing to the source when producing Order-assigned node number, represents the form of the dependence node listing between node data, and definition Out (x) carrys out table Show the ancestor list of node x, list is the ancestral data node set of back end x;
(12) with reference to compressed encoding:
Checking the ancestor list of W node before back end x, W represents window parameter, looks for and has most common ancestor with Out (x) The ancestor list of node, definition Out (y) is this ancestor list, and Out (x), just for reference mode, is encoded into three parts by y: join The number of examining, bit sequence and remaining ancestor node, reference number is x node number and the difference of y node number;Raw for Out (y) list Becoming a bit sequence, in Out (y) list and Out (x), common ancestor saves corresponding bit is 1, is otherwise 0;Remaining Ancestor node is that former ancestor node deducts common ancestor's node;
(13) run length encoding:
From Out (x), remaining ancestor node finds out continuous print node number, is encoded to start node number and continuous number length Two parts, remaining ancestor node is that upper step remainder ancestor node deducts continuous nodes;
(14) residual quantity coding:
In definition, in step, remaining ancestor node is x1, x2, x3..., xk, and node number x1≤x2≤…≤x3, encode between them Difference x1-x, x2-x1, x3-x2..., xk-xk-1For difference list.
3. a compression is traced to the source the system of data, it is characterised in that comprise with lower module:
Data categorization module, for tracing to the source, data are divided into identity data and ancestral data: identity data describes data itself Attribute character;Ancestral data represents the dependence between data object;
Ancestral data compression module, for inquiring about the set of metadata of similar data, continuously tracing to the source in data according to the dependence of ancestral data Data, more successively set of metadata of similar data, continuous data and remaining data are progressively compressed;
Identity data compression module, for using dictionary encoding to be compressed the identity data tracing to the source in data.
A kind of compression the most according to claim 3 is traced to the source the system of data, it is characterised in that described ancestral data compression mould Block comprises following subelement:
List ancestral data subelement, for by ancestral data list, ancestral data is had one or more ancestral by a series of First node of tracing to the source represents, each node is according to the order-assigned node number traced to the source when producing, by depending between node data The form relying relation node listing represents, definition Out (x) represents the ancestor list of node x, is back end in list The ancestral data node set of x;
With reference to compressed encoding subelement, for checking the ancestor list of W node before back end x, W represents window parameter, Looking for and have the ancestor list of most common ancestor's node with Out (x), definition Out (y) is this ancestor list, and y is just for reference node Point, is encoded into three parts: reference number, bit sequence and remaining ancestor node, reference number is x node number and y joint by Out (x) The difference of period;Generating a bit sequence for Out (y) list, in Out (y) list and Out (x), common ancestor saves institute Corresponding bit is 1, is otherwise 0;Remaining ancestor node is that former ancestor node deducts common ancestor's node;
Run length encoding subelement, finds out continuous print node number for ancestor node remaining from Out (x), is encoded to Start node number and continuous number length two parts, remaining ancestor node is that upper step remainder ancestor node deducts continuous nodes;
Residual quantity coded sub-units, being used for defining in step remaining ancestor node is x1, x2, x3..., xk, and node number x1≤x2 ≤…≤x3, encode difference x between them1-x, x2-x1, x3-x2..., xk-xk-1For difference list.
CN201610588856.4A 2016-07-25 2016-07-25 The compression method of a kind of data of tracing to the source and system Pending CN106294548A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610588856.4A CN106294548A (en) 2016-07-25 2016-07-25 The compression method of a kind of data of tracing to the source and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610588856.4A CN106294548A (en) 2016-07-25 2016-07-25 The compression method of a kind of data of tracing to the source and system

Publications (1)

Publication Number Publication Date
CN106294548A true CN106294548A (en) 2017-01-04

Family

ID=57652118

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610588856.4A Pending CN106294548A (en) 2016-07-25 2016-07-25 The compression method of a kind of data of tracing to the source and system

Country Status (1)

Country Link
CN (1) CN106294548A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111917745A (en) * 2020-07-16 2020-11-10 上海星秒光电科技有限公司 Data compression method
WO2021112907A1 (en) * 2019-12-03 2021-06-10 Western Digital Technologies, Inc. Replication barriers for dependent data transfers between data stores
US11409711B2 (en) 2019-12-03 2022-08-09 Western Digital Technologies, Inc. Barriers for dependent operations among sharded data stores
US11567899B2 (en) 2019-12-03 2023-01-31 Western Digital Technologies, Inc. Managing dependent delete operations among data stores

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130308831A1 (en) * 2012-05-18 2013-11-21 Ingrain, Inc. Method And System For Estimating Rock Properties From Rock Samples Using Digital Rock Physics Imaging
CN104809168A (en) * 2015-04-06 2015-07-29 华中科技大学 Partitioning and parallel distribution processing method of super-large scale RDF graph data
CN105721883A (en) * 2014-12-05 2016-06-29 华中科技大学 Video sharing method and system in cloud storage system based on source tracing information

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130308831A1 (en) * 2012-05-18 2013-11-21 Ingrain, Inc. Method And System For Estimating Rock Properties From Rock Samples Using Digital Rock Physics Imaging
CN105721883A (en) * 2014-12-05 2016-06-29 华中科技大学 Video sharing method and system in cloud storage system based on source tracing information
CN104809168A (en) * 2015-04-06 2015-07-29 华中科技大学 Partitioning and parallel distribution processing method of super-large scale RDF graph data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YULAI XIE等: "A Hybrid Approach for Efficient Provenance Storage", 《ACM》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021112907A1 (en) * 2019-12-03 2021-06-10 Western Digital Technologies, Inc. Replication barriers for dependent data transfers between data stores
US11409711B2 (en) 2019-12-03 2022-08-09 Western Digital Technologies, Inc. Barriers for dependent operations among sharded data stores
US11567899B2 (en) 2019-12-03 2023-01-31 Western Digital Technologies, Inc. Managing dependent delete operations among data stores
CN111917745A (en) * 2020-07-16 2020-11-10 上海星秒光电科技有限公司 Data compression method
CN111917745B (en) * 2020-07-16 2022-07-22 上海星秒光电科技有限公司 Data compression method

Similar Documents

Publication Publication Date Title
CN106294548A (en) The compression method of a kind of data of tracing to the source and system
Bauer et al. Equilogical spaces
US8103705B2 (en) System and method for storing text annotations with associated type information in a structured data store
Guiraud et al. Higher-dimensional categories with finite derivation type
CN105930419B (en) RDF data distributed parallel semantic coding method
Freedman et al. Optimal distance labeling schemes for trees
Thurachon et al. Incremental association rule mining with a fast incremental updating frequent pattern growth algorithm
Radhakrishna et al. A novel approach for mining similarity profiled temporal association patterns
Kemmar et al. Prefix-projection global constraint and top-k approach for sequential pattern mining
AU2006322637A1 (en) A succinct index structure for XML
Motik et al. OWL 2 web ontology language mapping to RDF graphs
CN113487024A (en) Alternate sequence generation model training method and method for extracting graph from text
Delgosha et al. A universal low complexity compression algorithm for sparse marked graphs
Jin et al. Learning graph-level representations with recurrent neural networks
Chehreghani et al. OInduced: an efficient algorithm for mining induced patterns from rooted ordered trees
Koopman et al. Characteristic relational patterns
CN103116654B (en) A kind of XML data node code compression method
CN106375490A (en) IP information matching and extension method
CN110909256B (en) Artificial intelligence information filtering system for computer
Gayathri et al. Horn-rule based compression technique for RDF data
CN103326731B (en) A kind of Hidden Markov correlated source coded method encoded based on distributed arithmetic
Nguyen et al. A new method for mining colossal patterns
Kemmar et al. Interval graph mining
CN107423341B (en) Ciphertext full-text search system
Kang et al. FaShapley: Fast and Approximated Shapley Based Model Pruning Towards Certifiably Robust DNNs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170104