CN103347047B

CN103347047B - Lossless data compression method based on online dictionaries

Info

Publication number: CN103347047B
Application number: CN201310225049.2A
Authority: CN
Inventors: 吴昊; 曾杰杰; 高水娟; 李莉; 宫鼎
Original assignee: Nanjing Communications Institute of Technology
Current assignee: Nanjing Communications Institute of Technology
Priority date: 2013-06-07
Filing date: 2013-06-07
Publication date: 2017-02-08
Anticipated expiration: 2033-06-07
Also published as: CN103347047A

Abstract

The invention relates to a lossless data compression method based on online dictionaries, wherein the online dictionaries are dictionaries stored on a server and comprise a standard dictionary, an extending dictionary and a private dictionary. Through the standard dictionary and the private dictionary, a client can conduct conversion in an original file and an object file, compression and decompression processes can be completed, and the extending dictionary facilitates optimization of the standard dictionary. According to the lossless data compression method, the dictionaries are stored through the remote server, space of the local dictionaries is saved, compression efficiency is improved, time and space complexity are taken into consideration comprehensively through a compression algorithm and a dictionary maintenance algorithm, and the high compression efficiency is achieved through simple operation. Meanwhile, according to difference of file types, different dictionaries can be adopted for the compression algorithm, pertinence of the dictionaries is enhanced, and the compression efficiency is improved.

Description

A kind of destructive data compressing method based on network dictionary

Technical field

The present invention relates to a kind of destructive data compressing method, particularly a kind of lossless data compression side based on network dictionary Method.

Background technology

Compress technique can be divided into lossy compression method and lossless compress, and lossy compression method is generally used for multimedia data compression, and no Damage compression and be then generally used for generic data compression, lossless compress can be divided into compression method based on statistical model and based on dictionary again The compression method of model, the former is represented as Huffman coding and arithmetic coding, and the latter is represented as LZ77, LZ78, LZW etc.. Existing being based on the dictionary in the compression algorithm of dictionary is locally generated based on source file.

Content of the invention

Its purpose of the present invention is that a kind of destructive data compressing method based on network dictionary of offer, employs long-range clothes Business device preserves dictionary, saves the space of local dictionary, improves compression efficiency, compression algorithm and dictionary maintenance algorithm are comprehensively examined Consider Time ＆ Space Complexity, higher compression efficiency has been achieved with simple calculations, compression algorithm can be according to file simultaneously Type is different, using different dictionaries, enhances dictionary specific aim, improves compression efficiency.

The technical scheme realized above-mentioned purpose and take, methods described includes dictionary maintenance, compression method, decompressing method, Described compression method includes：

1）Create privately owned dictionary in the server, file destination indicates dictionary version number and privately owned dictionary numbering；

2）According to the constructive method order traversal original document of string length in dictionary, with corresponding in normal dictionary " coding+ Number of repetition " is replaced, generating process file, and the access times according to string each in normal dictionary, updates drawing in normal dictionary Secondary；

3）To fail in procedure file to make privately owned dictionary with the Series Code that coding is replaced, be drawn according to it secondary, using Huffman Algorithm generates corresponding coding, is saved in privately owned dictionary in server；

4）With not being replaced as the string encoding in " coding+number of repetition " the replacement process file in privately owned dictionary, generate mesh Mark file；

5）During string is converted into " coding+number of repetition ", if number of repetition is 0, number of repetition is omitted, If number of repetition is the numerical value more than 0, then judge using " coding+coding+...+coding " mode and " coding+number of repetition " Which kind of data volume of mode is less, and which kind of just adopts；

Described decompressing method includes：

1）Read the version number in file destination and privately owned dictionary numbering, connection server, obtain version number and privately owned word The corresponding normal dictionary of allusion quotation numbering and privately owned dictionary；

2）Using the corresponding relation encoding in normal dictionary and privately owned dictionary and go here and there, by " the coding+repeat in file destination Number of times " reduces bunchiness, generates original document；

Described normal dictionary and extension dictionary maintaining method include：

1）Depending on the list item length of normal dictionary is because of its original file type difference；

2）After every second compression, all using the number of repetition of strings all in compression process as list item be respectively put into normal dictionary and In extension dictionary；

3）Periodically the list item in the list item and normal dictionary in extension dictionary is merged, select and draw time highest part, adopt Recompiled with huffman algorithm, generate new normal dictionary, the list item of non-inclusion criteria dictionary is put in new extension dictionary, Create redaction numbering, obtain redaction normal dictionary and extension dictionary；

Wherein normal dictionary, privately owned dictionary, extension dictionary are defined as follows：

Normal dictionary：The dictionary that current version number uses, all original documents being compressed using this version dictionary are altogether With using, include three field strings, encode, draw secondary.

Privately owned dictionary：Each original document corresponds to a privately owned dictionary, is made with the numbering of dictionary version number original document For the title of privately owned dictionary, privately owned dictionary is right by the string and this string that cannot find corresponding coding in original document in normal dictionary The coding answered is constituted, including two field strings, codings.

Extension dictionary：Comprise the string in all privately owned dictionaries in this version and each string is corresponding draws time, including two fields String, draw time.

Beneficial effect

Compared with prior art the present invention has advantages below.

Preserve dictionary due to employing remote server, save the space of local dictionary, improve compression efficiency, compression

Algorithm and dictionary maintenance algorithm have considered Time ＆ Space Complexity, are achieved higher with simple calculations Compression efficiency, compression algorithm can be different according to file type simultaneously, using different dictionaries, enhance dictionary specific aim, improve Compression efficiency.

Specific embodiment

A kind of destructive data compressing method based on network dictionary, including network dictionary, methods described include dictionary safeguard, Compression method, decompressing method, described compression method includes：

Described decompressing method includes：

Normal dictionary：The dictionary that current version number uses, all original documents being compressed using this version dictionary are altogether With using, include three field strings, encode, draw secondary；

Privately owned dictionary：Each original document corresponds to a privately owned dictionary, is made with the numbering of dictionary version number original document For the title of privately owned dictionary, privately owned dictionary is right by the string and this string that cannot find corresponding coding in original document in normal dictionary The coding answered is constituted, including two field strings, codings；

Described network dictionary preserves in the server, including normal dictionary, extension dictionary, privately owned dictionary, described standard word Allusion quotation comprises to go here and there, encodes, draws time three fields, and normal dictionary is the public described extension dictionary bag of compressed file using current version Containing string, draw time two fields, in all privately owned dictionaries in extension dictionary set current version dictionary, string draws time, for constructing Privately owned dictionary described in next version dictionary comprises string, two fields of coding, and privately owned dictionary is that each compressed file is privately owned.

Wherein original document can be by string in normal dictionary and privately owned dictionary and coding corresponding relation, with normal dictionary and private There is the Coding and description in dictionary.

The principle that in normal dictionary, list item is chosen is that the high person of weight is selected in, and the computational methods of weight are string length × draw time.

No matter original document is the modes such as step-by-step, byte, double byte preserves, the side that file destination all can be preserved using step-by-step Method.

Vocabulary explanation：

Original document：File before compression.

File destination：File after compression.

Procedure file：The file producing in compression process.

Server：It is used in network preserving the computer equipment of dictionary.

String：One group of binary number sequence, original document is to be combined by different strings.String length can be one just whole Random length in number interval or the positive integer times of a positive integer fixed length, specifically can be by its original file type depending on.

Coding：One group of binary number sequence, is mapped one by one from different strings, for going here and there described in file destination.

Draw secondary：The number of times that the corresponding coding of each string is cited.

Number of repetition：Coding is in normal dictionary and extension dictionary in the continuous number of times using of file destination, number of repetition A field.

Weight：The product of string length and number of repetition in dictionary.

Dictionary：Preserve and go here and there, encode, drawing corresponding relation that is secondary, repeating the fields such as number of words, weight.Dictionary designation is numbered by storehouse Constitute with version number, dictionary includes normal dictionary, privately owned dictionary and extension dictionary.Dictionary preserves in the server.

Normal dictionary：The dictionary that current version number uses, all original documents being compressed using this version dictionary are altogether With using, including three fields：Go here and there, encode, drawing time.

Privately owned dictionary：Each original document corresponds to a privately owned dictionary, is made with the numbering of dictionary version number+original document Title for privately owned dictionary.Privately owned dictionary is right by the string and this string that cannot find corresponding coding in original document in normal dictionary The coding answered is constituted, including two fields：String, coding.

Extension dictionary：Comprise the string in all privately owned dictionaries in this version and each string is corresponding draws time, including two fields： String, draw time.

Algorithm

Compression process：

1st, create privately owned dictionary in the server, file destination indicates dictionary version number and privately owned dictionary numbering；

2nd, the constructive method order traversal original document according to string length in dictionary, with corresponding in normal dictionary " coding+ Number of repetition " is replaced, generating process file, and the access times according to string each in normal dictionary, updates drawing in normal dictionary Secondary；

3rd, will fail in procedure file to make privately owned dictionary with the Series Code that coding is replaced, be drawn according to it secondary, using Huffman Algorithm generates corresponding coding, is saved in privately owned dictionary in server；

4th, with not being replaced as the string encoding in " coding+number of repetition " the replacement process file in privately owned dictionary, generate mesh Mark file.

5th, during string is converted into " coding+number of repetition ", if number of repetition is 0, number of repetition is omitted, If number of repetition is the numerical value more than 0, then judge using " coding+coding+...+coding " mode and " coding+number of repetition " Which kind of data volume of mode is less, and which kind of just adopts.

Decompression procedure：

1st, read the version number in file destination and privately owned dictionary numbering, connection server, obtain version number and privately owned word The corresponding normal dictionary of allusion quotation numbering and privately owned dictionary；

2nd, utilize the corresponding relation of coding and string in normal dictionary and privately owned dictionary, by " the coding+repeat in file destination Number of times " reduces bunchiness, generates original document.

Dictionary is safeguarded：

1st, depending on the list item length of normal dictionary is because of its original file type difference；

2nd, after every second compression, all using the number of repetition of strings all in compression process as list item be respectively put into normal dictionary and In extension dictionary；

3rd, periodically the list item in the list item and normal dictionary in extension dictionary is merged, select weight highest part, adopt Recompiled with huffman algorithm, generate new normal dictionary, the list item of non-inclusion criteria dictionary is put in new extension dictionary, Create redaction numbering, obtain redaction dictionary.

Other explanations：

1st, corresponding dictionary can be set up according to the different file types of original document；

2nd, in dictionary, the length of string freely can change or the integral multiple for fixed length in a length of interval；

Embodiment

Primary standard dictionary：

String	Coding	Draw secondary
			It is repeated 1 times	0000
It is repeated 2 times	0001
			It is repeated 3 times	0010
It is repeated 4 times	0011
			ABCD	010	6
ACBD	011	6
			BABC	1000	4
CBDB	1001	4
			CDBA	1010	4
DBAC	1011	4

Note：It is assumed herein that setting highest number of repetition in primary standard dictionary as 4 times, 11X is to retain coding, leaves for privately owned Dictionary distributes.

Initial extension dictionary：

String	Draw secondary	Weight
			DBCA	3	12
BACD	3	12
			CADD	3	12
ACDB	3	12
			CDDB	3	12
ACCD	3	12
			DCA	3	9
DBC	3	9
			DC	4	8
CA	3	6
			DB	2	4
BC	2	4
			D	4	4
A	3	3
			B	2	2
C	2	2

Original document：

ABCDDCABCDABCDDCADBACDBCADDABCDABCDABCDBACDBACDDBACDDBACBABC

Compression process example：

1st, the string that can mate with normative document in analysis original document（Marked with bold Italic below）：

ABCDDCABCDABCDDCADBACDBCADDABCDABCDABCDBACDBACDDBACDDBACBABC

2nd, coding in normal dictionary is used to replace the string that can mate in original document, generating process file 1：

010DC010010DCA1011DBCADD0100001BACDBACDDBACD10111000

3rd, each coding access times in calculation procedure 2 Plays dictionary, are added in normal dictionary, update normal dictionary For：

String	Coding	Draw secondary
			It is repeated 1 times	0000
It is repeated 2 times	0001
			It is repeated 3 times	0010
It is repeated 4 times	0011
			ABCD	010	6
ACBD	011	0
			BABC	1000	1
CBDB	1001	0
			CDBA	1010	0
DBAC	1011	2

4th, will fail in procedure file to make privately owned dictionary with the Series Code that coding is replaced：（If the length range of string in dictionary For 1-4 character）

（1）The character not being replaced in sequential scanning process file 1, retrieves drawing time of the string that length is 4 characters, sees Following table：

String	Coding	Draw secondary
			DBCA	1
BCAD		1
			CADD	1
BACD		3
			ACDB	1
CDBA		1
			DBAC	2
ACDD		1
			CDDB	1
DDBA		1

（2）To wherein draw secondary>K（This numerical value can adjust, and is set as K=1 herein）String descending, and mate interim volume Code：

String	Coding	Draw secondary
			BACD	E	3
DBAC	F	2

（3）According to upper table corresponding string in temporary code replacement process file 1, during because of the replacement in preposition list item Lead to draw time<The string of=K is ignored, and is not replaced,（String DBAC is ignored）Obtain procedure file 2：

010DC010010DCA1011DBCADD0100001EEDE10111000

（4）The character not being replaced in sequential scanning process file 2, retrieves drawing time of the string that length is 3 characters, sees Following table：

String	Coding	Draw secondary
			DCA	1
DBC		1
			BCA	1
CAD		1
			ADD	1

（5）Do not find to draw time>The string of K；

（6）The character not being replaced in sequential scanning process file 2, retrieves drawing time of the string that length is 2 characters, sees Following table：

String	Coding	Draw secondary
			DC	2
CA		2
			DB	1
BC		1
			AD	1
DD		1

（7）To wherein draw secondary>The string descending of K, and mate temporary code：

String	Coding	Draw secondary
			DC	F	2
CA	G	2

（8）According to upper table corresponding string in temporary code replacement process file 2, during because of the replacement in preposition list item Lead to draw time<The string of=K is ignored, and is not replaced,（String CA is ignored）Obtain procedure file 3：

010F010010FA1011DBCADD0100001EEDE10111000

（9）The character not being replaced in sequential scanning process file 3, retrieves the string that length is 1 character, by drawing time fall Sequence arranges and mates temporary code, see table：

String	Coding	Draw secondary
			D	G	4
A	H	2
			B	I	1
C	J	1

（10）By all temporary codes according to drawing number of times, carry out recompiling to obtain procedure file 4 using huffman algorithm：

String	Coding	Draw secondary
			D	1111	4
BACD	1100	3
			DC	11101	2
A	11100	2
			B	11011	1
C	11010	1

（11）Extract the content going here and there and encoding two fields in procedure file 4, generate privately owned dictionary：

String	Coding
		D	1111
BACD	1100
		DC	11101
A	11100
		B	11011
C	11010

（12）Replace temporary code with the coding in privately owned dictionary, obtain file destination：

010111010100101110111100101111111101111010111001111111101000011100110 01111110010111000

Through overcompression, original document is 60 bytes, and file destination is 86 about 11 bytes, actually used middle file destination Dictionary version number and the privately owned dictionary numbering of fixed length also need to be write in file header.

（12）By the string in privately owned dictionary and its corresponding draw time be incorporated in extension dictionary, and press weight descending, The computational methods of weight are string length × draw time, obtain new extension dictionary.

String	Draw secondary	Weight
			BACD	6	24
DBCA	3	12
			CADD	3	12
ACDB	3	12
			CDDB	3	12
ACCD	3	12
			DC	6	12
DCA	3	9
			DBC	3	9
D	8	8
			CA	3	6
A	5	5
			DB	2	4
BC	2	4
			B	3	3
C	3	3

Dictionary is safeguarded：

Periodically the list item in the list item and normal dictionary in extension dictionary is merged, the therefore achievement with string length is not made Weight, is selected weight highest part, is recompiled using huffman algorithm, generates new normal dictionary, non-inclusion criteria word The list item of allusion quotation is put in new extension dictionary, creates redaction numbering, obtains redaction dictionary.

Continuation of the previous cases：

1st, merging normal dictionary and extension dictionary is process dictionary, and presses weight descending：

String	Draw secondary	Weight
			ABCD	6	24
ACBD	6	24
			BACD	6	24
BABC	4	16
			CBDB	4	16
CDBA	4	16
			DBAC	4	16
DBCA	3	12
			CADD	3	12
ACDB	3	12
			CDDB	3	12
ACCD	3	12
			DC	6	12
DCA	3	9
			DBC	3	9
D	8	8
			CA	3	6
A	5	5
			DB	2	4
BC	2	4
			B	3	3
C	3	3

2nd, front N item higher for weight in table is selected（The adjustable in length of N, herein N take 7）, remaining length coding On the basis of retaining coding with privately owned dictionary, using huffman algorithm, the N number of string selected is encoded, obtained standard word newly Allusion quotation：

String	Coding	Draw secondary
			It is repeated 1 times	0000
It is repeated 2 times	0001
			It is repeated 3 times	0010
It is repeated 4 times	0011
			ABCD	010	6
ACBD	0110	6
			BACD	0111	6
BABC	1000	4
			CBDB	1001	4
CDBA	1010	4
			DBAC	1011	4

11X is to retain coding, leaves privately owned dictionary distribution for.

3rd, remainder in process dictionary is become new extension dictionary, obtain：

String	Draw secondary	Weight
			DBCA	3	12
CADD	3	12
			ACDB	3	12
CDDB	3	12
			ACCD	3	12
DC	6	12
			DCA	3	9
DBC	3	9
			D	8	8
CA	3	6
			A	5	5
DB	2	4
			BC	2	4
B	3	3
			C	3	3

Claims

1. a kind of destructive data compressing method based on network dictionary, including network dictionary it is characterised in that methods described includes Dictionary maintenance, compression method, decompressing method, described compression method includes：

2）According to the constructive method order traversal original document of string length in dictionary, with " coding+repeat corresponding in normal dictionary Number of times " is replaced, generating process file, and the access times according to string each in normal dictionary, and drawing in renewal normal dictionary is secondary；

3）To fail in procedure file to make privately owned dictionary with the Series Code that coding is replaced, be drawn according to it secondary, using huffman algorithm Generate corresponding coding, be saved in privately owned dictionary in server；

4）With not being replaced as the string encoding in " coding+number of repetition " the replacement process file in privately owned dictionary, generate target literary composition Part；

5）During string is converted into " coding+number of repetition ", if number of repetition is 0, number of repetition is omitted, such as

Number of repetition is the numerical value more than 0, then judge using " coding+coding+...+coding " mode with " coding+repeat time

Number " which kind of data volume of mode is less, and which kind of just adopts；

Described decompressing method includes：

1）Read the version number in file destination and privately owned dictionary numbering, connection server, obtain version number and privately owned dictionary is compiled Normal dictionary corresponding to number and privately owned dictionary；

2）Using the corresponding relation encoding in normal dictionary and privately owned dictionary and go here and there, by " the coding+repetition time in file destination Number " reduction bunchiness, generates original document；

2）After every second compression, all the number of repetition of strings all in compression process is respectively put into normal dictionary and extension as list item In dictionary；

3）Periodically the list item in the list item and normal dictionary in extension dictionary is merged, select and draw time highest part, using suddenly The graceful algorithm of husband recompiles, and generates new normal dictionary, and the list item of non-inclusion criteria dictionary is put in new extension dictionary, creates Redaction is numbered, and obtains redaction normal dictionary and extension dictionary；

Normal dictionary：The dictionary that current version number uses, all is made jointly using the original document that this version dictionary is compressed With including three field strings, encoding, drawing time；

Privately owned dictionary：Each original document corresponds to a privately owned dictionary, using the numbering of dictionary version number original document as private There is the title of dictionary, privately owned dictionary is corresponding by the string and this string that cannot find corresponding coding in original document in normal dictionary Coding is constituted, including two field strings, codings；

Extension dictionary：Comprise the string in all privately owned dictionaries in this version and each string is corresponding draws time, including two field strings, draw Secondary.

2. a kind of destructive data compressing method based on network dictionary according to claim 1 is it is characterised in that described net Network dictionary preserve in the server, includings normal dictionary, extension dictionary, privately owned dictionary, described normal dictionary comprise string, encode, Draw time three fields, normal dictionary is to be comprised string, drawn secondary two using the public described extension dictionary of compressed file of current version Field, in all privately owned dictionaries in extension dictionary set current version dictionary, string draws time, for constructing next version dictionary Described privately owned dictionary comprises string, two fields of coding, and privately owned dictionary is that each compressed file is privately owned.