CN101783788B - File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device - Google Patents

File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device Download PDF

Info

Publication number
CN101783788B
CN101783788B CN200910076795.3A CN200910076795A CN101783788B CN 101783788 B CN101783788 B CN 101783788B CN 200910076795 A CN200910076795 A CN 200910076795A CN 101783788 B CN101783788 B CN 101783788B
Authority
CN
China
Prior art keywords
word string
file
text
encoded
standard word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN200910076795.3A
Other languages
Chinese (zh)
Other versions
CN101783788A (en
Inventor
范昂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN200910076795.3A priority Critical patent/CN101783788B/en
Publication of CN101783788A publication Critical patent/CN101783788A/en
Application granted granted Critical
Publication of CN101783788B publication Critical patent/CN101783788B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the invention provides a file compression method, a file compression device, a file decompression method, a file decompression device, a compressed file searching method and a compressed file searching device. The file compression device comprises a first storage module, a first acquiring module, a first word segmentation module, and a first coding module, wherein the first storage module is used for storing a coding table which records the correspondence between standard character strings and coding identifiers, and each of the standard character strings has a unique coding identifier; the first acquiring module is used for acquiring a part of or all texts in a file to be compressed to form a text to be coded; the first word segmentation module is used for carrying out word segmentation to the text to be coded according to the standard character strings and decomposing the text to be coded to at least one character string to be coded; and the first coding module is used for acquiring a first coding sequence corresponding to the text to be coded by replacing the coding identifiers of the standard character strings with the corresponding at least one character string to be coded according to the correspondence between the standard character strings recorded in the coding table and the coding identifiers. The invention improves compression ratio of the text compression algorithm and convenience of the searching.

Description

Compressing file, decompression method, device and compressed file searching method, device
Technical field
The present invention relates to compressing file technical field, particularly a kind of compressing file, decompression method, device and compressed file searching method, device.
Background technology
Along with constantly advancing of computer technology, various types of data files are more and more huger, therefore, cause its storage to take increasing memory space, and transmission time need to take increasing bandwidth.Therefore, data file is compressed in and in computer technology, seems more and more important.
Now, for the compression of data file, be divided into two kinds of lossy compression method and Lossless Compressions, we conventional WinRAR, WinZip belong to Lossless Compression, its basic principle is all the same, briefly, namely more succinct method representation for the repeating data in file, namely remove data redundancy.
In existing Text compression algorithm, comprise a class statistics compression algorithm, as Huffman (Huffman) algorithm etc., be described as follows.
Huffman algorithm is a kind of compression method based on statistics.Its essence is exactly that the character in text is carried out to recompile, and for the higher character of frequency of utilization, its coding is also shorter.
Text after coding, mainly comprises 2 parts: Huffman code table part and compressed content part.When decompressing, first Huffman code table is taken out, then each character of compressed content part is decoded one by one, form source file.
As can be seen here, using the key of Huffman algorithm is to form Huffman code table.Here will use the data structure of Huffman tree.After a Huffman tree is generated, code table has also just generated.
Under illustrate, the urtext of supposing us is " abcbbcccc ".
The generation of Huffman tree comprises the steps:
Steps A 1, scan source file, adds up character frequency.
For sample, statistics is, and: a occurs 1 time, and b occurs 3 times, and c occurs 5 times, is designated as queue as shown in Figure 1, a:1 b:3 c:5.
Steps A 2 is taken out 2 nodes that frequency is minimum from above-mentioned queue, is merged into the branch nodes X that a frequency is 2 nodal frequency sums, joins in former queue, after adding, continues hold queue and arranges by frequency ascending order;
For sample, obtain queue as shown in Figure 2;
Steps A 3, repeating step A2, until only have a node in queue.
Steps A 4, obtains the Huffman shown in Fig. 3 by above-mentioned steps and sets, and leaf node is character, and path from root vertex to leaf node is the Huffman coding of this character.From a node, navigate to its left child, this section of path is 0, navigates to right child, and this section of path is 1.
As shown in Figure 3, the coding that can know a character is exactly 00, b character be encoded to 01, and c character be encoded to 1, after Huffman code table generates, original text " abcbbcccc " has just become 0001101011111 bit string, by each character, take 2 byte and calculate, size is by original 18 bytes (9*2), totally 144 bit, 13 bit have been become, 2 bytes.Reached the object of compression.
Decompression process is as described below, first according to Huffman code table, generates a Huffman tree, then, according to Huffman tree, compressed content is decompressed.
Such as if compressed content is bit string 0001101011111, shown in Fig. 3, so from root vertex, because first bit is 0, first port subtree, second bit is 0, port subtree, arrives leaf node a again, so decoding first character is out exactly a, each character of decompress(ion), all, from root node, flows according to bit, turn to the left or to the right, until arrive leaf node, the character that namely solution presses out, repeat this process, until all characters are all decompressed always.
Yet inventor, in realizing process of the present invention, finds that prior art at least exists following shortcoming:
In prior art, for each Text compression document, must comprise two parts, a part is the code table for encoding, another part is the coded sequence after Text compression, because these two is in a condensed document, so it is not very desirable causing compression ratio, be therefore necessary to propose new compression scheme, further to improve the compression ratio of Text compression algorithm.
Summary of the invention
The object of the embodiment of the present invention is to provide a kind of compressing file, decompression method, device and compressed file searching method, device, to improve the compression ratio of Text compression algorithm.
To achieve these goals, the embodiment of the present invention provides a kind of compressing file device, comprising:
First preserves module, and for preserving a coding schedule, described coding schedule has recorded the corresponding relation between standard word string and code identification, and described in each, standard word string has unique described code identification;
The first acquisition module, for obtaining the part or all of text of file to be compressed, forms text to be encoded;
First participle module, for described text to be encoded being carried out to participle according to described standard word string, resolves at least one word string to be encoded by described text to be encoded;
The first coding module, for the described standard word string that records according to described coding schedule and the corresponding relation between described code identification, utilize the described code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded.
Above-mentioned compressing file device, wherein, described code identification is with numeral, and the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
Above-mentioned compressing file device, wherein, also comprises:
Statistical module, for the text of described composition corpus is carried out to word frequency statistics, obtains the frequency that described standard word string occurs in described text.
Above-mentioned compressing file device, wherein, in described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding, and described compressing file device also comprises:
Modified module, for adding the file identification of described file to be compressed to described search field corresponding at least one word string to be encoded described in each.
To achieve these goals, the embodiment of the present invention also provides a kind of file compression method, it is characterized in that, comprising:
Obtain the part or all of text in file to be compressed, form text to be encoded;
According to standard word string, described text to be encoded is carried out to participle, described text to be encoded is resolved into at least one word string to be encoded;
According to the described standard word string recording in the coding schedule of preserving in advance and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded, described in each, standard word string has unique described code identification.
Above-mentioned method, wherein, described code identification is with numeral, and the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
Above-mentioned method, wherein, in described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding, and described method also comprises:
The file identification of described file to be compressed is added in described search field corresponding at least one word string to be encoded described in each.
To achieve these goals, the embodiment of the present invention also provides a kind of file decompressing device, it is characterized in that, comprising:
The 3rd acquisition module, for obtaining the first sequence to be decoded;
The first decoder module, the standard word string that the coding schedule of preserving in advance for basis records and the corresponding relation of code identification, utilize described standard word string to replace code identification corresponding in described the first sequence to be decoded, obtain the text corresponding with described the first sequence to be decoded, described in each, standard word string has unique described code identification.
Above-mentioned device, wherein, described code identification with numeral, the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
Above-mentioned device, wherein, also comprises:
Statistical module, for the text of described composition corpus is carried out to word frequency statistics, obtains the frequency that described standard word string occurs in described text.
Above-mentioned device, wherein, also comprises:
The second decoder module, for utilizing default value decompression algorithm, decompresses to the second sequence to be decoded, obtains described the first sequence to be decoded.
To achieve these goals, the embodiment of the present invention also provides a kind of file decompression method, it is characterized in that, comprising:
Obtain the first sequence to be decoded;
According to the standard word string recording in the coding schedule of preserving in advance and the corresponding relation of code identification, utilize described standard word string to replace code identification corresponding in described the first sequence to be decoded, obtain the text corresponding with described the first sequence to be decoded, described in each, standard word string has unique described code identification.
Above-mentioned device, wherein, described code identification with numeral, the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
Above-mentioned device, wherein, also comprises:
Utilize default value decompression algorithm, the second sequence to be decoded is decompressed, obtain described the first sequence to be decoded.
To achieve these goals, the embodiment of the present invention also provides a kind of compressed file searcher, it is characterized in that, comprising:
First preserves module, for preserving in advance a coding schedule, described coding schedule has recorded standard word string and with the corresponding relation between the code identification of numeral, described in each, standard word string has unique described code identification, in described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of described file identification comprises the described standard word string that described search field is corresponding;
The second acquisition module, for obtaining the search string of user's input;
The second word-dividing mode, for described search string being carried out to participle according to described standard word string, obtains at least one word string to be searched;
File identification extraction module, for obtaining respectively the corresponding file identification set of at least one word string to be searched described in each from described coding schedule;
Search Results output module, for exporting described file identification intersection of sets collection as Search Results.
To achieve these goals, the embodiment of the present invention also provides a kind of compressed file searching method, comprising:
Obtain the search string of user's input;
According to described standard word string, described search string is carried out to participle, obtain at least one word string to be searched;
From the coding schedule of preserving in advance, obtain respectively the corresponding file identification set of at least one word string to be searched described in each; Described coding schedule has recorded standard word string and with the corresponding relation between the code identification of numeral, described in each, standard word string has unique described code identification, and in described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of described file identification comprises the described standard word string that described search field is corresponding;
Described file identification intersection of sets collection is exported as Search Results.
To achieve these goals, the embodiment of the present invention also provides a kind of compressing file transmission method, comprising:
Obtain the part or all of text in file to be compressed, form text to be encoded;
According to described standard word string, described text to be encoded is carried out to participle, described text to be encoded is resolved into at least one word string to be encoded;
According to the standard word string recording in the coding schedule of preserving in advance and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded, described in each, standard word string has unique described code identification;
Described the first coded sequence is sent to network storage server.
Above-mentioned device, wherein, during part text in obtaining file to be compressed, described method also comprises:
Repeated obtain text is to the step that sends coded sequence, until the text in described file to be compressed all compresses end of transmission.
To achieve these goals, the embodiment of the present invention also provides a kind of compressing file transmitting device, comprising:
First preserves module, and for preserving a coding schedule, described coding schedule has recorded the corresponding relation between standard word string and code identification, and described in each, standard word string has unique described code identification;
The first acquisition module, for obtaining the part or all of text of file to be compressed, forms text to be encoded;
First participle module, for described text to be encoded being carried out to participle according to described standard word string, resolves at least one word string to be encoded by described text to be encoded;
The first coding module, for the described standard word string that records according to described coding schedule and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded;
Transport module, for sending to network storage server by described the first coded sequence.
The embodiment of the present invention has following beneficial effect:
First, in the embodiment of the present invention, preserve in advance a code table that is directed to all Text compressions, so do not comprise code table in each compressed file, therefore, greatly dwindled the data volume of the text after compression, improved compression ratio;
Secondly, the code table in the embodiment of the present invention is for the overall situation, is the code identification of the overall word string that obtains based on a large corpus, therefore can provide higher compression ratio;
Again, the technical scheme that is transferred to network storage server after compression with respect to prior art is compared, owing to storing identical coding schedule at network storage server in advance, so the coded sequence after compression does not comprise coding schedule, reduced network burden, and this coding schedule is all suitable for all compressed texts, when the text of the network storage is more, reduced memory space;
Finally, owing to using the coding schedule obtaining in advance, so text to be compressed can be divided into a plurality of parts at transmitting terminal, process respectively, handle part transmission in time, reduced the demand to interim storage.
Accompanying drawing explanation
The process schematic diagram that the Text compression that Fig. 1 is Huffman algorithm to Fig. 3 is processed;
Fig. 4 is the structural representation of the compressing file device of the embodiment of the present invention;
Fig. 5 is the schematic flow sheet of the file compression method of the embodiment of the present invention;
Fig. 6 is the schematic flow sheet of the compressed file searching method of the embodiment of the present invention.
Embodiment
In the method for the embodiment of the present invention and device, preserve in advance a database, this data-base recording be used to form the word of text or the umerical coding of the profit of word, when carrying out Text compression, utilize this database to encode, improve compression ratio, simultaneously, by increase by a search field in coding schedule, utilize this coding schedule to search for, saved the resource consumption of search.
As shown in Figure 1, the compressing file device in the data file of the embodiment of the present invention comprises:
First preserves module, be used for preserving a coding schedule, described coding schedule has recorded code identification corresponding to described standard word string, described code identification is with numeral, and described in each, standard word string has unique described code identification (namely the code identification of each standard word string is different, standard word string and code identification have one-to-one relationship), the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string;
The first acquisition module, for obtaining the part or all of text of file to be compressed, forms text to be encoded;
First participle module, carries out participle for the described standard word string according to described coding schedule to described text to be encoded, and described text to be encoded is resolved into at least one word string to be encoded;
The first coding module, for utilizing the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtains first coded sequence corresponding with described text to be encoded.
Relevant to its frequency of occurrences by the code identification that can know described standard word string above, therefore, the compressing file device of the embodiment of the present invention also comprises:
Statistical module, for carrying out word frequency statistics according to the described text that forms described corpus, obtains forming the frequency that the described standard word string of described text occurs in described text;
Within existing minute, word algorithm is divided into three major types: the segmenting method based on string matching, the segmenting method based on understanding and the segmenting method based on statistics specifically do not limit in specific embodiments of the invention.
The word string recording in above table and code identification meet following condition:
1, coding sign has uniqueness;
2, standard word string and coding sign have one-to-one relationship;
3, the number of times that standard word string occurs at the text that forms corpus is more, less for representing the numeral of code identification of described word string.
With concrete example, the embodiment of the present invention is elaborated below.
Suppose and utilize a plurality of texts to carry out, after word frequency statistics, having preserved corresponding relation as shown in the table in coding schedule, it should be understood that, at this, only illustrate, code identification does not represent actual situation:
Standard word string Code identification
…… ……
's ID1
…… ……
Word ID2
Carry out ID3
…… ……
Suitably ID4
Describe ID5
…… ……
Adopt ID6
…… ……
Suppose that text to be encoded that now acquisition module obtains, for " adopting suitable word ", obtains following word string to be encoded by word-dividing mode: adopt, suitably,, word.
Search the coded sequence that coding schedule can obtain text to be encoded: ID6 ID4 ID1 ID2.
The embodiment of the present invention has following beneficial effect with respect to the existing compression method based on statistics:
In the embodiment of the present invention, preserve in advance a code table that is directed to all Text compressions, so do not comprise code table in each compressed file, therefore, greatly dwindled the data volume of the text after compression, improved compression ratio;
Code table in the embodiment of the present invention is for the overall situation, is the code identification of the overall word string that obtains based on a large corpus, therefore can provide higher compression ratio.
Simultaneously, in prior art, in order to provide search service, after the text of compression need to being decompressed, just search service can be provided, in the embodiment of the present invention for search service is further provided, in this coding schedule, corresponding to standard word string described in each, be also provided with a search field, which file this search field appears at for recording corresponding standard word string, therefore, compressing file device also comprises:
Database update module, for adding the file identification of described file to be compressed to search field corresponding at least one word string to be encoded described in each;
This compressed file searcher comprises:
The second acquisition module, for obtaining the search string of user's input;
The second word-dividing mode, for described search string being carried out to participle according to described standard word string, obtains at least one word string to be searched;
File identification extraction module, for obtaining respectively the corresponding file identification set of at least one word string to be searched described in each from described coding schedule;
Search Results output module, exports as Search Results for the described file identification intersection of sets collection that described file identification extraction module is obtained.
By above-mentioned processing, utilize the compression set of the embodiment of the present invention, when search service is provided, utilizes this coding schedule can carry out search service, and compressed file need not be decompressed, saved the resource of system.
, can know, the Output rusults of the first coding module is a Serial No. meanwhile, and therefore, in order further to improve compression ratio, the compressing file device of the embodiment of the present invention also comprises:
The second compression module, be used for utilizing default value Coding Compression Algorithm, the code identification corresponding with described at least one word string to be encoded in the coded sequence respectively described the first coding module being obtained carries out compressed encoding, obtains second coded sequence corresponding with described text to be encoded.
Wherein, this default value Coding Compression Algorithm can be the numerical value Coding Compression Algorithm such as distance of swimming block code algorithm, distance of swimming variable-length encoding algorithm.
Simultaneously, owing to utilizing the coding schedule of preserving in advance in the embodiment of the present invention, rather than utilize the text in file to be compressed to obtain code identification, when so the compressing file device of the embodiment of the present invention is used for Internet Transmission, can to the text in a text, be divided into a plurality of parts and carry out serial process, and need not wait whole file to be read, so the processing time can be saved.
Text compression methods in the data file of the embodiment of the present invention, as shown in Figure 5, comprising:
Step 51, obtains the part or all of text in file to be compressed, forms text to be encoded;
Step 52, carries out participle according to the standard word string in coding schedule to described text to be encoded, and described text to be encoded is resolved into at least one word string to be encoded; Described coding schedule has recorded code identification corresponding to described standard word string, described code identification is with numeral, and described in each, standard word string has unique described code identification, the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described standard word string;
Step 53, utilizes the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtains first coded sequence corresponding with described text to be encoded;
Step 54, utilizes default value Coding Compression Algorithm, respectively the code identification corresponding with described at least one word string to be encoded in described the first coded sequence is carried out to compressed encoding, obtains second coded sequence corresponding with described text to be encoded.
The embodiment of the present invention also provides the searching method of the compressed file that the compression method shown in Fig. 5 is obtained, and as shown in Figure 6, comprising:
Step 61, obtains the search string of user's input;
Step 62, carries out participle according to standard word string to described search string, obtains at least one word string to be searched;
Step 63 is obtained respectively the corresponding file identification set of at least one word string to be searched described in each from coding schedule;
Step 64, exports described file identification intersection of sets collection as Search Results.
The file decompressing device of the embodiment of the present invention comprises:
First preserves module, be used for preserving a coding schedule, described coding schedule has recorded code identification corresponding to described standard word string, described code identification is with numeral, and described in each, standard word string has unique described code identification, the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string;
The 3rd acquisition module, for obtaining the first sequence to be decoded;
The first decoder module, be used for according to the described standard word string of described coding schedule record and the corresponding relation of described code identification, utilize described standard word string to replace code identification corresponding in described the first sequence to be decoded, obtain the text corresponding with described sequence to be decoded.
Certainly, if Serial No. is compressed in compression process, the file decompressing device of the embodiment of the present invention also comprises:
The second decoder module, for utilizing default value decompression algorithm, decompresses to the second sequence to be decoded, obtains the first sequence to be decoded;
Its processing procedure comprises the steps:
Utilize default value decompression algorithm, the second sequence to be decoded is decompressed, obtain the first sequence to be decoded;
According to the standard word string of coding schedule record and the corresponding relation of code identification, utilize described standard word string to replace code identification corresponding in described the first sequence to be decoded, obtain the text corresponding with described the first sequence to be decoded.
The embodiment of the present invention also provides a kind of compressing file transmission method, comprising:
Obtain full text or part text in file to be compressed, form text to be encoded;
According to described standard word string, described text to be encoded is carried out to participle, described text to be encoded is resolved into at least one word string to be encoded;
According to the standard word string recording in the coding schedule of preserving in advance and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded, described in each, standard word string has unique described code identification;
Described the first coded sequence is sent to network storage server.
During part text in obtaining file to be compressed, certainly also should repeat above-mentioned steps, until the full text in file to be compressed is disposed.
Corresponding compressing file transmitting device comprises:
First preserves module, and for preserving a coding schedule, described coding schedule has recorded the corresponding relation between standard word string and code identification, and described in each, standard word string has unique described code identification;
The first acquisition module, for obtaining the part or all of text of file to be compressed, forms text to be encoded;
First participle module, for described text to be encoded being carried out to participle according to described standard word string, resolves at least one word string to be encoded by described text to be encoded;
The first coding module, for the described standard word string that records according to described coding schedule and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded;
Transport module, for sending to network storage server by described the first coded sequence.
The technical scheme that is transferred to network storage server after compression with respect to prior art is compared, owing to storing identical coding schedule at network storage server in advance, so the coded sequence after compression does not comprise coding schedule, reduced network burden, and this coding schedule is all suitable for all compressed texts, when the text of the network storage is more, reduced memory space.
, owing to using the coding schedule obtaining in advance, so text to be compressed can be divided into a plurality of parts at transmitting terminal, process respectively meanwhile, handle part transmission in time, reduced the demand to interim storage.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (19)

1. a compressing file device, is characterized in that, comprising:
First preserves module, and for preserving a coding schedule, described coding schedule has recorded the corresponding relation between standard word string and code identification, and described in each, standard word string has unique described code identification; In described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding;
The first acquisition module, for obtaining the part or all of text of file to be compressed, forms text to be encoded;
First participle module, for described text to be encoded being carried out to participle according to described standard word string, resolves at least one word string to be encoded by described text to be encoded;
The first coding module, for the described standard word string that records according to described coding schedule and the corresponding relation between described code identification, utilize the described code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded.
2. compressing file device according to claim 1, is characterized in that, described code identification is with numeral, and the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
3. compressing file device according to claim 2, is characterized in that, also comprises:
Statistical module, for the text of described composition corpus is carried out to word frequency statistics, obtains the frequency that described standard word string occurs in described text.
4. according to the compressing file device described in claim 1 or 2 or 3, it is characterized in that, also comprise:
Modified module, for adding the file identification of described file to be compressed to described search field corresponding at least one word string to be encoded described in each.
5. according to the compressing file device described in claim 1 or 2 or 3, it is characterized in that, also comprise:
The second compression module, be used for utilizing default value Coding Compression Algorithm, the code identification corresponding with described at least one word string to be encoded in the coded sequence respectively described the first coding module being obtained carries out compressed encoding, obtains second coded sequence corresponding with described text to be encoded.
6. a file compression method, is characterized in that, comprising:
Obtain the part or all of text in file to be compressed, form text to be encoded;
According to standard word string, described text to be encoded is carried out to participle, described text to be encoded is resolved into at least one word string to be encoded;
According to the described standard word string recording in the coding schedule of preserving in advance and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded, described in each, standard word string has unique described code identification; In described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding.
7. method according to claim 6, is characterized in that, described code identification is with numeral, and the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
8. according to the method described in claim 6 or 7, it is characterized in that, in described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding, and described method also comprises:
The file identification of described file to be compressed is added in described search field corresponding at least one word string to be encoded described in each.
9. according to the method described in claim 6 or 7, it is characterized in that, also comprise:
Utilize default value Coding Compression Algorithm, respectively the code identification corresponding with described at least one word string to be encoded in described the first coded sequence carried out to compressed encoding, obtain second coded sequence corresponding with described text to be encoded.
10. a file decompressing device, is characterized in that, comprising:
The 3rd acquisition module, for obtaining the first sequence to be decoded;
The first decoder module, the standard word string that the coding schedule of preserving in advance for basis records and the corresponding relation of code identification, utilize described standard word string to replace code identification corresponding in described the first sequence to be decoded, obtain the text corresponding with described the first sequence to be decoded, described in each, standard word string has unique described code identification; In described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding.
11. file decompressing devices according to claim 10, it is characterized in that, described code identification with numeral, the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
12. file decompressing devices according to claim 11, is characterized in that, also comprise:
Statistical module, for the text of described composition corpus is carried out to word frequency statistics, obtains the frequency that described standard word string occurs in described text.
13. according to the file decompressing device described in claim 10 or 11 or 12, it is characterized in that, also comprises:
The second decoder module, for utilizing default value decompression algorithm, decompresses to the second sequence to be decoded, obtains described the first sequence to be decoded.
14. 1 kinds of file decompression methods, is characterized in that, comprising:
Obtain the first sequence to be decoded;
According to the standard word string recording in the coding schedule of preserving in advance and the corresponding relation of code identification, utilize described standard word string to replace code identification corresponding in described the first sequence to be decoded, obtain the text corresponding with described the first sequence to be decoded, described in each, standard word string has unique described code identification; In described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding.
15. methods according to claim 14, is characterized in that, described code identification with numeral, the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
16. according to the method described in claims 14 or 15, it is characterized in that, also comprises:
Utilize default value decompression algorithm, the second sequence to be decoded is decompressed, obtain described the first sequence to be decoded.
17. 1 kinds of compressing file transmission methods, is characterized in that, comprising:
Obtain the part or all of text in file to be compressed, form text to be encoded;
According to described standard word string, described text to be encoded is carried out to participle, described text to be encoded is resolved into at least one word string to be encoded;
According to the standard word string recording in the coding schedule of preserving in advance and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded, described in each, standard word string has unique described code identification; In described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding;
Described the first coded sequence is sent to network storage server.
18. methods according to claim 17, is characterized in that, during part text in obtaining file to be compressed, described method also comprises:
Repeated obtain text is to the step that sends coded sequence, until the text in described file to be compressed all compresses end of transmission.
19. 1 kinds of compressing file transmitting devices, is characterized in that, comprising:
First preserves module, and for preserving a coding schedule, described coding schedule has recorded the corresponding relation between standard word string and code identification, and described in each, standard word string has unique described code identification; In described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding;
The first acquisition module, for obtaining the part or all of text of file to be compressed, forms text to be encoded;
First participle module, for described text to be encoded being carried out to participle according to described standard word string, resolves at least one word string to be encoded by described text to be encoded;
The first coding module, for the described standard word string that records according to described coding schedule and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded;
Transport module, for sending to network storage server by described the first coded sequence.
CN200910076795.3A 2009-01-21 2009-01-21 File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device Active CN101783788B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910076795.3A CN101783788B (en) 2009-01-21 2009-01-21 File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910076795.3A CN101783788B (en) 2009-01-21 2009-01-21 File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device

Publications (2)

Publication Number Publication Date
CN101783788A CN101783788A (en) 2010-07-21
CN101783788B true CN101783788B (en) 2014-09-03

Family

ID=42523607

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910076795.3A Active CN101783788B (en) 2009-01-21 2009-01-21 File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device

Country Status (1)

Country Link
CN (1) CN101783788B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567322B (en) * 2010-12-09 2015-04-01 北京大学 Text compression method and text compression device
CN104021121B (en) * 2013-02-28 2018-01-26 北京四维图新科技股份有限公司 A kind of text data compression method, apparatus and server
CN104933063B (en) * 2014-03-19 2018-08-24 重庆新媒农信科技有限公司 Data processing method, searching method and device
SE538512C2 (en) * 2014-11-26 2016-08-30 Kelicomp Ab Improved compression and encryption of a file
CN105447393B (en) * 2015-11-18 2018-06-01 国网北京市电力公司 For the file transmitting method and device of electric system
CN106202172B (en) * 2016-06-24 2019-07-30 中国农业银行股份有限公司 Text compression methods and device
CN107818121B (en) * 2016-09-14 2022-05-10 阿里巴巴集团控股有限公司 HTML file compression method and device and electronic equipment
CN107918654B (en) * 2017-11-16 2020-07-24 联想(北京)有限公司 File decompression method and device and electronic equipment
CN108021541A (en) * 2017-12-15 2018-05-11 安徽长泰信息安全服务有限公司 A kind of method and its system for reducing text stored memory
CN108256017B (en) * 2018-01-08 2020-12-15 武汉斗鱼网络科技有限公司 Method and device for data storage and computer equipment
CN108829872B (en) * 2018-06-22 2021-03-09 武汉轻工大学 Method, device, system and storage medium for rapidly processing lossless compressed file
CN110032432B (en) * 2018-12-03 2023-09-26 创新先进技术有限公司 Example compression method and device and example decompression method and device
CN110309376A (en) * 2019-07-10 2019-10-08 深圳市友华软件科技有限公司 The configuration entry management method of embedded platform
CN111431537A (en) * 2020-03-06 2020-07-17 平安科技(深圳)有限公司 Data compression method and device and computer readable storage medium
CN112685044A (en) * 2020-12-20 2021-04-20 武汉迈威通信股份有限公司 Webpage compression and storage method and system based on ROM
CN112684325B (en) * 2020-12-30 2022-03-18 杭州加速科技有限公司 Compression method and device for test vector instruction in ATE (automatic test equipment)
CN112799672A (en) * 2020-12-31 2021-05-14 杭州广立微电子股份有限公司 Test data processing method based on keywords
CN113641434A (en) * 2021-08-12 2021-11-12 上海酷栈科技有限公司 Cloud desktop data compression self-adaptive encoding method and system and storage device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1630370A (en) * 2003-12-15 2005-06-22 联想(北京)有限公司 A method of coding and decoding for data compression
CN101086749A (en) * 2006-06-08 2007-12-12 杭州掌幄科技有限公司 A data compression algorithm of electronic case history

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100448111B1 (en) * 1998-06-09 2004-09-10 마츠시타 덴끼 산교 가부시키가이샤 Image encoder, image decoder, character checker, and data storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1630370A (en) * 2003-12-15 2005-06-22 联想(北京)有限公司 A method of coding and decoding for data compression
CN101086749A (en) * 2006-06-08 2007-12-12 杭州掌幄科技有限公司 A data compression algorithm of electronic case history

Also Published As

Publication number Publication date
CN101783788A (en) 2010-07-21

Similar Documents

Publication Publication Date Title
CN101783788B (en) File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device
CN1166072C (en) Data compaction, transmission, storage and program transmission
JP3278297B2 (en) Data compression method, data decompression method, data compression device, and data decompression device
KR101049699B1 (en) Data Compression Method
CN103067022A (en) Nondestructive compressing method, uncompressing method, compressing device and uncompressing device for integer data
WO2006043142A1 (en) Adaptive compression scheme
CN101534124B (en) Compression algorithm for short natural language
CN117040539B (en) Petroleum logging data compression method and device based on M-ary tree and LZW algorithm
CN104682966B (en) The lossless compression method of table data
Gagie et al. Efficient and compact representations of prefix codes
Jacob et al. Comparative analysis of lossless text compression techniques
CN111274950B (en) Feature vector data encoding and decoding method, server and terminal
Shanmugasundaram et al. Text preprocessing using enhanced intelligent dictionary based encoding (EIDBE)
Rani et al. A survey on lossless text data compression techniques
Mahmood et al. Efficient compression scheme for large natural text using zipf distribution
Hardi et al. Text File Compression Using Hybrid Run Length Encoding (Rle) Algorithm With Even Rodeh Code (Erc) And Variable Length Binary Encoding (Vlbe) To Save Storage Space
CN114070471B (en) Test data packet transmission method, device, system, equipment and medium
Rincy et al. Preprocessed text compression method for Malayalam text files
CN113708773A (en) Lossless compression and transmission method and system for power plant data
Mahmood et al. An Efficient Text Database Compression Technique using 6 Bit Character Encoding by Table Look Up
Huang et al. Lossless compression algorithm for multi-source sensor data research
CN117614459A (en) Compression method and system for oversized file fragment transmission process
Nishad et al. Efficient random sampling statistical method to improve big data compression ratio and pattern matching techniques for compressed data
Mahmoudi et al. Comparison of Compression Algorithms in text data for Data Mining
Rajon et al. An Effective Approach for Compression of Bengali Text

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant