CN101783788B - File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device - Google Patents
File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device Download PDFInfo
- Publication number
- CN101783788B CN101783788B CN200910076795.3A CN200910076795A CN101783788B CN 101783788 B CN101783788 B CN 101783788B CN 200910076795 A CN200910076795 A CN 200910076795A CN 101783788 B CN101783788 B CN 101783788B
- Authority
- CN
- China
- Prior art keywords
- word string
- file
- text
- encoded
- standard word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Document Processing Apparatus (AREA)
Abstract
The embodiment of the invention provides a file compression method, a file compression device, a file decompression method, a file decompression device, a compressed file searching method and a compressed file searching device. The file compression device comprises a first storage module, a first acquiring module, a first word segmentation module, and a first coding module, wherein the first storage module is used for storing a coding table which records the correspondence between standard character strings and coding identifiers, and each of the standard character strings has a unique coding identifier; the first acquiring module is used for acquiring a part of or all texts in a file to be compressed to form a text to be coded; the first word segmentation module is used for carrying out word segmentation to the text to be coded according to the standard character strings and decomposing the text to be coded to at least one character string to be coded; and the first coding module is used for acquiring a first coding sequence corresponding to the text to be coded by replacing the coding identifiers of the standard character strings with the corresponding at least one character string to be coded according to the correspondence between the standard character strings recorded in the coding table and the coding identifiers. The invention improves compression ratio of the text compression algorithm and convenience of the searching.
Description
Technical field
The present invention relates to compressing file technical field, particularly a kind of compressing file, decompression method, device and compressed file searching method, device.
Background technology
Along with constantly advancing of computer technology, various types of data files are more and more huger, therefore, cause its storage to take increasing memory space, and transmission time need to take increasing bandwidth.Therefore, data file is compressed in and in computer technology, seems more and more important.
Now, for the compression of data file, be divided into two kinds of lossy compression method and Lossless Compressions, we conventional WinRAR, WinZip belong to Lossless Compression, its basic principle is all the same, briefly, namely more succinct method representation for the repeating data in file, namely remove data redundancy.
In existing Text compression algorithm, comprise a class statistics compression algorithm, as Huffman (Huffman) algorithm etc., be described as follows.
Huffman algorithm is a kind of compression method based on statistics.Its essence is exactly that the character in text is carried out to recompile, and for the higher character of frequency of utilization, its coding is also shorter.
Text after coding, mainly comprises 2 parts: Huffman code table part and compressed content part.When decompressing, first Huffman code table is taken out, then each character of compressed content part is decoded one by one, form source file.
As can be seen here, using the key of Huffman algorithm is to form Huffman code table.Here will use the data structure of Huffman tree.After a Huffman tree is generated, code table has also just generated.
Under illustrate, the urtext of supposing us is " abcbbcccc ".
The generation of Huffman tree comprises the steps:
Steps A 1, scan source file, adds up character frequency.
For sample, statistics is, and: a occurs 1 time, and b occurs 3 times, and c occurs 5 times, is designated as queue as shown in Figure 1, a:1 b:3 c:5.
Steps A 2 is taken out 2 nodes that frequency is minimum from above-mentioned queue, is merged into the branch nodes X that a frequency is 2 nodal frequency sums, joins in former queue, after adding, continues hold queue and arranges by frequency ascending order;
For sample, obtain queue as shown in Figure 2;
Steps A 3, repeating step A2, until only have a node in queue.
Steps A 4, obtains the Huffman shown in Fig. 3 by above-mentioned steps and sets, and leaf node is character, and path from root vertex to leaf node is the Huffman coding of this character.From a node, navigate to its left child, this section of path is 0, navigates to right child, and this section of path is 1.
As shown in Figure 3, the coding that can know a character is exactly 00, b character be encoded to 01, and c character be encoded to 1, after Huffman code table generates, original text " abcbbcccc " has just become 0001101011111 bit string, by each character, take 2 byte and calculate, size is by original 18 bytes (9*2), totally 144 bit, 13 bit have been become, 2 bytes.Reached the object of compression.
Decompression process is as described below, first according to Huffman code table, generates a Huffman tree, then, according to Huffman tree, compressed content is decompressed.
Such as if compressed content is bit string 0001101011111, shown in Fig. 3, so from root vertex, because first bit is 0, first port subtree, second bit is 0, port subtree, arrives leaf node a again, so decoding first character is out exactly a, each character of decompress(ion), all, from root node, flows according to bit, turn to the left or to the right, until arrive leaf node, the character that namely solution presses out, repeat this process, until all characters are all decompressed always.
Yet inventor, in realizing process of the present invention, finds that prior art at least exists following shortcoming:
In prior art, for each Text compression document, must comprise two parts, a part is the code table for encoding, another part is the coded sequence after Text compression, because these two is in a condensed document, so it is not very desirable causing compression ratio, be therefore necessary to propose new compression scheme, further to improve the compression ratio of Text compression algorithm.
Summary of the invention
The object of the embodiment of the present invention is to provide a kind of compressing file, decompression method, device and compressed file searching method, device, to improve the compression ratio of Text compression algorithm.
To achieve these goals, the embodiment of the present invention provides a kind of compressing file device, comprising:
First preserves module, and for preserving a coding schedule, described coding schedule has recorded the corresponding relation between standard word string and code identification, and described in each, standard word string has unique described code identification;
The first acquisition module, for obtaining the part or all of text of file to be compressed, forms text to be encoded;
First participle module, for described text to be encoded being carried out to participle according to described standard word string, resolves at least one word string to be encoded by described text to be encoded;
The first coding module, for the described standard word string that records according to described coding schedule and the corresponding relation between described code identification, utilize the described code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded.
Above-mentioned compressing file device, wherein, described code identification is with numeral, and the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
Above-mentioned compressing file device, wherein, also comprises:
Statistical module, for the text of described composition corpus is carried out to word frequency statistics, obtains the frequency that described standard word string occurs in described text.
Above-mentioned compressing file device, wherein, in described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding, and described compressing file device also comprises:
Modified module, for adding the file identification of described file to be compressed to described search field corresponding at least one word string to be encoded described in each.
To achieve these goals, the embodiment of the present invention also provides a kind of file compression method, it is characterized in that, comprising:
Obtain the part or all of text in file to be compressed, form text to be encoded;
According to standard word string, described text to be encoded is carried out to participle, described text to be encoded is resolved into at least one word string to be encoded;
According to the described standard word string recording in the coding schedule of preserving in advance and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded, described in each, standard word string has unique described code identification.
Above-mentioned method, wherein, described code identification is with numeral, and the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
Above-mentioned method, wherein, in described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding, and described method also comprises:
The file identification of described file to be compressed is added in described search field corresponding at least one word string to be encoded described in each.
To achieve these goals, the embodiment of the present invention also provides a kind of file decompressing device, it is characterized in that, comprising:
The 3rd acquisition module, for obtaining the first sequence to be decoded;
The first decoder module, the standard word string that the coding schedule of preserving in advance for basis records and the corresponding relation of code identification, utilize described standard word string to replace code identification corresponding in described the first sequence to be decoded, obtain the text corresponding with described the first sequence to be decoded, described in each, standard word string has unique described code identification.
Above-mentioned device, wherein, described code identification with numeral, the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
Above-mentioned device, wherein, also comprises:
Statistical module, for the text of described composition corpus is carried out to word frequency statistics, obtains the frequency that described standard word string occurs in described text.
Above-mentioned device, wherein, also comprises:
The second decoder module, for utilizing default value decompression algorithm, decompresses to the second sequence to be decoded, obtains described the first sequence to be decoded.
To achieve these goals, the embodiment of the present invention also provides a kind of file decompression method, it is characterized in that, comprising:
Obtain the first sequence to be decoded;
According to the standard word string recording in the coding schedule of preserving in advance and the corresponding relation of code identification, utilize described standard word string to replace code identification corresponding in described the first sequence to be decoded, obtain the text corresponding with described the first sequence to be decoded, described in each, standard word string has unique described code identification.
Above-mentioned device, wherein, described code identification with numeral, the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
Above-mentioned device, wherein, also comprises:
Utilize default value decompression algorithm, the second sequence to be decoded is decompressed, obtain described the first sequence to be decoded.
To achieve these goals, the embodiment of the present invention also provides a kind of compressed file searcher, it is characterized in that, comprising:
First preserves module, for preserving in advance a coding schedule, described coding schedule has recorded standard word string and with the corresponding relation between the code identification of numeral, described in each, standard word string has unique described code identification, in described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of described file identification comprises the described standard word string that described search field is corresponding;
The second acquisition module, for obtaining the search string of user's input;
The second word-dividing mode, for described search string being carried out to participle according to described standard word string, obtains at least one word string to be searched;
File identification extraction module, for obtaining respectively the corresponding file identification set of at least one word string to be searched described in each from described coding schedule;
Search Results output module, for exporting described file identification intersection of sets collection as Search Results.
To achieve these goals, the embodiment of the present invention also provides a kind of compressed file searching method, comprising:
Obtain the search string of user's input;
According to described standard word string, described search string is carried out to participle, obtain at least one word string to be searched;
From the coding schedule of preserving in advance, obtain respectively the corresponding file identification set of at least one word string to be searched described in each; Described coding schedule has recorded standard word string and with the corresponding relation between the code identification of numeral, described in each, standard word string has unique described code identification, and in described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of described file identification comprises the described standard word string that described search field is corresponding;
Described file identification intersection of sets collection is exported as Search Results.
To achieve these goals, the embodiment of the present invention also provides a kind of compressing file transmission method, comprising:
Obtain the part or all of text in file to be compressed, form text to be encoded;
According to described standard word string, described text to be encoded is carried out to participle, described text to be encoded is resolved into at least one word string to be encoded;
According to the standard word string recording in the coding schedule of preserving in advance and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded, described in each, standard word string has unique described code identification;
Described the first coded sequence is sent to network storage server.
Above-mentioned device, wherein, during part text in obtaining file to be compressed, described method also comprises:
Repeated obtain text is to the step that sends coded sequence, until the text in described file to be compressed all compresses end of transmission.
To achieve these goals, the embodiment of the present invention also provides a kind of compressing file transmitting device, comprising:
First preserves module, and for preserving a coding schedule, described coding schedule has recorded the corresponding relation between standard word string and code identification, and described in each, standard word string has unique described code identification;
The first acquisition module, for obtaining the part or all of text of file to be compressed, forms text to be encoded;
First participle module, for described text to be encoded being carried out to participle according to described standard word string, resolves at least one word string to be encoded by described text to be encoded;
The first coding module, for the described standard word string that records according to described coding schedule and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded;
Transport module, for sending to network storage server by described the first coded sequence.
The embodiment of the present invention has following beneficial effect:
First, in the embodiment of the present invention, preserve in advance a code table that is directed to all Text compressions, so do not comprise code table in each compressed file, therefore, greatly dwindled the data volume of the text after compression, improved compression ratio;
Secondly, the code table in the embodiment of the present invention is for the overall situation, is the code identification of the overall word string that obtains based on a large corpus, therefore can provide higher compression ratio;
Again, the technical scheme that is transferred to network storage server after compression with respect to prior art is compared, owing to storing identical coding schedule at network storage server in advance, so the coded sequence after compression does not comprise coding schedule, reduced network burden, and this coding schedule is all suitable for all compressed texts, when the text of the network storage is more, reduced memory space;
Finally, owing to using the coding schedule obtaining in advance, so text to be compressed can be divided into a plurality of parts at transmitting terminal, process respectively, handle part transmission in time, reduced the demand to interim storage.
Accompanying drawing explanation
The process schematic diagram that the Text compression that Fig. 1 is Huffman algorithm to Fig. 3 is processed;
Fig. 4 is the structural representation of the compressing file device of the embodiment of the present invention;
Fig. 5 is the schematic flow sheet of the file compression method of the embodiment of the present invention;
Fig. 6 is the schematic flow sheet of the compressed file searching method of the embodiment of the present invention.
Embodiment
In the method for the embodiment of the present invention and device, preserve in advance a database, this data-base recording be used to form the word of text or the umerical coding of the profit of word, when carrying out Text compression, utilize this database to encode, improve compression ratio, simultaneously, by increase by a search field in coding schedule, utilize this coding schedule to search for, saved the resource consumption of search.
As shown in Figure 1, the compressing file device in the data file of the embodiment of the present invention comprises:
First preserves module, be used for preserving a coding schedule, described coding schedule has recorded code identification corresponding to described standard word string, described code identification is with numeral, and described in each, standard word string has unique described code identification (namely the code identification of each standard word string is different, standard word string and code identification have one-to-one relationship), the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string;
The first acquisition module, for obtaining the part or all of text of file to be compressed, forms text to be encoded;
First participle module, carries out participle for the described standard word string according to described coding schedule to described text to be encoded, and described text to be encoded is resolved into at least one word string to be encoded;
The first coding module, for utilizing the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtains first coded sequence corresponding with described text to be encoded.
Relevant to its frequency of occurrences by the code identification that can know described standard word string above, therefore, the compressing file device of the embodiment of the present invention also comprises:
Statistical module, for carrying out word frequency statistics according to the described text that forms described corpus, obtains forming the frequency that the described standard word string of described text occurs in described text;
Within existing minute, word algorithm is divided into three major types: the segmenting method based on string matching, the segmenting method based on understanding and the segmenting method based on statistics specifically do not limit in specific embodiments of the invention.
The word string recording in above table and code identification meet following condition:
1, coding sign has uniqueness;
2, standard word string and coding sign have one-to-one relationship;
3, the number of times that standard word string occurs at the text that forms corpus is more, less for representing the numeral of code identification of described word string.
With concrete example, the embodiment of the present invention is elaborated below.
Suppose and utilize a plurality of texts to carry out, after word frequency statistics, having preserved corresponding relation as shown in the table in coding schedule, it should be understood that, at this, only illustrate, code identification does not represent actual situation:
Standard word string | Code identification |
…… | …… |
's | ID1 |
…… | …… |
Word | ID2 |
Carry out | ID3 |
…… | …… |
Suitably | ID4 |
Describe | ID5 |
…… | …… |
Adopt | ID6 |
…… | …… |
Suppose that text to be encoded that now acquisition module obtains, for " adopting suitable word ", obtains following word string to be encoded by word-dividing mode: adopt, suitably,, word.
Search the coded sequence that coding schedule can obtain text to be encoded: ID6 ID4 ID1 ID2.
The embodiment of the present invention has following beneficial effect with respect to the existing compression method based on statistics:
In the embodiment of the present invention, preserve in advance a code table that is directed to all Text compressions, so do not comprise code table in each compressed file, therefore, greatly dwindled the data volume of the text after compression, improved compression ratio;
Code table in the embodiment of the present invention is for the overall situation, is the code identification of the overall word string that obtains based on a large corpus, therefore can provide higher compression ratio.
Simultaneously, in prior art, in order to provide search service, after the text of compression need to being decompressed, just search service can be provided, in the embodiment of the present invention for search service is further provided, in this coding schedule, corresponding to standard word string described in each, be also provided with a search field, which file this search field appears at for recording corresponding standard word string, therefore, compressing file device also comprises:
Database update module, for adding the file identification of described file to be compressed to search field corresponding at least one word string to be encoded described in each;
This compressed file searcher comprises:
The second acquisition module, for obtaining the search string of user's input;
The second word-dividing mode, for described search string being carried out to participle according to described standard word string, obtains at least one word string to be searched;
File identification extraction module, for obtaining respectively the corresponding file identification set of at least one word string to be searched described in each from described coding schedule;
Search Results output module, exports as Search Results for the described file identification intersection of sets collection that described file identification extraction module is obtained.
By above-mentioned processing, utilize the compression set of the embodiment of the present invention, when search service is provided, utilizes this coding schedule can carry out search service, and compressed file need not be decompressed, saved the resource of system.
, can know, the Output rusults of the first coding module is a Serial No. meanwhile, and therefore, in order further to improve compression ratio, the compressing file device of the embodiment of the present invention also comprises:
The second compression module, be used for utilizing default value Coding Compression Algorithm, the code identification corresponding with described at least one word string to be encoded in the coded sequence respectively described the first coding module being obtained carries out compressed encoding, obtains second coded sequence corresponding with described text to be encoded.
Wherein, this default value Coding Compression Algorithm can be the numerical value Coding Compression Algorithm such as distance of swimming block code algorithm, distance of swimming variable-length encoding algorithm.
Simultaneously, owing to utilizing the coding schedule of preserving in advance in the embodiment of the present invention, rather than utilize the text in file to be compressed to obtain code identification, when so the compressing file device of the embodiment of the present invention is used for Internet Transmission, can to the text in a text, be divided into a plurality of parts and carry out serial process, and need not wait whole file to be read, so the processing time can be saved.
Text compression methods in the data file of the embodiment of the present invention, as shown in Figure 5, comprising:
Step 51, obtains the part or all of text in file to be compressed, forms text to be encoded;
Step 52, carries out participle according to the standard word string in coding schedule to described text to be encoded, and described text to be encoded is resolved into at least one word string to be encoded; Described coding schedule has recorded code identification corresponding to described standard word string, described code identification is with numeral, and described in each, standard word string has unique described code identification, the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described standard word string;
Step 53, utilizes the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtains first coded sequence corresponding with described text to be encoded;
Step 54, utilizes default value Coding Compression Algorithm, respectively the code identification corresponding with described at least one word string to be encoded in described the first coded sequence is carried out to compressed encoding, obtains second coded sequence corresponding with described text to be encoded.
The embodiment of the present invention also provides the searching method of the compressed file that the compression method shown in Fig. 5 is obtained, and as shown in Figure 6, comprising:
Step 61, obtains the search string of user's input;
Step 62, carries out participle according to standard word string to described search string, obtains at least one word string to be searched;
Step 63 is obtained respectively the corresponding file identification set of at least one word string to be searched described in each from coding schedule;
Step 64, exports described file identification intersection of sets collection as Search Results.
The file decompressing device of the embodiment of the present invention comprises:
First preserves module, be used for preserving a coding schedule, described coding schedule has recorded code identification corresponding to described standard word string, described code identification is with numeral, and described in each, standard word string has unique described code identification, the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string;
The 3rd acquisition module, for obtaining the first sequence to be decoded;
The first decoder module, be used for according to the described standard word string of described coding schedule record and the corresponding relation of described code identification, utilize described standard word string to replace code identification corresponding in described the first sequence to be decoded, obtain the text corresponding with described sequence to be decoded.
Certainly, if Serial No. is compressed in compression process, the file decompressing device of the embodiment of the present invention also comprises:
The second decoder module, for utilizing default value decompression algorithm, decompresses to the second sequence to be decoded, obtains the first sequence to be decoded;
Its processing procedure comprises the steps:
Utilize default value decompression algorithm, the second sequence to be decoded is decompressed, obtain the first sequence to be decoded;
According to the standard word string of coding schedule record and the corresponding relation of code identification, utilize described standard word string to replace code identification corresponding in described the first sequence to be decoded, obtain the text corresponding with described the first sequence to be decoded.
The embodiment of the present invention also provides a kind of compressing file transmission method, comprising:
Obtain full text or part text in file to be compressed, form text to be encoded;
According to described standard word string, described text to be encoded is carried out to participle, described text to be encoded is resolved into at least one word string to be encoded;
According to the standard word string recording in the coding schedule of preserving in advance and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded, described in each, standard word string has unique described code identification;
Described the first coded sequence is sent to network storage server.
During part text in obtaining file to be compressed, certainly also should repeat above-mentioned steps, until the full text in file to be compressed is disposed.
Corresponding compressing file transmitting device comprises:
First preserves module, and for preserving a coding schedule, described coding schedule has recorded the corresponding relation between standard word string and code identification, and described in each, standard word string has unique described code identification;
The first acquisition module, for obtaining the part or all of text of file to be compressed, forms text to be encoded;
First participle module, for described text to be encoded being carried out to participle according to described standard word string, resolves at least one word string to be encoded by described text to be encoded;
The first coding module, for the described standard word string that records according to described coding schedule and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded;
Transport module, for sending to network storage server by described the first coded sequence.
The technical scheme that is transferred to network storage server after compression with respect to prior art is compared, owing to storing identical coding schedule at network storage server in advance, so the coded sequence after compression does not comprise coding schedule, reduced network burden, and this coding schedule is all suitable for all compressed texts, when the text of the network storage is more, reduced memory space.
, owing to using the coding schedule obtaining in advance, so text to be compressed can be divided into a plurality of parts at transmitting terminal, process respectively meanwhile, handle part transmission in time, reduced the demand to interim storage.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.
Claims (19)
1. a compressing file device, is characterized in that, comprising:
First preserves module, and for preserving a coding schedule, described coding schedule has recorded the corresponding relation between standard word string and code identification, and described in each, standard word string has unique described code identification; In described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding;
The first acquisition module, for obtaining the part or all of text of file to be compressed, forms text to be encoded;
First participle module, for described text to be encoded being carried out to participle according to described standard word string, resolves at least one word string to be encoded by described text to be encoded;
The first coding module, for the described standard word string that records according to described coding schedule and the corresponding relation between described code identification, utilize the described code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded.
2. compressing file device according to claim 1, is characterized in that, described code identification is with numeral, and the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
3. compressing file device according to claim 2, is characterized in that, also comprises:
Statistical module, for the text of described composition corpus is carried out to word frequency statistics, obtains the frequency that described standard word string occurs in described text.
4. according to the compressing file device described in claim 1 or 2 or 3, it is characterized in that, also comprise:
Modified module, for adding the file identification of described file to be compressed to described search field corresponding at least one word string to be encoded described in each.
5. according to the compressing file device described in claim 1 or 2 or 3, it is characterized in that, also comprise:
The second compression module, be used for utilizing default value Coding Compression Algorithm, the code identification corresponding with described at least one word string to be encoded in the coded sequence respectively described the first coding module being obtained carries out compressed encoding, obtains second coded sequence corresponding with described text to be encoded.
6. a file compression method, is characterized in that, comprising:
Obtain the part or all of text in file to be compressed, form text to be encoded;
According to standard word string, described text to be encoded is carried out to participle, described text to be encoded is resolved into at least one word string to be encoded;
According to the described standard word string recording in the coding schedule of preserving in advance and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded, described in each, standard word string has unique described code identification; In described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding.
7. method according to claim 6, is characterized in that, described code identification is with numeral, and the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
8. according to the method described in claim 6 or 7, it is characterized in that, in described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding, and described method also comprises:
The file identification of described file to be compressed is added in described search field corresponding at least one word string to be encoded described in each.
9. according to the method described in claim 6 or 7, it is characterized in that, also comprise:
Utilize default value Coding Compression Algorithm, respectively the code identification corresponding with described at least one word string to be encoded in described the first coded sequence carried out to compressed encoding, obtain second coded sequence corresponding with described text to be encoded.
10. a file decompressing device, is characterized in that, comprising:
The 3rd acquisition module, for obtaining the first sequence to be decoded;
The first decoder module, the standard word string that the coding schedule of preserving in advance for basis records and the corresponding relation of code identification, utilize described standard word string to replace code identification corresponding in described the first sequence to be decoded, obtain the text corresponding with described the first sequence to be decoded, described in each, standard word string has unique described code identification; In described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding.
11. file decompressing devices according to claim 10, it is characterized in that, described code identification with numeral, the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
12. file decompressing devices according to claim 11, is characterized in that, also comprise:
Statistical module, for the text of described composition corpus is carried out to word frequency statistics, obtains the frequency that described standard word string occurs in described text.
13. according to the file decompressing device described in claim 10 or 11 or 12, it is characterized in that, also comprises:
The second decoder module, for utilizing default value decompression algorithm, decompresses to the second sequence to be decoded, obtains described the first sequence to be decoded.
14. 1 kinds of file decompression methods, is characterized in that, comprising:
Obtain the first sequence to be decoded;
According to the standard word string recording in the coding schedule of preserving in advance and the corresponding relation of code identification, utilize described standard word string to replace code identification corresponding in described the first sequence to be decoded, obtain the text corresponding with described the first sequence to be decoded, described in each, standard word string has unique described code identification; In described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding.
15. methods according to claim 14, is characterized in that, described code identification with numeral, the frequency that described standard word string occurs in forming the text of corpus is higher, less for representing the numeral of code identification of described word string.
16. according to the method described in claims 14 or 15, it is characterized in that, also comprises:
Utilize default value decompression algorithm, the second sequence to be decoded is decompressed, obtain described the first sequence to be decoded.
17. 1 kinds of compressing file transmission methods, is characterized in that, comprising:
Obtain the part or all of text in file to be compressed, form text to be encoded;
According to described standard word string, described text to be encoded is carried out to participle, described text to be encoded is resolved into at least one word string to be encoded;
According to the standard word string recording in the coding schedule of preserving in advance and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded, described in each, standard word string has unique described code identification; In described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding;
Described the first coded sequence is sent to network storage server.
18. methods according to claim 17, is characterized in that, during part text in obtaining file to be compressed, described method also comprises:
Repeated obtain text is to the step that sends coded sequence, until the text in described file to be compressed all compresses end of transmission.
19. 1 kinds of compressing file transmitting devices, is characterized in that, comprising:
First preserves module, and for preserving a coding schedule, described coding schedule has recorded the corresponding relation between standard word string and code identification, and described in each, standard word string has unique described code identification; In described coding schedule, corresponding to standard word string described in each, be provided with search field, described search field is for log file sign, and the indicated file of file identification recording in described search field comprises the described standard word string that described search field is corresponding;
The first acquisition module, for obtaining the part or all of text of file to be compressed, forms text to be encoded;
First participle module, for described text to be encoded being carried out to participle according to described standard word string, resolves at least one word string to be encoded by described text to be encoded;
The first coding module, for the described standard word string that records according to described coding schedule and the corresponding relation between code identification, utilize the code identification of described standard word string to replace corresponding described at least one word string to be encoded, obtain first coded sequence corresponding with described text to be encoded;
Transport module, for sending to network storage server by described the first coded sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910076795.3A CN101783788B (en) | 2009-01-21 | 2009-01-21 | File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910076795.3A CN101783788B (en) | 2009-01-21 | 2009-01-21 | File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101783788A CN101783788A (en) | 2010-07-21 |
CN101783788B true CN101783788B (en) | 2014-09-03 |
Family
ID=42523607
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200910076795.3A Active CN101783788B (en) | 2009-01-21 | 2009-01-21 | File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101783788B (en) |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102567322B (en) * | 2010-12-09 | 2015-04-01 | 北京大学 | Text compression method and text compression device |
CN104021121B (en) * | 2013-02-28 | 2018-01-26 | 北京四维图新科技股份有限公司 | A kind of text data compression method, apparatus and server |
CN104933063B (en) * | 2014-03-19 | 2018-08-24 | 重庆新媒农信科技有限公司 | Data processing method, searching method and device |
SE538512C2 (en) * | 2014-11-26 | 2016-08-30 | Kelicomp Ab | Improved compression and encryption of a file |
CN105447393B (en) * | 2015-11-18 | 2018-06-01 | 国网北京市电力公司 | For the file transmitting method and device of electric system |
CN106202172B (en) * | 2016-06-24 | 2019-07-30 | 中国农业银行股份有限公司 | Text compression methods and device |
CN107818121B (en) * | 2016-09-14 | 2022-05-10 | 阿里巴巴集团控股有限公司 | HTML file compression method and device and electronic equipment |
CN107918654B (en) * | 2017-11-16 | 2020-07-24 | 联想(北京)有限公司 | File decompression method and device and electronic equipment |
CN108021541A (en) * | 2017-12-15 | 2018-05-11 | 安徽长泰信息安全服务有限公司 | A kind of method and its system for reducing text stored memory |
CN108256017B (en) * | 2018-01-08 | 2020-12-15 | 武汉斗鱼网络科技有限公司 | Method and device for data storage and computer equipment |
CN108829872B (en) * | 2018-06-22 | 2021-03-09 | 武汉轻工大学 | Method, device, system and storage medium for rapidly processing lossless compressed file |
CN110032432B (en) * | 2018-12-03 | 2023-09-26 | 创新先进技术有限公司 | Example compression method and device and example decompression method and device |
CN110309376A (en) * | 2019-07-10 | 2019-10-08 | 深圳市友华软件科技有限公司 | The configuration entry management method of embedded platform |
CN111431537A (en) * | 2020-03-06 | 2020-07-17 | 平安科技(深圳)有限公司 | Data compression method and device and computer readable storage medium |
CN112685044A (en) * | 2020-12-20 | 2021-04-20 | 武汉迈威通信股份有限公司 | Webpage compression and storage method and system based on ROM |
CN112684325B (en) * | 2020-12-30 | 2022-03-18 | 杭州加速科技有限公司 | Compression method and device for test vector instruction in ATE (automatic test equipment) |
CN112799672A (en) * | 2020-12-31 | 2021-05-14 | 杭州广立微电子股份有限公司 | Test data processing method based on keywords |
CN113641434A (en) * | 2021-08-12 | 2021-11-12 | 上海酷栈科技有限公司 | Cloud desktop data compression self-adaptive encoding method and system and storage device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1630370A (en) * | 2003-12-15 | 2005-06-22 | 联想(北京)有限公司 | A method of coding and decoding for data compression |
CN101086749A (en) * | 2006-06-08 | 2007-12-12 | 杭州掌幄科技有限公司 | A data compression algorithm of electronic case history |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100448111B1 (en) * | 1998-06-09 | 2004-09-10 | 마츠시타 덴끼 산교 가부시키가이샤 | Image encoder, image decoder, character checker, and data storage medium |
-
2009
- 2009-01-21 CN CN200910076795.3A patent/CN101783788B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1630370A (en) * | 2003-12-15 | 2005-06-22 | 联想(北京)有限公司 | A method of coding and decoding for data compression |
CN101086749A (en) * | 2006-06-08 | 2007-12-12 | 杭州掌幄科技有限公司 | A data compression algorithm of electronic case history |
Also Published As
Publication number | Publication date |
---|---|
CN101783788A (en) | 2010-07-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101783788B (en) | File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device | |
CN1166072C (en) | Data compaction, transmission, storage and program transmission | |
JP3278297B2 (en) | Data compression method, data decompression method, data compression device, and data decompression device | |
KR101049699B1 (en) | Data Compression Method | |
CN103067022A (en) | Nondestructive compressing method, uncompressing method, compressing device and uncompressing device for integer data | |
WO2006043142A1 (en) | Adaptive compression scheme | |
CN101534124B (en) | Compression algorithm for short natural language | |
CN117040539B (en) | Petroleum logging data compression method and device based on M-ary tree and LZW algorithm | |
CN104682966B (en) | The lossless compression method of table data | |
Gagie et al. | Efficient and compact representations of prefix codes | |
Jacob et al. | Comparative analysis of lossless text compression techniques | |
CN111274950B (en) | Feature vector data encoding and decoding method, server and terminal | |
Shanmugasundaram et al. | Text preprocessing using enhanced intelligent dictionary based encoding (EIDBE) | |
Rani et al. | A survey on lossless text data compression techniques | |
Mahmood et al. | Efficient compression scheme for large natural text using zipf distribution | |
Hardi et al. | Text File Compression Using Hybrid Run Length Encoding (Rle) Algorithm With Even Rodeh Code (Erc) And Variable Length Binary Encoding (Vlbe) To Save Storage Space | |
CN114070471B (en) | Test data packet transmission method, device, system, equipment and medium | |
Rincy et al. | Preprocessed text compression method for Malayalam text files | |
CN113708773A (en) | Lossless compression and transmission method and system for power plant data | |
Mahmood et al. | An Efficient Text Database Compression Technique using 6 Bit Character Encoding by Table Look Up | |
Huang et al. | Lossless compression algorithm for multi-source sensor data research | |
CN117614459A (en) | Compression method and system for oversized file fragment transmission process | |
Nishad et al. | Efficient random sampling statistical method to improve big data compression ratio and pattern matching techniques for compressed data | |
Mahmoudi et al. | Comparison of Compression Algorithms in text data for Data Mining | |
Rajon et al. | An Effective Approach for Compression of Bengali Text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |