CN104467868A - Chinese text compression method - Google Patents
Chinese text compression method Download PDFInfo
- Publication number
- CN104467868A CN104467868A CN201410614796.XA CN201410614796A CN104467868A CN 104467868 A CN104467868 A CN 104467868A CN 201410614796 A CN201410614796 A CN 201410614796A CN 104467868 A CN104467868 A CN 104467868A
- Authority
- CN
- China
- Prior art keywords
- chinese
- phrase
- compression method
- word frequency
- chinese text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention relates to a Chinese text compression method. According to the characteristics of a Chinese text, a dictionary compression algorithm is combined with Huffman coding, word segmentation is conducted on the Chinese text, the Chinese text is segmented into a plurality of Chinese word groups, word frequencies are counted, the word groups with high word frequencies are expressed with fewer bits, and the word groups with low word frequencies are expressed with more bits, so that the Chinese text is compressed at a high compression ratio, and the capacity of a processor and memory overhead are both considered; the hardware cost is lowered accordingly.
Description
Technical field
The present invention relates to field of data storage, particularly relate to a kind of Compression Method For Chinese Texts.
Background technology
In processes during text, often run into Chinese text very large, external flash (flash) has been fewer than storage text, directly changes the increase that hardware can bring cost.Under the prerequisite not changing hardware, solve the problem, then need to compress Chinese text.Conventional large-scale compression algorithm, the disposal ability of the processor of needs is higher, can not be suitable for all situations.There is the Chinese phrase of a large amount of repeatability in Chinese text, so there is a large amount of redundant content, and existing Huffman Compress Algorithm compression ratio is less than normal, can only compress about 1/3rd.Therefore, need to provide a kind of Compression Method For Chinese Texts taking into account the ability of processor and the high compression rate of memory cost.
Huffman coding delivers in nineteen fifty-two the coding that " building method of minimum redundance " is main theoretical basis of the reform of Chinese economic structure with D.A.Huffman, is a kind of lossless compression-encoding based on probabilistic model.
The process of Huffman encoding algorithm is generally: the frequency that in statistics initial data, each character occurs; All characters are by frequency descending; Set up Huffman tree: by Huffman tree stored in result data, set up the dictionary table of Huffman Code And Decode by this binary tree, obtain the corresponding numeral of source symbol by the structure set.
Wherein, dictionary algorithm is one of the simplest compression algorithm.It is that a corresponding dictionary list made in word many for the frequency of occurrences in text or word combination, and represents this word or vocabulary with special code.Such as:
There is dictionary list: 00=Chinese, 01=People, 02=China.
Source text: I am a Chinese people, I am from China.
Being encoded to after compression: I am a 00 01, I am from 02.
Length after compressed encoding significantly reduces, and such being coded in the many document compressions of proper noun occurs than being easier to.
But Chinese is different from English, there is no the marks such as space as the differentiation of word.
Summary of the invention
Therefore, the object of the present invention is to provide a kind of Compression Method For Chinese Texts, compression ratio is high, and to the ability of processor and request memory lower.
The invention provides a kind of Compression Method For Chinese Texts, comprise the following steps:
Step 10, set up encoder dictionary, comprising:
Step 101, participle is carried out to Chinese text;
Step 102, word frequency statistics;
Step 103, Unified coding is carried out to the phrase of different word frequency and other characters;
Step 104, derivation encoder dictionary;
Step 20, according to encoder dictionary, Chinese source file to be compressed, obtain compressed binary file;
Step 30, according to encoder dictionary to compressed binary file decompress(ion).
Chinese text is divided into multiple Chinese phrase in described step 101, described Chinese phrase is made up of multiple Chinese character.
In described step 103, the phrase of different word frequency and other character Huffman codes are encoded.
After the phrase of different word frequency and other character Huffman codes being encoded in described step 103, obtain Huffman binary tree, set up encoder dictionary by this Huffman binary tree.
Phrase high to word frequency rate in described step 103 represents with less bit, and the phrase that word frequency rate is low represents with more bit.
In described step 30 single-chip microcomputer according to encoder dictionary to compressed binary file decompress(ion).
In sum, Compression Method For Chinese Texts of the present invention, according to the feature of Chinese text, the basis of dictionary compression algorithm combines Huffman encoding, participle is carried out to Chinese text, Chinese text is divided into multiple Chinese phrase, and add up word frequency, word frequency rate high phrase represent with less bit, the phrase that word frequency rate is low represents with more bit, makes the method compression ratio high, and has taken into account ability and the memory cost of processor, achieve the compression to Chinese text, thus reduce hardware cost.
Accompanying drawing explanation
Below in conjunction with accompanying drawing, by the specific embodiment of the present invention describe in detail, will make technical scheme of the present invention and other beneficial effects apparent.
In accompanying drawing,
Fig. 1 is the flow chart of Compression Method For Chinese Texts of the present invention.
Embodiment
For further setting forth the technological means and effect thereof that the present invention takes, be described in detail below in conjunction with the preferred embodiments of the present invention and accompanying drawing thereof.
Refer to Fig. 1, the invention provides a kind of Compression Method For Chinese Texts, comprise the following steps:
Step 10, set up encoder dictionary, comprising:
Step 101, carry out participle to Chinese text, Chinese text is divided into multiple Chinese phrase, described Chinese phrase is made up of multiple Chinese character.
Step 102, word frequency statistics;
Step 103, Unified coding is carried out to the phrase of different word frequency and other character Huffman codes, obtain Huffman binary tree, set up encoder dictionary by this Huffman binary tree;
Wherein in an encoding process, the phrase high to word frequency rate represents with less bit, and the phrase that word frequency rate is low represents with more bit;
Step 104, derivation encoder dictionary;
Step 20, according to encoder dictionary, Chinese source file to be compressed, obtain compressed binary file;
Step 30, single-chip microcomputer according to encoder dictionary to compressed binary file decompress(ion).
Compression Method For Chinese Texts of the present invention belongs to the one application of dictionary compression algorithm, feature according to Chinese text produces, Chinese is different from English, there is no the marks such as space as the differentiation of word, therefore first need when setting up encoder dictionary to carry out participle to Chinese text, Chinese text is divided into multiple Chinese phrase, different word frequency phrase is added up, then encode with Huffman code, when encoding, the less bit of high-frequency phrase is represented, thus improve the compression ratio of Chinese text, and the method has also taken into account ability and the memory cost of processor, achieve and the high compression rate of Chinese text is compressed, thus reduce hardware cost.
The above; for the person of ordinary skill of the art; can make other various corresponding change and distortion according to technical scheme of the present invention and technical conceive, and all these change and be out of shape the protection range that all should belong to the accompanying claim of the present invention.
Claims (6)
1. a Compression Method For Chinese Texts, is characterized in that, comprises the following steps:
Step 10, set up encoder dictionary, comprising:
Step 101, participle is carried out to Chinese text;
Step 102, word frequency statistics;
Step 103, Unified coding is carried out to the phrase of different word frequency and other characters;
Step 104, derivation encoder dictionary;
Step 20, according to encoder dictionary, Chinese source file to be compressed, obtain compressed binary file;
Step 30, according to encoder dictionary to compressed binary file decompress(ion).
2. Compression Method For Chinese Texts as claimed in claim 1, it is characterized in that, Chinese text is divided into multiple Chinese phrase in described step 101, described Chinese phrase is made up of multiple Chinese character.
3. Compression Method For Chinese Texts as claimed in claim 1, is characterized in that, the phrase of different word frequency and other character Huffman codes is encoded in described step 103.
4. Compression Method For Chinese Texts as claimed in claim 3, is characterized in that, after the phrase of different word frequency and other character Huffman codes being encoded, obtains Huffman binary tree, sets up encoder dictionary by this Huffman binary tree in described step 103.
5. Compression Method For Chinese Texts as claimed in claim 3, it is characterized in that, phrase high to word frequency rate in described step 103 represents with less bit, and the phrase that word frequency rate is low represents with more bit.
6. Compression Method For Chinese Texts as claimed in claim 1, is characterized in that, in described step 30 single-chip microcomputer according to encoder dictionary to compressed binary file decompress(ion).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410614796.XA CN104467868A (en) | 2014-11-04 | 2014-11-04 | Chinese text compression method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410614796.XA CN104467868A (en) | 2014-11-04 | 2014-11-04 | Chinese text compression method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104467868A true CN104467868A (en) | 2015-03-25 |
Family
ID=52913333
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410614796.XA Pending CN104467868A (en) | 2014-11-04 | 2014-11-04 | Chinese text compression method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104467868A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105528420A (en) * | 2015-12-07 | 2016-04-27 | 北京金山安全软件有限公司 | Character encoding and decoding method and device and electronic equipment |
CN106549674A (en) * | 2016-10-28 | 2017-03-29 | 银江股份有限公司 | A kind of data compression and decompressing method towards electronic health record |
CN107786712A (en) * | 2016-08-30 | 2018-03-09 | 北京神州泰岳软件股份有限公司 | A kind of compression and storage method and device of contact person in address list information |
CN108563796A (en) * | 2018-05-04 | 2018-09-21 | 蔷薇信息技术有限公司 | Data compressing method, device and the electronic equipment of block chain |
CN109697277A (en) * | 2017-10-20 | 2019-04-30 | 北京京东尚科信息技术有限公司 | The method and apparatus of Text compression |
CN111030702A (en) * | 2019-12-27 | 2020-04-17 | 哈尔滨理工大学 | Text compression method |
CN112507665A (en) * | 2021-02-01 | 2021-03-16 | 北京江融信科技有限公司 | Chinese data compression and synchronous encryption method and system based on circumference ratio PI |
CN112818057A (en) * | 2021-01-07 | 2021-05-18 | 杭州链城数字科技有限公司 | Data exchange method and device based on block chain |
CN117651076A (en) * | 2023-11-29 | 2024-03-05 | 哈尔滨工程大学 | Adaptive cross-domain multichannel secret source coding compression and decompression method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101350624A (en) * | 2008-09-11 | 2009-01-21 | 中国科学院计算技术研究所 | Method for compressing Chinese text supporting ANSI encode |
CN101751451A (en) * | 2008-12-11 | 2010-06-23 | 高德软件有限公司 | Chinese data compression method and Chinese data decompression method and related devices |
CN102263560A (en) * | 2010-05-28 | 2011-11-30 | 富士通株式会社 | Differential encoding method and system |
CN102567322A (en) * | 2010-12-09 | 2012-07-11 | 北京大学 | Text compression method and text compression device |
-
2014
- 2014-11-04 CN CN201410614796.XA patent/CN104467868A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101350624A (en) * | 2008-09-11 | 2009-01-21 | 中国科学院计算技术研究所 | Method for compressing Chinese text supporting ANSI encode |
CN101751451A (en) * | 2008-12-11 | 2010-06-23 | 高德软件有限公司 | Chinese data compression method and Chinese data decompression method and related devices |
CN102263560A (en) * | 2010-05-28 | 2011-11-30 | 富士通株式会社 | Differential encoding method and system |
CN102567322A (en) * | 2010-12-09 | 2012-07-11 | 北京大学 | Text compression method and text compression device |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105528420A (en) * | 2015-12-07 | 2016-04-27 | 北京金山安全软件有限公司 | Character encoding and decoding method and device and electronic equipment |
CN107786712A (en) * | 2016-08-30 | 2018-03-09 | 北京神州泰岳软件股份有限公司 | A kind of compression and storage method and device of contact person in address list information |
CN106549674A (en) * | 2016-10-28 | 2017-03-29 | 银江股份有限公司 | A kind of data compression and decompressing method towards electronic health record |
CN106549674B (en) * | 2016-10-28 | 2019-07-23 | 银江股份有限公司 | A kind of data compression and decompressing method towards electronic health record |
CN109697277B (en) * | 2017-10-20 | 2024-02-13 | 北京京东尚科信息技术有限公司 | Text compression method and device |
CN109697277A (en) * | 2017-10-20 | 2019-04-30 | 北京京东尚科信息技术有限公司 | The method and apparatus of Text compression |
CN108563796A (en) * | 2018-05-04 | 2018-09-21 | 蔷薇信息技术有限公司 | Data compressing method, device and the electronic equipment of block chain |
CN111030702A (en) * | 2019-12-27 | 2020-04-17 | 哈尔滨理工大学 | Text compression method |
CN112818057A (en) * | 2021-01-07 | 2021-05-18 | 杭州链城数字科技有限公司 | Data exchange method and device based on block chain |
CN112818057B (en) * | 2021-01-07 | 2022-08-19 | 杭州链城数字科技有限公司 | Data exchange method and device based on block chain |
CN112507665B (en) * | 2021-02-01 | 2021-06-01 | 北京江融信科技有限公司 | Chinese data compression and synchronous encryption method and system based on circumference ratio PI |
CN112507665A (en) * | 2021-02-01 | 2021-03-16 | 北京江融信科技有限公司 | Chinese data compression and synchronous encryption method and system based on circumference ratio PI |
CN117651076A (en) * | 2023-11-29 | 2024-03-05 | 哈尔滨工程大学 | Adaptive cross-domain multichannel secret source coding compression and decompression method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104467868A (en) | Chinese text compression method | |
CN104283568B (en) | Data compressed encoding method based on part Hoffman tree | |
CN110518917B (en) | LZW data compression method and system based on Huffman coding | |
CN103236847A (en) | Multilayer Hash structure and run coding-based lossless compression method for data | |
WO2010044100A1 (en) | Lossless compression | |
EP2455853A2 (en) | Data compression method | |
US8933826B2 (en) | Encoder apparatus, decoder apparatus and method | |
CN101783788A (en) | File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device | |
CN103258030A (en) | Mobile device memory compression method based on dictionary encoding and run-length encoding | |
CN101562455A (en) | Context-based adaptive binary arithmetic coding (cabac) decoding apparatus and decoding method thereof | |
CN102880703B (en) | Chinese web page data encoding, coding/decoding method and system | |
CN101534124B (en) | Compression algorithm for short natural language | |
CN103731154B (en) | Data compression algorithm based on semantic analysis | |
CN104125475B (en) | Multi-dimensional quantum data compressing and uncompressing method and apparatus | |
CN104156990B (en) | A kind of lossless compression-encoding method and system for supporting super-huge data window | |
CN103347047A (en) | Lossless data compression method based on online dictionaries | |
KR20100009032A (en) | Lossless data compression method | |
CN104682966B (en) | The lossless compression method of table data | |
Mahmood et al. | A feasible 6 bit text database compression scheme with character encoding (6BC) | |
CN109587499A (en) | A kind of method of ultrahigh resolution computer desktop compressed encoding | |
CN106559085A (en) | A kind of normal form Hafman decoding method and its device | |
CN101741392A (en) | Huffman decoding method for fast resolving code length | |
CN102664634B (en) | A kind of for data compression method during dipper system transmitting-receiving Chinese-character text message | |
KR20160106229A (en) | IMPROVED HUFFMAN CODING METHOD AND APPARATUS THEREOF BY CREATING CONTEXT-BASED INNER-BLOCK AND GROUP BASED ON VARIANCE IN GROUP's SYMBOL FREQUENCY DATA | |
CN110739974B (en) | Data compression method and device and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150325 |