CN104467868A - Chinese text compression method - Google Patents

Chinese text compression method Download PDF

Info

Publication number
CN104467868A
CN104467868A CN201410614796.XA CN201410614796A CN104467868A CN 104467868 A CN104467868 A CN 104467868A CN 201410614796 A CN201410614796 A CN 201410614796A CN 104467868 A CN104467868 A CN 104467868A
Authority
CN
China
Prior art keywords
chinese
phrase
compression method
word frequency
chinese text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410614796.XA
Other languages
Chinese (zh)
Inventor
刘均
杨向辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Launch Technology Co Ltd
Original Assignee
Shenzhen Launch Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Launch Technology Co Ltd filed Critical Shenzhen Launch Technology Co Ltd
Priority to CN201410614796.XA priority Critical patent/CN104467868A/en
Publication of CN104467868A publication Critical patent/CN104467868A/en
Pending legal-status Critical Current

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to a Chinese text compression method. According to the characteristics of a Chinese text, a dictionary compression algorithm is combined with Huffman coding, word segmentation is conducted on the Chinese text, the Chinese text is segmented into a plurality of Chinese word groups, word frequencies are counted, the word groups with high word frequencies are expressed with fewer bits, and the word groups with low word frequencies are expressed with more bits, so that the Chinese text is compressed at a high compression ratio, and the capacity of a processor and memory overhead are both considered; the hardware cost is lowered accordingly.

Description

Compression Method For Chinese Texts
Technical field
The present invention relates to field of data storage, particularly relate to a kind of Compression Method For Chinese Texts.
Background technology
In processes during text, often run into Chinese text very large, external flash (flash) has been fewer than storage text, directly changes the increase that hardware can bring cost.Under the prerequisite not changing hardware, solve the problem, then need to compress Chinese text.Conventional large-scale compression algorithm, the disposal ability of the processor of needs is higher, can not be suitable for all situations.There is the Chinese phrase of a large amount of repeatability in Chinese text, so there is a large amount of redundant content, and existing Huffman Compress Algorithm compression ratio is less than normal, can only compress about 1/3rd.Therefore, need to provide a kind of Compression Method For Chinese Texts taking into account the ability of processor and the high compression rate of memory cost.
Huffman coding delivers in nineteen fifty-two the coding that " building method of minimum redundance " is main theoretical basis of the reform of Chinese economic structure with D.A.Huffman, is a kind of lossless compression-encoding based on probabilistic model.
The process of Huffman encoding algorithm is generally: the frequency that in statistics initial data, each character occurs; All characters are by frequency descending; Set up Huffman tree: by Huffman tree stored in result data, set up the dictionary table of Huffman Code And Decode by this binary tree, obtain the corresponding numeral of source symbol by the structure set.
Wherein, dictionary algorithm is one of the simplest compression algorithm.It is that a corresponding dictionary list made in word many for the frequency of occurrences in text or word combination, and represents this word or vocabulary with special code.Such as:
There is dictionary list: 00=Chinese, 01=People, 02=China.
Source text: I am a Chinese people, I am from China.
Being encoded to after compression: I am a 00 01, I am from 02.
Length after compressed encoding significantly reduces, and such being coded in the many document compressions of proper noun occurs than being easier to.
But Chinese is different from English, there is no the marks such as space as the differentiation of word.
Summary of the invention
Therefore, the object of the present invention is to provide a kind of Compression Method For Chinese Texts, compression ratio is high, and to the ability of processor and request memory lower.
The invention provides a kind of Compression Method For Chinese Texts, comprise the following steps:
Step 10, set up encoder dictionary, comprising:
Step 101, participle is carried out to Chinese text;
Step 102, word frequency statistics;
Step 103, Unified coding is carried out to the phrase of different word frequency and other characters;
Step 104, derivation encoder dictionary;
Step 20, according to encoder dictionary, Chinese source file to be compressed, obtain compressed binary file;
Step 30, according to encoder dictionary to compressed binary file decompress(ion).
Chinese text is divided into multiple Chinese phrase in described step 101, described Chinese phrase is made up of multiple Chinese character.
In described step 103, the phrase of different word frequency and other character Huffman codes are encoded.
After the phrase of different word frequency and other character Huffman codes being encoded in described step 103, obtain Huffman binary tree, set up encoder dictionary by this Huffman binary tree.
Phrase high to word frequency rate in described step 103 represents with less bit, and the phrase that word frequency rate is low represents with more bit.
In described step 30 single-chip microcomputer according to encoder dictionary to compressed binary file decompress(ion).
In sum, Compression Method For Chinese Texts of the present invention, according to the feature of Chinese text, the basis of dictionary compression algorithm combines Huffman encoding, participle is carried out to Chinese text, Chinese text is divided into multiple Chinese phrase, and add up word frequency, word frequency rate high phrase represent with less bit, the phrase that word frequency rate is low represents with more bit, makes the method compression ratio high, and has taken into account ability and the memory cost of processor, achieve the compression to Chinese text, thus reduce hardware cost.
Accompanying drawing explanation
Below in conjunction with accompanying drawing, by the specific embodiment of the present invention describe in detail, will make technical scheme of the present invention and other beneficial effects apparent.
In accompanying drawing,
Fig. 1 is the flow chart of Compression Method For Chinese Texts of the present invention.
Embodiment
For further setting forth the technological means and effect thereof that the present invention takes, be described in detail below in conjunction with the preferred embodiments of the present invention and accompanying drawing thereof.
Refer to Fig. 1, the invention provides a kind of Compression Method For Chinese Texts, comprise the following steps:
Step 10, set up encoder dictionary, comprising:
Step 101, carry out participle to Chinese text, Chinese text is divided into multiple Chinese phrase, described Chinese phrase is made up of multiple Chinese character.
Step 102, word frequency statistics;
Step 103, Unified coding is carried out to the phrase of different word frequency and other character Huffman codes, obtain Huffman binary tree, set up encoder dictionary by this Huffman binary tree;
Wherein in an encoding process, the phrase high to word frequency rate represents with less bit, and the phrase that word frequency rate is low represents with more bit;
Step 104, derivation encoder dictionary;
Step 20, according to encoder dictionary, Chinese source file to be compressed, obtain compressed binary file;
Step 30, single-chip microcomputer according to encoder dictionary to compressed binary file decompress(ion).
Compression Method For Chinese Texts of the present invention belongs to the one application of dictionary compression algorithm, feature according to Chinese text produces, Chinese is different from English, there is no the marks such as space as the differentiation of word, therefore first need when setting up encoder dictionary to carry out participle to Chinese text, Chinese text is divided into multiple Chinese phrase, different word frequency phrase is added up, then encode with Huffman code, when encoding, the less bit of high-frequency phrase is represented, thus improve the compression ratio of Chinese text, and the method has also taken into account ability and the memory cost of processor, achieve and the high compression rate of Chinese text is compressed, thus reduce hardware cost.
The above; for the person of ordinary skill of the art; can make other various corresponding change and distortion according to technical scheme of the present invention and technical conceive, and all these change and be out of shape the protection range that all should belong to the accompanying claim of the present invention.

Claims (6)

1. a Compression Method For Chinese Texts, is characterized in that, comprises the following steps:
Step 10, set up encoder dictionary, comprising:
Step 101, participle is carried out to Chinese text;
Step 102, word frequency statistics;
Step 103, Unified coding is carried out to the phrase of different word frequency and other characters;
Step 104, derivation encoder dictionary;
Step 20, according to encoder dictionary, Chinese source file to be compressed, obtain compressed binary file;
Step 30, according to encoder dictionary to compressed binary file decompress(ion).
2. Compression Method For Chinese Texts as claimed in claim 1, it is characterized in that, Chinese text is divided into multiple Chinese phrase in described step 101, described Chinese phrase is made up of multiple Chinese character.
3. Compression Method For Chinese Texts as claimed in claim 1, is characterized in that, the phrase of different word frequency and other character Huffman codes is encoded in described step 103.
4. Compression Method For Chinese Texts as claimed in claim 3, is characterized in that, after the phrase of different word frequency and other character Huffman codes being encoded, obtains Huffman binary tree, sets up encoder dictionary by this Huffman binary tree in described step 103.
5. Compression Method For Chinese Texts as claimed in claim 3, it is characterized in that, phrase high to word frequency rate in described step 103 represents with less bit, and the phrase that word frequency rate is low represents with more bit.
6. Compression Method For Chinese Texts as claimed in claim 1, is characterized in that, in described step 30 single-chip microcomputer according to encoder dictionary to compressed binary file decompress(ion).
CN201410614796.XA 2014-11-04 2014-11-04 Chinese text compression method Pending CN104467868A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410614796.XA CN104467868A (en) 2014-11-04 2014-11-04 Chinese text compression method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410614796.XA CN104467868A (en) 2014-11-04 2014-11-04 Chinese text compression method

Publications (1)

Publication Number Publication Date
CN104467868A true CN104467868A (en) 2015-03-25

Family

ID=52913333

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410614796.XA Pending CN104467868A (en) 2014-11-04 2014-11-04 Chinese text compression method

Country Status (1)

Country Link
CN (1) CN104467868A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105528420A (en) * 2015-12-07 2016-04-27 北京金山安全软件有限公司 Character encoding and decoding method and device and electronic equipment
CN106549674A (en) * 2016-10-28 2017-03-29 银江股份有限公司 A kind of data compression and decompressing method towards electronic health record
CN107786712A (en) * 2016-08-30 2018-03-09 北京神州泰岳软件股份有限公司 A kind of compression and storage method and device of contact person in address list information
CN108563796A (en) * 2018-05-04 2018-09-21 蔷薇信息技术有限公司 Data compressing method, device and the electronic equipment of block chain
CN109697277A (en) * 2017-10-20 2019-04-30 北京京东尚科信息技术有限公司 The method and apparatus of Text compression
CN111030702A (en) * 2019-12-27 2020-04-17 哈尔滨理工大学 Text compression method
CN112507665A (en) * 2021-02-01 2021-03-16 北京江融信科技有限公司 Chinese data compression and synchronous encryption method and system based on circumference ratio PI
CN112818057A (en) * 2021-01-07 2021-05-18 杭州链城数字科技有限公司 Data exchange method and device based on block chain
CN117651076A (en) * 2023-11-29 2024-03-05 哈尔滨工程大学 Adaptive cross-domain multichannel secret source coding compression and decompression method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350624A (en) * 2008-09-11 2009-01-21 中国科学院计算技术研究所 Method for compressing Chinese text supporting ANSI encode
CN101751451A (en) * 2008-12-11 2010-06-23 高德软件有限公司 Chinese data compression method and Chinese data decompression method and related devices
CN102263560A (en) * 2010-05-28 2011-11-30 富士通株式会社 Differential encoding method and system
CN102567322A (en) * 2010-12-09 2012-07-11 北京大学 Text compression method and text compression device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350624A (en) * 2008-09-11 2009-01-21 中国科学院计算技术研究所 Method for compressing Chinese text supporting ANSI encode
CN101751451A (en) * 2008-12-11 2010-06-23 高德软件有限公司 Chinese data compression method and Chinese data decompression method and related devices
CN102263560A (en) * 2010-05-28 2011-11-30 富士通株式会社 Differential encoding method and system
CN102567322A (en) * 2010-12-09 2012-07-11 北京大学 Text compression method and text compression device

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105528420A (en) * 2015-12-07 2016-04-27 北京金山安全软件有限公司 Character encoding and decoding method and device and electronic equipment
CN107786712A (en) * 2016-08-30 2018-03-09 北京神州泰岳软件股份有限公司 A kind of compression and storage method and device of contact person in address list information
CN106549674A (en) * 2016-10-28 2017-03-29 银江股份有限公司 A kind of data compression and decompressing method towards electronic health record
CN106549674B (en) * 2016-10-28 2019-07-23 银江股份有限公司 A kind of data compression and decompressing method towards electronic health record
CN109697277B (en) * 2017-10-20 2024-02-13 北京京东尚科信息技术有限公司 Text compression method and device
CN109697277A (en) * 2017-10-20 2019-04-30 北京京东尚科信息技术有限公司 The method and apparatus of Text compression
CN108563796A (en) * 2018-05-04 2018-09-21 蔷薇信息技术有限公司 Data compressing method, device and the electronic equipment of block chain
CN111030702A (en) * 2019-12-27 2020-04-17 哈尔滨理工大学 Text compression method
CN112818057A (en) * 2021-01-07 2021-05-18 杭州链城数字科技有限公司 Data exchange method and device based on block chain
CN112818057B (en) * 2021-01-07 2022-08-19 杭州链城数字科技有限公司 Data exchange method and device based on block chain
CN112507665B (en) * 2021-02-01 2021-06-01 北京江融信科技有限公司 Chinese data compression and synchronous encryption method and system based on circumference ratio PI
CN112507665A (en) * 2021-02-01 2021-03-16 北京江融信科技有限公司 Chinese data compression and synchronous encryption method and system based on circumference ratio PI
CN117651076A (en) * 2023-11-29 2024-03-05 哈尔滨工程大学 Adaptive cross-domain multichannel secret source coding compression and decompression method

Similar Documents

Publication Publication Date Title
CN104467868A (en) Chinese text compression method
CN104283568B (en) Data compressed encoding method based on part Hoffman tree
CN110518917B (en) LZW data compression method and system based on Huffman coding
CN103236847A (en) Multilayer Hash structure and run coding-based lossless compression method for data
WO2010044100A1 (en) Lossless compression
EP2455853A2 (en) Data compression method
US8933826B2 (en) Encoder apparatus, decoder apparatus and method
CN101783788A (en) File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device
CN103258030A (en) Mobile device memory compression method based on dictionary encoding and run-length encoding
CN101562455A (en) Context-based adaptive binary arithmetic coding (cabac) decoding apparatus and decoding method thereof
CN102880703B (en) Chinese web page data encoding, coding/decoding method and system
CN101534124B (en) Compression algorithm for short natural language
CN103731154B (en) Data compression algorithm based on semantic analysis
CN104125475B (en) Multi-dimensional quantum data compressing and uncompressing method and apparatus
CN104156990B (en) A kind of lossless compression-encoding method and system for supporting super-huge data window
CN103347047A (en) Lossless data compression method based on online dictionaries
KR20100009032A (en) Lossless data compression method
CN104682966B (en) The lossless compression method of table data
Mahmood et al. A feasible 6 bit text database compression scheme with character encoding (6BC)
CN109587499A (en) A kind of method of ultrahigh resolution computer desktop compressed encoding
CN106559085A (en) A kind of normal form Hafman decoding method and its device
CN101741392A (en) Huffman decoding method for fast resolving code length
CN102664634B (en) A kind of for data compression method during dipper system transmitting-receiving Chinese-character text message
KR20160106229A (en) IMPROVED HUFFMAN CODING METHOD AND APPARATUS THEREOF BY CREATING CONTEXT-BASED INNER-BLOCK AND GROUP BASED ON VARIANCE IN GROUP's SYMBOL FREQUENCY DATA
CN110739974B (en) Data compression method and device and computer readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150325