TWI287362B - Compressing method for statistical data characteristics by finite exhaustive optimization - Google Patents

Compressing method for statistical data characteristics by finite exhaustive optimization Download PDF

Info

Publication number
TWI287362B
TWI287362B TW93141518A TW93141518A TWI287362B TW I287362 B TWI287362 B TW I287362B TW 93141518 A TW93141518 A TW 93141518A TW 93141518 A TW93141518 A TW 93141518A TW I287362 B TWI287362 B TW I287362B
Authority
TW
Taiwan
Prior art keywords
data
information
code
length
block
Prior art date
Application number
TW93141518A
Other languages
Chinese (zh)
Other versions
TW200623657A (en
Inventor
Fred Chen
White Zhang
Harley Yan
Original Assignee
Inventec Besta Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inventec Besta Co Ltd filed Critical Inventec Besta Co Ltd
Priority to TW93141518A priority Critical patent/TWI287362B/en
Publication of TW200623657A publication Critical patent/TW200623657A/en
Application granted granted Critical
Publication of TWI287362B publication Critical patent/TWI287362B/en

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

A compressing method for statistical data characteristics by finite exhaustive optimization is disclosed. First, the statistics of the repeat frequency of the same character with specific length in the data is gathered. Next, the finite exhaustive method is used to search an alternative length range, and serial numbers are used for the repeated language units in the alternative length range. The non repeated units are coded in accordance with a Huffman compression algorithm. When the data are dictionary data, the block uniqueness of the dictionary database is used to divide the huge data into small blocks and then compress the small blocks so as to increase the speed of data searching to further enhance the efficiency of data compression without increasing the complexity of decompression time.

Description

1287362 九、發明說明: 【發明所屬之技術領域】 本發明係關於一種資料壓縮方法,特別是一種對資料統計特 徵進行有限窮舉優化的壓縮方法。 【先前技術】 現今電子產業飛速發展’電腦、手機、個人數位助理 (Personal Digital Assistant; PDA)等高科技產品日新月異。隨 著掌上型電子消費產品的廣泛應用’人們對於掌上型電子消費產 品的使用要求絲越高,於未來財均電子料產品除了必須 要能提供超大容量的知識’甚至要朝向多樣化之服務功能發展。 然而,目前的掌上型電子消費產品,尤其是各種嵌入式設 備,由於受限於尺寸大小因而其#源有限,即隨機存取記憶體 (random-accessmemory ; RAM) + (central processing unit ; CPU) , 大容量資料的儲存和快速讀取的問題。 特別是,在諸如·壓縮料料處理過程中。1287362 IX. Description of the invention: [Technical field to which the invention pertains] The present invention relates to a data compression method, and more particularly to a compression method for performing finite exhaustive optimization of data statistics features. [Prior Art] Today's electronics industry is rapidly developing. High-tech products such as computers, mobile phones, and personal digital assistants (PDAs) are changing with each passing day. With the wide application of handheld electronic consumer products, 'the higher the demand for the use of handheld electronic consumer products, the more expensive the electronic products in the future must provide the knowledge of large capacity' or even the diversified service functions. development of. However, current handheld electronic consumer products, especially various embedded devices, have limited source due to their limited size, namely, random access memory (RAM) + (central processing unit; CPU). , the problem of large-capacity data storage and fast reading. In particular, during processing such as compressing materials.

是採用哈夫曼(HUFFMAN)壓縮算法,即二U 進行統-編碼’再加上針狀長_語言單位進行替代2 那麼則會造成儲存空間的浪費 複頻率高的資料,如字典資料,如果不針對資特別疋蒂言重 最佳的壓縮方案,職财播諸;貝;、' 的自身特點提出 【發明内容】 1287362 鑒於以上的問題,士 計特徵進行有限窮舉優要目的在於提供一種對資料統 效率。 的i鈿方法,藉以提高壓縮及解壓縮之 本發明所揭露之對資 方法,採用有限窮舉徵進行有限窮舉優化的壓縮 縮,從而提高了資料壓财和2料的重複語言單位進行充分麗 率和保證咖算法的適應性。 本么明所揭露之對資祖 方法,係對資料進行分塊芦端 進行有限窮舉優化的壓縮 性,以躺並消除塊單位與料位間的相關 胃⑽㈣輕,_實魏單位的隨機解壓。 方法,月=斤=士路之對貧料統計特徵進行有限窮舉優化的壓縮 量次料*間空間需求較小的情況下正確屢縮及解_大 里貝枓,以適用於資源有限的嵌入式系統中。 有達上述目的’本發明所揭露之對㈣統計特徵進行 之=二Γ縮方法’包括有下列步驟:首先’針對欲壓縮 到一„ 接著進行預處理後之資料中字符統計,以得 :位及其頻率表;再根據語言單位及其頻率表進行有限窮 轉到—軸長度細,·賴,根騎代長度範圍對 預處理後之資料進行重複語轉位替代,以得㈣代"資料· 2後根據替代信息資料統計預處理後之資料中非替代字符的頻 到-哈夫曼(卿舰N)樹(步_),·最後,根據 4料和·職N樹利用萊斯(Lempd Zip Store Szymansla ; LZSS) _算法和賺FMan _算法進行預處理 1287362 後之資料的編碼,以得到—壓縮資料。 …田^貝料係為Uni⑺如碼文件時,則於預處理之步驟中 士,:長字::碼替換’即先將資料中碼值小於0x80的編碼消除其 :字節接著將其餘編碼按照使用頻率排序,接著將使用頻率較 q之既疋數S的碼烟碼值G〜GxFE替代,最制餘的碼值則 以OxffL上一字節編碼替代。其中此既定數量可為m個。 而當資料係為大容量資料庫(database)(如:字典資料) 時’於預處理之步驟的最後會先進行資料的分塊標記記錄以得到 塊信息’然後於最後得編碼#代_先根據塊信息妨分塊,再 以分割後的小塊資料進行壓縮。 有關本發明的特徵與實作,兹配合圖示作最佳實施例詳細說 明如下。 【實施方式】 以下舉出具體實施例以詳細說明本發明之内容,並以圖示作 為輔助說明。說明中提及之符號係參照圖式符號。 將本發明應用於肷入式設備中,如··電子書、可攜式全球定 位系統裝置(Portable global positioning devices)、可上網行動電 話、個人數位助理(PDA)配合無線傳輸魏及穿戴式電腦 (Wearable computers)等攜帶式電子裝置,可藉由資料壓縮優化 的算法於極小的空間複雜度下(即,内存佔用低,其可達到資料 庫(DataBase)的幾萬甚至幾億分之-)’而提供較高的資料壓縮 效率’並且比同類型的算法具有更小的時間複雜度(即,速度快, 1287362 且可在t速處理器上接近即時的解出任意需測覽的資料)。 no) 圖*首先,針騎壓縮之:#料進行預處理(步驟 其頻率之資料中字符統計,以得到語言單位及 搜尋二驟120);根據語言單位及其頻率表進行有限窮舉法 預产理替代長度關(步驟13G);根據替代長度範圍對 驟L 制行重複語言單鱗代,以彳靖代信息資料(步 頻率,、’1_#代信息資料統計預處理後之資料中非替代字符的 據铁获Γΐ到—哈夫曼(卿FMAN)樹(步驟15〇);最後再根 康曰他息資料和聊FMAN樹利用萊斯(Lempel Zip store 彳’Lzss) _#紗卿?购法進行預處理 後之禮的編碼,以得到一壓縮資料(步驟⑽)。 、第2圖’於進行資料預處理的步驟中,即『步驟110』, 包括下列步驟:首先,判斷欲壓縮資料係為二字節(即,16_bit) 國《準編碼(即,Unic0_文件資料還係其他編碼文件資料 …本也馬文件(ANSI)資料)(步驟⑴),也就是判斷欲墨縮資 料的編碼類型是否為Unic〇de石馬之編碼方式。 田奴貝料係為Umcode碼文件時,則進行長字節碼替 換於此’包括下列步驟:先將肠_碼文件資料中碼值小於 0x80的編碼消除其高字節(步驟112);接著將其餘編碼按照使用 頻率排序(步驟113);接著將使用頻率較高之既定數量的碼值用 碼值OxSO〜OxFE替代(步驟m);最後剩餘的碼值則以㈣標記 加上二字節編碼替代(步驟115);完成迦減碼文件資料之長 1287362 字節碼替換後,再判斷此欲壓縮資料是否為字典資料,即進行欲 壓縮資料之資料類型判斷(步驟116)。其中此既定數量可為 個。 …、 當欲壓縮資料不是Unicode碼文件資料時,則進行欲壓縮資 料是否為字典資料之判斷,即進行欲壓縮資料之資料類型判 驟 116 ) 〇 當欲壓縮資料為字典資料時,則進行資料的分塊標記記錄, 以得到塊信息(步驟117),即完成此預處理步驟。 ° “ 當欲壓縮資料不為字典資料時,則不進行任何資料處理動 作,也就是即完成此預處理步驟。 翏照第3 ® ’於『步驟12〇』中係先統計爾理後之資料中 所有相同将的位置(步驟122),然後對所有_字符依後續字 符進行排序’轉到語言單減其鮮制表(步驟124)。 參照第4圖’於進行有限窮舉法搜尋之步驟中,即『步驟 13〇』,首先於語譯似其解表顿$鱗長度範圍以上之 有重複語言單位,並触長度侧進行記錄,轉顺數個特定 長度範圍’射鱗長絲關最场定長度翻係鱗優 (步驟I32);以尋優範财每—特定長度觀内的語言單位對任 句進行由長聰的逐次替代,並記錄減少掉的長度(步驟-根據減少掉的長度查找—最大之減少掉的長度(步驟叫阶 最大之減少掉的長度所對應的雜⑽找度範圍,以得到: 長度辄圍(步驟138)。其中,尋優範圍可為從基準值到最大i複 1287362 語言單位的長度。 於『步驟⑽』即係根據替代長度範圍而產生重複語言單位 之替代U文件。也就是對替代長度範圍⑽錢語言單位從長 到短順次替換、編碼,編碼結果輸出_代 記錄重複語言單位。 最後之資料壓縮步驟,即『步驟⑽』,係先根據替代信息資 料和歸man樹對預處理後之資料進行⑽和祕腿混合編 碼替換,而於編碼後’儲存複數個信息,以得到該壓縮資料。盆 中,當此資料類型係為大容量資料庫(data base),如:_ 貧料,則於編碼替代時根據塊信息進行分塊,以得到分塊資料,、 ^行小塊分割的壓縮。於此,各項信息包括重複語言單位、塊 #息和Huffman樹等。 以下猎由具體實施方輯進—部舉例詳細說明。 例如:有一筆大英英中曰韓辭典資料,其原始資料長度為 45,776,158 Bytes (字節),舌土# 子即)I先,於進行預處理後資料長度為 ’接著統計出特定長度範圍(即,0x7f)資料中 重稷δ吾吕早位和重複語言單位頻率並存放於哪文件中,再經過 =限窮舉法搜尋可得—特定長度和減少掉的長度之關係列表,如 心圖所不’財可得顺域少掉縣度為16,287,刪卿, 並且可得到相對應之牲 範圍為3535 _,代長度 行編號,得到重複度範圍内的重複語言單位由長到短進 σ单位長度為491,862 Bytes,即替代信息資 11 1287362 料為例,862 Bytes。此外,為克服字典類大容量資料庫,因此對 資料逕行分塊’並在每個分塊頭建立地址索引,將地址索引存放 於.ldx文件中,其長度為⑹乃,做完上述工作後,開始 對資料進行壓縮,得·縮結果12,115,479B卿而使用習知的 _算法壓縮此辭典資料時,其資料長度為⑽抑㈣鄉。如 第6圖所示,其為大英英巾日韓辭典_結果對比表。 另外’貝料共分為24,862塊,壓縮率為2647%,而利用習 知眺縮方法其壓縮率為·5%。由此可知剌本發魏行資料 壓縮,其壓縮率有顯著提高。 ^以設狀-掌上型電子產品中之牛津辭典㈣來看,可制 如第7圖所示之牛津辭典資料壓騎果對照表,可看到原始資料長 度為22,58〇,376%如,_後資料長度為4,5()5,792咖;而若以 習知之壓縮方法進行_得職料長度為5娜,223 b卿。於此, 資料共分為146,292塊’ 1神為19·95%,而習知之壓縮方法壓 縮率為22.54〇/〇。因此,可知應用本發明進行資料_之效果較佳。 猎由上述之實例分析可知,㈣本發明—實施例之實現步驟, 即先$計資料中特定長度之相同字符重複頻率,接著利用有限窮 舉法尋找-替代長度麵’並㈣代長度顧⑽重複語言單位 以序就代替’而非重複語言單位則按照聊^^颜壓縮算法予以 編碼。 上而且’由於壓縮後之數據仍保留了字典類資料庫中的分塊特 性#息以及資料塊間數據相互獨立,因而提高資料查詢速度,從 12 1287362 :達到不增加解壓縮時間複雜度的前提下提高資料壓縮效率的目 因此’通過使用本發财法提高了 f知:聽_算法的效率, 對超^容量資觸別是字典諸中錢字符鮮高的資料,實現 了在貧料處理上更快、更高效壓縮資料的功效。 、 雖然本發日脑祕之較佳實關猶如上,财並非用以限 定本發明’任何熟習相像技藝者,在不脫離本發明之精神和範圍 内二當可無許之更動無飾,因此本發明之翻倾範圍須視 本_書軸之帽補範騎界定者鱗。 【圖式簡單說明】 # ^第1圖係為說明根據本發明之一實施例之對資料統計特徵進 行有限窮舉優化的壓縮方法的流程圖; 第2圖係為說明第i圖中之『步驟11〇』的詳細流程圖; 第3圖係為說明第i圖中之『步驟12〇』的詳細流程圖,· 第4圖係為說明第i圖中之『步驟13〇』的詳細流程圖,· '第5圖係為說明應用本發明之一實施例進行大英英中曰韓辭 '、貝料壓縮所產生之最大長度和減少掉的長度之關絲; · 第6圖係為說明應用本發明之一實施例和習知之壓縮方法進 行大英英中日韓辭典資料壓縮之壓縮結果對照表;以及 第7圖係為說明應用本發明之一實施例和習知之壓縮方法進 仃牛津辭典資料壓縮之壓縮結果對照表。 [主要元件符號說明】 步驟110......................預處理 13 1287362 步驟120.....................字符統計,以得到語言單位及其頻率表 步驟130.....................有限窮舉法搜尋,以取得替代長度範圍 步驟140.....................替代,以取得替代信息資料 步驟150.....................統計,以取得哈夫曼樹 步驟160.....................編碼’以取4寻堡縮貢料 步驟111......................是否為Unicode碼文件資料? 步驟112......................碼值小於0x80的編碼消除其高字節 步驟113......................其餘編碼按照使用頻率排序 步驟114......................使用頻率較高之既定數量的碼值用碼值 x80〜OxFE替代 步驟115......................剩餘的碼值則以Oxff標記加上兩字節編 碼替代 步驟116......................是否為字典資料? 步驟117......................記錄分塊標記 步驟122.....................統計所有相同字符的位置 步驟124.....................依後續字符進行排序,以得到語言單位及 其頻率表 步驟132.....................確定尋優範圍 步驟134.....................進行由長到短的逐次替代,並記錄減少掉 的長度 步驟136...................查找一最大之減少掉的長度 步驟138.....................取得對應的替換的特定長度範圍,以得 1287362 到替代長度範圍Huffman (HUFFMAN) compression algorithm, that is, two U for system-encoding 'plus stylus long _ language unit to replace 2 then will result in wasted storage space, such as dictionary data, if Not for the special compression scheme, the best compression scheme, the job finance broadcast; Bei;; 'The characteristics of its own [invention content] 1287362 In view of the above problems, the characteristics of the scholarship are limited to provide a kind of Data efficiency. The method of the present invention, which improves the compression and decompression of the method disclosed in the present invention, uses a finite exhaustive levy to perform a limited exhaustive optimization of the compression contraction, thereby improving the data compression and the repeated repetition of the language unit. The rate and the adaptability of the coffee algorithm. The method of cultivating the ancestors revealed by Ming Ming is based on the finite optimization of the finite element of the data, to lie and eliminate the correlation between the block unit and the material level (10) (four) light, _ real Wei unit random Unzip. Method, month = jin = Shilu's statistical characteristics of poor materials for finite exhaustive optimization of compression quantity and secondary material * The case of small space demand is small and correct _ _ 里 枓 枓 枓 枓 枓 枓 枓 枓 枓 枓 枓 枓 枓 枓In the system. There is a method for achieving the above-mentioned purpose (the fourth statistical method for the (four) statistical feature disclosed in the present invention includes the following steps: firstly, for the character to be compressed to a „subject to the pre-processed data, to obtain: And its frequency table; then according to the language unit and its frequency table, the finite-poor turn--the length of the axis is fine, and the length of the root riding generation is replaced by the repeated transposition of the pre-processed data to obtain the (four) generation " After the data · 2 according to the alternative information data, the pre-processed data in the non-substitute characters in the data to - Huffman (Qing N) tree (step _), and finally, according to the 4 materials and the N-tree use Les (Lempd Zip Store Szymansla; LZSS) _ algorithm and earn FMan _ algorithm to pre-process the data after 1283732 to obtain - compressed data. ... field ^ shell material is Uni (7) such as code files, then in the pre-processing steps Sergeant, long word: :code replacement', the code whose code value is less than 0x80 is first eliminated: the byte then sorts the remaining codes according to the frequency of use, and then the code smoke with the frequency S of the number S is used. Code value G~GxFE alternative, most The remaining code values are replaced by the one-byte code of OxffL, where the predetermined number can be m. When the data is a large-capacity database (such as dictionary data), the last step of the pre-processing step The block mark record of the data is first obtained to obtain the block information 'and then the coded one at the end _ first blocks the block according to the block information, and then compresses the divided pieces of data. Related to the features and implementations of the present invention, DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT The following detailed description is given to illustrate the preferred embodiments of the invention, The invention is applied to the intrusion device, such as an e-book, a portable global positioning device, an internet-enabled mobile phone, a personal digital assistant (PDA), a wireless transmission, and a wearable Portable electronic devices such as computers (Wearable computers) can be optimized by data compression algorithms with minimal space complexity (ie, low memory footprint, which can be achieved) The tens of thousands or even hundreds of millions of databases () provide higher data compression efficiency' and have less time complexity than the same type of algorithm (ie, fast, 1287362 and can be at t speed The processor is close to the instant to solve any data that needs to be viewed.) no) Figure * First, the needle rides the compression: #料进行处理 (steps in the frequency of the data in the character statistics to get the language unit and search 2 120); according to the language unit and its frequency table, the finite exhaustive method is used to replace the length of the pre-production (step 13G); according to the alternative length range, the language single scalar is repeated for the sequel to the information data (step frequency, , '1_# generation of information data statistics pre-processed data in the non-substitute character of the iron was obtained - Huffman (Qing FMAN) tree (step 15 〇); finally rooted in the information and chat FMAN tree The code of the pre-processed ritual is obtained by using the Lempel Zip store 彳 'Lzss _# 卿 ? purchasing method to obtain a compressed data (step (10)). 2, in the step of performing data preprocessing, that is, "step 110", the following steps are included: First, it is determined that the data to be compressed is a two-byte (ie, 16_bit) country "quasi-code (ie, Unic0_ file). The data is also the other encoding file data... This is also the ANSI file (step (1)), that is, whether the encoding type of the data to be inked is the encoding method of Unic〇de. When the Tiannubei material is a Umcode code file, the long byte code is replaced by this'. The following steps are included: the code with the code value less than 0x80 in the gut_code file data is first eliminated (step 112); Sorting the remaining codes according to the frequency of use (step 113); then replacing the code number with a higher frequency of use with the code values OxSO~OxFE (step m); the last remaining code values are marked with (4) plus two bytes The code is replaced (step 115); after the replacement of the length of the canonical code file data is 1,277,362 bytes, and then it is determined whether the data to be compressed is dictionary data, that is, the data type judgment of the data to be compressed is performed (step 116). The established quantity can be one. ..., when the data to be compressed is not the Unicode code file data, the judgment is made as to whether the data to be compressed is a dictionary data, that is, the data type judgment of the data to be compressed is 116) 〇 When the data to be compressed is dictionary data, the data is performed. The block is marked to obtain block information (step 117), that is, the pre-processing step is completed. ° “When the data to be compressed is not a dictionary material, no data processing action is performed, that is, the pre-processing step is completed. Referring to the 3 + 'in step 12〇, the data is first calculated. All the same positions in the middle (step 122), then sort all the _ characters according to the subsequent characters 'to the language list minus its fresh table (step 124). Refer to Figure 4 for the finite exhaustive search step In the middle, that is, "Step 13〇", firstly, there are repeated language units above the range of the length of the $scale, and the length side is recorded, and the number of specific length ranges is turned over. The length of the field is adjusted to be excellent (step I32); the language is replaced by Chang Cong in a language unit within a certain length of view, and the length of the reduction is recorded (step - according to the reduced Length search—the maximum reduced length (the step is called the maximum (10) range of the reduced length of the step to obtain: the length range (step 138). The optimal range can be from the reference value to Maximum i complex 1283732 words The length of the unit. In the "step (10)" is to replace the U file according to the alternative length range. That is, the replacement length range (10) money language unit from long to short sequential replacement, encoding, encoding result output _ generation Record the repeating language unit. The final data compression step, that is, "step (10)", first replaces the pre-processed data with the substitute information and the man tree (10) and the secret leg mixed code, and stores the plural after the encoding. Information to obtain the compressed data. In the basin, when the data type is a large data base, such as: _ poor material, block the block information according to the block information to obtain the block data. ^, the compression of small block division. Here, the information includes repeated language units, block #息, Huffman tree, etc. The following hunting is detailed by the specific implementation of the section - for example, a detailed description. For example: there is a British and English Korean dictionary data, the length of the original data is 45,776,158 Bytes (bytes), the tongue soil #子是) I first, after the pre-processing data length is 'continued statistics In the length range (ie, 0x7f), the data is stored in which file and the repeating language unit frequency is stored in the file, and then the search for the length of the specific length and the reduced length is obtained by the = limit exhaustive search. If the heart map does not have a wealth, the county can be reduced to 16,287, delete the Qing, and the corresponding range of the animal can be 3535 _, the length of the line number is obtained, and the repeating language unit within the range of repetition is obtained. The short-running σ unit length is 491,862 Bytes, which is an alternative information resource of 11 1287362, 862 Bytes. In addition, in order to overcome the dictionary-type large-capacity database, the data path is chunked 'and the address is established at each chunk head. Index, the address index is stored in the .ldx file, the length of which is (6). After the above work, the data is compressed, and the result is shortened by 12,115,479B and the dictionary data is compressed using a conventional algorithm. At the time, the length of the data is (10) and (4). As shown in Figure 6, it is the British-English towel Japanese-Korean dictionary _ results comparison table. In addition, the 'beef material is divided into 24,862 pieces, the compression ratio is 2647%, and the compression ratio is 5% by the conventional collapse method. From this, it can be seen that the compression data of Wei Wei has been significantly reduced. ^ In the case of the Oxford Dictionary (4) in the set-handheld electronic products, the Oxford Dictionary data shown in Figure 7 can be made into a comparison table, and the length of the original data can be seen as 22,58〇,376%. After _, the length of the data is 4,5 () 5,792 coffee; and if the compression method is used, the length of the job is 5 Na, 223 b. Here, the data is divided into 146, 292 pieces, which is 19.95%, and the conventional compression method has a compression ratio of 22.54 〇/〇. Therefore, it is understood that the effect of applying the present invention to the data is better. Hunting from the above example analysis, (4) the implementation steps of the present invention - the embodiment, that is, the first character of the same length of the same character repetition frequency, and then use the finite exhaustive method to find - replace the length surface 'and (four) generation length (10) Repeating the language unit in order to replace ' instead of repeating the language unit is encoded according to the chat ^^ color compression algorithm. And because 'the compressed data still retains the block feature in the dictionary class database and the data between the data blocks are independent of each other, thus improving the data query speed, from 12 1287362: to achieve the premise of not increasing the complexity of the decompression time The purpose of improving the efficiency of data compression is to improve the understanding of the efficiency of the algorithm by using the method of financing, and the data of the high-capacity characters in the dictionary is realized in the poor material processing. The ability to compress data faster and more efficiently. Although the best practice of this day's brain is as above, the money is not intended to limit the invention to any of the familiar artisans, and without any deviation from the spirit and scope of the present invention, The tilting range of the present invention is subject to the definition of the cap of the cap of the book. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a flow chart illustrating a compression method for performing finite exhaustive optimization of statistical features of data according to an embodiment of the present invention; FIG. 2 is a view for explaining Detailed flow chart of step 11〇; Fig. 3 is a detailed flow chart for explaining "step 12〇" in Fig. i, and Fig. 4 is a detailed flow for explaining "step 13〇" in Fig. i Fig. 5 ' is a diagram showing the application of the embodiment of the present invention to the British-Chinese 曰 辞 、 ', the maximum length and the reduced length of the bedding compression; A compression result comparison table of the British, Chinese, Japanese, and Korean dictionary data compression is applied to an embodiment of the present invention and a conventional compression method; and FIG. 7 is a diagram illustrating the application of an embodiment of the present invention and a conventional compression method to the Oxford Dictionary. The compression result comparison table of data compression. [Main component symbol description] Step 110................Preprocessing 13 1287362 Step 120.............. ....... character statistics, in order to get the language unit and its frequency table step 130........................ finite exhaustive search to obtain an alternative Length range step 140.....................substitute to obtain alternative informational steps 150................ ..... statistic, to obtain the Huffman tree step 160........................ Code 'to take 4 search for the tribute step 111.. .................... Is it a Unicode code file? Step 112................... Code code value less than 0x80 eliminates its high byte Step 113............. .........the rest of the codes are sorted according to the frequency of use. Steps 114................... Use a higher number of code values with a higher frequency. Code value x80~OxFE instead of step 115..................... The remaining code value is replaced by Oxff mark plus two-byte code step 116... ................... Is it dictionary material? Step 117................... Record the block marking step 122................. Counting the positions of all the same characters Step 124............... Sort by subsequent characters to get the language unit and its frequency table. Step 132.... .................determining the optimal range step 134.....................from long to short Substitute successively, and record the reduced length step 136................... Look for a maximum reduced length step 138.......... ...........get the corresponding length range of the corresponding replacement to get 1283732 to the alternative length range

1515

Claims (1)

1287362 十、申請專利範圍: h ,包括有下 對一資料進行預處理; 統計該預驗叙:㈣憎數财符 單位及該語言單位頻率之一關係表; t】複數個吕 圍;根據該_顧行纽雜_,物彳—替代長度範 之重賴理㈣_一個以上 之重稷π5早位替代,以得到一替代信息資料; 根據該替代m龜計該預餘後 符的頻率,以得到一哈夫曼(HUFFMAN)樹;^非曰代子 根據該替代H細哺哈歧樹彻料(L 縮算法和哈夫曼壓縮算法_爾理後 = 到-麼縮資料。 ^仃、、扁碼’以得 2.:申請專利細第丨項所述之對資料統計特徵進行有限窮兴 饭化的壓縮方法’其中當該資料係為—二字節國时 : =為-字典資料時,㈣料進行預處理之步驟包τ括下列步 將該資料中碼值小於0x80的編碼消除高字節; 將該資料中該消除高字節之編碼外之 排序; 編碼按照使用頻率 16 1287362 將該排序後之編碼中使用頻率較高之既定數量的碼值用 焉值0x80〜OxFE替代; 將該替代後之編碼以外之該排序後之編碼以標記加 上二字節編碼替代,以得到一長字節碼替代資料;以及 進行該長字節碼替代資料的分塊標記記錄,以得到該長字 節碼替代資料和-個以上之塊信息,其中該長字節碼替代資料 係為該預處理後之資料。 3. 4. 如申請專_圍第2項所述之對資料統計特徵進行有限窮舉 優化的壓縮方法,其中該既定數量係為127個。 如申請專利範圍第1項所述之對·統計特徵進行有限窮舉 優化的壓縮方法,其中當該資料係為一二字節國際標準編碼文 似為-字典詩時’於對—:祕進行預處理之麵中可得到 個以上之塊信息以及該預處理後之資料,其中該預處理後之 為-長字節碼替代資料,而於根據該替代信息資料和該 ^讀姻萊斯_算法和哈夫曼_算法職預處理後 之貝枓進行編碼,以得料之步驟中包括下列步驟: 對《亥長子卽碼替代資料根據 — 數個分塊資料; 據韻“進仃分塊,以得到複 萊斯代信編㈣哈夫_憤分«料進行該 t=r該哈妓_算权齡編碼·;以及 如申二專矛丨!·儲存複數個信息,以得到該1 縮資料。 利賴第4項所述之對資料統計特徵進行有限窮舉 5. 1287362 6. 7. 優化的壓縮方法,其中該信, 和該哈夫曼樹。 皂包括該重複語言單位 、該塊信息 t申請專·_1項職之對資魏計紐妨有限窮舉 餘預處奴㈣料進行„_分塊標記資料 ----------------- ^匕的_方法,其中當該資料係為-字典資料時, 以付到 個以上之塊信息和該預處理後之資料 如申請專利範圍第!項所述之對資料 優化_縮方法,w㈣^博雜仃有限窮舉 、, T田忒貝枓係為一字典資料時,於對一資 ;=處理之步驟中可得到—個以上之塊信息和該預處理 貝枓,而於根據該替代信息資料和該哈夫曼樹利用萊斯壓 縮算法何夫麵轉法·猶理狀簡進行編碼,以得 屡縮資料之步驟中包括下列步驟··1287362 X. The scope of application for patent: h, including the pre-processing of the next-to-one data; the statistical pre-test: (4) the relationship table between the financial unit and the frequency of the language unit; t] a plurality of Luwei; _Guxing New Miscellaneous _, 彳 彳 替代 替代 替代 替代 替代 替代 替代 替代 替代 替代 替代 替代 替代 ( 一个 一个 ( 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早 早Get a Huffman tree (HUFFMAN) tree; ^ non-曰 子 根据 according to the replacement H 细 哈 歧 ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( , flat code 'to get 2. The patent application details of the data to describe the statistical characteristics of the data for the limited poverty of the compression method 'When the data is - two bytes country: = for - dictionary data When the (four) material is pre-processed, the step τ includes the following steps: the code having the code value less than 0x80 is used to eliminate the high byte; the data is sorted out of the code of the high byte; the code is used according to the frequency of use 16 1287362 Use the frequency after the sorted encoding The predetermined number of code values are replaced by threshold values 0x80~OxFE; the sorted code other than the substitute code is replaced by a mark plus a two-byte code to obtain a long bytecode substitute data; The long bytecode substitutes the block mark record of the data to obtain the long byte code substitute data and more than one block information, wherein the long byte code substitute data is the preprocessed data. For example, the compression method for the finite exhaustive optimization of the statistical characteristics of the data described in the second paragraph of the application, wherein the predetermined number is 127. For example, the statistical characteristics described in item 1 of the patent application are limited. An optimized compression method, wherein when the data is a one-two-byte international standard coded text-like dictionary poems, more than one block information can be obtained in the face of pre-processing: and the pre-processing The data, wherein the pre-processed is a -long bytecode replacement material, and is encoded according to the alternative information material and the reading of the singularity algorithm and the Huffman _ algorithm pre-processing Come The steps include the following steps: Substituting the data for the replacement of the Haichangzi weight - several pieces of data; according to the rhyme, "to enter the block, to obtain the complex letter of the letter (four) Huff _ indignation] =r The Harmony_calculation age code·; and such as Shen 2 special spears! · Store a plurality of information to obtain the 1 contraction data. Lilai 4 described in the fourth item of limited statistical exhaustive data characteristics 5 1287362 6. 7. The optimized compression method, wherein the letter, and the Huffman tree. The soap includes the repeating language unit, the block information t application for the special _1 project The pre-slave (four) material is subject to the __block marking information----------------- ^匕 method, where when the data is - dictionary material, to pay The above block information and the pre-processed information are as claimed in the patent scope! According to the item, the data optimization _ shrink method, w (four) ^ Bo 仃 仃 finite exhaustive, T Tian 忒 枓 为 is a dictionary data, in the case of a capital; = processing step can get more than one block The information and the pre-processing shell are encoded according to the substitute information material and the Huffman tree using the Les compression algorithm, and the steps of the data reduction step include the following steps. ·· 以得到複數 對該預處理後之資料根據該塊信息進行分塊, 個分塊資料; 根據該替代信息資料和該哈夫曼樹對該分塊資料進行該 LZSS壓縮算法和該哈夫曼壓縮算法之混合編碼替換,以得該⑩ 壓縮資料;以及 ^ 於編碼後,儲存複數個信息,以得到該壓縮資料。 8·如申請專利範圍帛7項所述之對資料统計特徵進行有限窮舉 優化的壓縮方法,其中該信息包括該重複語言單位、該塊信息 和該哈夫曼樹。 9·如申請專利範圍第丨項所述之對資料統計特徵進行有限窮舉 18 1287362 古:_係表中找出—基準長度範圍以上之複數個重複語 ^並n讀複語言單位之長度制進行 複數個特定長度範圍; 仟a 得2據該基準長度範圍和最大之該特定長度顧較-尋 優乾圍; _讀_巾每—該特定長度範肋之該 稷數個語句進行由長到短的逐次替代,並記錄減少掉的長度! 根據該減^掉的長度查找—最大之賊少掉的長度;以及 Z最大減少掉的長度所對應的替換之特定長度範 W以传到一替代長度範圍。 13 ^申丨項所述之對㈣統計特徵進行有限 ^化的壓縮方法,其中根據該替代長度範圍對該資料進行一個 以上之誕語言單位#代,以_ —替代信息資料之步驟 :::長個度範圍内的該重複語言單位從長到魄 錄:碼結果輸出到該替代信息資射並且記 14.Γ申料·圍第1項所述之對資料統計特徵進行有限窮兴 笑化的壓縮方法,其中根據該替代自 牛 菜斯屡输备m+ 〃曰低心貝科和该哈夫曼樹利用 、:W法和哈夫支_算法對該預處理後之資料進行編 馬,Μ仔一壓縮資料之步驟包括下列步驟: · ,據_代信息㈣和該哈夫曼樹龍預處理後之 仃來纏縮算法和哈夫曼歷縮算法之混合編碼替換,以得該 20 1287362 壓縮資料;以及 於編碼後,儲存複數個信息,以得到該壓縮資料。 15.如申請專利範圍第12項所述之對資料統計特徵進行有限窮舉 優化的壓縮方法,其中該信息包括該重複語言單位和該哈夫曼 樹。Obtaining, by the complex number, the pre-processed data according to the block information, and dividing the block data; performing the LZSS compression algorithm and the Huffman compression on the block data according to the substitute information data and the Huffman tree The mixed code of the algorithm is replaced to obtain the 10 compressed data; and after the encoding, a plurality of pieces of information are stored to obtain the compressed data. 8. A compression method for finite exhaustive optimization of data statistical features as described in claim 7 of the patent application, wherein the information includes the repeated language unit, the block information, and the Huffman tree. 9. If the statistical characteristics of the data mentioned in the scope of the patent application are limited to exhaustive 18 1287362 Ancient: _ is found in the table - multiple repetitions above the reference length range ^ and n read the length of the complex language unit Performing a plurality of specific length ranges; 仟a obtaining 2 according to the reference length range and the maximum of the specific length of the comparison - finding the optimal circumference; _reading _ towel each - the specific length of the rib of the statement is performed by To a short successive substitution, and record the reduced length! According to the length of the reduction, the length of the largest thief is reduced; and the length of the replacement of the maximum length of the Z is reduced to a specific length. Replace the length range. 13 ^ The compression method described in the application of (4) statistical features, wherein the data is subjected to more than one linguistic unit # 代 according to the alternative length range, and the step of replacing the information with _ -: The repeating language unit in the long degree range is from long to slogan: the code result is output to the alternative information and recorded in the first paragraph, and the statistical characteristics of the data described in item 1 are limited and poorly smiled. a compression method in which the pre-processed data is compiled according to the substitution from the cattle feed, the m+ 〃曰 low-hearted beibe and the Huffman tree utilization, the W method and the Haf branch algorithm. The step of compressing the data includes the following steps: · According to the _ generation information (4) and the Huffman tree dragon pre-processing, the entanglement algorithm and the Huffman calendar algorithm are mixed coded to obtain the 20 1287362 compressing the data; and after encoding, storing a plurality of pieces of information to obtain the compressed data. 15. A compression method for finite exhaustive optimization of data statistical features as described in claim 12, wherein the information comprises the repeated language unit and the Huffman tree. 21twenty one
TW93141518A 2004-12-30 2004-12-30 Compressing method for statistical data characteristics by finite exhaustive optimization TWI287362B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW93141518A TWI287362B (en) 2004-12-30 2004-12-30 Compressing method for statistical data characteristics by finite exhaustive optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW93141518A TWI287362B (en) 2004-12-30 2004-12-30 Compressing method for statistical data characteristics by finite exhaustive optimization

Publications (2)

Publication Number Publication Date
TW200623657A TW200623657A (en) 2006-07-01
TWI287362B true TWI287362B (en) 2007-09-21

Family

ID=39460298

Family Applications (1)

Application Number Title Priority Date Filing Date
TW93141518A TWI287362B (en) 2004-12-30 2004-12-30 Compressing method for statistical data characteristics by finite exhaustive optimization

Country Status (1)

Country Link
TW (1) TWI287362B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103702133B (en) * 2013-12-19 2017-10-24 Tcl集团股份有限公司 A kind of compression of images methods of exhibiting and its device
CN111371459B (en) * 2020-04-26 2023-04-18 宁夏隆基宁光仪表股份有限公司 Multi-operation high-frequency replacement type data compression method suitable for intelligent electric meter

Also Published As

Publication number Publication date
TW200623657A (en) 2006-07-01

Similar Documents

Publication Publication Date Title
US20130141259A1 (en) Method and system for data compression
CN102880726B (en) A kind of image filtering method and system
CN105144157B (en) System and method for the data in compressed data library
CN110008192A (en) A kind of data file compression method, apparatus, equipment and readable storage medium storing program for executing
CN116861041B (en) Electronic document processing method and system
CN115630343B (en) Electronic document information processing method, device and equipment
JPS6356726B2 (en)
CA2463324C (en) Method of compressing digital ink
CN115408350A (en) Log compression method, log recovery method, log compression device, log recovery device, computer equipment and storage medium
CN117648681B (en) OFD format electronic document hidden information extraction and embedding method
CN114528944A (en) Medical text encoding method, device and equipment and readable storage medium
TWI287362B (en) Compressing method for statistical data characteristics by finite exhaustive optimization
CN109831544B (en) Code storage method and system applied to email address
CA3162480A1 (en) Computerized data compression and analysis using potentially non-adjacent pairs
CN105205487A (en) Picture processing method and device
CN116975864A (en) Malicious code detection method and device, electronic equipment and storage medium
US6226411B1 (en) Method for data compression and restoration
CN109255090B (en) Index data compression method of web graph
CN111767280A (en) Data processing method, device and storage medium
CN114466082B (en) Data compression and data decompression method and system and artificial intelligent AI chip
CN110825747A (en) Information access method, device and medium
CN114490546A (en) Track data compression method and device, electronic equipment and storage medium
CN112765937A (en) Text regularization method and device, electronic equipment and storage medium
CN113268986A (en) Unit name matching and searching method and device based on fuzzy matching algorithm
Nagaprasad et al. Authorship attribution based on data compression for telugu text

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees