CN100474781C

CN100474781C - Compression method of two-byte character data

Info

Publication number: CN100474781C
Application number: CNB2003101242211A
Authority: CN
Inventors: 赵畇衍
Original assignee: Pantech Co Ltd
Current assignee: Pan Thai Co ltd
Priority date: 2003-04-08
Filing date: 2003-12-31
Publication date: 2009-04-01
Anticipated expiration: 2023-12-31
Also published as: KR20040087503A; CN1536768A; KR100494876B1

Abstract

The present invention provides a method for compressing information in units of 2-byte characters (Korean characters, Chinese) before storing in an information processing module of a terminal, thereby reducing storage space for 2-byte character data. . The compression method of 2-byte character data of the present invention is characterized in that comprising: generate a plurality of compressible codewords according to the frequency number, store in the basic dictionary table, the step of the variable initialisation of the representation next codeword of registering; Identify input Whether the information data is a 2-byte character, and receive the input step; compare whether the input data is contained in the compressible code word, and when contained in the compressible code word, go through the mapping process from the dictionary table Searching for the matching code and outputting it, when there is no such matching code in the dictionary, registering it in the dictionary; judging whether it is the mantissa of the data, when the data has not been input, returning to the input step of inputting the information data in sequence; and when it is During the mantissa of data, carry out the step of clearing process, when the number of digits that conforms to the code obtained by encoding this compressible code word is smaller than the critical value that this compressible code word can reduce bit, take log ₂ (C1+1)-1 bit Output, when the matching code word is larger than the critical value, it is output with log ₂ (C1+1) bits, where C1 is the number of code words currently assigned.

Description

Compression method for 2-byte character data

技术领域 technical field

本发明涉及一种2字节字符数据的压缩方法，更具体地说，涉及一种为了减少移动通信终端机中的SMS(Short Message Service)和EMS(Enhanced Messaging Service)的信息存储空间，利用2字节字符压缩算法的2字节字符数据的压缩方法。The present invention relates to a kind of compression method of 2-byte character data, relate to a kind of in order to reduce the information storage space of SMS (Short Message Service) and EMS (Enhanced Messaging Service) in mobile communication terminal, utilize 2 Compression method for 2-byte character data of byte character compression algorithm.

背景技术 Background technique

一般情况下，客户利用移动通信终端机的信息发送接收功能(SMS、EMS)，进行各式各样的信息交换。大部分的移动通信终端机几乎不对这种信息进行压缩，进行部分压缩的终端机也只是利用适合英文字母的压缩算法。Generally, customers use the information sending and receiving functions (SMS, EMS) of mobile communication terminals to exchange various information. Most mobile communication terminals hardly compress such information, and terminals that perform partial compression only use compression algorithms suitable for English letters.

不过，当采用这种压缩算法时，象朝鲜字符和汉语这样的语言，因为大多具有冗长性的特点，所以相对地压缩效率低，并且需要更多的内存，存在不能有效地降低存储空间的问题。However, when using this compression algorithm, languages such as Korean characters and Chinese are relatively redundant because most of them have the characteristics of redundancy, so the compression efficiency is relatively low, and more memory is required, and there is a problem that the storage space cannot be effectively reduced. .

[专利文献1]日本特開平2-255977(日本专利第1990-255977号公告)[Patent Document 1] Japanese Patent Laid-Open No. 2-255977 (Japanese Patent Publication No. 1990-255977)

[专利文献2]日本特開平9-069785(日本专利第1997-069785号公告)[Patent Document 2] Japanese Patent Laid-Open No. 9-069785 (Japanese Patent No. 1997-069785 publication)

发明内容 Contents of the invention

本发明克服了上述不足，其目的在于提供一种在终端机的信息处理模块中，以2字节字符(朝鲜字符、汉语)为单位对信息进行压缩并存储，从而可以减少存储空间的2字节字符数据的压缩方法。The present invention overcomes the above-mentioned disadvantages, and its purpose is to provide a 2-byte character (Korean character, Chinese) to compress and store information in the information processing module of the terminal, thereby reducing storage space. Compression method for byte character data.

为了实现上述目的，本发明的2字节字符数据的压缩方法的特征在于包括：根据频率数生成多个可压缩代码字，并存储在基本词典表中，将登记的表示下一个代码字的变量初始化的步骤；参照被初始化了的变量，将追加的可压缩代码字存储在包含所述基本词典表在内的附加词典表中，将登记的表示下一个代码字的变量重新初始化的步骤；识别输入的信息数据是否是2字节字符，并接收的输入步骤；比较输入的数据是否包含在该可压缩代码字中，当包含在该可压缩的代码字中时，从该词典表中经过映射过程搜索符合代码并输出，当词典中没有该符合代码时，将其登记在词典中的步骤；判断是否是数据的尾数，当数据没有输入完时，返回依次输入信息数据的输入步骤；以及当是数据的尾数时，进行清除过程的步骤，所述清除过程是指在存储器存储方法中，以8位或16位存储数据，但为了被压缩了的数据具有可变长度的位数，当最后存储的数据不是8位或16位的时候，将最后剩下的位用0填满的过程；当将该可压缩代码字编码得到的符合代码的位数比该可压缩代码字可以降低位的临界值小时，以log2(C1+1)-1位输出，当符合代码字的位数比临界值大时，以log2(C1+1)位输出，该C1是当前被赋值的代码字的数。In order to achieve the above object, the compression method of the 2-byte character data of the present invention is characterized in that comprising: generating a plurality of compressible code words according to the frequency number, and storing them in the basic dictionary table, and registering the variable representing the next code word The step of initialization; referring to the initialized variable, storing the additional compressible codeword in the additional dictionary table including the basic dictionary table, and reinitializing the registered variable representing the next codeword; identifying Whether the input information data is a 2-byte character, and receive the input step; compare whether the input data is contained in the compressible code word, and when contained in the compressible code word, it is mapped from the dictionary table The process searches for the matching code and outputs it. When there is no such matching code in the dictionary, register it in the dictionary; judge whether it is the mantissa of the data, when the data has not been input, return to the input step of inputting the information data in sequence; and when When it is the mantissa of the data, the step of performing a clearing process, the clearing process refers to storing data in 8 bits or 16 bits in the memory storage method, but for the compressed data to have a variable length of bits, when the last When the stored data is not 8-bit or 16-bit, the process of filling the last remaining bit with 0; when the number of digits of the code obtained by encoding the compressible codeword is lower than the number of digits that can be reduced by the compressible codeword When the critical value is small, it will be output with log2(C1+1)-1 bits. When the number of digits corresponding to the code word is greater than the critical value, it will be output with log2(C1+1) bits. The C1 is the number of code words currently assigned .

本发明的的有益效果是，在终端机的信息处理模块中，通过压缩2字节字符(朝鲜字符、汉语等)的信息并进行存储，可以减少存储空间。也就是说，利用本发明的方法压缩英语和朝鲜字符混合的文本文件时，与现有的压缩方法相比，平均压缩率具有大约22％左右的改善效果。The beneficial effect of the present invention is that, in the information processing module of the terminal, by compressing and storing information of 2-byte characters (Korean characters, Chinese, etc.), the storage space can be reduced. That is to say, when using the method of the present invention to compress a text file with mixed English and Korean characters, the average compression rate has an improvement effect of about 22% compared with the existing compression method.

附图说明 Description of drawings

图1是本发明一个实施例中的2字节字符数据的压缩方法的操作流程图。FIG. 1 is an operation flowchart of a method for compressing 2-byte character data in one embodiment of the present invention.

图2是对在本发明的一个实施例的2字节字符数据的压缩方法中，从该词典表中经过映射过程搜索符合代码并输出的步骤(压缩步骤)进行详细说明的操作流程图。Fig. 2 is to in the compression method of 2 byte character data of an embodiment of the present invention, from this dictionary table through mapping process search accords with code and the step (compression step) that output is described in detail operation flowchart.

图3是对在本发明的一个实施例的2字节字符数据的压缩方法中管理该符合代码词典的词典生成/管理步骤进行详细说明的操作流程图。FIG. 3 is an operation flowchart explaining in detail the dictionary generation/management steps for managing the code-compliant dictionary in the compression method of 2-byte character data according to an embodiment of the present invention.

具体实施方式 Detailed ways

为了方便说明，本发明的2字节字符数据的压缩方法以韩国语为例进行说明。但同样适用于诸如汉语、日语等的以2字节标记的语言。因此，在本实施例中，仅对韩国语的压缩情况进行说明，但本发明并不仅限于韩国语，这对本领域的技术人员来说是显而易见的。For the convenience of description, the method for compressing 2-byte character data in the present invention is described by taking Korean as an example. But the same applies to languages marked in 2 bytes such as Chinese, Japanese, etc. Therefore, in this embodiment, only the compression of the Korean language is described, but the present invention is not limited to the Korean language, which is obvious to those skilled in the art.

以下对照附图对本发明的实施例进行说明。Embodiments of the present invention will be described below with reference to the accompanying drawings.

图1是本发明的一个实施例中的2字节字符数据的压缩方法的操作流程图，以下将对与此相关的情况进行说明。FIG. 1 is an operation flow chart of a method for compressing 2-byte character data in an embodiment of the present invention, and the related situation will be described below.

首先，初始化最大字符串数(N7)、代码字数(N2)、初始词典条目号码(N5)等，将频率数高的字符收藏在基本词典表中，并将登记的表示下一个代码字的变量C1初始化(S101)，用于字符压缩的代码字的构成如下表所述。这里，为了找到字符压缩所需要的代码字，从朝鲜字符和英语混合文件中找出完成型朝鲜字符2350个字的出现频率后，将其排列并观察，将其中2％经常使用的470个字作为代码字登记。这种情况下，该2％的470个字符整体出现频率达到85％以上。因此，该变量C1的初始化值能够为471。First, initialize the maximum number of character strings (N7), the number of code words (N2), the initial dictionary entry number (N5), etc., store the characters with high frequency numbers in the basic dictionary table, and store the registered variable representing the next code word C1 initialization (S101), the composition of the code word used for character compression is described in the following table. Here, in order to find the code words needed for character compression, after finding out the occurrence frequency of 2350 words of complete Korean characters from the Korean character and English mixed file, they are arranged and observed, and 470 words that are 2% of them are frequently used Register as a code word. In this case, the overall occurrence frequency of the 2% of the 470 characters reaches more than 85%. Therefore, the initialization value of the variable C1 can be 471.

表1：Table 1:

0～255 ASCII(美国信息交换标准码) 256～725 朝鲜字符代码(470个字) 726～1023 10位编码 1024～2047 11位编码 2048～4095 12位编码 0～255 ASCII (American Standard Code for Information Interchange) 256～725 Korean character code (470 characters) 726～1023 10-bit encoding 1024～2047 11-bit encoding 2048～4095 12-bit encoding

接着，对照被初始化的变量，将追加的可压缩的代码字存储在包含该基本词典表在内的附加词典表中，重新初始化登记的表示下一个代码字的变量C1(S102)。在此，编码可压缩代码字的符合代码的位数取决于下面的公式。Next, an additional compressible codeword is stored in the additional dictionary table including the basic dictionary table in comparison with the initialized variable, and the registered variable C1 indicating the next codeword is reinitialized (S102). Here, the number of bits of the conforming code encoding the compressible codeword depends on the following formula.

公式1：(C1+lim)≤2^log(C1+1)-1Formula 1: (C1+lim)≤2 ^log(C1+1) -1

公式2：lim＝C3-C1-1Formula 2: lim=C3-C1-1

公式3：C3＝2^log(C1+1) Formula 3: C3=2 ^log(C1+1)

在此，该C1是指当前被赋值的代码字数，lim是指代码字能降低位的临界值。因此，将代码字转换为位列的时候，如果代码字比所确定的临界值(lim)小，则以log₂(C1+1)-1位输出，如果符合代码字比临界值大，则以log₂(C1+1)位输出。Here, the C1 refers to the number of code words currently assigned, and lim refers to the critical value of a code word that can reduce bits. Therefore, when the code word is converted into a bit sequence, if the code word is smaller than the determined critical value (lim), it will be output in log ₂ (C1+1)-1 bits, and if the code word is larger than the critical value, then Output in log ₂ (C1+1) bits.

例如，该C1为750时，lim＝(1024-750-1)＝273，所以，压缩时代码字位于0至273之间，以9位编码输出，如果压缩时代码字位于274至749之间，各代码字再加上274，以10位编码输出For example, when the C1 is 750, lim=(1024-750-1)=273, so the code word is between 0 and 273 during compression and output with 9-bit encoding, if the code word is between 274 and 749 during compression , add 274 to each code word, and output in 10-bit code

解除压缩时，以9位读出代码字位，如果该读出的值比274小，则将其值作为代码字代码读取，如果该读出的值比274大，则重新以10位读出，将减去274的值作为代码字代码读出。下列的表2以上述的方式表示本发明的词典表构造。When uncompressing, read out the code word bit with 9 bits, if the read value is smaller than 274, then read its value as the code word code, if the read value is greater than 274, then read it again with 10 bits out, the value subtracted by 274 is read out as a code word code. The following Table 2 shows the structure of the dictionary table of the present invention as described above.

表2：Table 2:

可压缩代码字被编码的代码 10进制 0 000000000 0 1 000000001 1 2 000000010 2 . . . . . . 273 100010001 273 274 1000100100 548(274+274) 275 1000100101 549(274+275) . . . . . . 749 1111111111 1023(274+749) compressible codeword encoded code 10 hex 0 000000000 0 1 000000001 1 2 000000010 2 . . . . . . 273 100010001 273 274 1000100100 548(274+274) 275 1000100101 549(274+275) . . . . . . 749 1111111111 1023(274+749)

其后，依次输入信息数据。比较输入的数据是否包含在该可压缩的代码字中，当包含在该可压缩的代码字中时，从该词典表中经过映射过程，搜索符合代码并输出(S103)。然后，确认该符合代码是否存在于词典中，当词典中没有时，进行在词典中登记的词典生成步骤(S104)。Thereafter, information data are input sequentially. Compare whether the input data is included in the compressible code word, if it is included in the compressible code word, go through the mapping process from the dictionary table, search for the matching code and output it (S103). Then, it is checked whether the matching code exists in the dictionary, and if it is not in the dictionary, a dictionary creation step of registering in the dictionary is performed (S104).

之后，判断是否是数据的尾数，当不是数据的尾数时，返回到依次输入信息数据的步骤(S105)。Afterwards, it is judged whether it is the end of the data, and if it is not the end of the data, it returns to the step of sequentially inputting information data (S105).

如果是数据的尾数，则进行清除过程(Flush)(S106)。在此，所说的该清除过程是指在存储器存储方法中，以8位或16位存储数据，但为了被压缩了的数据具有可变长度的位数，当最后存储的数据不是8位或16位的时候，将最后剩下的位用0填满的过程。If it is the mantissa of the data, a flushing process (Flush) is performed (S106). Here, the clearing process refers to storing data in 8-bit or 16-bit in the memory storage method, but for the compressed data to have a variable-length number of bits, when the last stored data is not 8-bit or 16-bit For 16 bits, the process of filling the last remaining bits with 0s.

图2是对在本发明的一个实施例中的2字节字符数据的压缩方法中，从该词典表中经过映射过程，搜索符合代码并输出的步骤(压缩步骤)进行详细说明的操作流程图，与此相关的说明如下所述。Fig. 2 is to in the compression method of 2 byte character data in one embodiment of the present invention, from this dictionary table, through mapping process, the operation flow diagram that searches the step (compression step) that matches code and outputs in detail , instructions for this are described below.

首先读出输入数据的第一个字节(S201)。First, the first byte of input data is read (S201).

其后判断该第一个字节是否在第1赋值范围内(S202)。这里，当是完成型朝鲜字符的时候，因为第一个字节赋有从16进制的B0到C8的25个数字，所以该第1赋值范围可以是从16进制的B0到C8。Then judge whether the first byte is in the first assignment range (S202). Here, when it is a complete Korean character, since the first byte is assigned 25 numbers from B0 to C8 in hexadecimal, the first assignment range can be from B0 to C8 in hexadecimal.

如果该第一个字节位于第1赋值范围内，读出输入数据的第二个字节(S203)。If the first byte is within the first assigned range, the second byte of the input data is read (S203).

另一方面，如果该第一个字节不在第1赋值范围内，因为不是完成型的朝鲜字符，所以确定是美国信息交换标准码中的字符(S207)。On the other hand, if the first byte is not within the first assignment range, it is determined to be a character in the American Standard Code for Information Interchange (S207) because it is not a complete Korean character.

其后判断该第二个字节是否在第2赋值范围内(S204)。这里，当是完成型朝鲜字符的时候，因为第二个字节赋有从16进制的A1到FE的94个数字，所以该第2赋值范围可以是从16进制的A1到FE。Then judge whether the second byte is in the second assignment range (S204). Here, when it is a complete Korean character, since the second byte is assigned 94 numbers from A1 to FE in hexadecimal, the second assignment range can be from A1 to FE in hexadecimal.

如果该第二个字节位于该第2赋值范围内，判断输入数据是否包含在该词典表中(S205)。If the second byte is within the second assignment range, it is judged whether the input data is included in the dictionary table (S205).

另一方面，如果该第二个字节不在第2赋值范围内，因为不是完成型的朝鲜字符，所以确定是美国信息交换标准码中的字符On the other hand, if the second byte is not within the range of the second assignment, because it is not a complete Korean character, it is determined to be a character in the American Standard Code for Information Interchange

(S207)。(S207).

如果输入的数据包含在该词典表中，确定是符合代码值(S206)。If the input data is contained in the dictionary table, it is determined that the code value is met (S206).

另一方面，如果输入的数据没有包含在该词典表中，因为不是出现频率高的朝鲜字符，所以确定是美国信息交换标准码中的字符(S207)。On the other hand, if the input data is not included in the dictionary table, it is determined to be a character in ASI since it is not a Korean character with a high frequency of appearance (S207).

图3是对在本发明的一个实施例的2字节字符数据的压缩方法中检查该符合代码是否存在于词典中，如果词典中没有就登记在词典中，并除去登记在词典中的不经常使用的代码的词典管理步骤进行详细说明的操作流程图，与此相关的说明如下所述。Fig. 3 checks whether this conforming code exists in the dictionary in the compression method of 2 byte character data of an embodiment of the present invention, if not just register in the dictionary in the dictionary, and remove the infrequent ones registered in the dictionary The operation flow chart that explains the dictionary management procedure of the code used in detail, and the explanation related to it is as follows.

首先判断该代码字的字符串(长度)是否超过最大字符串数(N7)，如果该代码字的字符串超过最大字符串数(N7)则终止词典管理步骤(S301)。First judge whether the string (length) of this code word exceeds maximum string number (N7), if the string of this code word exceeds maximum string number (N7), then terminate dictionary management step (S301).

如果该代码字的字符串没有超过最大字符串数(N7)，则判断是否存在于该词典表中，当存在该词典表中时，则终止词典管理步骤(S302)。If the character string of this codeword does not exceed maximum character string number (N7), then judge whether to exist in this dictionary table, when exist in this dictionary table, then terminate the dictionary management step (S302).

如果词典表中不存在，向新变量C1赋值该字符串(S303)。If it does not exist in the dictionary table, assign the character string to the new variable C1 (S303).

接着，新变量C1为了被接着生成的字符串的代码字赋值而增加其值(S304)。Next, the value of the new variable C1 is increased to be assigned to the code word of the next generated character string (S304).

接着，判断增加的变量C1是否大于代码字数(N2)(S305)。Next, it is judged whether the increased variable C1 is greater than the number of code words (N2) (S305).

如果增加的变量C1大于代码字数(N2)，向增加的变量C1赋值词典条目号码(N5)，如果增加的变量C1小于代码字数(N2)时，不向其赋值词典条目号码(N5)(S306)。If the increased variable C1 is greater than the number of code words (N2), the dictionary entry number (N5) is assigned to the increased variable C1, if the increased variable C1 is less than the number of code words (N2), the dictionary entry number (N5) is not assigned to it (S306 ).

然后，判断赋值给增加的新变量C1的节点是否是作为表示字符串末尾字符的节点的叶(leaf)节点或是否是不被使用的节点(C1＝＝NULL)，当赋值给增加的新变量的节点不是表示词典条目中字符串末尾字符的节点的叶(leaf)节点或不是不被使用的节点时，返回到新变量C1为了被接着生成的字符串的代码字赋值而增加其值的步骤(S307)。Then, judge whether the node assigned to the increased new variable C1 is the leaf (leaf) node as the node representing the end character of the character string or whether it is an unused node (C1==NULL), when assigned to the increased new variable When the node is not the leaf node of the node representing the end character of the character string in the dictionary entry or is not an unused node, return to the step of increasing the value of the new variable C1 in order to be assigned a value by the code word of the next generated character string (S307).

如果赋值给增加的变量C1的节点是表示字符串末尾字符的节点的叶(leaf)节点或是不被使用的节点时，则从词典条目中除去变量C1，准备赋值新的字符串的代码字(S308)。If the node assigned to the increased variable C1 is a leaf (leaf) node representing a character string end character or an unused node, the variable C1 is removed from the dictionary entry, and the code word of a new character string is ready to be assigned (S308).

本发明并不限于上述实施例所公开的范围。在本发明的技术主题内可以进行各种改进、变更，这些改进、变更也从属于本发明的技术范畴，受本发明保护。The present invention is not limited to the scope disclosed in the above embodiments. Various improvements and changes can be made within the technical subject matter of the present invention, and these improvements and changes also belong to the technical scope of the present invention and are protected by the present invention.

Claims

1. a compression method of 2 byte character data, is characterized in that comprising:

Generate a plurality of compressible codewords according to the frequency number, store in the basic dictionary table, the step of initializing the variable representing the next codeword registered;

Referring to the initialized variable, storing the additional compressible codeword in the additional dictionary table including the basic dictionary table, and reinitializing the registered variable representing the next codeword;

Identify whether the input information data is a 2-byte character, and receive the input step;

Compare whether the input data is included in the compressible code word, if it is included in the compressible code word, search the matching code from the dictionary table through the mapping process and output it, if there is no matching code in the dictionary When code, the step of registering it in the dictionary;

Judging whether it is the mantissa of the data, when the data has not been input, return to the input step of sequentially inputting information data; and

When it is the mantissa of the data, the step of performing a clearing process, the clearing process refers to storing data with 8 bits or 16 bits in the memory storage method, but for the compressed data to have a variable length of bits, when When the last stored data is not 8-bit or 16-bit, the process of filling the last remaining bit with 0;

When the number of digits of the code obtained by encoding the compressible codeword is smaller than the critical value that the compressible codeword can reduce, output with log ₂ (C1+1)-1 bits, when the number of digits of the codeword is met When it is greater than the critical value, it is output in log ₂ (C1+1) bits, where C1 is the number of code words currently assigned.

2. the compression method of 2 byte character data according to claim 1, is characterized in that:

In order to find the compressible codeword, after finding the occurrence frequency of the 2-byte characters of the completion type from the mixed file of 2-byte characters and 1-byte characters, arrange and analyze them, and use the frequently used characters Register as a code word.

3. the compression method of 2 byte character data according to claim 1, is characterized in that:

Frequency numbers are measured from characters represented by combinations of 2 bytes or more, and only frequently used characters are registered in the dictionary as basic code words.

4. the compression method of 2 byte character data according to claim 2, is characterized in that:

The 2-byte characters are Chinese, and the 1-byte characters are English characters.

5. the compression method of 2 byte character data according to claim 2, is characterized in that:

The 2-byte characters are Korean and the 1-byte characters are English characters.

6. the compression method of 2 byte character data according to claim 1, is characterized in that:

The step of searching and outputting matching codes through a mapping process from the dictionary table includes:

The step of reading the first byte of the input data;

The step of judging whether the first byte is located in the first assignment range;

When the first byte is in the first assignment range, the step of reading the second byte of the input data;

When the first byte is not located in the first assignment range, because it is not a complete North Korean character, the step of determining that it is a character in the American Standard Code for Information Interchange;

The step of judging whether the second byte is located in the second assignment range;

When the second byte is located in the second assignment range, a step of judging whether the input data is included in the dictionary table;

When the second byte is not located in the second assignment range, because it is not a complete Korean character, the step of determining that it is a character in the American Standard Code for Information Interchange;

When the input data is contained in said dictionary table, the step of determining a code value is met; and

When the input data is not included in the dictionary table, since it is not a Korean character with a high frequency of appearance, it is determined that it is a character in the American Standard Code for Information Interchange.

7. the compression method of 2 byte character data according to claim 4, is characterized in that:

The first assignment range is from B0 to C8 in hexadecimal.

8. the compression method of 2 byte character data according to claim 4, is characterized in that:

The second assignment range is from A1 to FE in hexadecimal.