CN102811113B

CN102811113B - Character-type message compression method

Info

Publication number: CN102811113B
Application number: CN201210241220.4A
Authority: CN
Inventors: 常传文; 李玮; 茅文深; 鉴福升; 林明; 夏宁; 吴杰; 姚浩
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2012-07-12
Filing date: 2012-07-12
Publication date: 2014-12-10
Anticipated expiration: 2032-07-12
Also published as: CN102811113A

Abstract

The invention discloses a method for compressing character-type messages. The method provides an optimized adaptive update method for updating the frequency table. One is to update character by character during the encoding process, that is, to update each After the characters are arithmetically coded, the frequency table is updated. Updating the frequency table will take up a certain amount of calculations. If the calculation resources are limited, the first method cannot be applied. The present invention can adopt another method, that is, the update of the frequency table is carried out in units of several messages, after performing arithmetic coding on a single message character by character, only the number of occurrences of each character is recorded, and when the set number of messages is reached, After the text encoding is completed, the frequency table is updated according to the recording situation. The present invention can effectively realize the lossless compression of the message, improve the problems of high delay, occupying excess bandwidth, and using large storage resources encountered in the application process of message sharing, storage, distribution, etc., so that the compression ratio is close to or Reaching the maximum value of entropy coding theory.

Description

A Method for Compressing Character Type Messages

技术领域 technical field

本发明涉及一种基于字符型报文的有效压缩方法，根据字符型报文具有一个有限字符集合的特点，引入静态频率空间且自适应地更新，并结合算术编码等相关技术，取得了良好的压缩效果。The invention relates to an effective compression method based on character-type messages. According to the characteristics of character-type messages having a limited character set, a static frequency space is introduced and updated adaptively, and combined with related technologies such as arithmetic coding, a good compression method is obtained. compression effect.

本发明适用于任何基于有限字符集合报文压缩的诸如共享、存储、传输等场合，尤其是对报文传输实时性要求比较高的情况下，经实际验证均可良好地满足其应用需求。The present invention is applicable to any occasions such as sharing, storage and transmission based on limited character set message compression, especially when the real-time requirement of message transmission is relatively high, and it can well meet the application requirements through actual verification.

背景技术 Background technique

数据压缩方法按照压缩前后信息量是否有损失可分为两种类型，分别为有损压缩和无损压缩。有损压缩是指使用压缩后的数据进行重构（或者叫做还原，解压缩），重构后的数据与原来的数据有所不同；而无损压缩是指使用压缩后的数据进行重构后，数据与原来的数据完全相同，本发明专利所阐述的方法是一种无损压缩方法。Data compression methods can be divided into two types according to whether there is loss of information before and after compression, namely lossy compression and lossless compression. Lossy compression refers to the use of compressed data for reconstruction (or restoration, decompression), and the reconstructed data is different from the original data; while lossless compression refers to the use of compressed data for reconstruction, The data is exactly the same as the original data, and the method described in the patent of the present invention is a lossless compression method.

无损数据压缩按照实现技术划分，可分为预测、字典、统计三大类。预测编码主要是根据离散信号之间存在着一定关联性的特点，利用前面的一个或多个信号对下一个信号进行预测，然后对实际值和预测值的差(预测误差)进行编码，典型的方法有DPCM，ADPCM等，它们较适合用于声音、图像数据的压缩。字典编码主要是利用数据本身包含较多重复的字符串的特性，其基本原理是不断的从字符流中提取新的字符串，然后用代号来代替这个字符串，从而实现压缩，典型的方法有LZW编码等。LZW编码是通过在编码过程中动态生成一个字符串表，用较短的代号来代替较长的字符串来实现压缩。统计编码又称为熵编码方法，主要根据字符出现概率的分布特征而进行压缩，典型的方法有行程编码、霍夫曼编码、算术编码等。行程编码的基本原理是用一个符号值或串代替具有相同值的连续符号，使符号长度少于原始数据的长度，适用于同一符号连续出现多次的场合；霍夫曼编码的基本原理是对出现概率大的信息符号编短码字，对出现概率小的信息符号编长码字；算术编码的概念是由Peter Elias于1960年提出，但是其虽然在数学上成立，并不能通过计算机实现，当时并未得到实际应用。1976年，R.Pasco和J.Rissanen分别用定长的寄存器实现了有限精度的算术编码，使其能够在计算机上实现，其基本原理是将编码的消息表示成实数0和1之间的一个间隔，消息越长，编码表示它的间隔越小，表示这一间隔所需的二进制位就越多，发生概率较大的符号在编码的时候使区间较慢的变化，编码结果中就产生较少的位数，整个编码过程采用了以一个单独的浮点数来代替一串输入符号的思想，避开了使用一个特定码字代替一输入符号，也即霍夫曼编码中比特数必须取整的问题。相比之下，算术编码有更高的效率和优越性，尤其是当信源中包含的符号比较少时，比如只有两个符号，算术编码明显更有优势，而霍夫曼编码几乎起不到任何的压缩效果。Lossless data compression can be divided into three categories: prediction, dictionary, and statistics according to the implementation technology. Predictive coding is mainly based on the characteristics of a certain correlation between discrete signals, using one or more previous signals to predict the next signal, and then coding the difference between the actual value and the predicted value (prediction error), a typical Methods include DPCM, ADPCM, etc., which are more suitable for compression of sound and image data. Dictionary encoding mainly uses the characteristic that the data itself contains many repeated strings. Its basic principle is to continuously extract new strings from the character stream, and then replace the strings with code names to achieve compression. Typical methods are LZW encoding, etc. LZW encoding achieves compression by dynamically generating a string table during the encoding process and replacing longer strings with shorter code names. Statistical coding is also called entropy coding method, which is mainly compressed according to the distribution characteristics of character occurrence probability. Typical methods include run-length coding, Huffman coding, arithmetic coding and so on. The basic principle of run-length coding is to use a symbol value or string to replace consecutive symbols with the same value, so that the length of the symbol is less than the length of the original data, which is suitable for occasions where the same symbol appears repeatedly; the basic principle of Huffman coding is to Short codewords are coded for information symbols with a high probability of occurrence, and long codewords are coded for information symbols with a low probability of occurrence; the concept of arithmetic coding was proposed by Peter Elias in 1960, but although it is mathematically established, it cannot be realized by a computer. It was not practically applied at the time. In 1976, R.Pasco and J.Rissanen implemented finite-precision arithmetic coding with fixed-length registers, so that it can be realized on a computer. The basic principle is to represent the encoded message as a real number between 0 and 1. Interval, the longer the message, the smaller the interval between the codes and the more binary bits required to represent this interval, the symbols with a higher probability of occurrence make the interval change slower during encoding, and the encoding results produce slower intervals. The whole encoding process adopts the idea of replacing a string of input symbols with a single floating-point number, avoiding the use of a specific codeword to replace an input symbol, that is, the number of bits in Huffman encoding must be rounded The problem. In contrast, arithmetic coding has higher efficiency and superiority, especially when the source contains fewer symbols, such as only two symbols, arithmetic coding is obviously more advantageous, while Huffman coding can hardly achieve any compression effects.

通信报文（后续简称报文）的使用非常普遍，比如雷达目标信息、位置信息、时间信息等，其主要由字符组成。字符是指计算机中使用的字母、数字和符号等，其存储需要一个字节，具体详见ASCⅡ码表。随着信息时代的来临，各种报文的存储呈现海量特性，为共享、存储、分发带来较大压力。比如覆盖整个城市的车辆（公交车、出租车）监控、调度系统，各车辆将自身属性（如位置、状态）等通过特定报文格式传输至中心，其移动特性决定必须通过无线方式进行通信，同时，中心会为各车辆建立历史情况数据库，数量巨大的车辆信息给通信、存储带来不便。在实际使用中，为便于观察、交互，大量使用了具有字符特征的报文格式，比如广泛使用的NMEA-0183的报文格式，其是美国国家海洋电子协会为海用电子设备制定的标准格式，目前业已成为GPS导航设备统一的标准协议。The use of communication messages (hereinafter referred to as messages) is very common, such as radar target information, location information, time information, etc., which are mainly composed of characters. Characters refer to letters, numbers and symbols used in computers, and their storage requires one byte. For details, see the ASCII code table. With the advent of the information age, the storage of various messages presents a massive feature, which brings great pressure to sharing, storage, and distribution. For example, the vehicle (bus, taxi) monitoring and dispatching system covering the entire city, each vehicle transmits its own attributes (such as location, status) to the center through a specific message format, and its mobile characteristics determine that it must communicate wirelessly. At the same time, the center will establish a historical database for each vehicle, and the huge amount of vehicle information will bring inconvenience to communication and storage. In actual use, in order to facilitate observation and interaction, a large number of message formats with character characteristics are used, such as the widely used NMEA-0183 message format, which is a standard format developed by the National Marine Electronics Association for marine electronic equipment , has become a unified standard protocol for GPS navigation equipment.

目前，对于字符型报文格式的使用（如传输、存储），基本上都是未经压缩直接处理，从现有的文献和已公开材料查询，采用的压缩方案有：At present, the use of character-type message formats (such as transmission and storage) is basically processed directly without compression. From the existing literature and published materials, the compression schemes adopted are:

1.采用BCD码对报文进行压缩1. Use BCD code to compress the message

BCD码亦称二进制码十进制数或二-十进制代码，是一种二进制的数字编码形式，适用于对0~9这十个数字进行处理，固定使用4位二进制数来表示十个数字。BCD code, also known as binary code decimal number or binary-decimal code, is a binary digital encoding form, suitable for processing ten numbers from 0 to 9, and fixedly uses 4 binary numbers to represent ten numbers.

该方案适用范围有限，仅适用对数字字符进行压缩，对于字母等并不适用。This solution has a limited scope of application and is only suitable for compressing numeric characters, not for letters and the like.

2.采用扩展BCD码对报文进行压缩2. Use the extended BCD code to compress the message

将字符集合中所有字符二进制化，并使用二进制化后的数据代表字符，以实现压缩。比如有100个字符集合，对其二进制化，则每个字符将分配7个二进制位。Binarize all characters in the character set, and use the binarized data to represent characters to achieve compression. For example, if there is a set of 100 characters, if it is binarized, each character will be assigned 7 binary bits.

该方案是一种典型的等概率的霍夫曼编码方法，认为各字符是等概率的，并未考虑字符概率特征，存在位浪费，压缩比有限。This scheme is a typical equal-probability Huffman coding method, which considers that each character is equal-probability, and does not consider the character probability characteristics, which leads to bit waste and limited compression ratio.

3.采用霍夫曼编码对报文进行压缩3. Use Huffman coding to compress the message

霍夫曼编码使用变长编码表对源符号进行编码，其中变长编码表是通过一种评估源符号出现频率的方法得到的，出现频率大的符号使用较短的编码，反之使用较长的编码。传统的霍夫曼编码是一种静态的编码方法，其主要通过统计原始数据中各字符出现的频率，并由此创建霍夫曼树，从而对原始数据进行编码，这种方法在实际应用系统中有很大局限性，特别在诸如通信等实时传输、处理系统中。因此，在报文压缩上并未得到广泛的应用。自适应霍夫曼编码是对上述方法的一种动态编码方法，已在报文压缩上得到应用，它对数据编码的依据是动态变化的霍夫曼树，即对第N+1个字符的编码是根据原始数据中前N个字符得到的霍夫曼树来进行的，每读入一个字符就要调整字符的计数，并进行霍夫曼树的更新，从而确保编码效率最高。Huffman coding uses a variable-length coding table to encode the source symbols. The variable-length coding table is obtained by a method of evaluating the frequency of source symbols. The symbols with high frequency of occurrence use shorter codes, and vice versa use longer codes. coding. The traditional Huffman coding is a static coding method, which mainly creates a Huffman tree by counting the frequency of each character in the original data to encode the original data. This method is used in practical application systems There are great limitations, especially in real-time transmission and processing systems such as communications. Therefore, it has not been widely used in message compression. Adaptive Huffman coding is a dynamic coding method for the above method, which has been applied in message compression. Its basis for data coding is a dynamically changing Huffman tree, that is, for the N+1th character Encoding is performed based on the Huffman tree obtained from the first N characters in the original data. Every time a character is read, the character count must be adjusted and the Huffman tree updated to ensure the highest encoding efficiency.

该方案未考虑联合概率，且由于编码过程中比特数必须取整，使压缩效率产生折扣，造成了输出码流的浪费。This scheme does not consider the joint probability, and because the number of bits must be rounded during the encoding process, the compression efficiency is discounted, resulting in a waste of the output code stream.

发明内容 Contents of the invention

发明目的：本发明正是基于上述在处理字符型报文格式所遇到的问题，面向字符型报文格式，提出了一种通用无损报文压缩方法，该方法基于算术编码，并引入静态频率表和自适应频率表的建立，可有效地实现报文的无损压缩，改善了报文共享、存储、分发等应用过程中遇到的延时较高、占用多余带宽、使用较大存储资源等问题，使压缩比接近或达到熵编码理论的最大值。Purpose of the invention: the present invention is based on the above-mentioned problems encountered in processing character-type message formats, and for character-type message formats, a general lossless message compression method is proposed, which is based on arithmetic coding and introduces static frequency The establishment of the table and the adaptive frequency table can effectively realize the lossless compression of the message, and improve the high delay encountered in the application process of message sharing, storage, distribution, etc., occupying excess bandwidth, using large storage resources, etc. The problem is to make the compression ratio close to or reach the maximum value of entropy coding theory.

技术方案：一种字符型报文压缩方法，包括如下步骤：Technical solution: a method for compressing character-type messages, comprising the following steps:

假设该字符型报文的字符集合为A，其字符个数为n，字符概率为P_i，则有Assuming that the character set of the character message is A, the number of characters is n, and the character probability is P _i , then

a_i∈Aa _i ∈ A

$Σ_{i = 1}^{n} P_{i} = 1,$ 其中1≤i≤n $Σ_{i = 1}^{no} P_{i} = 1,$ where 1≤i≤n

（1）预处理(1) Pretreatment

在初次使用字符型报文格式进行编码时，需要初始化频率表_Adapt_Table，具体方式有两种：一是针对报文的字符集合特点，结合具体使用环境，分配给P_i具体值，从而创建字符集合的经验值静态频率表_Exper_Table，并将其具体值赋给频率表_Adapt_Table；二是创建等概率静态频率表_EqualPro_Table，即When using the character message format for encoding for the first time, the frequency table _Adapt_Table needs to be initialized. There are two specific methods: one is to assign a specific value to P _i according to the characteristics of the character set of the message, combined with the specific use environment, so as to create a character The static frequency table _Exper_Table of the experience value of the collection, and assign its specific value to the frequency table _Adapt_Table; the second is to create an equal probability static frequency table _EqualPro_Table, namely

${P P}_{i i} = = \frac{11}{n no}$

并将其赋给频率表_Adapt_Table。在实际使用过程中，可根据具体需求来选择初始化方式；And assign it to the frequency table _Adapt_Table. In actual use, the initialization method can be selected according to specific needs;

（2）接收一条报文(2) Receive a message

假设所述接收到的一条报文为Message，字符序列为B，序列元素个数为m，即Assume that the received message is Message, the character sequence is B, and the number of sequence elements is m, that is

b_j∈A，其中1≤j≤mb _j ∈ A, where 1≤j≤m

（3）读入字符(3) Read in characters

将所述接收到的报文Message的各字符逐个读入，假设读入字符为b_j，1≤j≤m，其概率为P_bj；Read each character of the received message Message one by one, assuming that the read character is b _j , 1≤j≤m, and its probability is P _bj ;

（4）算术编码(4) Arithmetic coding

根据当前的频率表_Adapt_Table，并结合当前字符频率P_bj对该字符进行算术编码；Perform arithmetic coding on the character according to the current frequency table _Adapt_Table, combined with the current character frequency P _bj ;

（5）判断是否更新频率表(5) Determine whether to update the frequency table

根据实际需求，在编码过程中，所述频率表_Adapt_Table更新可逐字符进行，即对本条报文Message中各个字符进行算术编码后，均更新频率表；也可以若干条报文为单位进行，即对单条报文逐个字符进行算术编码后，仅记录各个字符出现的次数，在达到设定的若干条报文编码结束后再根据记录情况进行频率表的更新；若需要更新频率表_Adapt_Table，则执行下一步骤（6），否则跳转到步骤（7）；According to actual needs, in the encoding process, the update of the frequency table _Adapt_Table can be performed character by character, that is, after performing arithmetic coding on each character in the message Message, the frequency table is updated; it can also be performed in units of several messages, That is, after performing arithmetic coding on a single message character by character, only record the number of occurrences of each character, and then update the frequency table according to the recording situation after reaching the number of set message encoding; if you need to update the frequency table _Adapt_Table, Then execute the next step (6), otherwise jump to step (7);

（6）更新频率表(6) Update frequency table

通过更新字符b_j的频率P_bj，进而更新频率表_Adapt_Table；Update the frequency table _Adapt_Table by updating the frequency P _bj of the character b _j ;

（7）本条报文编码是否结束(7) Whether the coding of this message is over

如果本条报文Message编码未结束，则跳转到步骤（3），继续编码下一个字符，否则执行下一步骤（8）；If the message encoding of this message is not over, then jump to step (3) and continue to encode the next character, otherwise go to the next step (8);

（8）判断是否有下一条报文(8) Determine whether there is a next message

若是，则执行步骤（9），否则执行步骤（11），即结束本次编码；If yes, execute step (9), otherwise execute step (11), that is, end this encoding;

（9）判断是否更新频率表(9) Determine whether to update the frequency table

对于采用所述以若干条报文为单位进行更新频率表_Adapt_Table方法的情况下，在本条报文Message编码结束后，如果要更新频率表则执行下一步骤，否则跳转到步骤（2），读入下一条报文，继续编码；For the case of using the method of updating the frequency table _Adapt_Table in units of several messages, after the encoding of this message Message is completed, if the frequency table is to be updated, then perform the next step, otherwise skip to step (2) , read the next message and continue encoding;

（10）更新频率表(10) Update frequency table

使用所记录的字符出现次数进行频率表_Adapt_Table的更新；Use the recorded number of character occurrences to update the frequency table _Adapt_Table;

（11）结束(11) end

结束本次编码。End this encoding.

所述步骤（4）中算术编码，包括如下步骤：The arithmetic coding in the step (4) includes the following steps:

假设算术编码所采用的初始编码区间为[0,Max]，Max为区间最大值，一般设置为0xFFFF，编码过程中区间为[Low,High]，区间范围为Range，其中Low为区间下沿，初始为0，High为区间上沿，初始为Max，读入字符为b_j，其频率为P_bj，累计频率为CumP_bj，即符号值小于该符号的频率的总计。Assume that the initial coding interval used in arithmetic coding is [0,Max], Max is the maximum value of the interval, generally set to 0xFFFF, the interval during the encoding process is [Low,High], and the interval range is Range, where Low is the lower edge of the interval, The initial value is 0, High is the upper edge of the interval, the initial value is Max, the read character is b _j , its frequency is P _bj , and the cumulative frequency is CumP _bj , that is, the sum of the frequencies whose symbol value is less than the symbol.

（41）初始化(41) initialization

初始化编码区间[0,Max]，建立频率表；Initialize the coding interval [0,Max], and establish a frequency table;

（42）读入字符b_j (42) Read in character b _j

将所述报文Message各字符逐个读入，假设读入字符为b_j，1≤j≤m，其概率为P_bj；Read each character of the message Message one by one, assuming that the read character is b _j , 1≤j≤m, and its probability is P _bj ;

（43）更新区间(43) Update interval

根据当前频率表以及P_bj和CumP_bj，更新区间[Low,High]，具体计算公式如下：According to the current frequency table and P _bj and CumP _bj , update the interval [Low,High], the specific calculation formula is as follows:

Range＝High-Low+1Range=High-Low+1

High＝Low+Range*(CumP_bj+P_bj)-1High＝Low+Range*(CumP _bj +P _bj )-1

Low＝Low+Range*CumP_bj Low＝Low+Range*CumP _bj

（44）归一化(44) Normalization

检查区间[Low,High]是否满足继续编码的条件，如果满足继续编码，否则对区间[Low,High]进行归一化操作；Check whether the interval [Low, High] satisfies the conditions for continuing encoding, if so, continue encoding, otherwise normalize the interval [Low, High];

（45）判断是否更新频率表(45) Determine whether to update the frequency table

若是，则执行下一步骤（46），否则跳转到步骤（47）；If so, execute the next step (46), otherwise jump to step (47);

（46）更新频率表(46) Update frequency table

更新所述编码字符的频率P_bj以及相应的累计频率CumP_bj，即更新频率表；Updating the frequency P _bj of the coded characters and the corresponding cumulative frequency CumP _bj , that is, updating the frequency table;

（47）判断是否结束(47) Determine whether to end

若是，则结束此次编码，否则跳转到步骤（42），继续编码下一字符。If so, then end this encoding, otherwise jump to step (42) and continue to encode the next character.

所述步骤44中，对区间[Low,High]进行归一化操作，具体分为以下三种情况：In the step 44, the normalization operation is performed on the interval [Low, High], specifically divided into the following three situations:

情况一：区间上沿最高位是1，次高位为0，下沿最高位是0，次高位是1，对之做将次高位移出操作，即忽略掉次高位，并记录下忽略次高位的次数Case1Num；Situation 1: The highest bit of the upper edge of the interval is 1, the second highest bit is 0, the highest bit of the lower edge is 0, and the second highest bit is 1, and the second highest bit is removed, that is, the second highest bit is ignored, and the ignored second highest bit is recorded. Times Case1Num;

情况二：区间上下沿最高位均是0，则进行将上下沿左移1位，且上沿加1的操作，并将移出位添加到输出码流，此后检查Case1Num是否为0，若不为0，则将最高位取反称为Case1Bit，并输出Case1Num个Case1Bit至输出码流；Case 2: The highest bit of the upper and lower edges of the interval is 0, then perform the operation of shifting the upper and lower edges to the left by 1 bit, and add 1 to the upper edge, and add the shifted bit to the output stream, and then check whether Case1Num is 0, if not 0, the highest bit inversion is called Case1Bit, and Case1Num Case1Bit is output to the output code stream;

情况三：区间上下沿最高位均是1，则进行将上下沿左移1位，且上沿加1的操作，并将移出位添加到输出码流，此后检查Case1Num是否为0，若不为0，则将最高位取反称为Case1Bit，并输出Case1Num个Case1Bit至输出码流。Case 3: The highest bit of the upper and lower edges of the interval is 1, then perform the operation of shifting the upper and lower edges to the left by 1 bit, and add 1 to the upper edge, and add the shifted bit to the output stream, and then check whether Case1Num is 0, if not 0, the inversion of the highest bit is called Case1Bit, and Case1Num Case1Bit is output to the output code stream.

归一化的目的是防止随着编码的进行，区间变得越来越窄，以至编解码出现错误。The purpose of normalization is to prevent the interval from becoming narrower and narrower as the encoding progresses, resulting in errors in encoding and decoding.

依照本节技术方案，解码是编码的逆过程，不再赘述。According to the technical solution in this section, decoding is the inverse process of encoding, and will not be repeated here.

有益效果：本发明通过实际应用以及论证，具有以下有益效果：Beneficial effects: the present invention has the following beneficial effects through practical application and demonstration:

（1）基于有限字符集合报文的特点，将算术编码应用在其无损压缩上面，充分发挥了算术编码的优点，相比较BCD码、霍夫曼编码等压缩方式具有更高的压缩比和效率。(1) Based on the characteristics of limited character set messages, arithmetic coding is applied to its lossless compression, which fully utilizes the advantages of arithmetic coding. Compared with BCD code, Huffman coding and other compression methods, it has a higher compression ratio and efficiency .

（2）通过使用经验值频率表的建立，使报文在压缩过程中的起始阶段就能够达到较好的压缩效果；(2) By using the establishment of the empirical value frequency table, the message can achieve a better compression effect at the initial stage of the compression process;

（3）引入两种动态更新频率表的方法，其中，在编码过程中逐字符进行更新的方法充分考虑了字符概率问题，尽可能地增大了报文压缩比；以若干条报文为单位进行更新的方法更是满足了计算资源受限的环境；(3) Two methods of dynamically updating the frequency table are introduced. Among them, the method of updating character by character during the encoding process fully considers the problem of character probability and increases the message compression ratio as much as possible; the unit is several messages The method of updating meets the environment with limited computing resources;

附图说明 Description of drawings

图1为本发明实施例的流程图；Fig. 1 is the flowchart of the embodiment of the present invention;

图2为本发明实施例中的算术编码的流程图。Fig. 2 is a flow chart of arithmetic coding in the embodiment of the present invention.

具体实施方式 Detailed ways

下面结合附图和具体实施例，进一步阐明本发明，应理解这些实施例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。Below in conjunction with accompanying drawing and specific embodiment, further illustrate the present invention, should be understood that these embodiments are only for illustrating the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various aspects of the present invention Modifications in equivalent forms all fall within the scope defined by the appended claims of this application.

如图1所示，本方案在预处理步骤中完成频率表的初始化，其通过采用等概率静态频率表或者经验值静态概率表两种方式完成。对于频率表的更新给出了一种优化的自适应更新方法，具体体现在步骤（6）和（10），其采用两种方式，一是在编码过程中逐字符进行更新，即对本条报文中各个字符进行算术编码后，均更新频率表。对频率表的更新会占用一定的计算量，若计算资源受限，第一种方式无法适用。本发明可采用另外一种方式，即频率表的更新以若干条报文为单位进行，对单条报文逐个字符进行算术编码后，仅记录各个字符出现的次数，在达到设定的若干条报文编码结束后再根据记录情况进行频率表的更新。As shown in Figure 1, the program completes the initialization of the frequency table in the preprocessing step, which is completed by using an equal-probability static frequency table or an empirical value static probability table. For the update of the frequency table, an optimized adaptive update method is given, which is embodied in steps (6) and (10), which adopts two methods, one is to update character by character during the encoding process, that is, to update the After each character in the text is arithmetically coded, the frequency table is updated. Updating the frequency table will take up a certain amount of calculations. If the calculation resources are limited, the first method cannot be applied. The present invention can adopt another method, that is, the update of the frequency table is carried out in units of several messages, after performing arithmetic coding on a single message character by character, only the number of occurrences of each character is recorded, and when the set number of messages is reached, After the text encoding is completed, the frequency table is updated according to the recording situation.

该方案中涉及到的变量说明如下：The variables involved in this scheme are described as follows:

①_Exper_Table：根据经验值建立起的静态频率表；①_Exper_Table: Static frequency table established according to experience value;

②_EqualPro_Table：字符集合中各字符概率相等，即等概率静态频率表；②_EqualPro_Table: The probability of each character in the character set is equal, that is, the static frequency table of equal probability;

③_Adapt_Table：编码过程中的自适应频率表。③_Adapt_Table: Adaptive frequency table in the encoding process.

假设字符型报文中所有可能字符集合为A，集合元素个数为n，其中，字符a_i出现概率为P_i,则有：Assuming that the set of all possible characters in a character message is A, and the number of elements in the set is n, where the probability of occurrence of character a _i is P _i , then:

a_i∈Aa _i ∈ A

本发明采用的技术方案步骤如下，具体流程图见附图1：The technical scheme step that the present invention adopts is as follows, and concrete flow chart sees accompanying drawing 1:

（1）预处理(1) Pretreatment

在初次使用该报文格式进行编码时，需要初始化频率表_Adapt_Table。具体方式有两种，一是可针对报文的字符集合特点，结合具体使用环境，分配给P_i具体值，从而创建字符集合的经验值静态频率表_Exper_Table，并将其具体值赋给_Adapt_Table；二是创建等概率静态频率表_EqualPro_Table，即When using this message format for encoding for the first time, the frequency table _Adapt_Table needs to be initialized. There are two specific methods. One is to assign a specific value to P _i according to the characteristics of the character set of the message, combined with the specific use environment, so as to create a static frequency table _Exper_Table of the experience value of the character set, and assign its specific value to _ Adapt_Table; the second is to create an equal probability static frequency table _EqualPro_Table, namely

${P P}_{i i} = = \frac{11}{n no}$

并将其赋给_Adapt_Table。在实际使用过程中，可根据具体需求来选择初始化方式；And assign it to _Adapt_Table. In actual use, the initialization method can be selected according to specific needs;

（2）接收一条报文(2) Receive a message

假设该条报文为Message，字符序列为B，序列元素个数为m，即Suppose the message is Message, the character sequence is B, and the number of sequence elements is m, that is

b_j∈B，其中1≤j≤mb _j ∈ B, where 1≤j≤m

（3）读入字符(3) Read in characters

将该条报文Message各字符逐个读入，假设读入字符为b_j，1≤j≤m，其概率为P_bj；Read each character of the Message one by one, assuming that the read character is b _j , 1≤j≤m, and its probability is P _bj ;

（4）算术编码(4) Arithmetic coding

根据当前的频率表，并结合当前字符频率P_bj对该字符进行算术编码；Arithmetic encoding of the character according to the current frequency table combined with the current character frequency P _bj ;

（5）是否更新频率表(5) Whether to update the frequency table

根据实际需求，在编码过程中，频率表_Adapt_Table更新可逐字符进行，即对本条报文Message中各个字符进行算术编码后，均更新频率表；也可以若干条报文为单位进行，即对单条报文逐个字符进行算术编码后，仅记录各个字符出现的次数，在达到设定的若干条报文编码结束后再根据记录情况进行频率表的更新。According to actual needs, during the encoding process, the update of the frequency table_Adapt_Table can be performed character by character, that is, after the arithmetic encoding of each character in the message Message, the frequency table is updated; it can also be performed in units of several messages, that is, to After a single message is arithmetically coded character by character, only the number of occurrences of each character is recorded, and the frequency table is updated according to the recording situation after reaching the number of set message codes.

具体步骤为若需要更新频率表_Adapt_Table，则执行下一步骤（6），否则跳转到步骤（7）；The specific steps are if the frequency table _Adapt_Table needs to be updated, then execute the next step (6), otherwise skip to step (7);

（6）更新频率表(6) Update frequency table

（8）是否有下一条报文(8) Whether there is a next message

（9）是否更新频率表(9) Whether to update the frequency table

对于采用上述第二种更新频率表_Adapt_Table方法的情况下，在本条报文Message编码结束后，如果要更新频率表则执行下一步骤，否则跳转到步骤（2），读入下一条报文，继续编码；For the above-mentioned second method of updating the frequency table _Adapt_Table, after the encoding of this message Message is completed, if the frequency table needs to be updated, then execute the next step; otherwise, jump to step (2) and read the next message text, continue coding;

（10）更新频率表(10) Update frequency table

具体为使用所记录的字符出现次数进行频率表_Adapt_Table的更新；Specifically, update the frequency table _Adapt_Table using the recorded number of occurrences of characters;

（11）结束(11) end

结束本次编码。End this encoding.

对于本发明采用的技术方案中的步骤（4）算术编码，其详细过程如下，且具体流程图见附图2：For the step (4) arithmetic coding in the technical solution adopted by the present invention, the detailed process is as follows, and the specific flow chart is shown in Figure 2:

（41）初始化(41) initialization

初始化编码区间[0,Max]，建立频率表等；Initialize the coding interval [0,Max], establish the frequency table, etc.;

（42）读入字符b_j (42) Read in character b _j

（43）更新区间(43) Update interval

Range＝High-Low+1Range=High-Low+1

High＝Low+Range*(CumP_bj+P_bj)-1High＝Low+Range*(CumP _bj +P _bj )-1

Low＝Low+Range*CumP_bj Low＝Low+Range*CumP _bj

（44）归一化(44) Normalization

检查区间[Low,High]是否满足继续编码的条件，如果满足继续编码，否则对区间[Low,High]进行归一化操作，具体分为一下三种情况：Check whether the interval [Low, High] satisfies the conditions for continuing coding, if so, continue coding, otherwise normalize the interval [Low, High], specifically divided into the following three situations:

归一化的目的是防止随着编码的进行,区间变得越来越窄,以至编解码出现错误;The purpose of normalization is to prevent the interval from becoming narrower and narrower as the encoding progresses, resulting in errors in encoding and decoding;

（45）是否更新频率表(45) Whether to update the frequency table

若是，则执行下一步骤（46），否则跳转到步骤（47）;If so, execute the next step (46), otherwise jump to step (47);

（46）更新频率表(46) Update frequency table

更新该编码字符的频率P_bj以及相应的累计频率CumP_bj，即更新频率表；Update the frequency P _bj of the encoded character and the corresponding cumulative frequency CumP _bj , that is, update the frequency table;

（47）是否结束(47) Is it over?

下面以定位信息中广泛使用的NMEA-0183的格式为例，对本发明技术方案进行详细说明，但是本发明的保护范围不局限于所述实施例。The technical solution of the present invention will be described in detail below by taking the format of NMEA-0183 widely used in positioning information as an example, but the protection scope of the present invention is not limited to the embodiments.

具体以NMEA-0183中表示地理定位信息的报文格式为例，并且假设本条报文Message为”$GPGLL,4250.5589,S,14718.5084,E,092204.999,A*2D”，报文中各个字段以逗号隔开，各字段所代表的具体信息如下：Specifically, take the message format representing geographic positioning information in NMEA-0183 as an example, and assume that the Message of this message is "$GPGLL,4250.5589,S,14718.5084,E,092204.999,A*2D", and each field in the message is separated by a comma The specific information represented by each field is as follows:

字段0：$GPGLL，语句ID，表明该语句为Geographic Position(GLL)地理定位信息；Field 0: $GPGLL, statement ID, indicating that the statement is Geographic Position (GLL) geographic positioning information;

字段1：纬度ddmm.mmmm，度分格式（前导位数不足则补0）；Field 1: latitude ddmm.mmmm, in degree-minute format (if the leading digits are insufficient, add 0);

字段2：纬度N（北纬）或S（南纬）；Field 2: Latitude N (northern latitude) or S (southern latitude);

字段3：经度dddmm.mmmm，度分格式（前导位数不足则补0）；Field 3: Longitude dddmm.mmmm, in degree-minute format (if the leading digits are insufficient, add 0);

字段4：经度E（东经）或W（西经）；Field 4: Longitude E (East) or W (West);

字段5：UTC时间，hhmmss.sss格式；Field 5: UTC time, hhmmss.sss format;

字段6：状态，A=定位，V=未定位；Field 6: Status, A=locate, V=not locate;

字段7：校验值。Field 7: check value.

此报文的具体编码步骤如下：The specific encoding steps of this message are as follows:

（1）预处理(1) Pretreatment

针对报文字符集合中数字和逗号出现频繁等特点，结合具体使用环境，创建字符集合的经验值静态频率表_Exper_Table，并将_Adapt_Table初始化为_Exper_Table；In view of the characteristics of frequent numbers and commas in the message character set, combined with the specific use environment, create a static frequency table _Exper_Table of the experience value of the character set, and initialize _Adapt_Table to _Exper_Table;

（2）读入字符(2) Read in characters

将该条报文Message各字符逐个读入；Read each character of the message Message one by one;

（3）算术编码(3) Arithmetic coding

根据当前的频率表，并结合当前字符频率对该字符进行算术编码；Arithmetic encoding of the character according to the current frequency table combined with the current character frequency;

（4）是否更新频率表(4) Whether to update the frequency table

根据实际需求，在编码过程中，频率表_Adapt_Table的更新可逐字符进行，即对本条报文Message中各个字符进行算术编码后，均更新频率表；也可以若干条报文为单位进行，即对单条报文逐个字符进行算术编码后，仅记录各个字符出现的次数，在达到设定的若干条报文编码结束后再根据记录情况进行频率表的更新。According to actual needs, during the encoding process, the update of the frequency table _Adapt_Table can be performed character by character, that is, after performing arithmetic encoding on each character in the message Message, the frequency table is updated; it can also be performed in units of several messages, namely After performing arithmetic coding on a single message character by character, only the number of occurrences of each character is recorded, and the frequency table is updated according to the recording situation after the encoding of the set number of messages is completed.

具体步骤为若需要更新频率表_Adapt_Table，则执行下一步骤（5），否则跳转到步骤（6）；The specific steps are as follows: if the frequency table _Adapt_Table needs to be updated, execute the next step (5), otherwise skip to step (6);

（5）更新频率表(5) Update frequency table

通过更新本次编码字符的频率，进而更新频率表_Adapt_Table；Update the frequency table _Adapt_Table by updating the frequency of this coded character;

（6）本条报文编码是否结束(6) Is the coding of this message ended?

如果本条报文Message编码未结束，则跳转到步骤（2），继续编码下一个字符，否则执行下一步骤（7）；If the message encoding of this message is not over, then jump to step (2) and continue to encode the next character, otherwise go to the next step (7);

（7）结束(7) end

结束本次编码。End this encoding.

Claims

1. a character type message compression method, is characterized in that: comprise the steps:

Assuming that the character set of the character message is A, the number of characters is n, and the character probability is P _i , then

a _i ∈ A

where 1≤i≤n

(1) Pretreatment

When using the character message format for encoding for the first time, it is necessary to initialize the frequency table _Adapt_Table and assign it to the frequency table _Adapt_Table;

(2) Receive a message

Assume that a received message is Message, the character sequence is B, and the number of sequence elements is m, that is

b _j ∈ A, where 1≤j≤m

(3) Read characters

Read each character of the received message Message one by one, assuming that the read character is b _j , 1≤j≤m, and its probability is P _bj ;

(4) Arithmetic coding

Perform arithmetic coding on the character according to the current frequency table _Adapt_Table, combined with the current character frequency P _bj ;

The specific steps of arithmetic coding are:

Assume that the initial coding interval used in arithmetic coding is [0,Max], Max is the maximum value of the interval, set to 0xFFFF, the interval during the encoding process is [Low,High], and the interval range is Range, where Low is the lower edge of the interval, and the initial is 0, High is the upper edge of the interval, the initial is Max, the read character is b _j , its frequency is P _bj , and the cumulative frequency is CumP _bj , that is, the sum of the frequencies whose symbol value is less than the symbol;

(41) Initialization

Initialize the coding interval [0,Max], and establish a frequency table;

(42) Read in character b _j

Read each character of the message Message one by one, assuming that the read character is b _j , 1≤j≤m, and its probability is P _bj ;

(43) Update interval

According to the current frequency table and P _bj and CumP _bj , update the interval [Low,High], the specific calculation formula is as follows:

Range＝High-Low+1

High＝Low+Range*(CumP _bj +P _bj )-1

Low＝Low+Range*CumP _bj

(44) Normalization

Check whether the interval [Low, High] satisfies the conditions for continuing encoding, if so, continue encoding, otherwise normalize the interval [Low, High];

Perform normalization operations on the interval [Low, High], specifically divided into the following three situations:

Situation 1: The highest bit of the upper edge of the interval is 1, the second highest bit is 0, the highest bit of the lower edge is 0, and the second highest bit is 1, and the second highest bit is removed, that is, the second highest bit is ignored, and the ignored second highest bit is recorded. Times Case1Num;

Case 2: The highest bit of the upper and lower edges of the interval is 0, then perform the operation of shifting the upper and lower edges to the left by 1 bit, and add 1 to the upper edge, and add the shifted bit to the output stream, and then check whether Case1Num is 0, if not 0, the highest bit inversion is called Case1Bit, and Case1Num Case1Bit is output to the output code stream;

Case 3: The highest bit of the upper and lower edges of the interval is 1, then perform the operation of shifting the upper and lower edges to the left by 1 bit, and add 1 to the upper edge, and add the shifted bit to the output stream, and then check whether Case1Num is 0, if not 0, the highest bit inversion is called Case1Bit, and Case1Num Case1Bit is output to the output code stream;

(45) Determine whether to update the frequency table

If so, then execute the next step (46), otherwise jump to step (47);

(46) Update frequency table

Updating the frequency P _bj of the coded characters and the corresponding cumulative frequency CumP _bj , that is, updating the frequency table;

(47) Judging whether it is over

If so, then end this encoding, otherwise jump to step (42) and continue to encode the next character;

(5) Determine whether to update the frequency table

According to actual needs, in the encoding process, the frequency table _Adapt_Table is updated or performed character by character, that is, after performing arithmetic encoding on each character in the message Message, the frequency table is updated; or performed in units of several messages, That is, after performing arithmetic coding on a single message character by character, only record the number of occurrences of each character, and then update the frequency table according to the recording situation after reaching the number of set message encoding; if you need to update the frequency table _Adapt_Table, Then execute the next step (6), otherwise jump to step (7);

(6) Update frequency table

Update the frequency table _Adapt_Table by updating the frequency P _bj of the character b _j ;

(7) Is the coding of this article over?

If the message encoding of this message is not over, then jump to step (3) and continue to encode the next character, otherwise perform the next step (8);

(8) Determine whether there is a next message

If so, execute step (9), otherwise execute step (11), that is, end this encoding;

(9) Determine whether to update the frequency table

For the case of using the method of updating the frequency table _Adapt_Table in units of several messages, after the encoding of this message Message is completed, if the frequency table is to be updated, the next step is performed, otherwise jump to step (2) , read the next message and continue encoding;

(10) Update frequency table

Use the recorded number of character occurrences to update the frequency table_Adapt_Table;

(11) end

End this encoding. the

2. the character type message compression method as claimed in claim 1, it is characterized in that: the initialization of frequency table is proposed by empirical value and two kinds of modes of equal probability, wherein, according to the character set characteristic of message in the empirical value mode, Combined with the specific use environment, assign specific values to P _i , thereby creating a static frequency table _Exper_Table of character sets, and assigning it to the frequency table _Adapt_Table; and the equal probability method is to create an equal probability static frequency table _EqualPro_Table, Right now

And assign it to the frequency table _Adapt_Table.

3. The character type message compression method as claimed in claim 1, characterized in that: decoding is the inverse process of encoding. the