WO2018107579A1 - 一种程序计数器压缩方法及其硬件电路 - Google Patents

一种程序计数器压缩方法及其硬件电路 Download PDF

Info

Publication number
WO2018107579A1
WO2018107579A1 PCT/CN2017/073927 CN2017073927W WO2018107579A1 WO 2018107579 A1 WO2018107579 A1 WO 2018107579A1 CN 2017073927 W CN2017073927 W CN 2017073927W WO 2018107579 A1 WO2018107579 A1 WO 2018107579A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
data
dictionary
program counter
compression
Prior art date
Application number
PCT/CN2017/073927
Other languages
English (en)
French (fr)
Inventor
张多利
张斌
宋宇鲲
卫灿
Original Assignee
合肥工业大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 合肥工业大学 filed Critical 合肥工业大学
Priority to US15/579,827 priority Critical patent/US10277246B2/en
Publication of WO2018107579A1 publication Critical patent/WO2018107579A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3088Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method employing the use of a dictionary, e.g. LZ78
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/28Error detection; Error correction; Monitoring by checking the correct order of processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/3636Software debugging by tracing the execution of the program
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3692Test management for test results analysis
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Definitions

  • the present invention relates to the field of electronic circuits, and in particular, to a program counter compression method and a hardware circuit thereof.
  • the first one is implemented by the simulator simulation program. This method is simple to implement, but because the simulation speed is slow, it affects the efficiency of information collection.
  • the second is in the processor.
  • the embedded hardware module records the path information in real time. This method collects information quickly, but the on-chip memory capacity and data transmission bandwidth limit the amount of information recorded.
  • an efficient program counter compression method is needed to reduce the amount of valid data that needs to be recorded.
  • the program counter compression unit is a non-core component of the processor, its hardware resource consumption should be as low as possible, and power consumption should be as low as possible so as not to affect the overall area and performance of the processor. Therefore, it is of great research value to explore an efficient program counter compression scheme and its hardware circuit, improve the compression ratio of the program counter, and reduce the resource consumption and power consumption of the hardware module.
  • the existing program counter compression method is mainly proposed based on two angles.
  • the first angle is based on the compression of the architecture. For example, only the jump instruction information during the execution of the program is recorded, and the sequential instruction information is not processed, and the second angle is
  • It is a general-purpose lossless data compression method such as differential encoding and dictionary encoding. Most of the existing technologies do not effectively combine these two angles, and the compression effect still has room for improvement.
  • architecture compression existing research has paid less attention to blocking instructions and has affected compression in some specific areas.
  • the dictionary encoding method has been widely used in the compression of the program counter, but the hardware implementation methods of the dictionary encoding proposed by the existing research mostly need more hardware resources, especially requiring more registers and power consumption. Also larger, limiting the practical use of this method.
  • the present invention proposes a new program counter compression method and corresponding hardware implementation circuit, or compression device.
  • the present invention provides a program counter compression method, characterized in that the compression method comprises the following steps:
  • Step (1) acquiring an execution condition of an instruction issued by the processor, and classifying and filtering the instruction based on an execution condition of the instruction;
  • Step (2) performing differential processing on the program counter value and the blocking period of the target type instruction based on the result of the classification screening, and performing slice processing on the obtained difference value to obtain a corresponding valid data segment;
  • Step (3) performing dictionary encoding on the valid data segment of the differential slice obtained in the step (2).
  • the program counter compression method divides the types of instructions into (1) sequential execution, (2) jump, and (3) blocking,
  • the step (1) includes:
  • Step (1.1) obtaining a program counter value corresponding to each instruction
  • Step (1.3) classifies the instruction based on a difference PC_diff of program counter values of any two consecutive instructions
  • the step (2) comprises:
  • Step (2.1) differentiating the program counter value of the recorded blocking instruction and the jump instruction and the corresponding instruction duration period to obtain a corresponding difference value
  • Step (2.2) dividing the difference value into a plurality of data segments according to the order of the data bits, and the data bits of each data segment are the same;
  • Step (2.3) judging from left to right whether each data bit in each data segment is all 0 or all 1;
  • Step (2.4) if each data bit in the data segment is all 0 or all 1, then the data segment is discarded. If each data bit in the data segment is not all 0 or 1, the data segment and its lower data segment are Both are transferred to step (3).
  • the step (3) comprises:
  • Step (3.2) constructing a dictionary in the memory, the dictionary including several elements;
  • Step (3.3) for each source data, sequentially search from each element in the dictionary. If the current source data matches an element in the dictionary, record the position of the element in the dictionary and then use the source.
  • the data update dictionary if all the elements do not match, the dictionary is directly updated with the source data. When the dictionary is updated, the dictionary does not slip, and the source data is sequentially overwritten by the elements in the dictionary from left to right, and after each update, The search start position will be +1.
  • the step (3) further comprises:
  • N is a positive integer greater than or equal to 2
  • the search start position of each data source An increase of 1 is obtained from the previous one, and a search matching result for each data source is obtained, and an actual matching result is determined based on the matching result for all N data sources.
  • the present invention provides a program counter compression apparatus, characterized in that the compression apparatus comprises:
  • An instruction classification module configured to acquire an execution condition of an instruction issued by a processor, and perform classification and screening on the instruction based on an execution condition of the instruction;
  • the data differential slicing module is configured to perform differential processing on the value of the program counter of the target type instruction and the blocking period based on the result of the classification screening, and perform slice processing on the obtained differential value;
  • the dictionary encoding module is configured to construct a dictionary encoding based on the RAM of depth N, and perform LZ dictionary encoding on the data segments of the valid differential slice.
  • said compression means is operative to perform said method.
  • the invention is mainly used for the control part
  • the invention is mainly directed to a processor architecture of a "microcontroller + coprocessor", the software program is executed in a microcontroller, and the microcontroller sends some control commands to the coprocessor according to the execution result, and the coprocessor is the main operation. Body, receive configuration, complete specific operations. There are a large number of blocking instructions in the software program of the processor based on this architecture. If an instruction corresponds to sending an operation control command to the coprocessor, the next instruction is a blocking instruction, and the instruction execution will stay on this instruction. Until the operation in the coprocessor is completed, this ensures that the coprocessor accepts the unified configuration of the microcontroller and the orderly execution of various operations.
  • the present invention effectively combines the two methods of architecture compression and non-architecture compression.
  • a three-stage compression scheme of the program counter is proposed, which greatly Increased the compression ratio of the program counter;
  • the present invention proposes a hardware implementation structure of dictionary coding based on memory, which can significantly reduce the number of registers used, even without using registers, and reduces the resource consumption and power consumption of the dictionary coding hardware module.
  • 1 is a schematic flow chart of a program counter compression method
  • 3 is a hardware circuit for performing differential slicing according to the present invention.
  • Figure 5 is a dictionary compression hardware structure diagram
  • Figure 6 is a schematic diagram of the dictionary window update
  • PC_classsify, Diff_encode, and LZ_encode in FIG. 1 sequentially refer to three compression step classification filters, differential slices, and dictionary coding
  • PC and PC_pre refer to values of a program counter of a current cycle and a previous cycle.
  • Jump_PC and Stall_PC refer to the value of the program counter corresponding to the jump instruction and the blocking instruction
  • Stall_len refers to the number of blocking cycles corresponding to the blocking instruction
  • Data_slice refers to the valid data segment after the differential slice processing.
  • the compression method of the program counter of the present invention is divided into three steps: (1) classification screening (2) differential slice (3) dictionary coding.
  • the implementation process of this embodiment will be described in detail below with respect to these three steps.
  • the present invention divides the instructions in the processor GFP into three types: (1) sequential execution (2) jump (3) blocking.
  • the execution of the instructions in the GFP can be parsed only by recording the values of the program counters corresponding to the latter two instructions.
  • PC_diff 0: blocking instruction, recording the value of the program counter corresponding to the blocking instruction and the period of blocking
  • PC_diff ⁇ 0 and PC_diff ⁇ 1 jump class instruction, record branch address and destination address
  • the classification screening method adopted by the present invention fully considers that there are three kinds of instructions in the processor of the "microcontroller + coprocessor" structure, namely, sequential instructions, blocking instructions and jump instructions, which minimizes the need for recording.
  • the amount of valid data, most of the prior art does not consider blocking instructions, there is still room for improvement in compression ratio.
  • the differential value diff is then sliced. The specific method of differential value slicing divides the differential value into several data segments according to the order of each data bit, and the data bits of each data segment are the same. 2 is a schematic diagram of a differential slice. The 32-bit difference value is divided into 8 data segments, each data segment contains 4 data bits, and the 4 data segments corresponding to the lower 16 bits are valid data segments.
  • the data segment selection rule is as follows: each data segment is checked in order from the high position to the status. If each data bit in the data segment is all 0 (positive number) or all 1 (negative number), the next data segment is continuously checked. Otherwise, from the data segment until the lowest data segment is a valid data segment, it needs to be transferred to the next compression step.
  • the 32-bit differential value 0000_0000_0000_0000_0011_0000_0000 contains three valid data segments 0011, 0000, 0000.
  • the advantages are mainly (1) no need to solve the bit width of the difference value, saving time and resources; (2) the differential slice value bit width is fixed for subsequent Dictionary encoding saves hardware resources.
  • the dictionary encoding requires a large number of comparators.
  • the bit width of the comparator is the same as the differential encoding output bit width.
  • the classical encoding method needs to support the effective bit width of 32 bits.
  • the differential encoding slice processing method is rarely used, and most of them are representation methods of differential effective field + differential value bit width.
  • the inventors of the present application have found through research that the differential encoding slice processing method of the present invention can effectively reduce the amount of calculation and save hardware resources.
  • the LZ dictionary encoding is performed on the valid differential slice field recorded in step 2.
  • LZ dictionary coding is a classic lossless data compression algorithm. Its main principle is that there is a dynamically changing dictionary window in the process of encoding. When reading a batch of data to be compressed is the same as a certain data segment in the dictionary, it is used. The starting position of the data segment and the length of the data matching are represented.
  • FIG. 4 is an example of LZ dictionary encoding used in the prior art.
  • the left 8 small squares constitute a sliding dictionary window (Dictionary)
  • the middle 7 small squares represent the source data (Src) to be compressed
  • the value of the variable M indicates the current source data and the dictionary match
  • 1 indicates The match is successful and 0 means failure.
  • the variables MP and ML represent the matching position and matching length of the source data with the dictionary window. M, MP, and ML change in real time as the source data and dictionary change. It is assumed here that the dictionary window contains 8 data buffer units, the initialization content is 0, 1, 2, ..., 7, and the data to be compressed is 1, 2, 3, 5, 6, 7.
  • the data 1 and the dictionary content are matched, the matching is successful, the matching position is 1, and then the dictionary content is updated, and the data in the dictionary is sequentially moved to the left by a cache unit. Subsequently found 2, 3 two data also and the position in the dictionary is 1
  • the data in the cache unit matches until the 5 fails to match successfully, and the first set of compression results (1, 3, 5) of the dictionary code is output, the matching position, 1, the matching length 3, and the matching end data 5.
  • the dictionary compression principle of the latter source data is the same.
  • the invention is improved based on the LZ dictionary coding principle, so that the execution efficiency is higher and the compression ratio is larger.
  • a RAM of depth N is used as a dictionary
  • a counter cnt1 generates a dictionary address addr and a matching position MP
  • a counter cnt2 generates a matching length ML
  • the comparator CP is in a plurality of cycles. The comparison matching operation is completed to obtain a matching result M.
  • the compression process is explained by taking FIG. 6 as an example.
  • the dictionary window includes four storage units, and the initialization contents are D0, D1, D2, and D3, and the source data to be compressed is s0, s1, s2, and s3.
  • When performing dictionary update replace D0, D1, D2, and D3 with s0, s1, s2, and s3 from left to right.
  • When performing matching search sequentially select 1, 2, 3, and 4 in the dictionary window.
  • the data of the cache unit is used as a matching search object.
  • the update strategy of the dictionary is that the dictionary does not slip, and the source data is overwritten by the elements in the dictionary from left to right, and the starting position of the search does not start from 0, and then increases by 1, and the number of cycles required for the dictionary update is Shortened from 2(N-1)+1 to 1, and greatly reduced power consumption.
  • the dictionary window can be divided. As shown in FIG. 7, the dictionary window is divided into two, and parallel search can be implemented in two dictionary windows.
  • A1, A2, A3 and A4 represent 4 source data read in one time
  • P0 represents the initial matching position
  • P1, P2, P3 and P4 respectively represent the matching positions of 4 source data in the dictionary window
  • Mx_Py represents The result of matching the xth source data with the yth matching position.
  • M1_P0_1 indicates that the first source data and the dictionary element of the P0 position match successfully
  • M1_P0_0 indicates that the matching of the first source data and the dictionary element of the P0 position fails.
  • Eof means to end all elements in the search dictionary.
  • the output value is to wait for all the full search ends, first determine the actual match of A1, A2, A3, A4 according to the separation match value in the full search process.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种程序计数器压缩方法及其硬件电路,该方法包括下述步骤:步骤(1)、获取处理器发出的指令的执行情况,并且基于所述指令的执行情况对所述指令进行分类筛选;步骤(2)、基于分类筛选的结果对目标分类的指令计数值和阻塞周期进行差分处理,并且对所获得的差分值进行切片处理;步骤(3)、对步骤(2)中记录的有效的差分切片字段进行LZ字典编码。该方法将体系结构压缩和非体系结构压缩两种方法有效地结合起来,通过分类筛选、差分编码和字典压缩等方法的组织和运用,提出了程序计数器的三级压缩方案,极大地提高了程序计数器的压缩率。

Description

一种程序计数器压缩方法及其硬件电路
相关申请
本申请主张于2016年12月13日提交的、申请号为201611143794.2、名称为“一种程序计数器压缩方法及其硬件电路”的中国发明专利申请的优先权。
技术领域
本发明涉及电子电路领域,具体涉及一种程序计数器压缩方法及其硬件电路。
背景技术
随着处理器中软件复杂度的提高,软件调试和分析软件执行效果显得越来越重要,收集程序执行的路径信息具有重要的意义。常用的路径信息收集方法有两种,第一种是通过仿真器模拟程序执行,这种方法实现起来简单,但是由于仿真速度较慢,影响了信息收集的效率,第二种是在处理器中嵌入硬件模块实时记录路径信息,这种方法信息收集速度快,但是片上存储器容量和数据传输带宽限制了记录的信息数量。为了提高信息收集的效率,同时降低对硬件条件的要求,需要高效的程序计数器压缩方法,减少需要记录的有效数据量。
由于程序计数器压缩单元是处理器的非核心部件,所以其硬件资源消耗应当尽可能少,功耗也应当尽可能低,以不影响处理器的整体面积和性能。因此,探索一种高效的程序计数器的压缩方案及其硬件电路,提高程序计数器的压缩率,降低硬件模块的资源消耗和功耗具有重要的研究价值。
现有的程序计数器压缩方法主要基于两个角度提出,第一个角度是基于体系结构的压缩,例如仅仅记录程序执行过程中的跳转指令信息,对于顺序指令信息不做处理,第二个角度是通用的无损数据压缩方法,例如差分编码和字典编码。现有的技术大多数没有将这两个角度有效结合起来,压缩效果仍有提升的空间。在体系结构压缩方面,已有的研究对阻塞指令的关注较少,在一些特定的领域中影响了压缩效果。字典编码这种方法在程序计数器的压缩中得到了较为广泛的运用,但是已有的研究提出的字典编码的硬件实现方法大都需要较多的硬件资源,尤其是需要较多的寄存器,同时功耗也较大,限制了这种方法的实际运用。
发明内容
针对上述问题,本发明提出了一种新的程序计数器压缩方法以及相应的硬件实现电路,或称为压缩装置。
具体而言,一方面,本发明提供一种程序计数器压缩方法,其特征在于,所述压缩方法包括下述步骤:
步骤(1)、获取处理器发出的指令的执行情况,并且基于所述指令的执行情况对所述指令进行分类筛选;
步骤(2)、基于分类筛选的结果对目标类型指令的程序计数器值和阻塞周期进行差分处理,并且对所获得的差分值进行切片处理,获得相应的有效数据段;
步骤(3)、对步骤(2)中获得的差分切片的有效数据段进行字典编码。
优选地,所述程序计数器压缩方法将指令的类型分为(1)顺序执行、(2)跳转和(3)阻塞,
所述步骤(1)包括:
步骤(1.1)、获取每个指令所对应的程序计数器值;
步骤(1.2)求解任意两条连续指令的程序计数器值的差值PC_diff=PC-PC_pre,其中,PC表示当前周期的程序计数器值,PC_pre表示上一个周期的程序计数器值;
步骤(1.3)基于任意两条连续指令的程序计数器值的差值PC_diff对所述指令进行分类,
(a)如果PC_diff=1,则将该指令分类为顺序指令,对该指令不做记录;
(b)如果PC_diff=0:则将该指令分类为阻塞指令,记录阻塞指令对应的程序计数器值和阻塞的周期;
(c)如果PC_diff≠0且PC_diff≠1,则将该指令分类为跳转指令,记录该指令对应的分支地址和目标地址所对应的程序计数器的值。
优选地,所述步骤(2)包括:
步骤(2.1)、对于所记录下的阻塞指令和跳转指令的程序计数器值和相应的指令持续周期数进行差分,获得相应差分值;
步骤(2.2)、将差分值按照数据位的顺序分成若干数据段,每个数据段的数据位数相同;
步骤(2.3)、从左至右判断每个数据段中的各个数据位是否全为0或全为1;
步骤(2.4)若数据段中的各个数据位全为0或全为1,则舍弃该数据段,若数据段中的各个数据位不全为0或1,则将该数据段以及其低位数据段均传输至步骤(3)。
优选地,所述步骤(3)包括:
步骤(3.1)、接收步骤(2)中所传送的数据段,其中每个数据段包含若干源数据;
步骤(3.2)、在存储器中构建字典,所述字典包括若干元素;
步骤(3.3)、对每个源数据,依次从字典中的每个元素中进行查找,若当前源数据与字典中的某个元素相匹配,则记录下字典中该元素的位置后再利用源数据更新字典,若所有元素均不匹配,则直接利用源数据更新字典,字典更新时,字典不发生滑动,依次将源数据从左至右将字典中的元素覆盖,并且,每次更新之后,将对查找起始位置+1。
优选地,所述步骤(3)还包括:
对字典窗口进行分割,单次读入N个数据源,N为大于等于2的正整数,对于每个数据源,每个程序周期启动一个数据源的查找,每个数据源的查找起始位置比上一个增加1,获取对于每个数据源的查找匹配结果,基于对所有N个数据源的匹配结果,确定实际匹配结果。
另一方面,本发明提供一种程序计数器压缩装置,其特征在于,所述压缩装置包括:
指令分类模块、数据差分切片模块、字典编码模块,所述指令分类模块用于获取处理器发出的指令的执行情况,并且基于所述指令的执行情况对所述指令进行分类筛选;
所述数据差分切片模块用于基于分类筛选的结果对目标类型指令的程序计数器的值和阻塞周期进行差分处理,并且对所获得的差分值进行切片处理;
所述字典编码模块用于基于深度为N的RAM构建字典编码,并且,对有效的差分切片的数据段进行LZ字典编码。
优选地,所述压缩装置用于执行所述的方法。
本发明主要用于控制部分
本发明主要针对“微控制器+协处理器”的处理器架构,软件程序在微控制器中执行,微控制器根据执行结果向协处理器发送一些控制命令,协处理器作为运算的主 体,接收配置,完成具体的运算操作。基于这种架构的处理器的软件程序中存在着大量的阻塞指令,如果某一条指令对应着向协处理器发送运算控制命令,则下一条指令为阻塞指令,指令执行会停留在这一条指令上,直到协处理器中的运算完成,这样保证了协处理器接受微控制器的统一调配,各种运算的有序执行。
与已有的研究相比,本发明的程序计数器的压缩方案及其硬件电路具有如下优势:
(1)本发明将体系结构压缩和非体系结构压缩两种方法有效地结合起来,通过分类筛选、差分编码和字典压缩等方法的组织和运用,提出了程序计数器的三级压缩方案,极大地提高了程序计数器的压缩率;
(2)本发明分类筛选步骤中,对于阻塞指令做了相应的处理,针对“微控制器+协处理器”结构的处理器中,压缩效果尤其好;
(3)本发明差分编码步骤中,通过对差分值做切片化处理,既减少了差分值的有效位宽,又使得切片的数据位宽固定,为后续字典编码节省了资源;
(4)本发明基于存储器提出了一种字典编码的硬件实现结构,其能够明显减少寄存器的使用数量,甚至不使用寄存器,降低了字典编码硬件模块的资源消耗和功耗。
附图说明
图1为程序计数器压缩方法的示意流程图;
图2为本发明的方法进行差分切片的示意图;
图3为本发明进行差分切片的硬件电路;
图4为进行LZ字典编码的实例;
图5字典压缩硬件结构图;
图6字典窗口更新示意图;
图7双RAM并行字典压缩硬件结构图
图8多源数据并行压缩匹配情况分布图
具体实施方式
以下结合附图及其实施例对本发明进行详细说明,但并不因此将本发明的保护范围限制在实施例描述的范围之中。
图1为三级压缩方案的流程图,图1中的PC_classsify、Diff_encode和LZ_encode依次指三个压缩步骤分类筛选、差分切片和字典编码,PC和PC_pre指当前周期和上一个周期的程序计数器的值,Jump_PC和Stall_PC指跳转指令和阻塞指令对应的程序计数器的值,Stall_len指阻塞指令对应的阻塞周期数,Data_slice指经过差分切片处理后的有效数据段。
如图1所示,本发明的程序计数器的压缩方法分为3个步骤:(1)分类筛选(2)差分切片(3)字典编码。下面分别就这三个步骤对本实施例的实现过程进行详细描述。
1、分类筛选
本发明将处理器GFP中的指令分为3种类型:(1)顺序执行(2)跳转(3)阻塞。在线调试设计中,仅仅通过记录后两种指令对应的程序计数器的值就可以解析出GFP中指令的执行情况。具体筛选方法是通过求解连续两条指令的程序计数器的差值进行分类,具体如下:假设PC表示当前周期的程序计数器的值,PC_pre表示上一个周期的程序计数器的值,差分值PC_diff=PC-PC_pre:
(1)PC_diff=1:顺序指令,不做记录
(2)PC_diff=0:阻塞指令,记录阻塞指令对应的程序计数器的值和阻塞的周期
(3)PC_diff≠0且PC_diff≠1:跳转类指令,记下分支地址和目标地址
本发明所采用的这种分类筛选方法充分考虑了“微控制器+协处理器”结构的处理器中存在三种指令,即顺序指令、阻塞指令和跳转指令,最大程度地减少了需要记录的有效数据量,现有技术大都没有考虑阻塞指令,压缩率方面仍有较大的提升空间。
2、差分切片
对第一个步骤中记录下的两类指令的程序计数器值和阻塞类指令持续周期数进行差分编码,假设D1和D2为分类筛选步骤中连续记录的两个有效数据,则差分值的计算公式为:diff=D1-D2。再对差分值diff进行切片处理。差分值切片的具体方法将差分值按照各个数据位的顺序分成若干数据段,每个数据段的数据位数相同。图2为差分切片的一个示意图,将32位的差分值分成了8个数据段,每个数据段包含4个数据位,低16位对应的4个数据段为有效数据段。如图3所示,只有当某一个数 据段的各个数据位不全为0或者不全为1时,才将该数据段传输至下一个压缩步骤(字典编码步骤)。这里的数据段取舍规则如下:从高位至地位依次检查各个数据段,如果数据段中的各个数据位全为0(正数)或全为1(负数),则继续检查下一个该数据段,否则从该数据段一直到最低位数据段均为有效数据段,都需要传输至下一个压缩步骤。例如32位的差分值0000_0000_0000_0000_0000_0011_0000_0000包含3个有效数据段0011、0000、0000。
与经典差分编码(差分有效字段+差分值位宽)方法相比,优点主要有(1)无需求解差分值的位宽,节省了时间和资源;(2)差分切片值位宽固定,为后续字典编码节省了硬件资源。字典编码需要大量比较器,比较器的位宽和差分编码输出位宽相同,经典编码方法需要支持32位的有效位宽
目前,现有技术中,差分编码的切片处理方法很少运用,大部分还是差分有效字段+差分值位宽的表示方法。但是,本申请的发明人经过研究发现,采用本发明的差分编码的切片处理方式,能够有效降低运算量,节约硬件资源。
3、LZ字典编码
对步骤2中记录的有效地差分切片字段进行LZ字典编码。
(1)LZ字典编码原理
LZ字典编码是一种经典的无损数据压缩算法,其主要原理是在编码的过程中有一个动态变化的字典窗口,当读入一批待压缩数据与字典中的某一数据段相同,则使用该数据段的起始位置和数据匹配的长度来表示,图4为现有技术中所采用LZ字典编码的实例。
图4中,左侧8个小方块组成滑动的字典窗口(Dictionary),中间7个小方块表示待压缩的源数据(Src),变量M的值表示当前源数据与字典的匹配情况,1表示匹配成功,0表示失败。变量MP和ML表示源数据与字典窗口的匹配位置和匹配长度。M、MP和ML随着源数据和字典的变化而实时改变。此处假设字典窗口包含8个数据缓存单元,初始化内容为0,1,2,…,7,待压缩的数据为1,2,3,5,6,7。首先数据1和字典内容进行匹配,匹配成功,匹配位置为1,然后对字典内容进行更新,字典中的数据依次向左移动一个缓存单元。后续依次发现2,3两个数据也和字典中位置为1 的缓存单元中的数据匹配,直至5无法匹配成功,输出字典编码的第一组压缩结果(1,3,5),匹配位置,1,匹配长度3,匹配结束数据5。后面源数据的字典压缩原理与此相同。
本发明基于LZ字典编码原理对其进行了改进,使其执行效率更高,压缩比更大。
(2)LZ字典编码硬件实现方案
如图5所示,本实施例中用一个深度为N的RAM作为字典(Dictionary),计数器cnt1产生字典的地址addr和匹配位置MP,计数器cnt2产生匹配长度ML,比较器CP在多个周期内完成比较匹配操作,得到匹配结果M。
以图6为例解释这种压缩过程,字典窗口包含4个存储单元,初始化内容为D0、D1、D2和D3,待压缩的源数据为s0、s1、s2和s3。在进行字典更新时,依次从左至右,用s0、s1、s2和s3将D0、D1、D2和D3替换掉,在进行匹配查找时,依次将字典窗口中第1、2、3和4个缓存单元的数据作为匹配查找对象。
字典的更新策略是字典不发生滑动,依次将源数据从左至右将字典中的元素覆盖,同时查找的起始位置每次不从0开始,依次加1,字典更新所需要花费的周期数从2(N-1)+1缩短至1,并且极大地减小了功耗。
当对压缩速率要求较高时,可以对字典窗口进行分割,如图7所示,将字典窗口分割成两个,可以在两个字典窗口中实现并行查找。
传统的LZ字典压缩硬件结构中,由于数据相关性,每次只能对一个源数据进行压缩,本设计中,一次性读入4个源数据,对4个源数据并行查找,提高压缩速度,根据匹配情况,压缩过程随时终止。
如图8所示,为压缩过程中匹配情况的分步特征。
图中:A1、A2、A3和A4表示一次性读入的4个源数据,P0表示初始匹配位置,P1、P2、P3和P4分别表示4个源数据在字典窗口中的匹配位置,Mx_Py表示第x个源数据与第y个匹配位置的匹配结果。例如M1_P0_1表示第一个源数据和P0位置的字典元素匹配成功,M1_P0_0表示第一个源数据和P0位置的字典元素匹配失败。eof表示结束查找字典中的所有元素。
编码过程
(1)对P0进行匹配,依次得到M1_P0、M2_P0、M3_P0、M4_P0的值;
(2)对A1、A2、A3和A4进行并行全查找,此过程中会得出M1_P1、M2_P2、M3_P3、M1_P4的值,图中表示的是A1、A2、A3和A4结束查找的条件,当4个源数据全部结束查找,当前批次的源数据结束查找。当M1_P1拉高时,在其后面的3个周期,依次得出M2_P1、M3_P1和M4_P1的值;当M2_P2拉高时,在其后面的2个周期,依次得出M3_P2和M4_P2的值;当M3_P3拉高时,在其后面的1个周期,得出M4_P3的值;
编码关键所在:
(1)编码何时结束(如图8所示,A1、A2、A3、A4编码均结束);
(2)CL输出、CP输出和码字输出(分成A1、A2、A3、A4四级处理)
输出值是等待所有的全查找结束后,先根据全查找过程中的分离匹配值确定A1、A2、A3、A4的实际匹配情况。
A1:匹配P0、匹配P1、不匹配
A2:匹配P0、匹配P1、匹配P2、不匹配:
A3:匹配P0、匹配P1、匹配P2、匹配P3、不匹配;
A4:匹配P0、匹配P1、匹配P2、匹配P3、匹配P4、不匹配;
An的匹配情况有以下几种:
a、匹配Pn-1(CLn-1>0);——CL=CL+1,CP不变
b、不匹配Pn-1(CLn-1<0);——输出码字,CL清零
c、匹配Pn(CLn-1=0);——CL=CL+1,CP载值
d、不匹配Pn(CLn-1=0);——输出码字,CL清零
以上所述仅为本发明的较佳实施例,并非对本发明做任何形式上的限制,凡在本发明的精神和原则之内,依据本发明的技术实质对以上实施例所做的任何简单修改、等同变化与修饰,均仍属于本发明的保护范围之内。
虽然上面结合本发明的优选实施例对本发明的原理进行了详细的描述,本领域技术人员应该理解,上述实施例仅仅是对本发明的示意性实现方式的解释,并非对本发 明包含范围的限定。实施例中的细节并不构成对本发明范围的限制,在不背离本发明的精神和范围的情况下,任何基于本发明技术方案的等效变换、简单替换等显而易见的改变,均落在本发明保护范围之内。

Claims (5)

  1. 一种程序计数器压缩方法,其特征在于,所述压缩方法包括下述步骤:
    步骤(1)、获取处理器发出的指令的执行情况,并且基于所述指令的执行情况对所述指令进行分类筛选;
    步骤(2)、基于分类筛选的结果对目标类型指令的程序计数器值和阻塞周期进行差分处理,并且对所获得的差分值进行切片处理,获得相应的有效数据段;
    步骤(3)、对步骤(2)中获得的差分切片的有效数据段进行字典编码,
    其中,所述程序计数器压缩方法将指令的类型分为(1)顺序执行、(2)跳转和(3)阻塞,
    所述步骤(1)包括:
    步骤(1.1)、获取每个指令所对应的程序计数器值;
    步骤(1.2)求解任意两条连续指令的程序计数器值的差值PC_diff=PC-PC_pre,其中,PC表示当前周期的程序计数器值,PC_pre表示上一个周期的程序计数器值;
    步骤(1.3)基于任意两条连续指令的程序计数器值的差值PC_diff对所述指令进行分类,
    (a)如果PC_diff=1,则将该指令分类为顺序指令,对该指令不做记录;
    (b)如果PC_diff=0:则将该指令分类为阻塞指令,记录阻塞指令对应的程序计数器值和阻塞的周期;
    (c)如果PC_diff≠0且PC_diff≠1,则将该指令分类为跳转指令,记录该指令对应的分支地址和目标地址所对应的程序计数器的值,
    所述步骤(2)包括:
    步骤(2.1)、对于所记录下的阻塞指令和跳转指令的程序计数器值和相应的指令持续周期数进行差分,获得相应差分值;
    步骤(2.2)、将差分值按照数据位的顺序分成若干数据段,每个数据段的数据位数相同;
    步骤(2.3)、从左至右判断每个数据段中的各个数据位是否全为0或全为1;
    步骤(2.4)若数据段中的各个数据位全为0或全为1,则舍弃该数据段,若数据段中的各个数据位不全为0或1,则将该数据段以及其低位数据段均传输至步骤(3)。
  2. 根据权利要求1所述的程序计数器压缩方法,其特征在于,所述步骤(3)包括:
    步骤(3.1)、接收步骤(2)中所传送的数据段,其中每个数据段包含若干源数据;
    步骤(3.2)、在存储器中构建字典,所述字典包括若干元素;
    步骤(3.3)、对每个源数据,依次从字典中的每个元素中进行查找,若当前源数据与字典中的某个元素相匹配,则记录下字典中该元素的位置后再利用源数据更新字典,若所有元素均不匹配,则直接利用源数据更新字典,字典更新时,字典不发生滑动,依次将源数据从左至右将字典中的元素覆盖,并且,每次更新之后,将对查找起始位置+1。
  3. 根据权利要求2所述的程序计数器压缩方法,其特征在于,所述步骤(3)还包括:
    对字典窗口进行分割,单次读入N个数据源,N为大于等于2的正整数,对于每个数据源,每个程序周期启动一个数据源的查找,每个数据源的查找起始位置比上一个增加1,获取对于每个数据源的查找匹配结果,基于对所有N个数据源的匹配结果,确定实际匹配结果。
  4. 一种程序计数器压缩装置,其特征在于,
    所述压缩装置包括:指令分类模块、数据差分切片模块、字典编码模块,
    所述指令分类模块用于获取处理器发出的指令的执行情况,并且基于所述指令的执行情况对所述指令进行分类筛选;
    所述数据差分切片模块用于基于分类筛选的结果对目标类型指令的程序计数器的值和阻塞周期进行差分处理,并且对所获得的差分值进行切片处理;
    所述字典编码模块用于基于深度为N的RAM构建字典编码,并且,对有效的差分切片的数据段进行LZ字典编码。
  5. 根据权利要求4所述的程序计数器压缩装置,其特征在于,所述压缩装置用于执行权利要求1-3中任意一项所述的方法。
PCT/CN2017/073927 2016-12-13 2017-02-17 一种程序计数器压缩方法及其硬件电路 WO2018107579A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/579,827 US10277246B2 (en) 2016-12-13 2017-02-17 Program counter compression method and hardware circuit thereof

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201611143794.2A CN106656200B (zh) 2016-12-13 2016-12-13 一种程序计数器压缩方法及其硬件电路
CN201611143794.2 2016-12-13

Publications (1)

Publication Number Publication Date
WO2018107579A1 true WO2018107579A1 (zh) 2018-06-21

Family

ID=58824864

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/073927 WO2018107579A1 (zh) 2016-12-13 2017-02-17 一种程序计数器压缩方法及其硬件电路

Country Status (3)

Country Link
US (1) US10277246B2 (zh)
CN (1) CN106656200B (zh)
WO (1) WO2018107579A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10135461B2 (en) * 2015-09-25 2018-11-20 Intel Corporation Systems, methods, and apparatuses for decompression using hardware and software
CN113157655A (zh) * 2020-01-22 2021-07-23 阿里巴巴集团控股有限公司 一种数据压缩、解压方法、装置、电子设备和存储介质
CN113114266B (zh) * 2021-04-30 2022-12-13 上海智大电子有限公司 一种综合监控系统实时数据化简压缩方法
CN113407363B (zh) * 2021-06-23 2024-05-17 京东科技控股股份有限公司 一种基于远程字典服务的滑窗计数方法及装置
CN117097810B (zh) * 2023-10-18 2024-01-02 深圳市互盟科技股份有限公司 基于云计算的数据中心传输优化方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060132822A1 (en) * 2004-05-27 2006-06-22 Silverbrook Research Pty Ltd Storage of program code in arbitrary locations in memory
CN101118486A (zh) * 2006-06-29 2008-02-06 英特尔公司 多执行线程的分区流水线执行的方法和设备
CN101517530A (zh) * 2006-09-29 2009-08-26 Mips技术公司 利用精简指令状态描述符追踪指令的装置和方法
CN101739338A (zh) * 2009-12-21 2010-06-16 北京龙芯中科技术服务中心有限公司 一种处理器地址数据跟踪的装置及方法

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2785695B1 (fr) * 1998-11-06 2003-01-31 Bull Cp8 Procede de compactage d'un programme de type code objet intermediaire executable dans un systeme embarque muni de ressources de traitement de donnees, systeme compacteur et systeme embarque multi-applications correspondants
GB2362733B (en) * 2000-05-25 2002-02-27 Siroyan Ltd Processors having compressed instructions.
US7185234B1 (en) * 2001-04-30 2007-02-27 Mips Technologies, Inc. Trace control from hardware and software
US6907598B2 (en) 2002-06-05 2005-06-14 Microsoft Corporation Method and system for compressing program code and interpreting compressed program code
DE102004010179A1 (de) * 2004-03-02 2005-10-06 Siemens Ag Verfahren und Datenverarbeitungsgerät zur Aktualisierung von Rechnerprogrammen per Datenübertragung
US7804428B2 (en) * 2008-11-10 2010-09-28 Apple Inc. System and method for compressing a stream of integer-valued data
US9348593B2 (en) * 2012-08-28 2016-05-24 Avago Technologies General Ip (Singapore) Pte. Ltd. Instruction address encoding and decoding based on program construct groups
CN105099460B (zh) 2014-05-07 2018-05-04 瑞昱半导体股份有限公司 字典压缩方法、字典解压缩方法与字典建构方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060132822A1 (en) * 2004-05-27 2006-06-22 Silverbrook Research Pty Ltd Storage of program code in arbitrary locations in memory
CN101118486A (zh) * 2006-06-29 2008-02-06 英特尔公司 多执行线程的分区流水线执行的方法和设备
CN101517530A (zh) * 2006-09-29 2009-08-26 Mips技术公司 利用精简指令状态描述符追踪指令的装置和方法
CN101739338A (zh) * 2009-12-21 2010-06-16 北京龙芯中科技术服务中心有限公司 一种处理器地址数据跟踪的装置及方法

Also Published As

Publication number Publication date
CN106656200B (zh) 2019-11-08
US10277246B2 (en) 2019-04-30
US20190089370A1 (en) 2019-03-21
CN106656200A (zh) 2017-05-10

Similar Documents

Publication Publication Date Title
WO2018107579A1 (zh) 一种程序计数器压缩方法及其硬件电路
US11836081B2 (en) Methods and systems for handling data received by a state machine engine
US10372653B2 (en) Apparatuses for providing data received by a state machine engine
US9886017B2 (en) Counter operation in a state machine lattice
US10509995B2 (en) Methods and devices for programming a state machine engine
US8680888B2 (en) Methods and systems for routing in a state machine
CA2939959A1 (en) Parallel decision tree processor architecture
US10254976B2 (en) Methods and systems for using state vector data in a state machine engine
US20150262062A1 (en) Decision tree threshold coding
WO2014035699A1 (en) Results generation for state machine engines
US9235564B2 (en) Offloading projection of fixed and variable length database columns
Zhang et al. cublastp: Fine-grained parallelization of protein sequence search on cpu+ gpu
US20170193351A1 (en) Methods and systems for vector length management
US20150262063A1 (en) Decision tree processors

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17881299

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17881299

Country of ref document: EP

Kind code of ref document: A1