WO2014169477A1 - 具有多态指令集体系结构的处理器 - Google Patents

具有多态指令集体系结构的处理器 Download PDF

Info

Publication number
WO2014169477A1
WO2014169477A1 PCT/CN2013/074426 CN2013074426W WO2014169477A1 WO 2014169477 A1 WO2014169477 A1 WO 2014169477A1 CN 2013074426 W CN2013074426 W CN 2013074426W WO 2014169477 A1 WO2014169477 A1 WO 2014169477A1
Authority
WO
WIPO (PCT)
Prior art keywords
polymorphic
instruction
processing unit
processor
microcode
Prior art date
Application number
PCT/CN2013/074426
Other languages
English (en)
French (fr)
Inventor
王东琳
谢少林
杨勇勇
尹磊祖
王磊
刘子君
汪涛
张星
Original Assignee
中国科学院自动化研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院自动化研究所 filed Critical 中国科学院自动化研究所
Priority to PCT/CN2013/074426 priority Critical patent/WO2014169477A1/zh
Priority to US14/785,385 priority patent/US20160162290A1/en
Publication of WO2014169477A1 publication Critical patent/WO2014169477A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode

Definitions

  • the present invention generally relates to a processor instruction set architecture, which is closely related to a processor instruction set definition, a processor architecture design, and a microarchitecture implementation method, and in particular, a dynamic reconfiguration with a stream slice.
  • the processor of the state instruction set architecture is closely related to a processor instruction set definition, a processor architecture design, and a microarchitecture implementation method, and in particular, a dynamic reconfiguration with a stream slice.
  • R ID and wireless sensors generate information every minute, and hundreds of millions of users' Internet services generate huge amounts of information interaction.
  • users place high demands on the real-time and effectiveness of information processing, such as online video.
  • On-demand systems users not only require high-definition pictures, but also require decoding and display speeds of at least 30 frames per second. We need to start with algorithmic feature analysis to study how to process massive amounts of information efficiently and quickly.
  • the first feature is that the amount of data is huge, and the amount of data generated by high-definition video, broadband communication, and high-precision sensors is increasing by 5 to 10 times per year.
  • the second feature is that the computational complexity is huge.
  • the computational complexity of information processing is usually the K-th power of the data quantity n (N nK ).
  • N nK the computational complexity of the bubble sorting algorithm
  • the complexity of the FFT algorithm is 0. (nlo g n)
  • the algorithm of massive information processing is relatively regular.
  • the fourth feature of mass information processing is that it has strong data locality: there is no correlation between local data blocks, but local data itself has strong correlation.
  • the calculation result in the filtering algorithm is only related to the data in the range of the filtering template, and the data of the template range needs to be calculated multiple times to obtain the final result; in the video encoding and decoding algorithm, the data of one or adjacent macroblocks needs to be complicated. The operation yields the final result, and there is no data correlation between the distant macroblocks.
  • the fifth feature of mass information processing is that the processing algorithm pattern is basically unchanged, but the algorithm details are constantly evolving. Such as video coding standards from H.263 H.264 evolution, communication protocol from 2G to 3G, to LTE evolution.
  • Massive information processing has its own unique performance requirements and application characteristics. Due to the huge amount of data in the process of mass information processing, the amount of computation is huge, and most require real-time calculation, the computational power of traditional scalar and superscalar processors is much lower than this requirement. At the same time, due to power consumption and volume limitation, we It is also impossible to rely on a stacked scalar processor to implement a massive information processing system.
  • the ASIC chip for mass information processing has a large design cost and a long period, and its update speed is far lower than the evolution speed of mass information processing algorithms, and cannot adapt to the development speed of massive information processing systems. Therefore, the transformation of traditional scalar and superscalar processors for mass information processing features, and even the design of new domain processors, is the current development trend of massive information processing chips.
  • An "instruction" is a symbol defined by the designer that the processor can understand. By sending a different sequence of instructions to the processor, the programmer specifies the actions of the processor at different times. The set of all instructions that the processor can understand is the instruction set of the processor. The programmer uses the instructions in the instruction set to implement various algorithms.
  • the general processor instruction set is determined, and the instruction behavior is in one-to-one correspondence with the processor implementation.
  • the calculation instruction "ADD R0, R1, R2" included in the ARMv4T instruction set indicates that the values in the registers R1 and R2 are to be added. Write R0 again.
  • the processor instruction set When the processor instruction set is determined, the programmer cannot add instructions to the instruction set, or redefine the behavior of the instructions. Therefore, the instructions in the general processor instruction set are more general to ensure programming flexibility. However, the general processor instruction set is difficult to implement some special applications efficiently. For example, in video coding, 8Wt data calculation is often required. If the 32-bit addition instruction "ADD R0, R1, R2" in the ARM processor is used to implement this type of algorithm, the efficiency is very low. As a result, various processors typically extend the instruction set for specific applications, such as MMX instructions for video image processing in the X86 instruction set, and NEON instructions in the ARM instruction set.
  • This type of extended instruction is characterized by high execution efficiency for a certain type of application, but is very inefficient for other applications. Therefore, after the design of the processor is completed, the application field it has adapted has been determined, and it is difficult to adapt to other application fields. Programmers are also unable to fine tune the processor based on the algorithmic characteristics of other application areas.
  • 2004/0019765 Al discusses a processor architecture of a RISC processor + configurable array processor unit in which multiple array processing units are logically divided into multiple pipeline stages. The behavior of each pipeline level is dynamically configured by the RISC processor.
  • US Patent Application No. 2006/0211387 Al Multistandard SDR Architecture Using Context-Based Operation Reconfigurable Instruction Set Processor defines a processor structure of a configuration unit + coprocessor, wherein each coprocessor is composed of a state control unit and a data path. Responsible for some similar processing tasks.
  • the technical problem to be solved by the present invention is to propose a processor having a multi-state instruction set architecture to solve the problem that the processor cannot redefine the processor instruction set after streaming.
  • the present invention provides a processor having a multi-state instruction set architecture, including a scalar processing unit, at least one multi-state instruction processing unit, at least one multi-granularity parallel memory, and a DMA controller;
  • the polymorphic instruction processing unit includes at least one functional unit; the polymorphic instruction processing unit is configured to interpret and execute a polymorphic instruction, and the functional unit is configured to perform a specific data operation task, wherein the polymorphic instruction refers to multiple consecutive executions.
  • the microcode record indicates an action that each functional unit needs to perform in a certain clock cycle;
  • the scalar processing unit is configured to call a polymorphic instruction and query an execution state of the polymorphic instruction;
  • the DMA controller is used to The configuration information of the polymorphic instruction is transmitted and the data required to transfer the polymorphic instruction to the multi-granularity memory.
  • the polymorphic instruction processing unit passively receives a polymorphic instruction from the DMA controller and is called by the scalar processing unit.
  • the scalar processing unit controls the polymorphic instruction processing unit through a first control path, and the scalar processing unit controls the DMA controller through a second control path.
  • the multi-state instruction processing unit further includes a microcode memory and a microcode control unit; the microcode memory is configured to store a polymorphic instruction; and the microcode control unit is configured to pass The first control path receives a control request of the scalar processing unit and performs a corresponding action.
  • the microcode control unit includes a configuration register for storing parameters and an operating state required for the operation of the multi-state instruction processor unit.
  • control request of the scalar processing unit includes starting or querying the polymorphic instruction processing unit, and reading and writing the configuration register of the polymorphic instruction processing unit.
  • the multi-state instruction processing unit further includes a transfer control unit having a plurality of data input/output ports and exchanging data through the transfer control unit.
  • the functional unit is configured to perform a data loading/storing operation, and read and write data from the multi-granular parallel memory through a first internal bus; meanwhile, the microcode memory is used as a slave
  • the device is coupled to the first internal bus and passively receives the microcode record from the outside.
  • the microcode control unit sequentially reads and executes microcode recording of the polymorphic instruction.
  • each row in the microcode memory stores a microcode record, and when the scalar processing unit invokes the polymorphic instruction, only the initial microcode record corresponding to the polymorphic instruction is specified.
  • the line number in the microcode memory is specified.
  • the programmer can still redefine the processor instruction set according to the characteristics of the application algorithm. Redefining the post-processor instruction set architecture is more in line with the application algorithm features, thereby improving the processor's processing performance in such applications.
  • the redefinition process does not modify the processor hardware and the corresponding assembler, linker, etc.
  • Figure 1 is a schematic illustration of the major components and interconnections of a processor having a polymorphic instruction set architecture of the present invention
  • Figure 2 is a schematic diagram showing the main components and interconnections of the polymorphic instruction execution unit of the present invention
  • FIG. 3 shows schematically the main components of the microcode recording of the present invention
  • Figure 4 shows a simplified diagram of how the behavior of a polymorphic instruction is defined and how the microcode memory holds the definition of a polymorphic instruction
  • FIG. 5 exemplarily shows a flow of defining and calling a multi-state instruction of the present invention
  • FIG. 6 is a view schematically showing a functional unit in a processor having a multi-state instruction set architecture of the present invention
  • Figure 7 exemplarily shows the interface definition and internal structure of a computing unit employed by the processor of the present invention
  • Figure 8 exemplarily shows the interface definition and internal structure of a bus interface unit employed by the processor of the present invention
  • Figure 9 exemplarily shows the interface definition of a register file heap employed by the processor of the present invention.
  • Figure 10 exemplarily shows the definition of a data transfer path between functional components in the processor of the present invention
  • Figure 11 exemplarily shows an implementation structure of a data transfer unit of a computing unit in the processor of the present invention
  • Figure 12 exemplarily shows an implementation structure of a data transfer unit between functional components in the processor of the present invention.
  • Figure 13 exemplarily shows the encoding of the functional components in the processor of the present invention
  • Figure 14 exemplarily illustrates the logical behavior of a multiplexer in a processor of the present invention in a processor of the present invention.
  • the present invention proposes a processor that dynamically reconstructs a polymorphic instruction set architecture after tape out.
  • the structure of the processor of the present invention is as shown in Fig. 1, and mainly includes the following components: a scalar processing unit 101, at least one multi-state instruction processing unit 100, at least one multi-granularity parallel memory 102 and a DMA controller 103.
  • the polymorphic instruction processing unit 100 includes at least one functional unit.
  • the polymorphic instruction refers to a sequence of a plurality of consecutively executed microcode records.
  • the multi-state instruction set is a set of polymorphic instructions, and the micro-code record indicates actions that each functional unit needs to perform in a certain clock cycle, such as performing an addition operation, performing a data loading operation, or doing nothing.
  • the polymorphic instruction processing unit 100 interprets and executes a polymorphic instruction, and the functional unit includes a function unit for executing a specific data operation task; the scalar processing unit 101 calls a polymorphic instruction and queries an execution state of the polymorphic instruction,
  • the DMA controller 103 transmits configuration information of the polymorphic instructions and data required to transfer the polymorphic instructions to the multi-granularity memory 102.
  • the scalar processing unit 101 controls the multi-state instruction processing unit 100 through a first control path 104, and the scalar processing unit 101 controls the DMA controller 103 through the second control path 105, the DMA controller 103 passing through the first internal bus 106 transmits configuration information to the multi-state processing unit 100, the DMA controller 103 transfers data to the multi-granularity parallel memory 102 via the second internal bus 107, and the DMA controller 103 reads and writes data from the outside via the bus 108, the polymorphic instruction processing unit The data is read and written from the multi-granularity parallel memory 102 via the second internal bus 107.
  • the scalar processing unit 101 can be a RISC or DSP, but must have a first control path 104, which must have the following functions: 1. Start multi-state instruction processing unit 100;
  • the multi-granularity parallel memory 102 adopts the application number 201110460585.1 (named
  • the multi-granular parallel memory in the Chinese Patent Publication of "Multi-granular Parallel Storage System and Memory" can simultaneously support parallel reading and writing of matrix row and column data of different data types.
  • the master device of the second internal bus 107 is a multi-state instruction processing unit 100, and the slave device is a multi-granularity parallel memory 102.
  • the DMA controller 103 and the polymorphic instruction processing unit 100 can read and write data from the multi-granularity parallel memory 102 through the second internal bus 107.
  • the master device of the first internal bus 106 is a DMA controller 103, and the slave device is a multi-state instruction processing unit 100.
  • the DMA controller 103 can write a polymorphic command to the multi-state instruction processing unit 100 through the first internal bus 106. .
  • the polymorphic instructions are stored in an external memory connected to the bus 108.
  • the polymorphic instruction processing unit The polymorphic instruction processing unit 100 passively receives the polymorphic instruction from the DMA controller 103 and is called by the scalar processing unit 101.
  • Fig. 2 shows an internal structure diagram of the polymorphic instruction processing unit 100.
  • the polymorphic instruction processing unit 100 includes a microcode memory 200, a microcode control unit 201, at least one functional unit 202, and a transmission control unit 203.
  • the microcode memory 200 is responsible for storing polymorphic instructions
  • the microcode control unit 201 receives various types of control requests of the scalar processing unit 101 through the first control path 104 and performs corresponding actions.
  • the microcode control unit 201 includes a configuration register 207 for storing parameters and operating states required for the operation of the polymorphic instruction processor unit 100, such as a function unit 202 that specifies execution of the current polymorphic instruction, specifying the required The data start address and the total length of the data, and whether the current polymorphic instruction processor unit 100 is idle or the like.
  • the multi-state instruction processing unit 100 is activated: At this time, the microcode control unit 201 reads the microcode record 300 from the microcode memory 200, and generates corresponding control information, which is sent to the function unit 202 and the transmission control unit 203.
  • the microcode control unit 201 returns the execution state of the current polymorphic instruction: complete or idle.
  • Configuration Register 207 of the Read-Write Multi-State Instruction Processing Unit 100 At this point, the microcode control unit 201 will write the specified data to the designated configuration register 207, or return the data of the specified configuration register 207.
  • the polymorphic instruction processing unit 100 can design at least one different functional unit 202 depending on the application requirements.
  • the functional unit 202 is responsible for performing specific data manipulation tasks, such as performing an addition operation, or a data loading/storing operation.
  • the function unit 202 generally has a plurality of data input/output ports, and exchanges data through the transfer control unit 203. For example, after the addition unit completes the addition operation, the addition unit transfers the addition result to the transfer control unit 203, and the transfer control unit 203 then adds the result. It is sent to the multiplication unit and multiplied.
  • the transfer control unit 203 is connected to the data input/output ports of all the functional units 202, receives the source and destination information of each time data from the microcode control unit 201 through the interface 206, and sends the source data to the destination.
  • the bus 107 is the first internal bus 107 of FIG. 1.
  • Some types of functional units 202 need to perform data load/store operations, and data needs to be read and written from the multi-granular parallel memory 102 through the first internal bus 107.
  • the microcode memory 200 is connected to the first internal bus 107 as a slave device, passively receiving the microcode record 300 from the outside.
  • FIG. 3 shows a block diagram of a microcode record 300.
  • the microcode record 300 is divided into a plurality of domains, and each functional unit has a corresponding domain in the microcode record 300, such as the functional unit domain 301 corresponding to the second functional unit.
  • the polymorphic instructions of the present invention are a plurality of serially executed microcode recording 300 sequences having a particular function. As shown in Figure 4.
  • the polymorphic instructions, i.e., the sequence of microcode records 300, are stored in the microcode memory 200 and are sequentially read and executed by the microcode control unit 201.
  • Each row in the microcode memory 200 stores a microcode record 300.
  • the programmer can use the microcode record 300 to flexibly define the behavior of the polymorphic instruction and the starting line number of the polymorphic instruction in the microcode memory according to the algorithm requirements.
  • Figure 5 exemplarily shows a flow of defining and invoking a polymorphic instruction.
  • write scalar code the code calls the programmer-defined polymorphic instruction, at this time, the starting line number of the polymorphic instruction has not been determined, and is replaced by a label.
  • the polymorphic instruction record represented by Instrl in text is compiled and linked. , becomes a binary file that the microcode control unit 201 can understand, and at the same time, in the compiling and linking process, the starting address of each polymorphic instruction is determined, and thus the value of Instrl has been determined to be 10.
  • scalar code After the scalar code has been compiled and linked, it also needs to cross-link with the polymorphic instruction binary file, and replace the polymorphic instruction start address represented by the symbol in the original scalar code with the actual value to generate a scalar binary file.
  • the scalar code uses the DMA controller 103 to load the contents of the polymorphic instruction binary to the microcode memory before calling the polymorphic instruction, and then invokes the polymorphic instruction.
  • Embodiment of a Processor Having a Polymorphic Instruction Set Architecture An exemplary embodiment of a polymorphic instruction set architecture is presented below. This embodiment is merely one embodiment of the present invention, and the present invention is not limited to this example. .
  • This embodiment is a processor with a polymorphic instruction set architecture for data intensive applications.
  • Figure 6 shows the functional units in the processor.
  • the data bit width of all functional units is 512 bits.
  • 512 bits can be regarded as 64 8-bit or 32 16-bit or 16 32-bit data.
  • IALU in the functional unit is used For fixed-point logic calculation, FALU is used for floating-point logic calculation, IMAC is used for fixed-point multiply and accumulate calculation, FMAC is used for floating-point multiply and accumulate operation, and SHU0 and SHU1 are used for data interleaving operation, that is, exchange of any two of 512 bit data.
  • M is a 512 1 ⁇ 2 bit wide register file stack, 811; 0, : 811; 1, : 811; 2 is a bus interface unit responsible for loading/storing data from the multi-granularity parallel memory 102.
  • IALU, FALU, IMAC, FMAC, SHU0, SHU1 have similar interfaces. In this embodiment, they are collectively referred to as computing unit 500.
  • the interface of the computing unit 500 is as shown in FIG. 7, which includes four data input ports 604, and corresponding The four temporary registers are 600.
  • the arithmetic logic 601 reads data from the temporary register for operation, writes the result of the operation to the temporary register 602, and then transfers the result of the operation to the transfer control unit 203 via the output port 603.
  • BIU0, BIU1, and BIU2 are collectively referred to as bus interface unit 501, and its internal structure is shown in FIG. It has a data input port 702 which acquires data from the transfer control unit 203 and writes the obtained data to the temporary register 700; a data output port 703 through which the data in the temporary register 701 is transferred to the transfer control Unit 203; an internal bus interface 107 through which data in the multi-granularity parallel memory 102 is read and written; an address calculation logic 704 is responsible for calculating the address addressed to the second internal bus 107.
  • M is a 512-bit wide register file with four write ports 800, four read ports 802, and corresponding banks 801.
  • Figure 9 illustrates the interface to this register file heap.
  • the calculation results of each functional unit can be directly transmitted to other functional units to implement cascading operations.
  • FMAC mainly performs floating point multiply and accumulate operations, and its operation result does not need to be directly transmitted to the fixed point calculation unit IALU or IMAC.
  • the benefit of reducing the data transfer path is that it reduces the number of connections between functional units, which in turn reduces chip area and reduces chip cost.
  • the data transfer path between the functional units in this embodiment is as shown in FIG. 10. The beginning of each column in the table indicates the data destination, the beginning of each row represents the data source, and the cell with the hook in the middle indicates the existence of the transport path.
  • some functional units can share the transmission path according to the application needs, and the common transmission path between the functional units can reduce the power chip connection, but these functional units cannot be transmitted at the same time. Send the data. If SHU0 to BIU0 and SHU1 to BIU1 share a transmission path, when SHU0 transmits data to BIU0, data cannot be transmitted between SHU1 and BIU1.
  • the shading in Fig. 10 shows a partially shared transmission path.
  • the transmission control unit 203 corresponding to FIG. 10 is composed of 29 multiplexers.
  • the first layer is composed of IALU, IMAC, FALU, and FMAC.
  • This level is ACU, as shown in Figure 11.
  • This layer carries out data transfer with other functional units via three input ports ACU.I0, ACU.I1, ACU.I2 and an output port ACU.0.
  • the second level consists of ACU, M, SHU0, SHU1 and BIU0 ⁇ BIU2. As shown in Figure 12, there are a total of 13 multi-way selectors, namely M0 ⁇ M12 in Figure 12, data input of each multiplexer. See the mark in the figure.
  • each functional unit control field 301 in the microcode record 300 in addition to indicating the operation to be performed by the functional unit, also needs to indicate the destination of the operation result, which is specified by the encoding in Fig. 13, as used in the FALU control field.
  • the destination code is "1100".
  • the microcode control unit 201 transmits the destination information of all the functional units in the microcode record 300 to the transmission control unit 203, which generates control signals of 29 multiplexers based on the destination information.
  • Figure 14 depicts the logical behavior of multiplexer M0, where GroupID represents the group number of the destination in the corresponding functional unit control field 301.

Abstract

本发明提出一种具有多态指令集体系结构的处理器,其包括一个标量处理单元(101)、至少一个多态指令处理单元(100)、至少一个多粒度并行存储器(102)和一个DMA控制器(103);多态指令处理单元(100)包括至少一个功能单元(202);多态指令处理单元(100)用于解释和执行多态指令,其功能单元(202)用于执行具体的数据操作任务;所述标量处理单元(101)用于调用多态指令并查询多态指令的执行状态;所述DMA控制器(103)用于传送多态指令的配置信息以及向所述多粒度存储器(102)传送多态指令所需数据。本发明的处理器在流片生产后,程序员仍可根据应用算法特点对处理器指令集进行重定义。

Description

具有多态指令集体系结构的处理器
技术领域 本发明主要涉及处理器指令集体系结构, 与处理器指令集的定义、 处理器体系结构设计以及微体系结构的实现方法紧密相关, 特别是一种 具有流片后可动态重构的多态指令集体系结构的处理器。
背景技术 近年来,互联网、云计算和物联网发展迅猛。无所不在的移动设备、
R ID、无线传感器每分每秒都在产生信息, 数以亿计用户的互联网服务 产生了巨量信息交互; 同时, 用户对信息处理的实时性、 有效性提出了 很高要求, 如在线视频点播系统, 用户不仅要求高清晰的画面, 还要求 至少每秒 30 帧以上的解码和显示速度。 我们需要从算法特征分析着手 研究如何高效快速的处理海量信息。
总体说来, 海量信息处理呈现出以下几个特征: 第一个特征是数据 量巨大, 高清视频、 宽带通信、 高精度传感器所产生的数据量都以每年 5~10倍的速度递增。第二个特征是计算量巨大, 信息处理的计算复杂度 通常为数据量 n的 K次方 卩 O nK ), 如冒泡排序算法的计算复杂度为 0(n2), FFT算法复杂度为 0(nlogn), 随着数据量的增加, 信息处理所需 的计算量急剧增加。 第三个特征是海量信息处理的算法相对规整, 如一 维二维滤波、 FFT变换、 自适应滤波等核心算法都能以简单的数学公式 来表达, 不需要复杂的逻辑判断。 海量信息处理的第四个特征是具有很 强的数据局部性: 局部数据块之间不存在相关性, 但局部数据自身存在 强相关性。 如滤波算法中的计算结果只与滤波模板范围内的数据相关, 且模板范围的数据需要经过多次计算才能得到最终结果; 视频编解码算 法中需要对一个或相邻宏块的数据经过复杂的运算得到最终结果, 而距 离较远的宏块间不存在数据相关性。 海量信息处理的第五个特征是处理 算法模式基本不变, 但算法细节不断演进。 如视频编码标准从 H.263向 H.264演进, 通信协议从 2G到 3G, 再到 LTE的演进。
海量信息处理有自身独特的性能要求和应用特性。 由于海量信息处 理过程中数据量巨大, 运算量巨大, 而且大部分要求实时计算, 传统的 标量、 超标量处理器的计算能力远低于这一要求; 同时, 由于功耗、 体 积的限制, 我们也无法仅仅依靠堆砌标量处理器来实现海量信息处理系 统。 而针对海量信息处理的 ASIC芯片由于设计开发成本大, 周期长, 其更新速度远低于海量信息处理算法的演进速度, 无法适应海量信息处 理系统的发展速度。 因此, 针对海量信息处理特征对传统的标量、 超标 量处理器进行改造, 甚至设计全新的领域处理器, 是当前海量信息处理 芯片的发展趋势。
"指令"是设计者所定义的、 处理器可以理解的符号。 通过向处理 器发送不同的指令序列, 程序员指定处理器不同时刻的动作。 处理器所 能理解的所有指令的集合, 即为该处理器的指令集。 程序员利用指令集 中的指令, 实现各种算法。
一般处理器指令集都是确定的, 指令行为与处理器实现一一对应, 如 ARMv4T指令集中包括的计算指令 "ADD R0,R1, R2", 表示要将寄 存器 R1和 R2中的值相加, 再写入 R0。
当处理器指令集确定后, 程序员无法向指令集中增加指令, 或重新 定义指令的行为, 因此, 一般处理器指令集中的指令比较通用, 以保证 编程灵活性。 但通用的处理器指令集难以高效实现某些特殊的应用。 如 视频编码中, 经常需要进行 8Wt的数据计算, 如果用类似 ARM处理器 中的 32bit加法指令 " ADD R0,R1,R2"实现该类算法, 效率非常低。 因 此, 各类处理器通常都会针对特殊的应用, 扩展指令集, 如 X86指令集 中针对视频图像处理的 MMX指令,以及 ARM指令集中的 NEON指令。
这类扩展指令的特点是对于某一类应用具有很高的执行效率, 但对 于其它应用, 执行效率非常低。 因此, 处理器在设计完成后, 它所适应 的应用领域就已经确定, 难以适应其它应用领域。 程序员也无法根据其 它应用领域的算法特征, 对处理器进行微调优化。
目前已有一些专利讨论如何实现可重构计算。 如美国专利 US2005/0027970A1 (Reconfigurable Instruction Set Computing)以及专利 US2005/0169550 Al (Video Processing System With Reconfigurable Instructions)采用 CPU+类 FPGA的结构, 用户用统一的高层语言进行开 发, 编译器将程序划分成 CPU运行的部分和 FPGA运行的部分。 该方 法的特点是能利用 FPGA的灵活性加速程序效率, 但 FPGA过于灵活的 配置导致芯片性能 /成本比不高。 美国专利 US2004/0019765 Al (Pipelined Reconfigurable Dynamic Instruction Set Processor)讨论了一个 RISC处理 器 +可配置阵列处理器单元的处理器结构, 在该结构中多个阵列处理单 元按逻辑划分成多个流水级, 每个流水级的行为通过 RISC处理器的动 态配置。 美国专利 US2006/0211387 Al(Multistandard SDR Architecture Using Context-Based Operation Reconfigurable Instruction Set Processor)定 义了一种配置单元 +协处理器的处理器结构, 其中每个协处理器由状态 控制单元和数据通路组成, 负责某些相似的处理任务。
发明内容 本发明所要解决的技术问题是提出一种具有多态指令集体系结构 的处理器, 以解决处理器在流片后无法对处理器指令集重新定义的问题。
为解决上述技术问题, 本发明提出一种具有多态指令集体系结构的 处理器, 包括一个标量处理单元、 至少一个多态指令处理单元、 至少一 个多粒度并行存储器和一个 DMA控制器; 所述多态指令处理单元包括 至少一个功能单元; 所述多态指令处理单元用于解释和执行多态指令, 其功能单元用于执行具体的数据操作任务, 其中, 多态指令是指多个连 续执行的微码记录的序列, 微码记录表示某个时钟周期内各功能单元需 要执行的动作; 所述标量处理单元用于调用多态指令并查询多态指令的 执行状态; 所述 DMA控制器用于传送多态指令的配置信息以及向所述 多粒度存储器传送多态指令所需数据。
根据本发明的一种具体实施方式, 所述多态指令处理单元从所述 DMA控制器被动接收多态指令, 并被标量处理单元调用。 根据本发明的一种具体实施方式, 所述标量处理单元通过一个第一 控制通路来控制所述多态指令处理单元, 所述标量处理单元通过第二控 制通路来控制所述 DMA控制器。
根据本发明的一种具体实施方式, 所述多态指令处理单元还包括微 码存储器) 和微码控制单元; 所述微码存储器用于存放多态指令; 所述 微码控制单元用于通过所述第一控制通路接收所述标量处理单元的控 制请求并执行相应的动作。
根据本发明的一种具体实施方式, 所述微码控制单元包括配置寄存 器, 该配置寄存器用于存储多态指令处理器单元运行时所需参数及运行 状态。
根据本发明的一种具体实施方式, 所述标量处理单元的控制请求包 括启动或查询所述多态指令处理单元、 读写所述多态指令处理单元的配 置寄存器。
根据本发明的一种具体实施方式, 所述多态指令处理单元还包括传 送控制单元, 所述功能单元具有多个数据输入 /输出端口, 并通过该传送 控制单元交换数据。
根据本发明的一种具体实施方式, 所述功能单元用于执行数据加载 /存储操作, 并通过一第一内部总线从所述多粒度并行存储器读写数据; 同时, 所述微码存储器作为从设备与该第一内部总线相连, 被动地从外 部接收微码记录。
根据本发明的一种具体实施方式, 所述微码控制单元依次读取并执 行多态指令的微码记录。
根据本发明的一种具体实施方式, 所述微码存储器中的每一行存放 一个微码记录, 当所述标量处理单元调用多态指令时, 只指定该多态指 令对应的起始微码记录在该微码存储器中的行号。
本发明的具有多态指令集体系结构的处理器在流片生产后, 程序员 仍可根据应用算法特点对处理器指令集进行重定义。 重定义后处理器指 令集体系结构更加契合应用算法特征, 从而能提高处理器在该类应用中 的处理性能。 重定义过程不修改处理器硬件和相应的汇编器、 链接器等 软件工具链, 但对于不同的指令定义, 指令集体系结构呈现出不同的形 态。
附图说明 图 1简要示出了本发明的具有多态指令集体系结构的处理器的主要 组成部分和互连关系;
图 2简要示出了本发明的多态指令执行单元的主要组成部分和互连 关系;
图 3简要示出了本发明的微码记录的主要组成部分;
图 4简要示出了如何定义多态指令的行为以及微码存储器如何保存 多态指令的定义;
图 5示例性地示出了本发明的一种定义和调用多态指令的流程; 图 6简要示出了本发明的一种具有多态指令集体系结构处理器中的 功能单元;
图 7示例性地示出了本发明的处理器采用的计算单元的接口定义和 内部结构;
图 8示例性地示出了本发明的处理器采用的总线接口单元的接口定 义和内部结构;
图 9示例性地示出了本发明的处理器采用的寄存器文件堆的接口定 义;
图 10 示例性地示出了本发明的处理器中功能部件之间数据传送路 径的定义;
图 11 示例性地示出了本发明的处理器中计算单元内部数据传送单 元的实现结构;
图 12 示例性地示出了本发明的处理器中功能部件之间数据传送单 元的实现结构
图 13示例性地示出了本发明的处理器中功能部件的编码; 图 14 示例性地示出了本发明的处理器中本发明的处理器中多路选 择器的逻辑行为。
具体实施方式 为使本发明的目的、 技术方案和优点更加清楚明白, 以下结合具体 实施例, 并参照附图, 对本发明作进一歩的详细说明。
本发明提出了一种流片 (Tape out,试生产) 后可动态重构多态指令 集体系结构的处理器。
本发明的处理器的结构如图 1 所示, 主要包括以下几个组成部分: 一个标量处理单元 101, 至少一个多态指令处理单元 100, 至少一个多 粒度并行存储器 102和一个 DMA控制器 103。 所述多态指令处理单元 100包括至少一个功能单元。
所述多态指令是指多个连续执行的微码记录的序列。 所述多态指令 集即多态指令的集合, 微码记录表示某个时钟周期内各功能单元需要执 行的动作, 如进行加法操作, 或进行数据加载操作, 或者什么都不做。
其中, 所述多态指令处理单元 100解释和执行多态指令, 其包含的 功能单元用于执行具体的数据操作任务; 所述标量处理单元 101调用多 态指令并查询多态指令的执行状态, 而所述 DMA控制器 103则传送多 态指令的配置信息以及向所述多粒度存储器 102传送多态指令所需数据。
所述标量处理单元 101通过一个第一控制通路 104来控制多态指令 处理单元 100, 标量处理单元 101通过第二控制通路 105来控制 DMA 控制器 103, 所述 DMA控制器 103通过第一内部总线 106向多态处理 单元 100传送配置信息, DMA控制器 103通过第二内部总线 107向多 粒度并行存储器 102传送数据, DMA控制器 103通过总线 108从外部 读写数据, 所述多态指令处理单元 100通过第二内部总线 107从所述多 粒度并行存储器 102读写数据。
所述标量处理单元 101可以为一 RISC或 DSP, 但必须有第一控制 通路 104, 该控制通路 104必须具备以下功能: 1.启动多态指令处理单元 100;
2.查询多态指令处理单元 100的执行状态;
3.读写多态指令处理单元 100的配置寄存器 (将在下面描述)。 所述多粒度并行存储器 102采用申请号为 201110460585.1 (名称为
"多粒度并行存储系统与存储器") 的中国专利公开说明书中的多粒度 并行存储器, 该存储器可同时支持不同数据类型的矩阵行列数据并行读 写。
所述第二内部总线 107的主设备为多态指令处理单元 100, 从设备 为多粒度并行存储器 102。 DMA控制器 103和多态指令处理单元 100可 通过该第二内部总线 107从多粒度并行存储器 102读写数据,
所述第一内部总线 106的主设备为 DMA控制器 103, 从设备为多 态指令处理单元 100, DMA控制器 103可通过该第一内部总线 106向多 态指令处理单元 100写入多态指令。 多态指令被存放在与总线 108相连 的外部存储器中。 多态指令处理单元 多态指令处理单元 100从 DMA控制器 103被动接收多态指令, 并 被标量处理单元 101调用。 图 2给出了多态指令处理单元 100的内部结 构图。
多态指令处理单元 100包括微码存储器 200、 微码控制单元 201、 至少一个功能单元 202和传送控制单元 203。 微码存储器 200负责存放 多态指令, 微码控制单元 201, 通过第一控制通路 104接收标量处理单 元 101的各类控制请求并执行相应的动作。 所述微码控制单元 201包括 配置寄存器 207, 该配置寄存器 207用于存储多态指令处理器单元 100 运行时所需参数及运行状态,如指定执行当前多态指令的功能单元 202, 指定所需数据起始地址和数据总长度, 以及表明当前多态指令处理器单 元 100是否空闲等。
这些请求包括: 1.启动多态指令处理单元 100: 此时微码控制单元 201从微码存储 器 200读取微码记录 300,并产生相应的控制信息,发送给功能单元 202 和传送控制单元 203。
2.查询多态指令处理单元 100: 此时微码控制单元 201返回当前多 态指令的执行状态: 完成或空闲。
3.读写多态指令处理单元 100的配置寄存器 207: 此时微码控制单 元 201将向指定配置寄存器 207写入指定的数据, 或返回指定配置寄存 器 207的数据。
多态指令处理单元 100可根据应用需求, 设计至少一个不同的功能 单元 202。 功能单元 202负责执行具体的数据操作任务, 如执行加法运 算, 或数据加载 /存储操作。 功能单元 202—般有多个数据输入 /输出端 口,并通过传送控制单元 203交换数据,如加法单元在完成加法运算后, 将加法结果传递给传送控制单元 203, 传送控制单元 203然后将加法结 果送入乘法单元, 进行乘法运算。
传送控制单元 203与所有功能单元 202的数据输入 /输出端口相连, 通过接口 206从微码控制单元 201接收每个时刻数据的来源地和目的地 信息, 并将来源地数据送至目的地。
总线 107即图 1 中的第一内部总线 107, 某些类型的功能单元 202 需要执行数据加载 /存储操作,需要通过第一内部总线 107从多粒度并行 存储器 102读写数据。 同时, 微码存储器 200从作为从设备与第一内部 总线 107相连, 被动地从外部接收微码记录 300。 多态指令的定义与调用 图 3显示了一项微码记录 300的结构图。微码记录 300分成多个域, 每个功能单元在微码记录 300中都有对应的域, 如功能单元域 301对应 第 2功能单元。同时,微码记录 300中还有一个特殊的微码控制域 302, 该域指明下一个时钟,微码控制单元 201需要读取哪一行微码记录 300。 如前所述, 本发明的多态指令是多个连续执行的、 具有特定功能的 微码记录 300序列。 如图 4所示。 多态指令, 即微码记录 300的序列存 放在微码存储器 200中, 被微码控制单元 201依次读取并执行。 微码存 储器 200中的每一行存放一个微码记录 300, 当标量处理单元 101调用 多态指令时, 只需指定该多态指令对应的起始记录在微码存储器 200中 的行号。
程序员可以根据算法需求, 利用微码记录 300灵活定义多态指令的 行为和多态指令在微码存储器中的起始行号。 图 5示例性地示出了一种 定义和调用多态指令的流程。 首先, 程序员根据应用需求, 定义一个或 多个多态指令的行为, 并将该指令的行为转换为微码记录 300序列, 该 序列一般用文本来表达, "ALU.T0 = Tl + T2 (U) || Repeat(10)", 表示 ALU进行 10次加法运算。 同时, 编写标量代码, 该代码调用程序员定 义的多态指令, 此时多态指令的起始行号还没有确定, 用标号代替, 如 Instrl 用文本表示的多态指令记录经过编译和链接后, 变成微码控制单 元 201可以理解的二进制文件, 同时, 在编译和链接过程中, 确定每一 个多态指令的起始地址, 如此时 Instrl的值已经确定为 10。 标量代码经 过编译链接后, 还需要与多态指令二制文件进行交叉链接, 将原标量代 码中用符号表示的多态指令起始地址替换为实际的数值, 生成标量二进 制文件。 标量代码在调用多态指令之前, 利用 DMA控制器 103将多态 指令二进制文件内容加载至微码存储器, 再调用多态指令。 具有多态指令集体系结构的处理器的实施例 下面给出多态指令集体系结构的一个示例性的实施例, 该实施例只 是本发明的一种实施方式, 本发明内容不局限于该示例。
该实施例是一种面向数据密集型应用的具有多态指令集体系结构 的处理器。 图 6显示了该处理器中的功能单元。 如图 6所示, 所有功能 单元的数据位宽都为 512 bit, 在进行数据操作时, 512 bit可以看成 64 个 8 bit或 32个 16 bit或 16个 32 bit的数据。 功能单元中的 IALU用于 进行定点逻辑计算, FALU用于进行浮点逻辑计算, IMAC用于进行定 点乘累加计算, FMAC用于进行浮点乘累加操作, SHU0和 SHU1用地 进行数据交织操作, 即交换 512 bit数据内任意两个 8 bit数据的位置。 M为 512 ½位宽的寄存器文件堆,811;0、:811;1、:811;2为总线接口单元, 负责从多粒度并行存储器 102中加载 /存储数据。
IALU、 FALU、 IMAC、 FMAC、 SHU0、 SHU1 具有相似的接口, 该实施例中统称它们为计算单元 500, 该计算单元 500的接口如图 7所 示, 它包括四个数据输入端口 604, 以及对应的四个临时寄存器 600。 运算逻辑 601从临时寄存器中读取数据进行运算, 运算结果写入临时寄 存器 602,然后通过输出端口 603将运算结果传送至传送控制单元 203。
BIU0、 BIU1、 BIU2统称为总线接口单元 501, 其内部结构如图 8 所示。 它具有一个数据输入端口 702, 它通过从传送控制单元 203获取 数据, 并将获得的数据写入临时寄存器 700; —个数据输出端口 703, 通过该端口将临时寄存器 701中的数据传送至传送控制单元 203; —个 内部总线接口 107, 通过该接口读写多粒度并行存储器 102中的数据; 一个地址计算逻辑 704, 负责计算发往第二内部总线 107的地址。
M为 512位宽的寄存器文件堆 (Register file),具有 4个写端口 800、 4个读端口 802, 以及对应的存储体 801。 图 9示例了该寄存器文件堆的 接口。
在多态指令集体系结构中, 各功能单元的计算结果可以直接传送给 其它功能单元, 实现级联运算。 在本实施例中, 并不需要所有功能单元 之间都设计直接的数据传送路径, 如 FMAC主要进行浮点乘累加运算, 它的运算结果没有必要直接传送给定点计算单元 IALU或 IMAC。 减少 数据传送路径的好处在于可减少功能单元之间的连线, 进而减少芯片面 积, 降低芯片成本。本实施例中各功能单元之间的数据传送路径如图 10 所示,该表中每一列的开头表示数据目的地,每一行的开头表示数据源, 中间有勾的单元格表示存在传送路径。 另外, 为进一歩减少传送路径, 某些功能单元之间可以根据应用需要共用传送路径, 功能单元之间共用 传路径可减少功芯片连线, 但这些功能单元之间就不能在同一时刻都传 送数据了。如 SHU0至 BIU0、SHU1至 BIU1共用一条传送路径,则 SHU0 向 BIU0传送数据时, SHU1与 BIU1之间就不能传送数据了。 图 10中 的阴影表示了部分共用的传送路径。
与图 10对应的传送控制单元 203由 29个多路选择器构成, 为方便 表述,我们将传送控制单元 203分解成两个层次,第一个层次由 IALU、 IMAC、 FALU、 FMAC构成, 暂称该层次为 ACU, 如图 11所示。 该层 次通过三个输入端口 ACU.I0、 ACU.I1、 ACU.I2 以及一个输出端口 ACU.0与其它功能单元进行数据传送。 ACU—共包括 16个多路选择器, 即图 11中 M13 M28, 各个多路选择器的数据输入参见图中的标记。
第二个层次由 ACU、 M、 SHU0、 SHU1以及 BIU0~BIU2构成, 如 图 12所示, 一共包括 13个多路选择器, 即图 12中的 M0~M12, 各个 多路选择器的数据输入参见图中的标记。
为了产生传送控制单元 203 中的 29个多路选择器的控制信号, 我 们首先对所能功能单元进行分组并编码, 如图 13所示, 其中 " X "表示 不关心, "0"或 " 1 "都可以。 在微码记录 300 中的每个功能单元控制 域 301除了指明功能单元要执行的操作外, 还需要指明操作结果的目的 地, 该目的地通过图 13中的编码来指定, 如 FALU控制域用文本表达 为 "IALU.T0 = FALU.T1 +T2", 其中 右边的 "FALU.T1 +T2"表 示 FALU要执行加法操作, 而 "=,,左边的 "IALU"指数据操作结果目 的地, 该目的地的编码即为 " 1100"。
微码控制单元 201将微码记录 300中的所有功能单元的目的地信息 都发送给传送控制单元 203, 传送控制单元 203根据这些目的地信息产 生 29个多路选择器的控制信号。 图 14描述了多路选择器 M0的逻辑行 为, 其中 GroupID表示对应功能单元控制域 301中目的地的组编号。
以上所述的具体实施例, 对本发明的目的、 技术方案和有益效果进 行了进一歩详细说明, 应理解的是, 以上所述仅为本发明的具体实施例 而已, 并不用于限制本发明, 凡在本发明的精神和原则之内, 所做的任 何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。

Claims

权利要求
1、 一种具有多态指令集体系结构的处理器, 其特征在于: 包括一 个标量处理单元 (101 )、 至少一个多态指令处理单元 (100)、 至少一个 多粒度并行存储器(102)和一个 DMA控制器(103 ); 所述多态指令处 理单元 (100) 包括至少一个功能单元 (202);
所述多态指令处理单元 (100) 用于解释和执行多态指令, 其功能 单元 (202) 用于执行具体的数据操作任务, 其中, 多态指令是指多个 连续执行的微码记录 (300) 的序列, 微码记录表示某个时钟周期内各 功能单元 (202) 需要执行的动作;
所述标量处理单元 (101 ) 用于调用多态指令并查询多态指令的执 行状态;
所述 DMA控制器(103 )用于传送多态指令的配置信息以及向所述 多粒度存储器 (102) 传送多态指令所需数据。
2、 如权利要求 1 所述的具有多态指令集体系结构的处理器, 其特 征在于: 所述多态指令处理单元 (100) 从所述 DMA控制器 (103 ) 被 动接收多态指令, 并被标量处理单元 (101 ) 调用。
3、 如权利要求 2 所述的具有多态指令集体系结构的处理器, 其特 征在于: 所述标量处理单元 (101 ) 通过一个第一控制通路 (104) 来控 制所述多态指令处理单元 (100 ), 所述标量处理单元 (101 ) 通过第二 控制通路 (105) 来控制所述 DMA控制器 (103 )。
4、 如权利要求 3 所述的具有多态指令集体系结构的处理器, 其特 征在于: 所述多态指令处理单元 (100) 还包括微码存储器 (200) 和微 码控制单元 (201 );
所述微码存储器 (200) 用于存放多态指令;
所述微码控制单元 (201 ) 用于通过所述第一控制通路 (104) 接收 所述标量处理单元 (101 ) 的控制请求并执行相应的动作。
5、 如权利要求 4 所述的具有多态指令集体系结构的处理器, 其特 征在于: 所述微码控制单元 (201 ) 包括配置寄存器 (207 ), 该配置寄 存器 (207 ) 用于存储多态指令处理器单元 (100) 运行时所需参数及运 行状态。
6、 如权利要求 5 所述的具有多态指令集体系结构的处理器, 其特 征在于: 所述标量处理单元 (101 ) 的控制请求包括启动或查询所述多 态指令处理单元 (100 )、 读写所述多态指令处理单元 (100 ) 的配置寄 存器 (207 )。
7、 如权利要求 5 所述的具有多态指令集体系结构的处理器, 其特 征在于: 所述多态指令处理单元 (100 ) 还包括传送控制单元 (203 ), 所述功能单元 (202) 具有多个数据输入 /输出端口, 并通过该传送控制 单元 (203 ) 交换数据。
8、 如权利要求 5 所述的具有多态指令集体系结构的处理器, 其特 征在于: 所述功能单元 (202 ) 用于执行数据加载 /存储操作, 并通过一 第一内部总线(107 )从所述多粒度并行存储器(102)读写数据; 同时, 所述微码存储器 (200) 作为从设备与该第一内部总线 (107 ) 相连, 被 动地从外部接收微码记录 (300)。
9、 如权利要求 4 所述的具有多态指令集体系结构的处理器, 其特 征在于: 所述微码控制单元 (201 ) 依次读取并执行多态指令的微码记 录 (300)。
10、 如权利要求 9所述的具有多态指令集体系结构的处理器, 其特 征在于: 所述微码存储器(200)中的每一行存放一个微码记录(300), 当所述标量处理单元 (101 ) 调用多态指令时, 只指定该多态指令对应 的起始微码记录在该微码存储器 (200 ) 中的行号。
PCT/CN2013/074426 2013-04-19 2013-04-19 具有多态指令集体系结构的处理器 WO2014169477A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2013/074426 WO2014169477A1 (zh) 2013-04-19 2013-04-19 具有多态指令集体系结构的处理器
US14/785,385 US20160162290A1 (en) 2013-04-19 2013-04-19 Processor with Polymorphic Instruction Set Architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2013/074426 WO2014169477A1 (zh) 2013-04-19 2013-04-19 具有多态指令集体系结构的处理器

Publications (1)

Publication Number Publication Date
WO2014169477A1 true WO2014169477A1 (zh) 2014-10-23

Family

ID=51730708

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/074426 WO2014169477A1 (zh) 2013-04-19 2013-04-19 具有多态指令集体系结构的处理器

Country Status (2)

Country Link
US (1) US20160162290A1 (zh)
WO (1) WO2014169477A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709858A (zh) * 2016-12-12 2017-05-24 中国航空工业集团公司西安航空计算技术研究所 一种统一染色图形处理器单指令多线程染色处理单元结构
US10489358B2 (en) 2017-02-15 2019-11-26 Ca, Inc. Schemas to declare graph data models

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050169550A1 (en) * 2003-07-29 2005-08-04 Arnold Jeffrey M. Video processing system with reconfigurable instructions
US20060211387A1 (en) * 2005-02-17 2006-09-21 Samsung Electronics Co., Ltd. Multistandard SDR architecture using context-based operation reconfigurable instruction set processors
CN101133409A (zh) * 2005-03-03 2008-02-27 Clear-Speed科技公司 处理器中的可再配置逻辑
CN101908032B (zh) * 2010-08-30 2012-08-15 湖南大学 可重新配置处理器集合的处理器阵列

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5036453A (en) * 1985-12-12 1991-07-30 Texas Instruments Incorporated Master/slave sequencing processor
US6895452B1 (en) * 1997-06-04 2005-05-17 Marger Johnson & Mccollom, P.C. Tightly coupled and scalable memory and execution unit architecture
US8156362B2 (en) * 2008-03-11 2012-04-10 Globalfoundries Inc. Hardware monitoring and decision making for transitioning in and out of low-power state
US10140129B2 (en) * 2012-12-28 2018-11-27 Intel Corporation Processing core having shared front end unit

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050169550A1 (en) * 2003-07-29 2005-08-04 Arnold Jeffrey M. Video processing system with reconfigurable instructions
US20060211387A1 (en) * 2005-02-17 2006-09-21 Samsung Electronics Co., Ltd. Multistandard SDR architecture using context-based operation reconfigurable instruction set processors
CN101133409A (zh) * 2005-03-03 2008-02-27 Clear-Speed科技公司 处理器中的可再配置逻辑
CN101908032B (zh) * 2010-08-30 2012-08-15 湖南大学 可重新配置处理器集合的处理器阵列

Also Published As

Publication number Publication date
US20160162290A1 (en) 2016-06-09

Similar Documents

Publication Publication Date Title
US10564980B2 (en) Apparatus, methods, and systems for conditional queues in a configurable spatial accelerator
US10515046B2 (en) Processors, methods, and systems with a configurable spatial accelerator
CN111512292A (zh) 用于可配置空间加速器中的非结构化数据流的装置、方法和系统
EP3776229A1 (en) Apparatuses, methods, and systems for remote memory access in a configurable spatial accelerator
JP4339245B2 (ja) スカラー/ベクトルプロセッサ
WO2019190687A1 (en) Apparatus, methods, and systems for integrated performance monitoring in a configurable spatial accelerator
CN110955453A (zh) 用于执行矩阵压缩和解压缩指令的系统和方法
CN111767236A (zh) 用于可配置空间加速器中的存储器接口电路分配的装置、方法和系统
US10853073B2 (en) Apparatuses, methods, and systems for conditional operations in a configurable spatial accelerator
TW201802668A (zh) 可中斷及可重啟矩陣乘法指令、處理器、方法和系統
CN113050990A (zh) 用于矩阵操作加速器的指令的装置、方法和系统
EP3798823A1 (en) Apparatuses, methods, and systems for instructions of a matrix operations accelerator
TWI603262B (zh) 緊縮有限脈衝響應(fir)濾波器處理器,方法,系統及指令
CN114625418A (zh) 用于执行快速转换片并且将片用作一维向量的指令的系统
US11403104B2 (en) Neural network processor, chip and electronic device
CN110909883A (zh) 用于执行指定三元片逻辑操作的指令的系统和方法
US20220043770A1 (en) Neural network processor, chip and electronic device
CN110909882A (zh) 用于执行水平铺块操作的系统和方法
CN113885942A (zh) 用于将片寄存器对归零的系统和方法
CN114327362A (zh) 大规模矩阵重构和矩阵-标量操作
US20210200540A1 (en) Apparatuses, methods, and systems for fused operations in a configurable spatial accelerator
CN103235717B (zh) 具有多态指令集体系结构的处理器
CN114675883A (zh) 用于对齐矩阵操作加速器瓦片的指令的装置、方法和系统
WO2021115149A1 (zh) 神经网络处理器、芯片和电子设备
WO2014169477A1 (zh) 具有多态指令集体系结构的处理器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13882227

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 14785385

Country of ref document: US

122 Ep: pct application non-entry in european phase

Ref document number: 13882227

Country of ref document: EP

Kind code of ref document: A1