CN111124999A

CN111124999A - Dual-mode computer framework supporting in-memory computation

Info

Publication number: CN111124999A
Application number: CN201911258025.0A
Authority: CN
Inventors: 张章; 曾剑敏; 魏亚东; 解光军
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-05-08
Anticipated expiration: 2039-12-10
Also published as: CN111124999B

Abstract

The invention relates to the technical field of computer architectures, in particular to a dual-mode computer architecture supporting in-memory computation. The invention has the beneficial effects that: the proposed architecture can work under a non-von neumann architecture, also preserving the traditional von neumann architecture. The goal of this is to make maximum use of existing compilation toolchains and programming models. Simulations show that the architecture of the present invention can provide tens to hundreds of times acceleration for a particular application, as compared to a reference, depending on the number of macro cells used in the IMC-SRAM. In the case of a block of macro-cells, the binary neural network of size 32 x 32 and the hash algorithm for 512-bit character input can be accelerated by 6.7 and 12.2 times, respectively. Compared with the benchmark, the average energy saving of the framework is 3 times.

Description

Dual-mode computer framework supporting in-memory computation

Technical Field

The invention relates to the technical field of computer architectures, in particular to a dual-mode computer architecture supporting in-memory computing.

Background

Most computers currently employ the traditional von neumann architecture, i.e., the computing unit and memory unit are separate. Such computers work following the paradigm of load (from memory) (in CPU) compute-write back (memory). On one hand, due to the limitation of chips or IP pins, the bandwidth between the computing unit and the memory is also limited due to the separation of computing and memory, and the bandwidth resource of the memory module cannot be fully utilized. On the other hand, the computation and storage separation causes the CPU to continuously read and write the memory at a remote location when processing data, which inevitably causes huge power consumption waste. For example, a 64-bit double-precision floating-point operation may consume approximately 50pJ of energy, a CPU core may consume approximately 500pJ of energy to read data from a cache that is 20mm away, and an off-chip DRAM may require approximately 6000pJ of energy to read the associated data. With the development of big data, artificial intelligence and the internet of things, it is increasingly urgent to solve the bandwidth bottleneck between the CPU and the memory and improve the energy utilization efficiency of the system. The design of processor architectures presents greater challenges. For example, how to meet the bandwidth requirement of the computing unit in the artificial intelligence chip on the storage module, how to enable the terminal of the internet of things (mostly MCU, that is, micro central processing unit) to maintain or even reduce the power consumption while improving the processing performance.

One solution to these problems is the storage-Computing integration (PIM), which is designed In such a way that the Computing unit and the Memory module are as close as possible, even when they are integrated, i.e. Computing In Memory.

In fact, the concept of integrating storage and calculation was proposed in the seventies of the last century, and since the bandwidth demand or power consumption of a CPU for a memory is not a main contradiction in the development of computer technology, the concept is not paid enough attention. Until the nineties, a large number of computers came into existence, and the design concept could be summarized as putting the computing units and memory units in one chip, produced by logic technology or DRAM technology. However, such a design is extremely expensive in production, and the designed processor is poor in programmability and upgradability, and is not popularized finally.

In recent years, with the development of artificial intelligence, a storage computing technology based on a Static Random Access Memory (SRAM) has attracted much attention. The basic principle of the technology is that a plurality of SRAM memory cells (Bitcell) on a pair of bit lines are read simultaneously, the voltage drop behavior on the bit lines is represented by a logical and relation, and if a double-input double-output Sense Amplifier (SA) is used at the tail end of the bit lines, a plurality of logical results can be obtained at the SA end. Fig. 1 depicts the principle of SRAM-based storage integration.

However, most of the existing storage and computation integrated design based on the SRAM aims at artificial intelligence acceleration, and no storage and computation integrated design is designed for general purpose.

Disclosure of Invention

It is an object of the present invention to overcome the problems of the prior art and to provide a dual-mode computer architecture that supports in-memory computing that addresses at least some of the problems of the prior art.

In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:

a dual-mode computer architecture supporting internal memory computation comprises a processor core, an instruction memory, a memory computation coprocessor and a plurality of computation type SRAM macro units, wherein the processor core adopts a reduced instruction set core and is provided with a six-stage pipeline, a pre-decoding stage pipeline is inserted in an instruction fetching stage and a decoding stage and is used for decoding memory computation special instructions, the instruction memory is generated before memory compilation, the memory computation coprocessor is used for decoding the memory computation special instructions transmitted by the processor core, switching the working mode of a system and controlling the read-write and memory computation of the SRAM macro units, the SRAM macro units are provided with three groups of row decoders and one group of column decoders, the SRAM macro units also comprise an IM-ALU unit, the IM-ALU unit contains a small number of logic gates and a plurality of selectors, and the logic gates are used for completing the memory computation operation outside SA, the selector is used for selecting the calculated result to write back to the memory array, the storage calculation coprocessor controls the selection of the architecture mode of the computer and is controlled by the type of the instruction taken by the CPU, in a common mode, the storage calculation coprocessor processes data according to the paradigm of reading data from the memory, processing the data and writing the data back to the memory, in an IMC mode, the architecture is equivalent to a non-Von Neumann architecture, and the CPU sends the storage calculation instruction to an SRAM macro cell of the memory, directly performs the data operation in the memory and directly writes back to Bitcell.

Further, the column decoder is configured by a storage compute coprocessor.

The invention has the beneficial effects that: compared with the traditional von Neumann architecture, the invention has the characteristic of dual-mode architecture, namely the architecture provided by the invention can work under the non-von Neumann architecture, and the traditional von Neumann architecture is also reserved. The goal of this is to make maximum use of existing compilation toolchains and programming models. Simulations show that the architecture of the present invention can provide a speed up of tens to hundreds times for a particular application, depending on the number of macro cells used in the IMC-SRAM, compared to a reference (no in-memory computation function, otherwise the same as the present architecture). In the case of a block macrocell, the Hash (Hash) algorithm for 32 x 32 Binary Neural Network (BNN) and 512 bit character inputs can be accelerated by 6.7 and 12.2 times, respectively. Compared with the benchmark, the average energy saving of the framework is 3 times.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a prior art SRAM-based in-memory computation;

FIG. 2 is a diagram of an SRAM architecture supporting in-memory computation in accordance with the present invention;

FIG. 3 is a dual-mode computer architecture according to the present invention;

FIG. 4 is a comparison of a conventional instruction and an IMC instruction;

FIG. 5 is a schematic diagram of adaptive parallel vector operation in SRAM according to the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 shows the principle of calculation based on SRAM bit line discharge. The left diagram is a schematic diagram of logic operation on a column of Bitcell, and the right diagram is a conventional 6-pipe cell (Bitcell). Data to be logically operated is written into the Bitcell (Bitcell j and Bitcell k in the figure are taken as examples), and then the bit lines BL and BLB are precharged to high level. And then, simultaneously opening the word lines WLj and WLk, and discharging the bit lines on two sides under different conditions according to the data stored in the Bitcell j and the Bitcell k. For a Bitcell, if it stores 0 (i.e., Q is 0 in the right diagram of fig. 1), BL is discharged to ground through M5 and M3, and conversely, BLB is discharged through M6 and M4. It can be known that for a certain bit line, if bit cell on the side stores 0, the bit line will discharge, and after SA comparison and amplification, the SA outputs logic 0, and a logical and relationship is naturally formed, and similarly, logics such as NAND, OR and NOR can be obtained. And adding a NAND gate after SA to obtain exclusive-OR logic. And several logic gates are added to obtain the operation results of addition, etc. This is the basic principle for memory computing based on SRAM.

FIG. 2 is a schematic diagram of a computational SRAM macrocell, called an IMC-SRAM, according to the principles described in FIG. 1. Unlike conventional SRAM, IMC-SRAM uses three sets of row decoders and a configurable set of column decoders in order to enable memory computations. In addition, the IMC-SRAM further comprises a lightweight arithmetic logic unit (IM-ALU) mainly comprising some logic gates for completing some operations after SA.

The IMC-SRAM working mode can be configured, when the system works in a common mode, the IMC-SRAM is equal to the traditional SRAM, only one set of row decoder and column decoder acts, and the reading and writing behaviors are the same as those of the common SRAM. When the system works in IMC mode, the IMC-SRAM will also switch to this mode. At this point, the IMC-SRAM reads the data directly for computation and writes the result back into the memory array. In this mode, two row decoders Ar1 and Ar2 are used for the reading of two source operands and Ar3 is used to write the computed results back to the array. Since two operands for computation require row-aligned storage, the operands may share a set of column decoders. When a two operand read operation is performed in IMC mode, the columns that need to be read can be configured by VL _ cfg. The data of the Mode, the VC _ cfg, the row and column addresses and the like are all from the IMC-CP.

Fig. 3 is an overview of the dual-mode computer architecture proposed by the present invention. It can be seen that the architecture comprises a reduced instruction set CPU core, an instruction set memory, a memory compute coprocessor IMC-CP and several compute memory macro-units. The main difference between the architecture and the traditional von Neumann architecture is that a computing memory IMC-SRAM is adopted, a CPU can access the memory according to an address, and can also send an instruction to the memory, so that data can be operated in the memory, and an operation result is directly written back to the memory. The traditional instruction and the storage calculation instruction have equal bit width and can be mixed and stored in the same instruction memory, the CPU decodes the fetched instruction, and if the fetched instruction is the traditional instruction, the pipeline operation is continuously executed; if the instruction is a storage calculation type instruction, the instruction is sent to IMC-CP processing. At this point, if the next instruction in the CPU is not associated with the access, execution continues, otherwise, the pipeline is halted until the store calculation is complete. The IMC-CP receiving the storage computation instruction switches the mode to the IMC mode. The storage calculation instruction comprises an operation code, a source operand address, a destination operand address and a calculated vector length of the storage calculation. The IMC-CP transmits information such as operand addresses, operation codes and the like to the IMC-SRAM, controls the IMC-SRAM to carry out storage operation until the whole storage operation is completed, and sends a Finish signal to the CPU to indicate that the storage operation is completed. The CPU may continue to run. At the same time, the IMC-CP switches the system to the normal mode, and the IMC-SRAM is also switched to this mode at random.

FIG. 4 is a schematic diagram of a CPU pipeline employed by the architecture of the present invention. The IF, PID, ID, EX, MEM and WB respectively represent several pipeline stages of instruction fetch, predecoding, decoding, execution, access and write-back, wherein the PID directly inserted in the IF and ID is a predecoding stage proposed by the present invention, which is used for decoding the mixed instruction in the instruction memory to determine whether an instruction belongs to the traditional instruction or the memory calculation instruction.

Fig. 5 shows an example of adaptive in-memory vector computation according to the present invention. The left graph is a program comparison graph using a conventional approach and an in-memory calculation approach. A and B are vectors with 20 components, respectively, and the corresponding vectors of A and B are added together, respectively, with the result being stored in the C vector. A comparison of programming using the conventional paradigm and stored computational instructions is given in the figure, respectively. The right graph is the mapping relation of the storage calculation instruction in the IMC-SRAM. One bank of IMC-SRAM has 8 words (word) per row, and with 2 banks of IMC-SRAM, the first 16 component additions of a and B can be computed in one cycle, and the remaining four component additions computed in the second cycle. The design enables the IMC-SRAM to perform a large amount of vector parallel computation in one cycle. If 3 bank IMC-SRAM is used, a maximum of 14 vector components can be calculated per cycle, and the vector addition of a and B can be completed in one cycle. For memory vector operation with a specific length, the operation can be completed in a total of several cycles, which is automatically calculated by the IMC-CP according to the number of the accessed IMC-SRAM blocks, namely the self-adaptive memory vector calculation of the framework of the invention.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A dual-mode computer architecture supporting in-memory computation is characterized by comprising a processor core, an instruction memory, a memory computation coprocessor and a plurality of computation type SRAM macro units, wherein the processor core adopts a simplified instruction set core and is provided with a six-stage pipeline, a pre-decoding stage pipeline is inserted in an instruction fetching stage and a decoding stage and is used for decoding memory computation special instructions, the instruction memory is generated before memory compilation, the memory computation coprocessor is used for decoding the memory computation special instructions transmitted by the processor core, switching the working mode of a system and controlling the read-write and memory computation of the SRAM macro units, the SRAM macro units are provided with three groups of row decoders and one group of column decoders, the SRAM macro units also comprise an IM-ALU unit, and the IM-ALU unit is provided with a small number of logic gates and a plurality of selectors, the logic gate is used for completing storage calculation operation outside SA, the selector is used for selecting the calculated result to write back to the memory array, the storage calculation coprocessor controls the selection of the architecture mode of the computer and is controlled by the type of the instruction fetched by the CPU, in the common mode, the storage calculation coprocessor processes data according to the paradigm of reading data from the memory, processing data and writing the data back to the memory, in the IMC mode, the architecture is equivalent to a non-Von Neumann architecture, and the CPU sends the storage calculation instruction to an SRAM macro-unit of the memory, directly performs data operation in the memory and directly writes back to Bitcell.

2. The dual-mode computer architecture to support in-memory computing of claim 1, wherein the column decoder is configured by a memory computing coprocessor.