CN111124999A - Dual-mode computer framework supporting in-memory computation - Google Patents

Dual-mode computer framework supporting in-memory computation Download PDF

Info

Publication number
CN111124999A
CN111124999A CN201911258025.0A CN201911258025A CN111124999A CN 111124999 A CN111124999 A CN 111124999A CN 201911258025 A CN201911258025 A CN 201911258025A CN 111124999 A CN111124999 A CN 111124999A
Authority
CN
China
Prior art keywords
memory
architecture
computation
instruction
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911258025.0A
Other languages
Chinese (zh)
Other versions
CN111124999B (en
Inventor
张章
曾剑敏
魏亚东
解光军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN201911258025.0A priority Critical patent/CN111124999B/en
Publication of CN111124999A publication Critical patent/CN111124999A/en
Application granted granted Critical
Publication of CN111124999B publication Critical patent/CN111124999B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the technical field of computer architectures, in particular to a dual-mode computer architecture supporting in-memory computation. The invention has the beneficial effects that: the proposed architecture can work under a non-von neumann architecture, also preserving the traditional von neumann architecture. The goal of this is to make maximum use of existing compilation toolchains and programming models. Simulations show that the architecture of the present invention can provide tens to hundreds of times acceleration for a particular application, as compared to a reference, depending on the number of macro cells used in the IMC-SRAM. In the case of a block of macro-cells, the binary neural network of size 32 x 32 and the hash algorithm for 512-bit character input can be accelerated by 6.7 and 12.2 times, respectively. Compared with the benchmark, the average energy saving of the framework is 3 times.

Description

Dual-mode computer framework supporting in-memory computation
Technical Field
The invention relates to the technical field of computer architectures, in particular to a dual-mode computer architecture supporting in-memory computing.
Background
Most computers currently employ the traditional von neumann architecture, i.e., the computing unit and memory unit are separate. Such computers work following the paradigm of load (from memory) (in CPU) compute-write back (memory). On one hand, due to the limitation of chips or IP pins, the bandwidth between the computing unit and the memory is also limited due to the separation of computing and memory, and the bandwidth resource of the memory module cannot be fully utilized. On the other hand, the computation and storage separation causes the CPU to continuously read and write the memory at a remote location when processing data, which inevitably causes huge power consumption waste. For example, a 64-bit double-precision floating-point operation may consume approximately 50pJ of energy, a CPU core may consume approximately 500pJ of energy to read data from a cache that is 20mm away, and an off-chip DRAM may require approximately 6000pJ of energy to read the associated data. With the development of big data, artificial intelligence and the internet of things, it is increasingly urgent to solve the bandwidth bottleneck between the CPU and the memory and improve the energy utilization efficiency of the system. The design of processor architectures presents greater challenges. For example, how to meet the bandwidth requirement of the computing unit in the artificial intelligence chip on the storage module, how to enable the terminal of the internet of things (mostly MCU, that is, micro central processing unit) to maintain or even reduce the power consumption while improving the processing performance.
One solution to these problems is the storage-Computing integration (PIM), which is designed In such a way that the Computing unit and the Memory module are as close as possible, even when they are integrated, i.e. Computing In Memory.
In fact, the concept of integrating storage and calculation was proposed in the seventies of the last century, and since the bandwidth demand or power consumption of a CPU for a memory is not a main contradiction in the development of computer technology, the concept is not paid enough attention. Until the nineties, a large number of computers came into existence, and the design concept could be summarized as putting the computing units and memory units in one chip, produced by logic technology or DRAM technology. However, such a design is extremely expensive in production, and the designed processor is poor in programmability and upgradability, and is not popularized finally.
In recent years, with the development of artificial intelligence, a storage computing technology based on a Static Random Access Memory (SRAM) has attracted much attention. The basic principle of the technology is that a plurality of SRAM memory cells (Bitcell) on a pair of bit lines are read simultaneously, the voltage drop behavior on the bit lines is represented by a logical and relation, and if a double-input double-output Sense Amplifier (SA) is used at the tail end of the bit lines, a plurality of logical results can be obtained at the SA end. Fig. 1 depicts the principle of SRAM-based storage integration.
However, most of the existing storage and computation integrated design based on the SRAM aims at artificial intelligence acceleration, and no storage and computation integrated design is designed for general purpose.
Disclosure of Invention
It is an object of the present invention to overcome the problems of the prior art and to provide a dual-mode computer architecture that supports in-memory computing that addresses at least some of the problems of the prior art.
In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:
a dual-mode computer architecture supporting internal memory computation comprises a processor core, an instruction memory, a memory computation coprocessor and a plurality of computation type SRAM macro units, wherein the processor core adopts a reduced instruction set core and is provided with a six-stage pipeline, a pre-decoding stage pipeline is inserted in an instruction fetching stage and a decoding stage and is used for decoding memory computation special instructions, the instruction memory is generated before memory compilation, the memory computation coprocessor is used for decoding the memory computation special instructions transmitted by the processor core, switching the working mode of a system and controlling the read-write and memory computation of the SRAM macro units, the SRAM macro units are provided with three groups of row decoders and one group of column decoders, the SRAM macro units also comprise an IM-ALU unit, the IM-ALU unit contains a small number of logic gates and a plurality of selectors, and the logic gates are used for completing the memory computation operation outside SA, the selector is used for selecting the calculated result to write back to the memory array, the storage calculation coprocessor controls the selection of the architecture mode of the computer and is controlled by the type of the instruction taken by the CPU, in a common mode, the storage calculation coprocessor processes data according to the paradigm of reading data from the memory, processing the data and writing the data back to the memory, in an IMC mode, the architecture is equivalent to a non-Von Neumann architecture, and the CPU sends the storage calculation instruction to an SRAM macro cell of the memory, directly performs the data operation in the memory and directly writes back to Bitcell.
Further, the column decoder is configured by a storage compute coprocessor.
The invention has the beneficial effects that: compared with the traditional von Neumann architecture, the invention has the characteristic of dual-mode architecture, namely the architecture provided by the invention can work under the non-von Neumann architecture, and the traditional von Neumann architecture is also reserved. The goal of this is to make maximum use of existing compilation toolchains and programming models. Simulations show that the architecture of the present invention can provide a speed up of tens to hundreds times for a particular application, depending on the number of macro cells used in the IMC-SRAM, compared to a reference (no in-memory computation function, otherwise the same as the present architecture). In the case of a block macrocell, the Hash (Hash) algorithm for 32 x 32 Binary Neural Network (BNN) and 512 bit character inputs can be accelerated by 6.7 and 12.2 times, respectively. Compared with the benchmark, the average energy saving of the framework is 3 times.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a prior art SRAM-based in-memory computation;
FIG. 2 is a diagram of an SRAM architecture supporting in-memory computation in accordance with the present invention;
FIG. 3 is a dual-mode computer architecture according to the present invention;
FIG. 4 is a comparison of a conventional instruction and an IMC instruction;
FIG. 5 is a schematic diagram of adaptive parallel vector operation in SRAM according to the present invention.
Detailed Description
In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 shows the principle of calculation based on SRAM bit line discharge. The left diagram is a schematic diagram of logic operation on a column of Bitcell, and the right diagram is a conventional 6-pipe cell (Bitcell). Data to be logically operated is written into the Bitcell (Bitcell j and Bitcell k in the figure are taken as examples), and then the bit lines BL and BLB are precharged to high level. And then, simultaneously opening the word lines WLj and WLk, and discharging the bit lines on two sides under different conditions according to the data stored in the Bitcell j and the Bitcell k. For a Bitcell, if it stores 0 (i.e., Q is 0 in the right diagram of fig. 1), BL is discharged to ground through M5 and M3, and conversely, BLB is discharged through M6 and M4. It can be known that for a certain bit line, if bit cell on the side stores 0, the bit line will discharge, and after SA comparison and amplification, the SA outputs logic 0, and a logical and relationship is naturally formed, and similarly, logics such as NAND, OR and NOR can be obtained. And adding a NAND gate after SA to obtain exclusive-OR logic. And several logic gates are added to obtain the operation results of addition, etc. This is the basic principle for memory computing based on SRAM.
FIG. 2 is a schematic diagram of a computational SRAM macrocell, called an IMC-SRAM, according to the principles described in FIG. 1. Unlike conventional SRAM, IMC-SRAM uses three sets of row decoders and a configurable set of column decoders in order to enable memory computations. In addition, the IMC-SRAM further comprises a lightweight arithmetic logic unit (IM-ALU) mainly comprising some logic gates for completing some operations after SA.
The IMC-SRAM working mode can be configured, when the system works in a common mode, the IMC-SRAM is equal to the traditional SRAM, only one set of row decoder and column decoder acts, and the reading and writing behaviors are the same as those of the common SRAM. When the system works in IMC mode, the IMC-SRAM will also switch to this mode. At this point, the IMC-SRAM reads the data directly for computation and writes the result back into the memory array. In this mode, two row decoders Ar1 and Ar2 are used for the reading of two source operands and Ar3 is used to write the computed results back to the array. Since two operands for computation require row-aligned storage, the operands may share a set of column decoders. When a two operand read operation is performed in IMC mode, the columns that need to be read can be configured by VL _ cfg. The data of the Mode, the VC _ cfg, the row and column addresses and the like are all from the IMC-CP.
Fig. 3 is an overview of the dual-mode computer architecture proposed by the present invention. It can be seen that the architecture comprises a reduced instruction set CPU core, an instruction set memory, a memory compute coprocessor IMC-CP and several compute memory macro-units. The main difference between the architecture and the traditional von Neumann architecture is that a computing memory IMC-SRAM is adopted, a CPU can access the memory according to an address, and can also send an instruction to the memory, so that data can be operated in the memory, and an operation result is directly written back to the memory. The traditional instruction and the storage calculation instruction have equal bit width and can be mixed and stored in the same instruction memory, the CPU decodes the fetched instruction, and if the fetched instruction is the traditional instruction, the pipeline operation is continuously executed; if the instruction is a storage calculation type instruction, the instruction is sent to IMC-CP processing. At this point, if the next instruction in the CPU is not associated with the access, execution continues, otherwise, the pipeline is halted until the store calculation is complete. The IMC-CP receiving the storage computation instruction switches the mode to the IMC mode. The storage calculation instruction comprises an operation code, a source operand address, a destination operand address and a calculated vector length of the storage calculation. The IMC-CP transmits information such as operand addresses, operation codes and the like to the IMC-SRAM, controls the IMC-SRAM to carry out storage operation until the whole storage operation is completed, and sends a Finish signal to the CPU to indicate that the storage operation is completed. The CPU may continue to run. At the same time, the IMC-CP switches the system to the normal mode, and the IMC-SRAM is also switched to this mode at random.
FIG. 4 is a schematic diagram of a CPU pipeline employed by the architecture of the present invention. The IF, PID, ID, EX, MEM and WB respectively represent several pipeline stages of instruction fetch, predecoding, decoding, execution, access and write-back, wherein the PID directly inserted in the IF and ID is a predecoding stage proposed by the present invention, which is used for decoding the mixed instruction in the instruction memory to determine whether an instruction belongs to the traditional instruction or the memory calculation instruction.
Fig. 5 shows an example of adaptive in-memory vector computation according to the present invention. The left graph is a program comparison graph using a conventional approach and an in-memory calculation approach. A and B are vectors with 20 components, respectively, and the corresponding vectors of A and B are added together, respectively, with the result being stored in the C vector. A comparison of programming using the conventional paradigm and stored computational instructions is given in the figure, respectively. The right graph is the mapping relation of the storage calculation instruction in the IMC-SRAM. One bank of IMC-SRAM has 8 words (word) per row, and with 2 banks of IMC-SRAM, the first 16 component additions of a and B can be computed in one cycle, and the remaining four component additions computed in the second cycle. The design enables the IMC-SRAM to perform a large amount of vector parallel computation in one cycle. If 3 bank IMC-SRAM is used, a maximum of 14 vector components can be calculated per cycle, and the vector addition of a and B can be completed in one cycle. For memory vector operation with a specific length, the operation can be completed in a total of several cycles, which is automatically calculated by the IMC-CP according to the number of the accessed IMC-SRAM blocks, namely the self-adaptive memory vector calculation of the framework of the invention.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims (2)

1. A dual-mode computer architecture supporting in-memory computation is characterized by comprising a processor core, an instruction memory, a memory computation coprocessor and a plurality of computation type SRAM macro units, wherein the processor core adopts a simplified instruction set core and is provided with a six-stage pipeline, a pre-decoding stage pipeline is inserted in an instruction fetching stage and a decoding stage and is used for decoding memory computation special instructions, the instruction memory is generated before memory compilation, the memory computation coprocessor is used for decoding the memory computation special instructions transmitted by the processor core, switching the working mode of a system and controlling the read-write and memory computation of the SRAM macro units, the SRAM macro units are provided with three groups of row decoders and one group of column decoders, the SRAM macro units also comprise an IM-ALU unit, and the IM-ALU unit is provided with a small number of logic gates and a plurality of selectors, the logic gate is used for completing storage calculation operation outside SA, the selector is used for selecting the calculated result to write back to the memory array, the storage calculation coprocessor controls the selection of the architecture mode of the computer and is controlled by the type of the instruction fetched by the CPU, in the common mode, the storage calculation coprocessor processes data according to the paradigm of reading data from the memory, processing data and writing the data back to the memory, in the IMC mode, the architecture is equivalent to a non-Von Neumann architecture, and the CPU sends the storage calculation instruction to an SRAM macro-unit of the memory, directly performs data operation in the memory and directly writes back to Bitcell.
2. The dual-mode computer architecture to support in-memory computing of claim 1, wherein the column decoder is configured by a memory computing coprocessor.
CN201911258025.0A 2019-12-10 2019-12-10 Dual-mode computer framework supporting in-memory computation Active CN111124999B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911258025.0A CN111124999B (en) 2019-12-10 2019-12-10 Dual-mode computer framework supporting in-memory computation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911258025.0A CN111124999B (en) 2019-12-10 2019-12-10 Dual-mode computer framework supporting in-memory computation

Publications (2)

Publication Number Publication Date
CN111124999A true CN111124999A (en) 2020-05-08
CN111124999B CN111124999B (en) 2023-03-03

Family

ID=70498034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911258025.0A Active CN111124999B (en) 2019-12-10 2019-12-10 Dual-mode computer framework supporting in-memory computation

Country Status (1)

Country Link
CN (1) CN111124999B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590195A (en) * 2021-07-22 2021-11-02 中国人民解放军国防科技大学 Storage-computation integrated DRAM (dynamic random Access memory) computation unit design supporting floating-point format multiply-add
CN114639407A (en) * 2020-12-16 2022-06-17 美光科技公司 Reconfigurable in-memory processing logic
US11947967B2 (en) 2020-07-17 2024-04-02 Lodestar Licensing Group Llc Reconfigurable processing-in-memory logic using look-up tables

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6226738B1 (en) * 1997-08-01 2001-05-01 Micron Technology, Inc. Split embedded DRAM processor
US20160224098A1 (en) * 2015-01-30 2016-08-04 Alexander Gendler Communicating via a mailbox interface of a processor
CN108369513A (en) * 2015-12-22 2018-08-03 英特尔公司 For loading-indexing-and-collect instruction and the logic of operation
CN110414677A (en) * 2019-07-11 2019-11-05 东南大学 It is a kind of to deposit interior counting circuit suitable for connect binaryzation neural network entirely

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6226738B1 (en) * 1997-08-01 2001-05-01 Micron Technology, Inc. Split embedded DRAM processor
US20160224098A1 (en) * 2015-01-30 2016-08-04 Alexander Gendler Communicating via a mailbox interface of a processor
CN108369513A (en) * 2015-12-22 2018-08-03 英特尔公司 For loading-indexing-and-collect instruction and the logic of operation
CN110414677A (en) * 2019-07-11 2019-11-05 东南大学 It is a kind of to deposit interior counting circuit suitable for connect binaryzation neural network entirely

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱世凯等: "基于存储计算的可重构加速架构设计", 《计算机工程与设计》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11947967B2 (en) 2020-07-17 2024-04-02 Lodestar Licensing Group Llc Reconfigurable processing-in-memory logic using look-up tables
CN114639407A (en) * 2020-12-16 2022-06-17 美光科技公司 Reconfigurable in-memory processing logic
US11887693B2 (en) 2020-12-16 2024-01-30 Lodestar Licensing Group Llc Reconfigurable processing-in-memory logic
CN113590195A (en) * 2021-07-22 2021-11-02 中国人民解放军国防科技大学 Storage-computation integrated DRAM (dynamic random Access memory) computation unit design supporting floating-point format multiply-add
CN113590195B (en) * 2021-07-22 2023-11-07 中国人民解放军国防科技大学 Memory calculation integrated DRAM computing unit supporting floating point format multiply-add

Also Published As

Publication number Publication date
CN111124999B (en) 2023-03-03

Similar Documents

Publication Publication Date Title
US20210278988A1 (en) Apparatuses and methods for data movement
US11106389B2 (en) Apparatuses and methods for data transfer from sensing circuitry to a controller
CN111124999B (en) Dual-mode computer framework supporting in-memory computation
US10725680B2 (en) Apparatuses and methods to change data category values
Akyel et al. DRC 2: Dynamically Reconfigurable Computing Circuit based on memory architecture
US20180122479A1 (en) Associative row decoder
CN102541774B (en) Multi-grain parallel storage system and storage
US20180357007A1 (en) Data transfer between subarrays in memory
US11126690B2 (en) Machine learning architecture support for block sparsity
CN102541749A (en) Multi-granularity parallel storage system
Zeng et al. DM-IMCA: A dual-mode in-memory computing architecture for general purpose processing
CN111045727B (en) Processing unit array based on nonvolatile memory calculation and calculation method thereof
CN116126779A (en) 9T memory operation circuit, multiply-accumulate operation circuit, memory operation circuit and chip
CN113378115B (en) Near-memory sparse vector multiplier based on magnetic random access memory
Ungethüm et al. Overview on hardware optimizations for database engines
CN112463717B (en) Conditional branch implementation method under coarse-grained reconfigurable architecture
CN111709872B (en) Spin memory computing architecture of graph triangle counting algorithm
CN117608519B (en) Signed multiplication and multiply-accumulate operation circuit based on 10T-SRAM
KR20190029270A (en) Processing in memory device with multiple cache and memory accessing method thereof
US20230028952A1 (en) Memory device performing in-memory operation and method thereof
US20210200538A1 (en) Dual write micro-op queue
Wu et al. Research on Array Circuit Design Based on In-Memory Computing
CN117234720A (en) Dynamically configurable memory computing fusion data caching structure, processor and electronic equipment
Saghir et al. Reducing Power of Memory Hierarchy in General Purpose Graphics Processing Units
WO2017010524A1 (en) Simd parallel computing device, simd parallel computing semiconductor chip, simd parallel computing method, apparatus including simd parallel computing device or semiconductor chip

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant