CN111124999B - Dual-mode computer framework supporting in-memory computation - Google Patents

Dual-mode computer framework supporting in-memory computation Download PDF

Info

Publication number
CN111124999B
CN111124999B CN201911258025.0A CN201911258025A CN111124999B CN 111124999 B CN111124999 B CN 111124999B CN 201911258025 A CN201911258025 A CN 201911258025A CN 111124999 B CN111124999 B CN 111124999B
Authority
CN
China
Prior art keywords
computation
memory
storage
architecture
sram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911258025.0A
Other languages
Chinese (zh)
Other versions
CN111124999A (en
Inventor
张章
曾剑敏
魏亚东
解光军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN201911258025.0A priority Critical patent/CN111124999B/en
Publication of CN111124999A publication Critical patent/CN111124999A/en
Application granted granted Critical
Publication of CN111124999B publication Critical patent/CN111124999B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System (AREA)
  • Static Random-Access Memory (AREA)

Abstract

The invention relates to the technical field of computer architectures, in particular to a dual-mode computer architecture supporting internal storage computing. The invention has the beneficial effects that: the proposed architecture can work under a non-von neumann architecture, also preserving the traditional von neumann architecture. The goal of this is to make maximum use of existing compilation toolchains and programming models. Simulations show that the architecture of the present invention can provide tens to hundreds of times acceleration for a particular application, as compared to a reference, depending on the number of macro cells used in the IMC-SRAM. In the case of a block of macro-cells, the 32 × 32 scale binary neural network and the 512-bit character input hashing algorithm can be accelerated by 6.7 and 12.2 times, respectively. Compared with the benchmark, the average energy saving of the structure is 3 times.

Description

Dual-mode computer framework supporting in-memory computation
Technical Field
The invention relates to the technical field of computer architectures, in particular to a dual-mode computer architecture supporting in-memory computing.
Background
Most computers currently employ the traditional von neumann architecture, i.e., the computing unit and memory unit are separate. Such computers work following the paradigm of load (from memory) (in CPU) compute-write back (memory). On one hand, due to the limitation of chips or IP pins, the bandwidth between the computing unit and the memory is also limited due to the separation of computing and memory, and the bandwidth resource of the memory module cannot be fully utilized. On the other hand, the computation and storage separation causes the CPU to continuously read and write the memory at a remote location when processing data, which inevitably causes huge power consumption waste. For example, a 64-bit double-precision floating-point operation may consume approximately 50pJ of energy, a CPU core may consume approximately 500pJ of energy to read data from a cache that is 20mm away, and an off-chip DRAM may require approximately 6000pJ of energy to read the associated data. With the development of big data, artificial intelligence and the internet of things, it is increasingly urgent to solve the bandwidth bottleneck between the CPU and the memory and improve the energy utilization efficiency of the system. The design of processor architectures presents greater challenges. For example, how to meet the bandwidth requirement of the computing unit in the artificial intelligence chip on the storage module, how to enable the terminal of the internet of things (mostly MCU, that is, micro central processing unit) to maintain or even reduce the power consumption while improving the processing performance.
One solution to these problems is the existence of a Computing-In-Memory (PIM) design, which is designed to bring the Computing unit and the Memory module as close as possible, even to be able to perform the computation In one piece, i.e. In Memory.
In fact, the concept of integrative memory and computation was proposed as early as the seventies of the last century, and since the bandwidth demand or power consumption of a CPU for a memory is not a main contradiction in the development of computer technology, the idea is not paid enough attention. Until the nineties, a large number of computers came into existence, and the design concept could be summarized as putting the computing units and memory units in one chip, produced by logic technology or DRAM technology. However, such a design is extremely expensive in production, and the designed processor is poor in programmability and upgradability, and is not popularized finally.
In recent years, with the development of artificial intelligence, a storage computing technology based on a Static Random Access Memory (SRAM) has attracted much attention. The basic principle of the technology is that a plurality of SRAM memory cells (Bitcell) on a pair of bit lines are read simultaneously, the voltage drop behavior on the bit lines is represented by a logical and relation, and if a double-input double-output Sense Amplifier (SA) is used at the tail end of the bit lines, a plurality of logical results can be obtained at the SA end. Fig. 1 depicts the principle of SRAM-based storage integration.
However, most of the existing storage and computation integrated design based on the SRAM aims at artificial intelligence acceleration, and no storage and computation integrated design is designed for general purpose.
Disclosure of Invention
It is an object of the present invention to overcome the problems of the prior art and to provide a dual-mode computer architecture that supports in-memory computing that addresses at least some of the problems of the prior art.
In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:
a dual-mode computer architecture supporting in-memory computation comprises a processor core, an instruction memory, a storage computation coprocessor and a plurality of computation type SRAM macro units, wherein the processor core adopts a reduced instruction set core and is provided with a six-stage pipeline, a pre-decoding stage pipeline is inserted in an instruction fetching stage and a decoding stage and is used for decoding storage computation special instructions, the instruction memory is generated before storage compilation, the storage computation coprocessor is used for decoding the storage computation special instructions transmitted by the processor core, switching the working mode of a system and controlling the reading and writing and storage computation of the SRAM macro units, the SRAM macro units are provided with three groups of row decoders and one group of column decoders, the SRAM macro units further comprise an IM-ALU unit, the IM-ALU unit comprises a small number of logic gates and a plurality of selectors, the logic gates are used for completing storage computation operation outside SA, the selectors are used for selecting the result after computation and writing back to a memory array, the storage computation coprocessor controls the mode selection of the computer architecture and is controlled by the type of the instruction fetched by the CPU, and in a normal mode, the storage computation coprocessor reads processing data from the memory, reads the storage computation data from the memory, directly writes back to the memory array, writes back to the IMtc-based on the equivalent Von-processor, and directly writes back to the IMtc architecture.
Further, the column decoder is configured by a storage compute coprocessor.
The invention has the beneficial effects that: compared with the traditional von Neumann architecture, the invention has the characteristic of dual-mode architecture, namely the architecture provided by the invention can work under the non-von Neumann architecture, and the traditional von Neumann architecture is also reserved. The goal of this is to make maximum use of existing compilation toolchains and programming models. Simulations show that the architecture of the present invention can provide a speed up of tens to hundreds times for a particular application, depending on the number of macro cells used in the IMC-SRAM, compared to a reference (no in-memory computation function, otherwise the same as the present architecture). In the case of a one-block macrocell, the Hash (Hash) algorithm can be accelerated by 6.7 and 12.2 times for 32 × 32 scale Binary Neural Networks (BNNs) and 512-bit character inputs, respectively. Compared with the benchmark, the average energy saving of the framework is 3 times.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a prior art SRAM-based in-memory computation;
FIG. 2 is a diagram of an SRAM architecture for supporting in-memory computation in accordance with the present invention;
FIG. 3 is a dual-mode computer architecture according to the present invention;
FIG. 4 is a comparative example of a conventional instruction and an IMC instruction;
FIG. 5 is a schematic diagram of adaptive parallel vector operation in SRAM according to the present invention.
Detailed Description
In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 shows the principle of calculation based on SRAM bit line discharge. The left diagram is a schematic diagram of logic operation on a column of Bitcell, and the right diagram is a conventional 6-pipe cell (Bitcell). Data required to be operated logically is written into the Bitcell (Bitcell j and Bitcell k in the figure are taken as examples), and then the bit lines BL and BLB are precharged to high level. And then, simultaneously opening the word lines WLj and WLk, and discharging the bit lines on two sides under different conditions according to the data stored in the Bitcell j and the Bitcell k. For a Bitcell, if it stores 0 (i.e., Q is 0 in the right diagram of fig. 1), BL is discharged to ground through M5 and M3, and conversely, BLB is discharged through M6 and M4. It can be known that for a certain bit line, if bit cell on the side stores 0, the bit line will discharge, and after SA comparison and amplification, the SA outputs logic 0, and a logical and relationship is naturally formed, and similarly, logics such as NAND, OR and NOR can be obtained. And adding a NAND gate after SA to obtain exclusive-OR logic. And several logic gates are added to obtain the operation results of addition, etc. This is the basic principle for memory computing based on SRAM.
FIG. 2 is a schematic diagram of a computational SRAM macrocell, called an IMC-SRAM, according to the principles described in FIG. 1. Unlike conventional SRAM, IMC-SRAM uses three sets of row decoders and a configurable set of column decoders in order to enable memory computations. In addition, the IMC-SRAM further comprises a lightweight arithmetic logic unit (IM-ALU) mainly comprising logic gates for completing part of the operation after SA.
The IMC-SRAM working mode is configurable, when the system works in a common mode, the IMC-SRAM is equal to a traditional SRAM, only one set of row decoder and column decoder acts, and the reading and writing behaviors are the same as those of the common SRAM. When the system works in IMC mode, the IMC-SRAM will also switch to this mode. At this point, the IMC-SRAM reads the data directly for computation and writes the result back into the memory array. In this mode, two row decoders Ar1 and Ar2 are used for the reading of two source operands and Ar3 is used to write the computation results back to the array. Since two operands for computation require row-aligned storage, the operands may share a set of column decoders. When a two operand read operation is performed in IMC mode, the columns that need to be read can be configured by VL _ cfg. The data of the Mode, the VC _ cfg, the row and column addresses and the like are all from the IMC-CP.
FIG. 3 is an overview of the dual-mode computer architecture proposed by the present invention. It can be seen that the architecture comprises a reduced instruction set CPU core, an instruction set memory, a memory compute coprocessor IMC-CP and several compute memory macro-units. The main difference between the architecture and the traditional von Neumann architecture is that a computing memory IMC-SRAM is adopted, a CPU can access the memory according to an address, and can also send an instruction to the memory, so that data can be operated in the memory, and an operation result is directly written back to the memory. The traditional instruction and the storage calculation instruction have the same bit width and can be stored in the same instruction memory in a mixed mode, the CPU decodes the fetched instruction, and if the fetched instruction is the traditional type instruction, the pipeline operation is continuously executed; if the instruction is a storage calculation type instruction, the instruction is sent to IMC-CP processing. At this point, if the next instruction in the CPU is not associated with memory access, execution continues, otherwise, the pipeline is halted until the memory computation is complete. The IMC-CP receiving the storage computation instruction switches the mode to the IMC mode. The storage calculation instruction comprises an operation code, a source operand address, a destination operand address and a calculated vector length of the storage calculation. The IMC-CP transmits information such as operand addresses, operation codes and the like to the IMC-SRAM, controls the IMC-SRAM to carry out storage operation until the whole storage operation is completed, and sends a Finish signal to the CPU to indicate that the storage operation is completed. The CPU may continue to run. At the same time, the IMC-CP switches the system to the normal mode, and the IMC-SRAM is also switched to this mode at random.
FIG. 4 is a schematic diagram of a CPU pipeline employed by the architecture of the present invention. The IF, PID, ID, EX, MEM and WB respectively represent several pipeline stages of instruction fetch, predecoding, decoding, execution, access and write-back, wherein the PID directly inserted in the IF and ID is a predecoding stage proposed by the present invention, which is used for decoding the mixed instruction in the instruction memory to determine whether an instruction belongs to the traditional instruction or the memory calculation instruction.
Fig. 5 shows an example of adaptive memory vector computation according to the present invention. The left graph is a program comparison graph using a conventional approach and an in-memory calculation approach. A and B are vectors with 20 components, respectively, and the corresponding vectors of A and B are added together, respectively, with the result being stored in the C vector. A comparison of programming using conventional paradigm and stored computation instructions is given in the figure, respectively. The right graph is the mapping relation of the storage calculation instruction in the IMC-SRAM. One bank of IMC-SRAM has 8 words (word) per row, and with 2 banks of IMC-SRAM, the first 16 component additions of a and B can be computed in one cycle, and the remaining four component additions computed in the second cycle. The design enables the IMC-SRAM to perform a large amount of vector parallel computation in one cycle. If 3 bank IMC-SRAM is used, a maximum of 14 vector components can be calculated per cycle, and the vector addition of a and B can be completed in one cycle. For memory vector operation with a specific length, how many cycles are needed to complete the memory vector operation, which is automatically calculated by the IMC-CP according to the number of the accessed IMC-SRAM blocks, namely the self-adaptive memory vector calculation of the framework of the invention.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims (2)

1. A dual-mode computer architecture supporting in-memory computing, comprising a processor core, an instruction memory, a storage-computation coprocessor, a plurality of computation-type SRAM macro-units, wherein the processor core adopts a reduced instruction set core and has a six-stage pipeline, pre-decoding stage pipelines are inserted in an instruction fetching stage and a decoding stage for decoding storage-computation-specific instructions, the instruction memory is generated before memory compilation, the storage-computation coprocessor is used for decoding storage-computation-specific instructions transmitted by the processor core, switching the working mode of a system and controlling the reading and writing of the SRAM macro-units and storage computation, the SRAM macro-units have three groups of row decoders and one group of column decoders, the memory macro-units further comprise an IM-ALU unit which contains a small number of logic gates and a plurality of selectors, the logic is used for completing storage computation operations outside an SA, the selectors are used for selecting the results after computation to be written back to a memory array, the architecture mode of the storage-computation control computer is selected and controlled by the type of instructions fetched by the CPU, in a normal mode, the storage-computation processing data from the storage-computation coprocessor is directly written back to the memory-computation coprocessor, and the architecture directly reads data from the SRAM-compute data in a NOT-processor architecture, and writes back to the architecture as a non-processor.
2. A dual-mode computer architecture supporting in-memory computing as claimed in claim 1, wherein said column decoder is configured by a memory computing coprocessor.
CN201911258025.0A 2019-12-10 2019-12-10 Dual-mode computer framework supporting in-memory computation Active CN111124999B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911258025.0A CN111124999B (en) 2019-12-10 2019-12-10 Dual-mode computer framework supporting in-memory computation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911258025.0A CN111124999B (en) 2019-12-10 2019-12-10 Dual-mode computer framework supporting in-memory computation

Publications (2)

Publication Number Publication Date
CN111124999A CN111124999A (en) 2020-05-08
CN111124999B true CN111124999B (en) 2023-03-03

Family

ID=70498034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911258025.0A Active CN111124999B (en) 2019-12-10 2019-12-10 Dual-mode computer framework supporting in-memory computation

Country Status (1)

Country Link
CN (1) CN111124999B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11403111B2 (en) 2020-07-17 2022-08-02 Micron Technology, Inc. Reconfigurable processing-in-memory logic using look-up tables
US11355170B1 (en) * 2020-12-16 2022-06-07 Micron Technology, Inc. Reconfigurable processing-in-memory logic
CN113590195B (en) * 2021-07-22 2023-11-07 中国人民解放军国防科技大学 Memory calculation integrated DRAM computing unit supporting floating point format multiply-add

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6226738B1 (en) * 1997-08-01 2001-05-01 Micron Technology, Inc. Split embedded DRAM processor
CN108369513A (en) * 2015-12-22 2018-08-03 英特尔公司 For loading-indexing-and-collect instruction and the logic of operation
CN110414677A (en) * 2019-07-11 2019-11-05 东南大学 It is a kind of to deposit interior counting circuit suitable for connect binaryzation neural network entirely

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160224098A1 (en) * 2015-01-30 2016-08-04 Alexander Gendler Communicating via a mailbox interface of a processor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6226738B1 (en) * 1997-08-01 2001-05-01 Micron Technology, Inc. Split embedded DRAM processor
CN108369513A (en) * 2015-12-22 2018-08-03 英特尔公司 For loading-indexing-and-collect instruction and the logic of operation
CN110414677A (en) * 2019-07-11 2019-11-05 东南大学 It is a kind of to deposit interior counting circuit suitable for connect binaryzation neural network entirely

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于存储计算的可重构加速架构设计;朱世凯等;《计算机工程与设计》(第04期);全文 *

Also Published As

Publication number Publication date
CN111124999A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
US11726791B2 (en) Generating and executing a control flow
CN111124999B (en) Dual-mode computer framework supporting in-memory computation
US11106389B2 (en) Apparatuses and methods for data transfer from sensing circuitry to a controller
US9971541B2 (en) Apparatuses and methods for data movement
US10725680B2 (en) Apparatuses and methods to change data category values
US10884957B2 (en) Pipeline circuit architecture to provide in-memory computation functionality
Akyel et al. DRC 2: Dynamically Reconfigurable Computing Circuit based on memory architecture
US10262701B2 (en) Data transfer between subarrays in memory
US20190057727A1 (en) Memory device to provide in-memory computation functionality for a pipeline circuit architecture
US9361101B2 (en) Extension of CPU context-state management for micro-architecture state
CN111752530A (en) Machine learning architecture support for block sparsity
Vokorokos et al. Innovative operating memory architecture for computers using the data driven computation model
CN116126779A (en) 9T memory operation circuit, multiply-accumulate operation circuit, memory operation circuit and chip
EP3570286B1 (en) Apparatus for simultaneous read and precharge of a memory
Zeng et al. DM-IMCA: A dual-mode in-memory computing architecture for general purpose processing
US20210117189A1 (en) Reduced instruction set processor based on memristor
US10249361B2 (en) SRAM write driver with improved drive strength
CN113378115B (en) Near-memory sparse vector multiplier based on magnetic random access memory
Ungethüm et al. Overview on hardware optimizations for database engines
CN112463717B (en) Conditional branch implementation method under coarse-grained reconfigurable architecture
KR20190029270A (en) Processing in memory device with multiple cache and memory accessing method thereof
CN111709872B (en) Spin memory computing architecture of graph triangle counting algorithm
US20230028952A1 (en) Memory device performing in-memory operation and method thereof
US20210200538A1 (en) Dual write micro-op queue
Wu et al. Research on Array Circuit Design Based on In-Memory Computing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant