CN111045727A

CN111045727A - Processing unit array based on nonvolatile memory calculation and calculation method thereof

Info

Publication number: CN111045727A
Application number: CN201811193218.8A
Authority: CN
Inventors: 马建国; 刘鹏; 周绍华
Original assignee: Tianjin University Marine Technology Research Institute
Current assignee: Tianjin University Marine Technology Research Institute
Priority date: 2018-10-14
Filing date: 2018-10-14
Publication date: 2020-04-21
Anticipated expiration: 2038-10-14
Also published as: CN111045727B

Abstract

A processing unit array based on nonvolatile memory calculation and a calculation method thereof belong to the field of nonvolatile storage and calculation, a nonvolatile memory is used as a memory, a whole NVM is logically divided into a large number of sub-blocks as the memory of a processing unit, each sub-block is added with an operation unit to form a processing unit, all the processing units are uniformly controlled by a group of control signals, and large-scale parallel operation is completed in a SIMD mode; each processing unit can directly acquire data in the memory of the processing unit and the memory of the adjacent processing unit to complete local operation, and can acquire data at other positions on the NVM through the original read-write function of the whole NVM to complete global operation. The invention can improve the storage capacity, reduce the overhead of reading data, reduce the storage power consumption and simultaneously reduce the area of each processing unit, thereby improving the unit number and the parallelism in the processing unit array, having wider application range and supporting local operation and global operation.

Description

Processing unit array based on nonvolatile memory calculation and calculation method thereof

Technical Field

The invention belongs to the field of nonvolatile storage and calculation, and particularly relates to a processing unit array based on nonvolatile memory calculation and a calculation method thereof.

Background

A Processing element Array (PE Array for short) is an important module for performing massively parallel computing, is a core component of most image processors (image processors) and visual processors (vision processors), and has the same or similar structure in a GPU and an NPU (Neural Network Processing Unit).

As described in documents a 1000 frames/s Video Chip Using Scalable Pixel-Level Parallel Processing (IEEE j ournal OF SOLID-STATE CIRCUITS, 2017), hierarchical Parallel Processor for High-speed Video Chip (IEEE transaction CIRCUITS and Systems for Video Technology, 2016), a Programmable Video Chip OF Parallel Processors (IEEE j ournal OF Parallel Processors, file), in image Processors and visual Processors proposed in the last decade, Processing cell arrays are generally used, which implement most OF the image Processing algorithms such as SIMD (Single Instruction stream) and Multiple Data algorithms, such as 2-Pixel and Multiple Data (local area) algorithms, such as local area, etc., to detect images. The Processing Unit arrays each include a plurality of Processing units (PE), the Processing units are structurally independent and interconnected by an on-chip network, each Processing Unit generally includes a storage module and a computation module, the storage module is formed by an SRAM, and the computation module includes an Arithmetic Logic Unit (ALU). Because the storage density of the SRAM is very low and the occupied chip area is large, the storage module capacity in the processing unit is small, usually between dozens of bytes and hundreds of bytes, and when facing data intensive processing tasks, more data cannot be cached; meanwhile, the area of a single processing unit is large, the number of the processing units in the array can be limited, and the parallel computing capacity of the processing unit array is further limited; the data communication between the independent processing units also requires more time and power consumption overhead, because each processing unit is only directly connected with the adjacent processing unit, local calculation and global calculation cannot be considered at the same time.

Such as documents A Novel ReRAM-Based Processing-in-Memory Architecture for graph transformation (ACM Transactions ON Storage, 2018), PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory (ACM/IEEE 43rd International Symposium ON Computer Architecture, 2016), ReRAM-Based Processing-in-Memory Architecture for Current Network Computation (IEEE TRANSACTION VERY LARGE SCALE INTEGRATION VLSI (I) SYSTEM, 2018), a non-volatile Memory-Based Memory Computation method (2018), Memory Computation (Processing-in-PIM) is a method of moving data from a Storage wall into a Storage wall for Storage and Computation, and Storage of data is done in a manner that reduces the overhead of the Computation. In recent years, an idea of directly adding a logic computing unit to a nonvolatile memory or a cache to realize Processing-in-memory has been proposed internationally, and a nonvolatile memory (NVM) is used to approach to the read-write performance and high storage density of a DRAM, thereby reducing the overhead required for data migration between the memory and the computing unit and increasing the capacity of the memory or the cache. However, the current PIM is mainly divided into two types, one is to add a more complex processing unit on the memory, for example, if the general embedded processor is moved into the memory, the processing unit has a strong function but a small number, and is presented in a multi-core form rather than a large-scale array form, and the parallelism is very low; the other method is to directly use the read-write operation of the storage unit to realize some very simple and limited kinds of operations, such as to realize a neural network processor through PIM, namely to directly realize only multiply-add operation by using crossbar, and the other method also uses the read-write operation of the storage unit to realize and, or, and xor logic, which is usually only suitable for specific applications and cannot be used for general parallel calculation such as image processing. The PIM is not designed for data intensive operation tasks such as image processing, and cannot realize functions such as SIMD, local calculation, and global calculation.

Disclosure of Invention

In order to overcome the defects of insufficient storage capacity, large static power consumption, small number of processing units, simple operation capacity, incapability of considering both local operation and global operation and the like in a processing unit array, the invention provides the processing unit array based on nonvolatile memory calculation and the working method thereof.

A processing unit array based on nonvolatile memory calculation and a calculation method thereof use a nonvolatile memory as a memory, a whole NVM is logically divided into a large number of sub-blocks as the memory of a processing unit, each sub-block is added with an operation unit to form a processing unit, all the processing units are uniformly controlled by a group of control signals, and large-scale parallel operation is completed in a SIMD mode; each processing unit can directly acquire data in the memory of the processing unit and the memory of the adjacent processing unit to complete local operation, and can acquire data at other positions on the NVM through the original read-write function of the whole NVM to complete global operation.

A processing unit array based on nonvolatile memory calculation comprises an instruction decoder used for decoding instructions of the processing unit array, a nonvolatile memory used for storing data to be processed, the nonvolatile memory comprises M x N identical subblocks and forms an array of M rows and N columns, each subblock has a row address R and a column address C in the array, and is marked as Block (R, C), an operation unit is added to each subblock and is used for finishing processing the data to form a processing unit, and therefore the nonvolatile memory becomes a processing unit array, and one operation unit can be connected with at most four storage subblocks, namely an upper storage subblock, a lower storage subblock, a left storage subblock, a right storage subblock and a left storage subblock.

The arithmetic unit comprises a 2-4 decoder, two registers A and B, an arithmetic logic unit and 4 groups of transmission gate arrays, wherein the number of the transmission gates in each group of transmission gate arrays is consistent with the bit number of the registers and the data bit width of the storage sub-blocks.

The 4 groups of transmission gate arrays in the above arithmetic unit are respectively marked as TG1, TG2, TG3 and TG4, the input ends of TG1, TG2, TG3 and TG4 are sequentially and respectively connected with the data buses of the storage subblock Block (R, C), the right adjacent subblock Block (R, C + 1), the lower adjacent subblock Block (R +1, C) and the right adjacent subblock Block (R +1, C + 1) in the processing unit, the input end of the register a is connected with the output ends of TG1 and TG3 in the processing unit, the input end of the register B is connected with the output ends of TG1, TG2, TG3 and TG4 in the processing unit, so that the transmission gate can control the data in the storage subblock to be stored in the register; the output end of the register A is connected with the input end of the arithmetic logic unit, and the output end of the register B is connected with the other input end of the arithmetic logic unit; each memory subblock is connected with at most four sets of transmission gate arrays.

The control signals of the above-mentioned instruction decoder comprise functional signals of an arithmetic logic unit for selecting a function of the arithmetic logic unit, such as an arithmetic operation of addition, subtraction or the like, or a logical operation of and, or, xor or the like, whose port is connected to the functional signal interface of all arithmetic logic units in the array of processing units.

The control signals of the instruction decoder also include address signals of the data to be processed in the subblock, which are used for selecting the data to be processed, where the address signals are relative addresses of the data in the subblock, that is, the address of the first data in the subblock is 0, the address of the second data is 1, and so on; the ports of which are connected to the address interfaces of all sub-blocks in the processing unit array.

The control signal of the instruction decoder further includes a two-bit encoding signal, a port of the two-bit encoding signal is connected to the input end of all 2-4 decoders in the processing unit array, and the two-bit encoding signal is used for selecting a transmission gate connected to one of the four storage sub-blocks adjacent to the arithmetic unit.

The four

input signals

00, 01, 10, 11 of the 2-4 decoder correspond to the four output signal lines 0, 1, 2, 3, respectively, and the four output signal lines 0, 1, 2, 3 are sequentially connected to the control ports of the processing units TG1, TG2, TG3, TG4 one by one, respectively, to control the four transmission gate arrays to transmit.

A computing method of a processing unit array based on non-volatile memory computing, comprising two operands and a computing operation, comprising the steps of:

the instruction decoder transmits the encoding signal of the first operand to all 2-4 decoders in the processing unit array, and the 2-4 decoders output signals to four groups of transmission gate arrays in the processing unit where the 2-4 decoders are located after decoding, so that one or only one group of transmission gate arrays are opened and data can be transmitted;

the instruction decoder transmits the address signal of the first operand to the address interfaces of all storage sub-blocks in the processing unit array, reads out the data stored in the corresponding positions in the storage sub-blocks, at the moment, only one group of a plurality of groups of transmission gate arrays connected with the sub-blocks is opened, and the read data is stored in the register A through the group of transmission gate arrays;

the instruction decoder transmits the encoding signal of the second operand to all 2-4 decoders in the processing unit array, and the 2-4 decoders output signals to four groups of transmission gate arrays in the processing unit where the 2-4 decoders are located after decoding, so that one and only one group of transmission gate arrays are opened and data can be transmitted;

the instruction decoder transmits the address signal of the second operand to the address interfaces of all storage sub-blocks in the processing unit array, reads out the data stored in the corresponding positions in the storage sub-blocks, at the moment, only one group of the transmission gate arrays connected with the sub-blocks is opened, and the read data is stored in the register B through the transmission gate arrays;

the instruction decoder transmits the function signal to a function signal interface of the arithmetic logic unit, the arithmetic logic unit determines the type of calculation operation according to the function signal, then the arithmetic logic unit performs the type of calculation on the numbers in the registers A and B, and the obtained result is output through an output end of the arithmetic logic unit.

A processing unit array based on nonvolatile memory calculation adopts a nonvolatile memory with higher storage density as the storage of each processing unit, so that the storage capacity of the processing unit array is large, more data can be cached, the hit rate is improved, and the like.

Meanwhile, the nonvolatile memory has low static power consumption and is more energy-saving than a processing unit array formed by a volatile memory such as an SRAM.

Secondly, the area of the processing unit in the invention is smaller, so that more processing units can be integrated in chips such as a vision processor/an image processor, and the parallelism of the processing unit array is greatly improved.

And thirdly, the universal operation unit is adopted in the invention, so that the function ductility is stronger, more and more complex operation functions can be realized according to the application, and the application range is wider.

Finally, the processing unit in the invention can directly process the data in the four adjacent storage sub-blocks, has excellent local arithmetic capability, and can conveniently complete various local algorithms if all the pixels of the local area are stored in the four sub-blocks in the image processing; meanwhile, the processing unit array is realized on a nonvolatile memory, so that data of any address on the memory can be read through the read-write function of the memory to carry out global operation, and local operation and global operation are both considered.

Drawings

FIG. 1 is a diagram of a processing unit array architecture based on ReRAM memory computations;

fig. 2 is a structural view of a crossbar-based processing unit.

In the figure: 1. ReRAM; 2. An instruction decoder; 3. cross bar; 4. an arithmetic unit; 5. 2-4 decoder; 6. a transmission gate array TG 1; 7. a transmission gate array TG 2; 8. a transmission gate array TG 3; 9. a transmission gate array TG 4; 10. a register A; 11. a register B; 12. an arithmetic logic unit.

Detailed Description

The following describes in detail specific embodiments of the present invention with reference to the drawings, but the present invention is not limited to the embodiments disclosed below, and can be implemented in various ways.

The present embodiment employs a ReRAM as a nonvolatile memory. The ReRAM is composed of a large number of crossbar having Word Lines (WL) and Bit Lines (BL), where WL selects a certain row of memory cells, and BL reads out data of memory cells at the intersection with WL, and therefore, WL + BL can be considered as addresses of memory cells in crossbar.

The processing unit array in this embodiment includes an instruction decoder for decoding an instruction of the processing unit array, and a ReRAM for storing data to be processed, where the ReRAM includes M × N identical crossbar, and forms an array of M rows and N columns, each crossbar has a row address R and a column address C in the array, denoted as Block (R, C), and adds an operation unit for completing processing of data to form a processing unit, and thus, the ReRAM is a processing unit array, as shown in fig. 1, in which one operation unit can be connected to at most four crossbar units, up and down, left and right.

As shown in fig. 2, the above-mentioned arithmetic unit includes a 2-4 decoder, two registers a and B, an arithmetic logic unit, 4 sets of transmission gate arrays; the number of the transmission gates in each group of transmission gate arrays is consistent with the digit of the register and the data bit width of the crossbar, each transmission gate is provided with a control signal port, the control signal ports of each group of transmission gates are connected together, and when the control signal is at a high level, the transmission gates are opened and can transmit data; the arithmetic logic unit can complete operations such as addition, subtraction, AND, OR, NOT, XOR, shift and the like, and the operation to be executed is determined by the function control signal; registers A and B are each comprised of edge flip-flops, the rising edge of which is active.

The 4 groups of transmission gate arrays in the above arithmetic unit are respectively marked as TG1, TG2, TG3 and TG4, the input ends of TG1, TG2, TG3 and TG4 are sequentially and respectively connected with the data buses of crossbar Block (R, C), right adjacent sub-Block (R, C + 1), lower adjacent sub-Block (R +1, C) and right lower adjacent sub-Block (R +1, C + 1) in the processing unit, the input end of register a is connected with the output ends of TG1 and TG3 in the processing unit, the input end of register B is connected with the output ends of TG1, TG2, TG3 and TG4 in the processing unit, so that the transmission gate can control the data in crossbar to be stored in the register; the output end of the register A is connected with the input end of the arithmetic logic unit, and the output end of the register B is connected with the other input end of the arithmetic logic unit; each crossbar is connected with at most four groups of transmission gate arrays, wherein the transmission gate arrays connected with the crossbar positioned at the four corners of the array are one group, and the transmission gate arrays connected with the crossbar positioned at the edge of the array are two groups.

The control signals of the instruction decoder described above include functional signals of the arithmetic logic unit for selecting the function of the arithmetic logic unit, including addition, subtraction, and, or, not, exclusive or, shift, etc., and the ports thereof are connected to the functional signal interfaces of all the arithmetic logic units in the processing unit array.

The control signals of the instruction decoder also include address signals, namely WL and BL, in the crossbar of the data to be processed, which are used for selecting the data to be processed, wherein the address signals are relative addresses of the data in the crossbar, namely the address of the first data in the subblock is 0, the address of the second data is 1, and the like; its port is connected to the address interface of all crossbars in the processing unit array.

The control signal of the instruction decoder further comprises a two-bit encoding signal, a port of the two-bit encoding signal is connected with the input end of all 2-4 decoders in the processing unit array, and the two-bit encoding signal is used for selecting a transmission gate connected with one of four crossbar adjacent to the operation unit.

The four

input signals

The computing method based on the processing unit array comprises two operands and a computing operation, taking X + Y as an embodiment, wherein X refers to a datum at the Addr1 position in the crossbar in each processing unit, namely the crossbar at the upper left side of the operation unit, and the code is 00, and Y refers to a datum at the Addr2 position in the crossbar at the right side of each processing unit, and the code is 01, and specifically comprises the following steps:

firstly, an instruction decoder transmits an encoding signal 00 of a first operand X to all 2-4 decoders in the processing unit array, the 2-4 decoders output signals to four groups of transmission gate arrays in the processing unit where the 2-4 decoders are located after decoding, and one group of transmission gate arrays, namely TG1, is opened to transmit data;

secondly, the instruction decoder transmits an address signal Addr1 of a first operand X to WL and BL of all cross bars in the processing unit array, reads data X stored at a corresponding position in the cross bars, at the moment, only one group TG1 in a plurality of groups of transmission gate arrays connected with the cross bars is opened, the register A is triggered by a rising edge of a clock through the group of transmission gate arrays, the read data X is stored in the register A, and at the moment, the register B does not obtain the rising edge trigger and cannot read the data;

thirdly, the instruction decoder transmits the encoding signal of the second operand Y to all 2-4 decoders in the processing unit array, and the 2-4 decoders output signals to four groups of transmission gate arrays in the processing unit where the 2-4 decoders are located after decoding, so that one group of transmission gate arrays and only one group of transmission gate arrays TG2 are opened to transmit data;

fourthly, the instruction decoder transmits an address signal Addr2 of a second operand Y to WL and BL of all cross bars in the processing unit array, reads data Y stored at a corresponding position in the cross bars, at the moment, only TG2 in a plurality of groups of transmission gate arrays connected with the cross bars is opened, a clock rising edge triggers a register B, the read data is stored in the register B through a transmission gate array TG2, at the moment, the register A does not obtain rising edge triggering, and the stored data cannot be changed;

and fifthly, the instruction decoder transmits the function signal to a function signal interface of the arithmetic logic unit, the arithmetic logic unit determines an addition function according to the function signal, then the numbers in the registers A and B are added, and the obtained result is output through an output end of the arithmetic logic unit.

Claims

1. An array of processing units based on non-volatile memory computing, comprising: the non-volatile memory comprises M multiplied by N identical subblocks and forms an array of M rows and N columns, and each subblock has a row address R and a column address C in the array and is marked as Block_R,CEach sub-block is added with an arithmetic unit to form a processing unit.

2. The non-volatile memory computing-based processing unit array of claim 1, wherein: the arithmetic unit comprises a 2-4 decoder, two registers A and B, an arithmetic logic unit and 4 groups of transmission gate arrays, wherein the number of transmission gates in each group of transmission gate arrays is consistent with the bit number of the registers and the data bit width of the storage subblocks.

3. The non-volatile memory computing-based processing unit array of claim 1, wherein: the 4 groups of transmission gate arrays in the operation unit are respectively recorded as TG1, TG2, TG3 and TG4, and the input ends of the TG1, the TG2, the TG3 and the TG4 are sequentially and respectively connected with the storage sub-Block Block in the processing unit_R,CRight adjacent subblock Block_R,C+1Lower adjacent subblock Block_R+1,CThe lower right adjacent subblock Block_R+1,C+1The input ends of the registers A are connected with the output ends of TG1 and TG3 in the processing unit, the input ends of the registers B are connected with the output ends of TG1, TG2, TG3 and TG4 in the processing unit, the output ends of the registers A are connected with the input end of the arithmetic logic unit, and the output ends of the registers B are connected with the other input end of the arithmetic logic unit.

4. The non-volatile memory computing-based processing unit array of claim 1, wherein: the control signals of the instruction decoder include functional signals of the arithmetic logic unit for selecting the function of the arithmetic logic unit, and the port of the control signals of the instruction decoder is connected with the functional signal interfaces of all the arithmetic logic units in the processing unit array.

5. The non-volatile memory computing-based processing unit array of claim 1, wherein: the control signals of the instruction decoder also comprise address signals of the data to be processed in the subblocks, which are used for selecting the data to be processed, and the ports of the control signals are connected with the address interfaces of all the subblocks in the processing unit array.

6. The non-volatile memory computing-based processing unit array of claim 1, wherein: the control signals of the instruction decoder also comprise two-bit encoding signals for selecting the storage sub-block where the operand is located, and the port of the two-bit encoding signals is connected with the input end of all 2-4 decoders in the processing unit array.

7. The non-volatile memory computing-based processing unit array of claim 2, wherein: the four input signals 00, 01, 10 and 11 of the 2-4 decoder respectively correspond to four output signal lines 0, 1, 2 and 3, and the four output signal lines 0, 1, 2 and 3 are sequentially and respectively connected with control ports of TG1, TG2, TG3 and TG4 of a processing unit where the four output signal lines are located one by one.

8. A computing method of a processing unit array based on non-volatile memory computing, characterized by: comprising two operands and a calculation operation, which comprises in particular the steps of:

firstly, an instruction decoder transmits an encoding signal of a first operand to all 2-4 decoders in the processing unit array, and the 2-4 decoders output signals to four groups of transmission gate arrays in the processing unit where the 2-4 decoders are located after decoding, so that one group and only one group of transmission gate arrays are opened and can transmit data;

secondly, the instruction decoder transmits the address signal of the first operand to the address interfaces of all storage sub-blocks in the processing unit array, reads out the data stored in the corresponding position in each storage sub-block, and respectively stores the data in the register A through the opened transmission gate array connected with the instruction decoder;

thirdly, the instruction decoder transmits the encoding signal of the second operand to all 2-4 decoders in the processing unit array, and the 2-4 decoders output signals to four groups of transmission gate arrays in the processing unit where the 2-4 decoders are located after decoding, so that one group and only one group of transmission gate arrays are opened and can transmit data;

fourthly, the instruction decoder transmits the address signal of the second operand to the address interfaces of all the storage sub-blocks in the processing unit array, reads out the data stored in the corresponding position in each storage sub-block, and respectively stores the data in the register B through the opened transmission gate array connected with the instruction decoder;

in a fifth step, the instruction decoder transmits the function signal to a function signal interface of the arithmetic logic unit, and the arithmetic logic unit determines the type of calculation operation according to the function signal and then performs the type of calculation on the numbers in the registers a and B.