CN111045727A - Processing unit array based on nonvolatile memory calculation and calculation method thereof - Google Patents

Processing unit array based on nonvolatile memory calculation and calculation method thereof Download PDF

Info

Publication number
CN111045727A
CN111045727A CN201811193218.8A CN201811193218A CN111045727A CN 111045727 A CN111045727 A CN 111045727A CN 201811193218 A CN201811193218 A CN 201811193218A CN 111045727 A CN111045727 A CN 111045727A
Authority
CN
China
Prior art keywords
processing unit
array
transmission gate
data
instruction decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811193218.8A
Other languages
Chinese (zh)
Other versions
CN111045727B (en
Inventor
马建国
刘鹏
周绍华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University Marine Technology Research Institute
Original Assignee
Tianjin University Marine Technology Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University Marine Technology Research Institute filed Critical Tianjin University Marine Technology Research Institute
Priority to CN201811193218.8A priority Critical patent/CN111045727B/en
Publication of CN111045727A publication Critical patent/CN111045727A/en
Application granted granted Critical
Publication of CN111045727B publication Critical patent/CN111045727B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A processing unit array based on nonvolatile memory calculation and a calculation method thereof belong to the field of nonvolatile storage and calculation, a nonvolatile memory is used as a memory, a whole NVM is logically divided into a large number of sub-blocks as the memory of a processing unit, each sub-block is added with an operation unit to form a processing unit, all the processing units are uniformly controlled by a group of control signals, and large-scale parallel operation is completed in a SIMD mode; each processing unit can directly acquire data in the memory of the processing unit and the memory of the adjacent processing unit to complete local operation, and can acquire data at other positions on the NVM through the original read-write function of the whole NVM to complete global operation. The invention can improve the storage capacity, reduce the overhead of reading data, reduce the storage power consumption and simultaneously reduce the area of each processing unit, thereby improving the unit number and the parallelism in the processing unit array, having wider application range and supporting local operation and global operation.

Description

Processing unit array based on nonvolatile memory calculation and calculation method thereof
Technical Field
The invention belongs to the field of nonvolatile storage and calculation, and particularly relates to a processing unit array based on nonvolatile memory calculation and a calculation method thereof.
Background
A Processing element Array (PE Array for short) is an important module for performing massively parallel computing, is a core component of most image processors (image processors) and visual processors (vision processors), and has the same or similar structure in a GPU and an NPU (Neural Network Processing Unit).
As described in documents a 1000 frames/s Video Chip Using Scalable Pixel-Level Parallel Processing (IEEE j ournal OF SOLID-STATE CIRCUITS, 2017), hierarchical Parallel Processor for High-speed Video Chip (IEEE transaction CIRCUITS and Systems for Video Technology, 2016), a Programmable Video Chip OF Parallel Processors (IEEE j ournal OF Parallel Processors, file), in image Processors and visual Processors proposed in the last decade, Processing cell arrays are generally used, which implement most OF the image Processing algorithms such as SIMD (Single Instruction stream) and Multiple Data algorithms, such as 2-Pixel and Multiple Data (local area) algorithms, such as local area, etc., to detect images. The Processing Unit arrays each include a plurality of Processing units (PE), the Processing units are structurally independent and interconnected by an on-chip network, each Processing Unit generally includes a storage module and a computation module, the storage module is formed by an SRAM, and the computation module includes an Arithmetic Logic Unit (ALU). Because the storage density of the SRAM is very low and the occupied chip area is large, the storage module capacity in the processing unit is small, usually between dozens of bytes and hundreds of bytes, and when facing data intensive processing tasks, more data cannot be cached; meanwhile, the area of a single processing unit is large, the number of the processing units in the array can be limited, and the parallel computing capacity of the processing unit array is further limited; the data communication between the independent processing units also requires more time and power consumption overhead, because each processing unit is only directly connected with the adjacent processing unit, local calculation and global calculation cannot be considered at the same time.
Such as documents A Novel ReRAM-Based Processing-in-Memory Architecture for graph transformation (ACM Transactions ON Storage, 2018), PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory (ACM/IEEE 43rd International Symposium ON Computer Architecture, 2016), ReRAM-Based Processing-in-Memory Architecture for Current Network Computation (IEEE TRANSACTION VERY LARGE SCALE INTEGRATION VLSI (I) SYSTEM, 2018), a non-volatile Memory-Based Memory Computation method (2018), Memory Computation (Processing-in-PIM) is a method of moving data from a Storage wall into a Storage wall for Storage and Computation, and Storage of data is done in a manner that reduces the overhead of the Computation. In recent years, an idea of directly adding a logic computing unit to a nonvolatile memory or a cache to realize Processing-in-memory has been proposed internationally, and a nonvolatile memory (NVM) is used to approach to the read-write performance and high storage density of a DRAM, thereby reducing the overhead required for data migration between the memory and the computing unit and increasing the capacity of the memory or the cache. However, the current PIM is mainly divided into two types, one is to add a more complex processing unit on the memory, for example, if the general embedded processor is moved into the memory, the processing unit has a strong function but a small number, and is presented in a multi-core form rather than a large-scale array form, and the parallelism is very low; the other method is to directly use the read-write operation of the storage unit to realize some very simple and limited kinds of operations, such as to realize a neural network processor through PIM, namely to directly realize only multiply-add operation by using crossbar, and the other method also uses the read-write operation of the storage unit to realize and, or, and xor logic, which is usually only suitable for specific applications and cannot be used for general parallel calculation such as image processing. The PIM is not designed for data intensive operation tasks such as image processing, and cannot realize functions such as SIMD, local calculation, and global calculation.
Disclosure of Invention
In order to overcome the defects of insufficient storage capacity, large static power consumption, small number of processing units, simple operation capacity, incapability of considering both local operation and global operation and the like in a processing unit array, the invention provides the processing unit array based on nonvolatile memory calculation and the working method thereof.
A processing unit array based on nonvolatile memory calculation and a calculation method thereof use a nonvolatile memory as a memory, a whole NVM is logically divided into a large number of sub-blocks as the memory of a processing unit, each sub-block is added with an operation unit to form a processing unit, all the processing units are uniformly controlled by a group of control signals, and large-scale parallel operation is completed in a SIMD mode; each processing unit can directly acquire data in the memory of the processing unit and the memory of the adjacent processing unit to complete local operation, and can acquire data at other positions on the NVM through the original read-write function of the whole NVM to complete global operation.
A processing unit array based on nonvolatile memory calculation comprises an instruction decoder used for decoding instructions of the processing unit array, a nonvolatile memory used for storing data to be processed, the nonvolatile memory comprises M x N identical subblocks and forms an array of M rows and N columns, each subblock has a row address R and a column address C in the array, and is marked as Block (R, C), an operation unit is added to each subblock and is used for finishing processing the data to form a processing unit, and therefore the nonvolatile memory becomes a processing unit array, and one operation unit can be connected with at most four storage subblocks, namely an upper storage subblock, a lower storage subblock, a left storage subblock, a right storage subblock and a left storage subblock.
The arithmetic unit comprises a 2-4 decoder, two registers A and B, an arithmetic logic unit and 4 groups of transmission gate arrays, wherein the number of the transmission gates in each group of transmission gate arrays is consistent with the bit number of the registers and the data bit width of the storage sub-blocks.
The 4 groups of transmission gate arrays in the above arithmetic unit are respectively marked as TG1, TG2, TG3 and TG4, the input ends of TG1, TG2, TG3 and TG4 are sequentially and respectively connected with the data buses of the storage subblock Block (R, C), the right adjacent subblock Block (R, C + 1), the lower adjacent subblock Block (R +1, C) and the right adjacent subblock Block (R +1, C + 1) in the processing unit, the input end of the register a is connected with the output ends of TG1 and TG3 in the processing unit, the input end of the register B is connected with the output ends of TG1, TG2, TG3 and TG4 in the processing unit, so that the transmission gate can control the data in the storage subblock to be stored in the register; the output end of the register A is connected with the input end of the arithmetic logic unit, and the output end of the register B is connected with the other input end of the arithmetic logic unit; each memory subblock is connected with at most four sets of transmission gate arrays.
The control signals of the above-mentioned instruction decoder comprise functional signals of an arithmetic logic unit for selecting a function of the arithmetic logic unit, such as an arithmetic operation of addition, subtraction or the like, or a logical operation of and, or, xor or the like, whose port is connected to the functional signal interface of all arithmetic logic units in the array of processing units.
The control signals of the instruction decoder also include address signals of the data to be processed in the subblock, which are used for selecting the data to be processed, where the address signals are relative addresses of the data in the subblock, that is, the address of the first data in the subblock is 0, the address of the second data is 1, and so on; the ports of which are connected to the address interfaces of all sub-blocks in the processing unit array.
The control signal of the instruction decoder further includes a two-bit encoding signal, a port of the two-bit encoding signal is connected to the input end of all 2-4 decoders in the processing unit array, and the two-bit encoding signal is used for selecting a transmission gate connected to one of the four storage sub-blocks adjacent to the arithmetic unit.
The four input signals 00, 01, 10, 11 of the 2-4 decoder correspond to the four output signal lines 0, 1, 2, 3, respectively, and the four output signal lines 0, 1, 2, 3 are sequentially connected to the control ports of the processing units TG1, TG2, TG3, TG4 one by one, respectively, to control the four transmission gate arrays to transmit.
A computing method of a processing unit array based on non-volatile memory computing, comprising two operands and a computing operation, comprising the steps of:
the instruction decoder transmits the encoding signal of the first operand to all 2-4 decoders in the processing unit array, and the 2-4 decoders output signals to four groups of transmission gate arrays in the processing unit where the 2-4 decoders are located after decoding, so that one or only one group of transmission gate arrays are opened and data can be transmitted;
the instruction decoder transmits the address signal of the first operand to the address interfaces of all storage sub-blocks in the processing unit array, reads out the data stored in the corresponding positions in the storage sub-blocks, at the moment, only one group of a plurality of groups of transmission gate arrays connected with the sub-blocks is opened, and the read data is stored in the register A through the group of transmission gate arrays;
the instruction decoder transmits the encoding signal of the second operand to all 2-4 decoders in the processing unit array, and the 2-4 decoders output signals to four groups of transmission gate arrays in the processing unit where the 2-4 decoders are located after decoding, so that one and only one group of transmission gate arrays are opened and data can be transmitted;
the instruction decoder transmits the address signal of the second operand to the address interfaces of all storage sub-blocks in the processing unit array, reads out the data stored in the corresponding positions in the storage sub-blocks, at the moment, only one group of the transmission gate arrays connected with the sub-blocks is opened, and the read data is stored in the register B through the transmission gate arrays;
the instruction decoder transmits the function signal to a function signal interface of the arithmetic logic unit, the arithmetic logic unit determines the type of calculation operation according to the function signal, then the arithmetic logic unit performs the type of calculation on the numbers in the registers A and B, and the obtained result is output through an output end of the arithmetic logic unit.
A processing unit array based on nonvolatile memory calculation adopts a nonvolatile memory with higher storage density as the storage of each processing unit, so that the storage capacity of the processing unit array is large, more data can be cached, the hit rate is improved, and the like.
Meanwhile, the nonvolatile memory has low static power consumption and is more energy-saving than a processing unit array formed by a volatile memory such as an SRAM.
Secondly, the area of the processing unit in the invention is smaller, so that more processing units can be integrated in chips such as a vision processor/an image processor, and the parallelism of the processing unit array is greatly improved.
And thirdly, the universal operation unit is adopted in the invention, so that the function ductility is stronger, more and more complex operation functions can be realized according to the application, and the application range is wider.
Finally, the processing unit in the invention can directly process the data in the four adjacent storage sub-blocks, has excellent local arithmetic capability, and can conveniently complete various local algorithms if all the pixels of the local area are stored in the four sub-blocks in the image processing; meanwhile, the processing unit array is realized on a nonvolatile memory, so that data of any address on the memory can be read through the read-write function of the memory to carry out global operation, and local operation and global operation are both considered.
Drawings
FIG. 1 is a diagram of a processing unit array architecture based on ReRAM memory computations;
fig. 2 is a structural view of a crossbar-based processing unit.
In the figure: 1. ReRAM; 2. An instruction decoder; 3. cross bar; 4. an arithmetic unit; 5. 2-4 decoder; 6. a transmission gate array TG 1; 7. a transmission gate array TG 2; 8. a transmission gate array TG 3; 9. a transmission gate array TG 4; 10. a register A; 11. a register B; 12. an arithmetic logic unit.
Detailed Description
The following describes in detail specific embodiments of the present invention with reference to the drawings, but the present invention is not limited to the embodiments disclosed below, and can be implemented in various ways.
The present embodiment employs a ReRAM as a nonvolatile memory. The ReRAM is composed of a large number of crossbar having Word Lines (WL) and Bit Lines (BL), where WL selects a certain row of memory cells, and BL reads out data of memory cells at the intersection with WL, and therefore, WL + BL can be considered as addresses of memory cells in crossbar.
The processing unit array in this embodiment includes an instruction decoder for decoding an instruction of the processing unit array, and a ReRAM for storing data to be processed, where the ReRAM includes M × N identical crossbar, and forms an array of M rows and N columns, each crossbar has a row address R and a column address C in the array, denoted as Block (R, C), and adds an operation unit for completing processing of data to form a processing unit, and thus, the ReRAM is a processing unit array, as shown in fig. 1, in which one operation unit can be connected to at most four crossbar units, up and down, left and right.
As shown in fig. 2, the above-mentioned arithmetic unit includes a 2-4 decoder, two registers a and B, an arithmetic logic unit, 4 sets of transmission gate arrays; the number of the transmission gates in each group of transmission gate arrays is consistent with the digit of the register and the data bit width of the crossbar, each transmission gate is provided with a control signal port, the control signal ports of each group of transmission gates are connected together, and when the control signal is at a high level, the transmission gates are opened and can transmit data; the arithmetic logic unit can complete operations such as addition, subtraction, AND, OR, NOT, XOR, shift and the like, and the operation to be executed is determined by the function control signal; registers A and B are each comprised of edge flip-flops, the rising edge of which is active.
The 4 groups of transmission gate arrays in the above arithmetic unit are respectively marked as TG1, TG2, TG3 and TG4, the input ends of TG1, TG2, TG3 and TG4 are sequentially and respectively connected with the data buses of crossbar Block (R, C), right adjacent sub-Block (R, C + 1), lower adjacent sub-Block (R +1, C) and right lower adjacent sub-Block (R +1, C + 1) in the processing unit, the input end of register a is connected with the output ends of TG1 and TG3 in the processing unit, the input end of register B is connected with the output ends of TG1, TG2, TG3 and TG4 in the processing unit, so that the transmission gate can control the data in crossbar to be stored in the register; the output end of the register A is connected with the input end of the arithmetic logic unit, and the output end of the register B is connected with the other input end of the arithmetic logic unit; each crossbar is connected with at most four groups of transmission gate arrays, wherein the transmission gate arrays connected with the crossbar positioned at the four corners of the array are one group, and the transmission gate arrays connected with the crossbar positioned at the edge of the array are two groups.
The control signals of the instruction decoder described above include functional signals of the arithmetic logic unit for selecting the function of the arithmetic logic unit, including addition, subtraction, and, or, not, exclusive or, shift, etc., and the ports thereof are connected to the functional signal interfaces of all the arithmetic logic units in the processing unit array.
The control signals of the instruction decoder also include address signals, namely WL and BL, in the crossbar of the data to be processed, which are used for selecting the data to be processed, wherein the address signals are relative addresses of the data in the crossbar, namely the address of the first data in the subblock is 0, the address of the second data is 1, and the like; its port is connected to the address interface of all crossbars in the processing unit array.
The control signal of the instruction decoder further comprises a two-bit encoding signal, a port of the two-bit encoding signal is connected with the input end of all 2-4 decoders in the processing unit array, and the two-bit encoding signal is used for selecting a transmission gate connected with one of four crossbar adjacent to the operation unit.
The four input signals 00, 01, 10, 11 of the 2-4 decoder correspond to the four output signal lines 0, 1, 2, 3, respectively, and the four output signal lines 0, 1, 2, 3 are sequentially connected to the control ports of the processing units TG1, TG2, TG3, TG4 one by one, respectively, to control the four transmission gate arrays to transmit.
The computing method based on the processing unit array comprises two operands and a computing operation, taking X + Y as an embodiment, wherein X refers to a datum at the Addr1 position in the crossbar in each processing unit, namely the crossbar at the upper left side of the operation unit, and the code is 00, and Y refers to a datum at the Addr2 position in the crossbar at the right side of each processing unit, and the code is 01, and specifically comprises the following steps:
firstly, an instruction decoder transmits an encoding signal 00 of a first operand X to all 2-4 decoders in the processing unit array, the 2-4 decoders output signals to four groups of transmission gate arrays in the processing unit where the 2-4 decoders are located after decoding, and one group of transmission gate arrays, namely TG1, is opened to transmit data;
secondly, the instruction decoder transmits an address signal Addr1 of a first operand X to WL and BL of all cross bars in the processing unit array, reads data X stored at a corresponding position in the cross bars, at the moment, only one group TG1 in a plurality of groups of transmission gate arrays connected with the cross bars is opened, the register A is triggered by a rising edge of a clock through the group of transmission gate arrays, the read data X is stored in the register A, and at the moment, the register B does not obtain the rising edge trigger and cannot read the data;
thirdly, the instruction decoder transmits the encoding signal of the second operand Y to all 2-4 decoders in the processing unit array, and the 2-4 decoders output signals to four groups of transmission gate arrays in the processing unit where the 2-4 decoders are located after decoding, so that one group of transmission gate arrays and only one group of transmission gate arrays TG2 are opened to transmit data;
fourthly, the instruction decoder transmits an address signal Addr2 of a second operand Y to WL and BL of all cross bars in the processing unit array, reads data Y stored at a corresponding position in the cross bars, at the moment, only TG2 in a plurality of groups of transmission gate arrays connected with the cross bars is opened, a clock rising edge triggers a register B, the read data is stored in the register B through a transmission gate array TG2, at the moment, the register A does not obtain rising edge triggering, and the stored data cannot be changed;
and fifthly, the instruction decoder transmits the function signal to a function signal interface of the arithmetic logic unit, the arithmetic logic unit determines an addition function according to the function signal, then the numbers in the registers A and B are added, and the obtained result is output through an output end of the arithmetic logic unit.

Claims (8)

1. An array of processing units based on non-volatile memory computing, comprising: the non-volatile memory comprises M multiplied by N identical subblocks and forms an array of M rows and N columns, and each subblock has a row address R and a column address C in the array and is marked as BlockR,CEach sub-block is added with an arithmetic unit to form a processing unit.
2. The non-volatile memory computing-based processing unit array of claim 1, wherein: the arithmetic unit comprises a 2-4 decoder, two registers A and B, an arithmetic logic unit and 4 groups of transmission gate arrays, wherein the number of transmission gates in each group of transmission gate arrays is consistent with the bit number of the registers and the data bit width of the storage subblocks.
3. The non-volatile memory computing-based processing unit array of claim 1, wherein: the 4 groups of transmission gate arrays in the operation unit are respectively recorded as TG1, TG2, TG3 and TG4, and the input ends of the TG1, the TG2, the TG3 and the TG4 are sequentially and respectively connected with the storage sub-Block Block in the processing unitR,CRight adjacent subblock BlockR,C+1Lower adjacent subblock BlockR+1,CThe lower right adjacent subblock BlockR+1,C+1The input ends of the registers A are connected with the output ends of TG1 and TG3 in the processing unit, the input ends of the registers B are connected with the output ends of TG1, TG2, TG3 and TG4 in the processing unit, the output ends of the registers A are connected with the input end of the arithmetic logic unit, and the output ends of the registers B are connected with the other input end of the arithmetic logic unit.
4. The non-volatile memory computing-based processing unit array of claim 1, wherein: the control signals of the instruction decoder include functional signals of the arithmetic logic unit for selecting the function of the arithmetic logic unit, and the port of the control signals of the instruction decoder is connected with the functional signal interfaces of all the arithmetic logic units in the processing unit array.
5. The non-volatile memory computing-based processing unit array of claim 1, wherein: the control signals of the instruction decoder also comprise address signals of the data to be processed in the subblocks, which are used for selecting the data to be processed, and the ports of the control signals are connected with the address interfaces of all the subblocks in the processing unit array.
6. The non-volatile memory computing-based processing unit array of claim 1, wherein: the control signals of the instruction decoder also comprise two-bit encoding signals for selecting the storage sub-block where the operand is located, and the port of the two-bit encoding signals is connected with the input end of all 2-4 decoders in the processing unit array.
7. The non-volatile memory computing-based processing unit array of claim 2, wherein: the four input signals 00, 01, 10 and 11 of the 2-4 decoder respectively correspond to four output signal lines 0, 1, 2 and 3, and the four output signal lines 0, 1, 2 and 3 are sequentially and respectively connected with control ports of TG1, TG2, TG3 and TG4 of a processing unit where the four output signal lines are located one by one.
8. A computing method of a processing unit array based on non-volatile memory computing, characterized by: comprising two operands and a calculation operation, which comprises in particular the steps of:
firstly, an instruction decoder transmits an encoding signal of a first operand to all 2-4 decoders in the processing unit array, and the 2-4 decoders output signals to four groups of transmission gate arrays in the processing unit where the 2-4 decoders are located after decoding, so that one group and only one group of transmission gate arrays are opened and can transmit data;
secondly, the instruction decoder transmits the address signal of the first operand to the address interfaces of all storage sub-blocks in the processing unit array, reads out the data stored in the corresponding position in each storage sub-block, and respectively stores the data in the register A through the opened transmission gate array connected with the instruction decoder;
thirdly, the instruction decoder transmits the encoding signal of the second operand to all 2-4 decoders in the processing unit array, and the 2-4 decoders output signals to four groups of transmission gate arrays in the processing unit where the 2-4 decoders are located after decoding, so that one group and only one group of transmission gate arrays are opened and can transmit data;
fourthly, the instruction decoder transmits the address signal of the second operand to the address interfaces of all the storage sub-blocks in the processing unit array, reads out the data stored in the corresponding position in each storage sub-block, and respectively stores the data in the register B through the opened transmission gate array connected with the instruction decoder;
in a fifth step, the instruction decoder transmits the function signal to a function signal interface of the arithmetic logic unit, and the arithmetic logic unit determines the type of calculation operation according to the function signal and then performs the type of calculation on the numbers in the registers a and B.
CN201811193218.8A 2018-10-14 2018-10-14 Processing unit array based on nonvolatile memory calculation and calculation method thereof Active CN111045727B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811193218.8A CN111045727B (en) 2018-10-14 2018-10-14 Processing unit array based on nonvolatile memory calculation and calculation method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811193218.8A CN111045727B (en) 2018-10-14 2018-10-14 Processing unit array based on nonvolatile memory calculation and calculation method thereof

Publications (2)

Publication Number Publication Date
CN111045727A true CN111045727A (en) 2020-04-21
CN111045727B CN111045727B (en) 2023-09-05

Family

ID=70230117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811193218.8A Active CN111045727B (en) 2018-10-14 2018-10-14 Processing unit array based on nonvolatile memory calculation and calculation method thereof

Country Status (1)

Country Link
CN (1) CN111045727B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723913A (en) * 2020-06-19 2020-09-29 浪潮电子信息产业股份有限公司 Data processing method, device and equipment and readable storage medium
CN111723907A (en) * 2020-06-11 2020-09-29 浪潮电子信息产业股份有限公司 Model training device, method, system and computer readable storage medium
CN112967172A (en) * 2021-02-26 2021-06-15 成都商汤科技有限公司 Data processing device, method, computer equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1317799A (en) * 2000-04-10 2001-10-17 黄松柏 On-line programmable computer ICs for autoamtic control
CN103150146A (en) * 2013-01-31 2013-06-12 西安电子科技大学 ASIP (application-specific instruction-set processor) based on extensible processor architecture and realizing method thereof
CN103345448A (en) * 2013-07-10 2013-10-09 广西科技大学 Two-read-out and one-read-in storage controller integrating addressing and storage
CN104090859A (en) * 2014-06-26 2014-10-08 北京邮电大学 Address decoding method based on multi-valued logic circuit
CN105845174A (en) * 2015-01-12 2016-08-10 上海新储集成电路有限公司 Nonvolatile look-up table memory cell composition and implementation method of look-up table circuit
CN108197705A (en) * 2017-12-29 2018-06-22 国民技术股份有限公司 Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium
CN108335716A (en) * 2018-01-26 2018-07-27 北京航空航天大学 A kind of memory computational methods based on nonvolatile storage

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1317799A (en) * 2000-04-10 2001-10-17 黄松柏 On-line programmable computer ICs for autoamtic control
CN103150146A (en) * 2013-01-31 2013-06-12 西安电子科技大学 ASIP (application-specific instruction-set processor) based on extensible processor architecture and realizing method thereof
CN103345448A (en) * 2013-07-10 2013-10-09 广西科技大学 Two-read-out and one-read-in storage controller integrating addressing and storage
CN104090859A (en) * 2014-06-26 2014-10-08 北京邮电大学 Address decoding method based on multi-valued logic circuit
CN105845174A (en) * 2015-01-12 2016-08-10 上海新储集成电路有限公司 Nonvolatile look-up table memory cell composition and implementation method of look-up table circuit
CN108197705A (en) * 2017-12-29 2018-06-22 国民技术股份有限公司 Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium
CN108335716A (en) * 2018-01-26 2018-07-27 北京航空航天大学 A kind of memory computational methods based on nonvolatile storage

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723907A (en) * 2020-06-11 2020-09-29 浪潮电子信息产业股份有限公司 Model training device, method, system and computer readable storage medium
CN111723907B (en) * 2020-06-11 2023-02-24 浪潮电子信息产业股份有限公司 Model training device, method, system and computer readable storage medium
CN111723913A (en) * 2020-06-19 2020-09-29 浪潮电子信息产业股份有限公司 Data processing method, device and equipment and readable storage medium
CN112967172A (en) * 2021-02-26 2021-06-15 成都商汤科技有限公司 Data processing device, method, computer equipment and storage medium

Also Published As

Publication number Publication date
CN111045727B (en) 2023-09-05

Similar Documents

Publication Publication Date Title
Akyel et al. DRC 2: Dynamically Reconfigurable Computing Circuit based on memory architecture
CN109766309B (en) Spin-save integrated chip
Bavikadi et al. A review of in-memory computing architectures for machine learning applications
US7791962B2 (en) Semiconductor device and semiconductor signal processing apparatus
WO2017172398A1 (en) Apparatuses and methods for data movement
WO2017142826A1 (en) Apparatuses and methods for data movement
US20030196030A1 (en) Method and apparatus for an energy efficient operation of multiple processors in a memory
JPH07282237A (en) Semiconductor integrated circuit
CN111045727B (en) Processing unit array based on nonvolatile memory calculation and calculation method thereof
US20060101231A1 (en) Semiconductor signal processing device
CN110674462B (en) Matrix operation device, method, processor and computer readable storage medium
US11468002B2 (en) Computational memory with cooperation among rows of processing elements and memory thereof
Angizi et al. Pisa: A binary-weight processing-in-sensor accelerator for edge image processing
CN111048135A (en) CNN processing device based on memristor memory calculation and working method thereof
Zhao et al. NAND-SPIN-based processing-in-MRAM architecture for convolutional neural network acceleration
US20210295906A1 (en) Storage Unit and Static Random Access Memory
CN111124999A (en) Dual-mode computer framework supporting in-memory computation
CN106021171A (en) An SM4-128 secret key extension realization method and system based on a large-scale coarseness reconfigurable processor
CN117234720A (en) Dynamically configurable memory computing fusion data caching structure, processor and electronic equipment
US20220318610A1 (en) Programmable in-memory computing accelerator for low-precision deep neural network inference
Do et al. Enhancing matrix multiplication with a monolithic 3-d-based scratchpad memory
US20210241806A1 (en) Streaming access memory device, system and method
Zhao et al. A Novel Transpose 2T-DRAM based Computing-in-Memory Architecture for On-chip DNN Training and Inference
Wang et al. Parallel stateful logic in RRAM: Theoretical analysis and arithmetic design
JPH0259943A (en) Memory device with operational function

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant