CN113378115B - Near-memory sparse vector multiplier based on magnetic random access memory - Google Patents

Near-memory sparse vector multiplier based on magnetic random access memory Download PDF

Info

Publication number
CN113378115B
CN113378115B CN202110689836.7A CN202110689836A CN113378115B CN 113378115 B CN113378115 B CN 113378115B CN 202110689836 A CN202110689836 A CN 202110689836A CN 113378115 B CN113378115 B CN 113378115B
Authority
CN
China
Prior art keywords
sparse
memory
vector
bit
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110689836.7A
Other languages
Chinese (zh)
Other versions
CN113378115A (en
Inventor
蔡浩
陈骏通
张优优
郭亚楠
周永亮
刘波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202110689836.7A priority Critical patent/CN113378115B/en
Publication of CN113378115A publication Critical patent/CN113378115A/en
Application granted granted Critical
Publication of CN113378115B publication Critical patent/CN113378115B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/325Power saving in peripheral device
    • G06F1/3275Power saving in memory, e.g. RAM, cache
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C11/00Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
    • G11C11/02Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using magnetic elements
    • G11C11/16Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using magnetic elements using elements in which the storage effect is based on magnetic spin effect
    • G11C11/165Auxiliary circuits

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a near-memory sparse vector multiplier based on a Magnetic Random Access Memory (MRAM), which belongs to the field of integrated circuit design and comprises a sparse mark generator, an input unit, a controller, a near-memory multiply accumulator, a near-memory processing unit, a core memory array, a cache memory array, a sense amplifier and a shift adder tree. The invention has the functions of realizing multiplication calculation of 2 signed integer vectors and automatically skipping zero vectors. The MRAM has the characteristics of non-volatile and extremely low standby power consumption, and meanwhile, sparse flag bits are introduced and calculated at the output end of the memory, so that the data transfer power consumption and the turnover power consumption are respectively reduced. Compared with a traditional Von Neumann architecture neural network accelerator, the invention effectively improves the calculation energy efficiency of vector multiplication.

Description

Near-memory sparse vector multiplier based on magnetic random access memory
Technical Field
The invention relates to the field of integrated circuits, in particular to a near-memory sparse vector multiplier based on a magnetic random access memory.
Background
In recent years, the neural network expands the wonderful colors in the fields of computer vision, natural language processing and the like, and leads a new round of artificial intelligence hot-air. Neural networks are composed of layers with different functions, and currently, the mainstream designs include: a convolution calculation layer, a full connection calculation layer, an activation function layer, a normalization layer, an attention layer and the like. In the application process, the core calculation process can be abstracted into a vector multiplication form, as shown in a formula (1):
wherein the method comprises the steps ofFor the input or the calculated result of each layer, the +.>Is a fixed weight and does not change.
In order to effectively reduce the consumption of hardware resources, particularly in embedded mobile equipment, one idea is to change an activation value and a weight from 32-bit floating point number to 8-bit integer by using a quantization method, and greatly reduce the storage requirement and the data calculation amount under the condition of not losing the application performance, thereby improving the energy efficiency. Another idea is to use the sparse nature of the activation values or weights, as shown in the following example:obviously, for vectors->In other words, the multiplication result of the first four elements and any vector is 0, so skipping the multiplication of zero vector can effectively reduce power consumption. At present, for multiplication of sparse vectors, a method for judging after data is read out is mostly adopted, although the calculation power consumption can be reduced, access is still needed, and in consideration of the bit width of each element occupying 8bits, the access power consumption also occupies a dominant factor, so that the method still has an optimization space.
In the conventional von neumann architecture, the memory and the computation unit are independent of each other, and when a computation operation needs to be performed, data needs to be transferred to a cache of the computation unit, which is typically composed of Static Random Access Memory (SRAM) or Flip-Flop (Flip-Flop), and then the result is transferred to the memory, which consumes a lot of energy for data transfer and cache update. The near memory computing (Near Memory Computing, NMC) breaks through the traditional von Neumann architecture, and a computing circuit and a memory are integrated, so that the data transfer and memory access power consumption is greatly reduced. Since NMC usually employs a memory array in combination with a digital processing unit, computational accuracy can be guaranteed, but further reducing the power consumption of both circuits is a key challenge in NMC architecture. Most NMC technologies are based on Dynamic Random Access Memory (DRAM) or FLASH memory (FLASH), which requires frequent refresh operations to maintain data, FLASH memory access is slow, and short boards exist in the face of neural network applications with large data computation. The novel nonvolatile memory MRAM can store data in a power-off state, so that the data maintenance power consumption and the leakage power consumption are greatly reduced, and the faster memory access speed meets the calculation requirement of a neural network, so that the near memory sparse vector multiplier based on the MRAM has great advantages compared with other NMC technologies.
Disclosure of Invention
Technical problems: aiming at the defects in the prior art, the invention discloses a near-memory sparse vector multiplier based on a Magnetic Random Access Memory (MRAM), wherein one bit of flag bit is additionally written while data is written, and the near-memory sparse vector multiplication is realized by using a near-memory processing unit to skip memory and calculation processes by using sparse flag bit information. In the circuit structure and the network structure, the multiplier is optimized in power consumption, and the problems of low speed and high energy consumption of the existing NMC technology are solved.
The technical scheme is as follows: the invention relates to a near-memory sparse vector multiplier based on a magnetic random access memory, which comprises a sparse mark generator, an input unit, a near-memory multiply accumulator and a controller, wherein the input unit is connected with the near-memory multiply accumulator;
the sparse mark generator is connected with the input unit, judges whether input data is 0 through a logic circuit, generates sparse mark bits, and transmits the data and the sparse mark bits into the input unit; the input data includes a weight vector and an activation vector;
the input unit is connected with the near-memory multiply-accumulate device, the near-memory multiply-accumulate device receives the data from the input unit and performs near-memory multiply-accumulate calculation, and the memory access and calculation of the zero vector are skipped in the near-memory multiply-accumulate calculation process;
the controller is respectively connected with the sparse mark generator, the input unit and the near-memory multiply-accumulator, and is used for controlling the realization of functions of the sparse mark generator, the input unit and the near-memory multiply-accumulator and generating address signals for reading and storing data.
Further, the sparse flag generator comprises six two-input OR gates and one two-input NOR gate, and is used for judging whether 8-bit data are all 0 or not and generating sparse flag bits of the data; the six two-input OR gates are respectively marked as a first two-input OR gate, a second two-input OR gate, a third two-input OR gate, a fourth two-input OR gate, a fifth two-input OR gate and a sixth two-input OR gate, wherein the input ends of the first to fourth two-input OR gates form the input end of the sparse mark generator, the output ends of the first two-input OR gate and the second two-input OR gate are connected with the input end of the fifth two-input OR gate, the output ends of the third two-input OR gate and the fourth two-input OR gate are connected with the input end of the sixth two-input OR gate, the output ends of the fifth two-input OR gate and the sixth two-input OR gate are connected with the input end of the two-input NOR gate, and the output end of the two-input NOR gate is the output end of the sparse mark generator.
Further, the input unit is configured to receive input data and sparse flag bits of the sparse flag generator, receive 8-bit write data and the sparse flag bits of the data every cycle, receive 8-bit write data input from the sparse flag generator in eight cycles, update a current sparse flag bit after each data receiving cycle, and output a total of 64-bit and 1-bit sparse flag bits after eight cycles;
as shown in formula (4), the sparse flag bit F is used for representing whether the vector with the length of 8 and the bit width of 8bits is zero or not, and F i Indicating whether the vector written in the i-th cycle is zero.
Further, the near-memory multiply accumulator comprises a near-memory processing unit PE, a part and an accumulator, wherein each near-memory processing unit PE in the near-memory multiply accumulator performs parallel calculation, and the final result is accumulated by the part and the accumulator;
the near memory processing unit comprises an address decoder, a core array MRAM1, a buffer array MRAM2, a buffer array MRAM3, a first sensitive amplifier, a second sensitive amplifier, a shift adder tree and a logic AND module.
The address decoder is respectively connected with the core array MRAM1, the cache array MRAM2 and the cache array MRAM3, and is used for decoding address signals output by the controller and storing data into corresponding addresses according to the address signals; or reading the data participating in calculation;
the core array MRAM1 is used for storing weight vectors, the cache array MRAM2 is used for storing activation vectors, and the cache array MRAM3 is used for storing output vectors;
the first sense amplifier is connected with the core array MRAM1 and is used for reading the weight vector sparse flag bit F of the core array MRAM1 0 The second sense amplifier is connected with the buffer array MRAM2, and the first sense amplifier and the second sense amplifier are sensitive to sparse flag signals and are used for reading sparse flag bits F of activation vectors in the buffer array MRAM2 1 And a data bit;
the first sense amplifier and the second sense amplifier first read sparse flag bits in the weight vector and the activation vector, where F 0 And F is equal to 1 Mutually interact and feed back to the first sense amplifier and the second sense amplifier, if F 0 |F 1 If true, at least one group of vectors in the representative weight vector or the activation vector is zero, so that the first sense amplifier and the second sense amplifier are all turned off, and the memory access of the zero vector is skipped. If F 0 |F 1 If false, the weight vector and the activation vector are subjected to AND multiplication operation through a logic AND module and are sent to a shift adder tree.
The shift adder tree is sensitive to the sparse flag signal, the shift adder tree receives the sparse flag bit transmitted by the first sensitive amplifier and the second sensitive amplifier, if the sparse flag bit indicates that a zero vector exists in the vector to be multiplied, calculation of the zero vector is skipped, all data are maintained unchanged, and output is set to 0 through the combinational logic, so that the overturning power consumption is reduced. Otherwise, the inputs of the first sensitive amplifier and the second sensitive amplifier are multiplied by logic AND and sent to a shift adder tree for shift addition;
the logical AND module is used for calculating the product of the activation vector and the weight vector, and the logical AND module calculates (1 bit multiplied by 8 bits) each time and sends the result to the shift adder tree, and the calculation is completed (8 bits multiplied by 8 bits) after 8 periods.
The near-memory multiply accumulator works in a three-stage pipeline manner, including 'PE calculation-part and accumulation-write-back'. Vector multiplication is completed inside each PE, then the accumulated results of 48 PEs are sent to a part and an accumulator, shifting is performed after accumulation operation is performed, data is restored to 8bits, and finally 8bits of data are written back to the cache array MRAM 3. In the whole process, the read operation occurs in the core array MRAM1 and the cache array MRAM2, and the write operation occurs in the cache array MRAM3
Further, the core array MRAM1 is used for storing weight vectors, and the weight matrix M is mapped into the near memory processing unit PE core array MRAM1 as shown in formula (2)
The mapping mode is that each element of the weight matrix M is unfolded into 8-bit binary numbers, one bit of sparse zone bit is additionally added to each row, and whether the vector of the row is zero is judged.
Further, the cache array MRAM2 is configured to store an activation vectorThe mapping formula in the cache array MRAM2 disclosed by the invention is shown as (3)
The mapping mode is an activation vectorEach element is unfolded into 8-bit binary number, each row is arranged in the same address bit of eight operated-on bits, and a sparse flag bit is used for judging whether the vectors of the row and the previous row are zero vectors, for example, f a7 Indicating whether the row vector is zero and f a0 -f a6 Whether or not it is also zero, thus f a7 For representing the activation vector->Whether it is a zero vector.
The beneficial effects are that: the invention adopts the technical scheme and has the following beneficial effects:
(1) The invention constructs the near-memory sparse vector multiplier based on the MRAM, the data stored in the MRAM array can not be lost due to power off, the memory requirement that a large amount of weights are hardly updated in the application of the neural network is met, the data maintenance power consumption is effectively reduced, meanwhile, the power consumption of data migration is greatly reduced due to the characteristic of near-memory calculation, and the overall energy efficiency is improved.
(2) The invention realizes the sparsity judgment of the input data by using the sparse mark generator, records the sparsity by using 1.6% of storage overhead, and overcomes the defect that all data still need to be accessed when sparse vector operation is performed.
(3) According to the invention, the near-memory sparse vector multiplier is utilized to realize the neural network calculation of full-connection 8-bit quantization, the memory access and calculation stage is skipped based on the sparse zone bit, and the memory access power consumption and the calculation power consumption are reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description of the embodiments will simply reduce the drawings that are generally required for use, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a block diagram of a structure for realizing MNIST handwriting digital recognition by using a near-memory sparse vector multiplier based on a magnetic random access memory according to an embodiment of the present invention;
FIG. 2 is a block diagram of a magnetic random access memory-based near-memory sparse vector multiplier according to an embodiment of the present invention;
FIG. 3 is a circuit diagram of a sparse flag generator provided by an embodiment of the present invention;
FIG. 4 is a block diagram of a near-memory multiply-accumulator provided by an embodiment of the present invention;
FIG. 5 is a block diagram of a near-memory processing unit according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a shift adder according to an embodiment of the present invention;
FIG. 7 is a timing diagram of the operation of the near memory processing unit according to the embodiment of the invention;
FIG. 8 is a schematic diagram of a working pipeline of a near-memory sparse vector multiplier based on a magnetic random access memory according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of power consumption contrast of near-memory sparse vector multiplication calculation provided by an embodiment of the present invention;
fig. 10 is a statistical result of sparsity of the neural network in MNIST handwriting database application provided in the embodiment of the present invention;
fig. 11 is a block diagram of a multiplier of the present invention.
Detailed Description
The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the examples of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without making any inventive effort based on the embodiments of the present invention are within the scope of the present invention.
FIG. 1 is a block diagram of a magnetic random access memory-based near-memory sparse vector multiplier for realizing MNIST handwriting digital recognition according to an embodiment of the present invention; converting a picture to be identified into an input vector, wherein a circle in a box represents a weight vector, obtaining a group of probability vectors through a calculation mode of multiple times of vector multiplication, and taking out a number corresponding to the maximum value from the probability vectors to obtain an identification value, wherein the vector multiplication is realized by using a near-memory sparse vector multiplier.
As shown in fig. 2 and 11, a near-memory sparse vector multiplier based on a magnetic random access memory according to the present invention includes a sparse flag generator, an input unit, a controller, and a near-memory multiply accumulator.
The sparse flag generator is connected with the input unit, judges whether input data is 0 through the logic circuit, generates sparse flag bits, and transmits the data and the sparse flag bits into the input unit. The input data includes a weight vector and an activation vector.
The input unit is connected with the near-memory multiply-accumulate device, the near-memory multiply-accumulate device receives the data from the input unit and performs near-memory multiply-accumulate calculation, and the memory access and calculation of the zero vector are skipped in the near-memory multiply-accumulate calculation process;
the controller is respectively connected with the sparse mark generator, the input unit and the near-memory multiply-accumulator, and is used for controlling the realization of the functions of the sparse mark generator, the input unit and the near-memory multiply-accumulator.
As shown in fig. 3, the sparse flag generator includes a combinational logic circuit composed of 6 two-input or gates and 1 two-input nor gate, and is configured to determine whether all 8bits of data are 0, implement the logic operation of the formula (5), and generate sparse flag bits of the data.
In this embodiment, a full connection layer of 64×384 is used as a design object, i.e., a matrix with weight data of 64×384 is input with an ordered array with an activation vector of 1×384, and an ordered array with an activation vector of 1×64 is output, and the system will complete the calculation of equation (6), wherein-128 < i, w <127.
As shown in fig. 4, the near-memory multiply-accumulator provided in the embodiment of the present invention includes 48 near-memory processing units PE and a part and accumulator, where each near-memory processing unit PE performs parallel computation, and the computation results are accumulated in the part and accumulator. Thus, the weight array of 64×384 is divided into 48 sets of data of 64×8 in one-to-one correspondence with the PEs, while the input activation vector is divided into 48 sets of data of 1×8 in one-to-one correspondence with the PEs in the same manner; thus, equation (6) can be transformed into equation (7) again, where j represents the j-th PE unit.
As shown in fig. 5, the near memory processing unit includes an address decoder, a core array MRAM1, a buffer array MRAM2, a buffer array MRAM3, a first sense amplifier, a second sense amplifier, a shift adder tree, and a logic and module.
The address decoder is respectively connected with the core array MRAM1, the cache array MRAM2 and the cache array MRAM3, and is used for decoding the address signals output by the controller and storing the data into corresponding addresses; or reading the data participating in calculation;
the core array MRAM1 is used to store the weight vector, the cache array MRAM2 is used to store the activation vector, and the cache array MRAM3 is used to store the output vector.
The first sense amplifier is connected with the core array MRAM1 and is used for reading the weight vector sparse flag bit F of the core array MRAM1 0 The second sense amplifier is connected with the cache array MRAM2 and is used for reading sparse flag bit F of activation vector in the cache array MRAM2 1 And data bits, where F 0 And F is equal to 1 Mutually interact and feed back to the first sense amplifier and the second sense amplifier, if F 0 |F 1 If true, at least one set of the representative weight vector or activation vector is zero, thereby setting the firstThe sense amplifier and the second sense amplifier are all turned off and the calculation cycle is skipped. If F 0 |F 1 If false, the weight vector and the activation vector are subjected to AND multiplication operation through a logic AND module and are sent to a shift adder tree.
The logical AND module is used for calculating the product of the activation vector and the weight vector, and the logical AND module calculates (1 bit multiplied by 8 bits) each time and sends the result to the shift adder tree, and the calculation is completed (8 bits multiplied by 8 bits) after 8 periods.
Weight matrix W in equation (7) ij The mapping in the core array MRAM1 in PE is shown as a formula (8), each element of the mapping mode weight matrix W is unfolded into 8-bit binary numbers, and each row is additionally added with one sparse flag bit f wjx (j is the j-th PE, x is the x-th operand of the PE), and the controller provided by the embodiment of the invention generates a weight vector write signal to write the weight vector into the core array MRAM1, and since the MRAM memory data adopted in the embodiment is not affected by the power off, only one time of weight is needed for MNIST handwriting data identification application of the embodiment.
The activation vector is mapped as shown in equation (9), whereinRepresenting the input vector corresponding to the ith PE, expanding each element of the input vector into 8-bit binary numbers, wherein each row is arranged in the same address bit and one sparse flag bit f of eight operated-in bits ijx (j is the j-th PE and x is the x-th operand of the PE). And the controller provided by the embodiment of the invention generates an activation vector write signal to write the activation vector into the cache array MRAM 2.
The PE internal calculation is therefore as shown in equation (10):
FIG. 6 is a schematic diagram of a shift adder according to an embodiment of the present invention, wherein 8bits of data with a bit width of 8 are calculated by two-by-two and stored in S reg And the shift adder is sensitive to sparse flag bits, if the input vector is zero vector, S reg The output is held constant and set to 0, otherwise shift-add.
As shown in FIG. 7, the embodiment of the present invention provides a timing diagram of the operation of the near memory processing unit, the controller generates a read enable signal SAE, and at the falling edge (1) of the read enable signal SAE, the core array MRAM1 executes the read weight sparse flag bit F 0 The cache array MRAM2 performs a read-in sparse flag bit F 1
At rising edge (2) of SAE, if the sparse flag bits are all 0, it means that neither the activation vector nor the weight vector is 0, and thus ready to enter a calculation operation, comprising the following three simultaneous steps:
a) Reading all data (8×8 bits) of the weight vector at the rising edge according to the above-mentioned memory data mapping mode, reading the most significant bit (8×1 bits) of the activation vector, performing logical AND operation on the data and the most significant bit (8×8 bits) of the activation vector, generating a product result (8×8 bits) of the weight vector and the most significant bit of the activation vector, and sending the product result into a shift adder tree;
b) The shift adder tree outputs a reset operation at the rising edge ((2) and outputs S at the next cycle 0
c) The read enable of the first sense amplifier is turned off at (2) (data is held by a register in the sense amplifier without flip power consumption and read power consumption generation), the read enable of the second sense amplifier is held, the next highest order of the activation vector is read out in the next cycle, and the next highest order is AND-operated with the weight vector and fed into a shift adder tree to complete S 0 Shifting one bit left and accumulating the current result;
d) Repeating the c operation until the least significant bit of the activation vector is read,the shift adder outputs a final accumulation result S 7
If any sparse flag bit is 1, it indicates that there is a zero vector in the activation vector and the weight vector at the address, and in the next eight periods, neither the weight vector nor the activation vector nor the value (PSUM) of the register in the shift adder is updated, at which time the output of the shift adder is set to 0 by the combinational logic.
FIG. 8 is a schematic diagram of a working pipeline of a near-memory sparse vector multiplier based on a magnetic random access memory according to an embodiment of the present invention, wherein a near-memory multiply accumulator enters an inference stage after a weight matrix is uploaded; firstly, vector multiply accumulation is completed in each PE, then the accumulation results of 48 PEs are sent to a part of accumulator, shifting is carried out after accumulation operation is carried out, data is restored to 8bits, and finally 8bits of data are written back to the cache array MRAM 3. During the whole process, the read operation occurs in the core array MRAM1 and the cache array MRAM2, and the write operation occurs in the cache array MRAM 3.
FIG. 9 is a schematic diagram showing power consumption contrast of near-memory sparse vector multiplication calculation according to an embodiment of the present invention; as can be seen from the analysis of the above calculation process, when calculating the non-zero vector, a single PE needs to read 130 bits of data (8×8bits of weight vector and 1bit of sparse flag bit, 8×8bits of activation vector and 1bit of sparse flag bit), and since the activation vector reads only 8bits at a time, 93 bits of registers (8×8bits of weight vector and 1bit of sparse flag bit, 8bits of activation vector and 1bit of sparse flag bit, 19 bits of accumulation and result) and part of combinational logic are needed in total. Because of the addition of the sparse flag bit, when the PE processes the activation vector or the weight vector to be zero, only 2 flag bits are needed to be read in the first period, then all sense amplifiers are turned off, so that the register maintains the last moment value, and the output is set to 0 through the combinational logic. In this embodiment, the flip of the register and the reading phase of the sense amplifier consume more than 80% of energy, so that the power consumption can be effectively reduced and the energy efficiency can be improved by the method.
Fig. 10 is a statistical result of sparsity of the neural network in MNIST handwriting dataset application according to an embodiment of the present invention; and analyzing the weight uploaded to the PE unit and the input data to obtain a statistical result. The statistical result corresponds to the structure of the near-memory vector multiplier, each box represents a PE unit, the darker the color is, the lower the sparsity degree is, otherwise, the higher the sparsity degree is, the overall average sparsity level is 61.2%, the ten times of calculation are averaged, and more than six times of calculation are skipped. Therefore, the near-memory sparse vector multiplier of the embodiment saves power consumption by identifying the sparsity and skipping the calculation process, thereby improving energy efficiency.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (5)

1. The near-memory sparse vector multiplier based on the magnetic random access memory is characterized by comprising a sparse mark generator, an input unit, a near-memory multiply accumulator and a controller;
the sparse mark generator is connected with the input unit, judges whether input data is 0 through a logic circuit, generates sparse mark bits, and transmits the data and the sparse mark bits into the input unit; the input data includes a weight vector and an activation vector;
the input unit is connected with the near-memory multiply-accumulate device, the near-memory multiply-accumulate device receives the data from the input unit and performs near-memory multiply-accumulate calculation, and the memory access and calculation of the zero vector are skipped in the near-memory multiply-accumulate calculation process;
the controller is respectively connected with the sparse mark generator, the input unit and the near-memory multiply-accumulator, and is used for controlling the realization of the functions of the sparse mark generator, the input unit and the near-memory multiply-accumulator and generating address signals for reading and storing data;
the near-memory multiply accumulator comprises a near-memory processing unit PE, a part and an accumulator, wherein each near-memory processing unit PE in the near-memory multiply accumulator performs parallel calculation, and the final result is accumulated by the part and the accumulator;
the near memory processing unit comprises an address decoder, a core array MRAM1, a cache array MRAM2, a cache array MRAM3, a first sensitive amplifier, a second sensitive amplifier, a shift adder tree and a logic AND module;
the address decoder is respectively connected with the core array MRAM1, the cache array MRAM2 and the cache array MRAM3, and is used for decoding address signals output by the controller and storing data into corresponding addresses according to the address signals; or reading the data participating in calculation;
the core array MRAM1 is used for storing weight vectors, the cache array MRAM2 is used for storing activation vectors, and the cache array MRAM3 is used for storing output vectors;
the first sense amplifier is connected with the core array MRAM1 and is used for reading the weight vector sparse flag bit F of the core array MRAM1 0 The second sense amplifier is connected with the buffer array MRAM2, and the first sense amplifier and the second sense amplifier are sensitive to sparse flag signals and are used for reading sparse flag bits F of activation vectors in the buffer array MRAM2 1 And a data bit;
the first sense amplifier and the second sense amplifier first read sparse flag bits in the weight vector and the activation vector, where F 0 And F is equal to 1 Mutually interact and feed back to the first sense amplifier and the second sense amplifier, if F 0 |F 1 If true, at least one group of vectors in the representative weight vector or the activation vector is zero, so that the first sensitive amplifier and the second sensitive amplifier are all turned off, and the memory of the zero vector is skipped; if F 0 |F 1 If yes, performing multiplication operation on the weight vector and the activation vector through a logic AND module and sending the multiplication operation to a shift adder tree;
the shift adder tree is sensitive to the sparse flag signal, the shift adder tree receives the sparse flag bits transmitted by the first sensitive amplifier and the second sensitive amplifier, if the sparse flag bits indicate zero vectors exist in vectors to be multiplied, calculation of the zero vectors is skipped, all data are maintained unchanged, and output is set to 0 through combinational logic, so that turnover power consumption is reduced; otherwise, the inputs of the first sensitive amplifier and the second sensitive amplifier are multiplied by logic AND and sent to a shift adder tree for shift addition;
after the multiplication of the weight vector and the activation vector is completed in the near memory processing unit PE, the accumulated result of each PE is then sent to a part of accumulator, the accumulated result is shifted after the accumulated operation is executed, the data is restored to 8bits, and finally the 8-bit output vector is written back to the cache array MRAM 3.
2. The mram-based near-memory sparse vector multiplier of claim 1, wherein the sparse flag generator comprises six two-input or gates and one two-input nor gate, and is configured to determine whether 8-bit data are all 0 s, and generate sparse flag bits of the data.
3. The near-memory sparse vector multiplier based on the magnetic random access memory of claim 1, wherein the input unit is configured to receive input data and sparse bits of the sparse flag generator, receive 8-bit write data and sparse bits of the data every cycle, receive 8-bit write data from the sparse flag generator in eight cycles, update a current sparse bit after each data receiving cycle, and output a total of 64-bit and 1-bit sparse bits after eight cycles;
as shown in formula (4), the sparse flag bit F is used for representing whether the vector with the length of 8 and the bit width of 8bits is zero or not, and F i Indicating whether the vector written in the i-th cycle is zero.
4. The MRAM-based near-memory sparse vector multiplier of claim 1, wherein the core array MRAM1 is configured to store a weight vector, and the weight matrix M is mapped in the near-memory processing unit PE core array MRAM1 as shown in formula (2)
The mapping mode is that each element of the weight matrix M is unfolded into 8-bit binary numbers, one bit of sparse zone bit is additionally added to each row, and whether the vector of the row is zero is judged.
5. The MRAM-based near memory sparse vector multiplier of claim 1, wherein the MRAM array MRAM2 is configured to store an activation vector, the activation vectorThe mapping formula in the cache array MRAM2 disclosed by the invention is shown as (3)
The mapping mode is an activation vectorEach element is unfolded into 8-bit binary numbers, each row is arranged in the same address bit of eight operated-on bits, and one sparse flag bit is used for judging whether the vectors of the row and the previous row are zero vectors, f a7 Indicating whether the row vector is zero and f a0 -f a6 Whether or not it is also zero, thus f a7 For representing the activation vector->Whether it is a zero vector.
CN202110689836.7A 2021-06-22 2021-06-22 Near-memory sparse vector multiplier based on magnetic random access memory Active CN113378115B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110689836.7A CN113378115B (en) 2021-06-22 2021-06-22 Near-memory sparse vector multiplier based on magnetic random access memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110689836.7A CN113378115B (en) 2021-06-22 2021-06-22 Near-memory sparse vector multiplier based on magnetic random access memory

Publications (2)

Publication Number Publication Date
CN113378115A CN113378115A (en) 2021-09-10
CN113378115B true CN113378115B (en) 2024-04-09

Family

ID=77578375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110689836.7A Active CN113378115B (en) 2021-06-22 2021-06-22 Near-memory sparse vector multiplier based on magnetic random access memory

Country Status (1)

Country Link
CN (1) CN113378115B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115981751B (en) * 2023-03-10 2023-06-06 之江实验室 Near-memory computing system, near-memory computing method, near-memory computing device, medium and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110325988A (en) * 2017-01-22 2019-10-11 Gsi 科技公司 Sparse matrix multiplication in associated memory devices
CN110889259A (en) * 2019-11-06 2020-03-17 北京中科胜芯科技有限公司 Sparse matrix vector multiplication calculation unit for arranged block diagonal weight matrix

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10997496B2 (en) * 2016-08-11 2021-05-04 Nvidia Corporation Sparse convolutional neural network accelerator

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110325988A (en) * 2017-01-22 2019-10-11 Gsi 科技公司 Sparse matrix multiplication in associated memory devices
CN110889259A (en) * 2019-11-06 2020-03-17 北京中科胜芯科技有限公司 Sparse matrix vector multiplication calculation unit for arranged block diagonal weight matrix

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种基于FPGA的稀疏矩阵高效乘法器;刘世培 等;微电子学;20130420(第02期);全文 *
深度卷积算法优化与硬件加速;付世航;中国优秀硕士学位论文全文数据库信息科技辑;20191215;全文 *

Also Published As

Publication number Publication date
CN113378115A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
Imani et al. Acam: Approximate computing based on adaptive associative memory with online learning
CN108665063B (en) Bidirectional parallel processing convolution acceleration system for BNN hardware accelerator
US20210072986A1 (en) Methods for performing processing-in-memory operations on serially allocated data, and related memory devices and systems
US10545559B2 (en) Data processing system and method
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
Shahshahani et al. Memory optimization techniques for fpga based cnn implementations
CN113378115B (en) Near-memory sparse vector multiplier based on magnetic random access memory
KR20230084449A (en) Neural processing unit
CN117234720A (en) Dynamically configurable memory computing fusion data caching structure, processor and electronic equipment
US20230385258A1 (en) Dynamic random access memory-based content-addressable memory (dram-cam) architecture for exact pattern matching
CN115879530B (en) RRAM (remote radio access m) memory-oriented computing system array structure optimization method
CN109978143B (en) Stack type self-encoder based on SIMD architecture and encoding method
KR20240036594A (en) Subsum management and reconfigurable systolic flow architectures for in-memory computation
KR102544063B1 (en) Neural processing unit capable of reusing data and method thereof
CN112596881B (en) Storage component and artificial intelligence processor
CN115222028A (en) One-dimensional CNN-LSTM acceleration platform based on FPGA and implementation method
CN114267391A (en) Machine learning hardware accelerator
Lei et al. Low power AI ASIC design for portable edge computing
Bai et al. An OpenCL-based FPGA accelerator with the Winograd’s minimal filtering algorithm for convolution neuron networks
Zhang et al. Three-level memory access architecture for FPGA-based real-time remote sensing image processing system
Chen et al. An efficient ReRAM-based inference accelerator for convolutional neural networks via activation reuse
Isono et al. A 12.1 TOPS/W mixed-precision quantized deep convolutional neural network accelerator for low power on edge/endpoint device
CN111047024A (en) Computing device and related product
Xin et al. Lightweight convolutional neural network of YOLO v3-Tiny algorithm on FPGA for target detection
CN113900622B (en) FPGA-based data information rapid sorting method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant