CN111079919A

CN111079919A - Memory computing architecture supporting weight sparsity and data output method thereof

Info

Publication number: CN111079919A
Application number: CN201911151228.XA
Authority: CN
Inventors: 刘勇攀; 岳金山; 袁哲; 孙文钰; 李学清; 杨华中
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2020-04-28
Anticipated expiration: 2039-11-21
Also published as: CN111079919B

Abstract

The embodiment of the invention provides a memory computing architecture supporting weight sparsity and a data output method thereof, wherein the architecture comprises the following steps: the memory cell array comprises a plurality of sub memory cell blocks, and an analog-to-digital conversion unit is correspondingly arranged at an output port of each row of sub memory cell blocks; the operation module is used for carrying out sparse training on the weight of the neural network model stored in the storage unit array according to each sub storage unit block, so that the weight stored in each sub storage unit block is trained to be an all-zero value or a non-all-zero value; and the detection module is used for turning off the analog-to-digital conversion unit and setting the output of the analog-to-digital conversion unit to be zero when the sub-storage unit block corresponding to the analog-to-digital conversion unit is detected to be in a working state and the stored weight is all zero. The embodiment of the invention can effectively reduce the power consumption of the memory calculation in the weight sparse application of the neural network model and improve the feasibility of the application.

Description

Memory computing architecture supporting weight sparsity and data output method thereof

Technical Field

The invention relates to the technical field of circuit design, in particular to a memory computing architecture supporting weight sparsity and a data output method thereof.

Background

The memory computing is a new circuit architecture, different from a traditional von Neumann architecture with separated storage and computing, the memory computing integrates the storage and the computing, and the computing is completed in a storage unit. Compared with the traditional structure, the memory computing has the characteristics of high parallelism and high energy efficiency, and is a better alternative scheme for algorithms which need a large number of parallel matrix vector multiplication operations, particularly neural network algorithms.

The neural network algorithm is an important algorithm of the current artificial intelligence technology, is composed of a large number of matrix vector multiplication operations, and is suitable for realizing high-energy-efficiency processing by using an in-memory computing circuit. In the application of the traditional memory computing architecture in a neural network algorithm, the memory computing architecture comprises a memory cell array with M rows and N columns, images are input into the memory cell through a digital-to-analog converter (DAC) in each row, and then multiplication and accumulation operations are carried out on the images and neural network weights (N columns in each row are weight data of one N-bit) stored in the memory cell.

At each clock cycle, m rows of DACs in the memory cell array are turned on, and the multiplication and accumulation result of the m rows is converted into a digital signal output on an analog-to-digital converter (ADC) of each column. That is, the result obtained from the memory calculation needs to be converted into a digital signal by an ADC or other modules, and then stored and processed in a digital circuit. Let the image input in the ith row be a_iThe weight data of the n-bit in the ith row and the ith column (j x n) to (j x n + n-1) is w_ijThen the result of multiply-accumulate of the ADC output is

In practical application, in consideration of redundancy existing in a neural network algorithm, a large amount of weight data (weight) in the algorithm can be set to be 0 through a sparse technology, so that the calculation overhead of the neural network is reduced. However, the distribution of 0 s in memory calculations tends to be discrete and irregular. Since the memory calculation is usually parallel calculation, even if most weights are 0, as long as there is a weight other than 0, the ADC corresponding to the corresponding output result needs to be turned on, which will generate a large amount of power consumption, and even may occupy 95% of the power consumption overhead of the whole memory calculation module.

Disclosure of Invention

In order to overcome the above problems or at least partially solve the above problems, embodiments of the present invention provide a memory computing architecture supporting weight sparsity and a data output method thereof, so as to effectively reduce power consumption of memory computing in neural network model weight sparsity application and improve feasibility of application.

In a first aspect, an embodiment of the present invention provides an in-memory computing architecture supporting weight sparseness, including:

the memory cell array comprises a plurality of sub memory cell blocks, and an analog-to-digital conversion unit is correspondingly arranged at an output port of each row of sub memory cell blocks;

the operation module is used for carrying out sparse training on the weight of the neural network model stored in the storage unit array according to each sub storage unit block, so that the weight stored in each sub storage unit block is trained to be an all-zero value or a non-all-zero value;

and the detection module is used for turning off the analog-to-digital conversion unit and setting the output of the analog-to-digital conversion unit to be zero when the sub-storage unit block corresponding to the analog-to-digital conversion unit is detected to be in a working state and the stored weight is all zero.

Further, the operation module is further configured to adaptively adjust the number of rows and the number of columns of the sub-memory cell block in the process of performing sparse training, so as to adapt to the total number of rows and the total number of columns of the memory cell array.

Further, the operation module is further configured to mark the sub-memory cell block as a sparse block after training the weight stored in the sub-memory cell block to an all-zero value;

correspondingly, the detection module is further configured to detect whether the weight stored in each sub-memory cell block is all zero by detecting whether each sub-memory cell block includes a sparse block flag.

Optionally, the analog-to-digital conversion unit is specifically an analog-to-digital converter ADC, a sampling amplifying circuit SA, or an in-memory computing processing unit PU.

Optionally, the operation module is specifically configured to mark whether the sub storage unit block is a sparse block by using a 1-bit sparse mark sparse index.

Further, the operation module is further configured to, in each clock cycle, respectively match the number of rows and the number of columns of the corresponding memory cell array with the number of rows and the number of columns of the sub memory cell block.

In a second aspect, an embodiment of the present invention provides a data output method based on the memory computing architecture supporting weight sparsity as described in the first aspect, including:

according to each sub-storage unit block, carrying out sparse training on weights of the neural network model stored in the storage unit array, so that the weights stored in each sub-storage unit block are trained to be all-zero values or non-all-zero values;

if the sub-memory cell block corresponding to the analog-digital conversion unit is detected to be in a working state and the stored weight is all zero, the analog-digital conversion unit is turned off, the output of the analog-digital conversion unit is set to be zero, otherwise, multiplication and addition operation is carried out according to the input of the sub-memory cell block corresponding to the analog-digital conversion unit in the working state and the weight stored in the sub-memory cell block in the working state, and a multiplication and addition operation result is output by turning on the analog-digital conversion unit.

According to the weight-sparse-support memory computing architecture and the data output method thereof, provided by the embodiment of the invention, the weight in the memory-computed memory cell array is sparsely trained according to blocks, the memory cell array computed in the memory is divided into the sub-memory cell blocks, the weight sparseness of the neural network model according to the blocks is realized, and meanwhile, the power consumption of the memory-computed neural network model weight sparseness application can be effectively reduced and the feasibility of the application is improved by turning off the analog-to-digital conversion units corresponding to the sparse blocks.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a memory computing architecture supporting weight sparseness according to an embodiment of the present invention;

FIG. 2 is a schematic circuit diagram of a memory computing architecture supporting weight sparseness according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a data output method based on a memory computing architecture supporting weight sparseness according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments of the present invention without any creative efforts belong to the protection scope of the embodiments of the present invention.

Aiming at the problem of overhigh power consumption of in-memory calculation in neural network application in the prior art, the embodiment of the invention realizes the block-by-block sparsity of the weight of the neural network model by carrying out the block-by-block sparsity training on the weight in the memory unit array of in-memory calculation and carrying out the sub-memory unit block division on the memory unit array of in-memory calculation, and can effectively reduce the power consumption of in-memory calculation in the neural network model weight sparsity application and improve the feasibility of application by turning off the analog-to-digital conversion unit corresponding to the sparse block. Embodiments of the present invention will be described and illustrated with reference to various embodiments.

Fig. 1 is a schematic structural diagram of a memory computing architecture supporting weight sparsity according to an embodiment of the present invention, where the architecture may be used to implement closing of a corresponding analog-to-digital conversion circuit unit in a weight sparsity manner supporting rules, so as to reduce power consumption of a memory computing circuit system. As shown in fig. 1, the system includes a memory cell array 101, an operation module 102, and a detection module 103. Wherein:

the memory cell array 101 comprises a plurality of sub-memory cell blocks, and an analog-to-digital conversion unit is correspondingly arranged at an output port of each column of sub-memory cell blocks; the operation module 102 is configured to perform sparse training on weights of the neural network model stored in the memory cell array according to each sub-memory cell block, so that the weights stored in each sub-memory cell block are trained to be all-zero values or non-all-zero values; the detection module 103 is configured to turn off the analog-to-digital conversion unit and set an output of the analog-to-digital conversion unit to zero when it is detected that the sub-memory cell block corresponding to the analog-to-digital conversion unit is in the working state and the stored weight is a full zero value.

It can be understood that, in the memory computing architecture supporting weight sparsity according to the embodiment of the present invention, the weights stored in the memory computing circuit are sparse in blocks by training the weights of the neural network in a regular sparse manner, and when performing computation of corresponding sparse weights, the memory computing architecture can directly skip and turn off corresponding analog-to-digital conversion circuit units to save power consumption. Therefore, the method at least comprises a storage unit array 101, an operation module 102 and a detection module 103, which are respectively used for realizing the storage of the weight of the neural network model, the block-wise sparse training of the weight of the neural network model and the energy-saving processing flow by detecting sparse blocks and switching off the corresponding analog-to-digital conversion units.

Specifically, as shown in fig. 2, for a schematic circuit structure diagram of a memory computing architecture supporting weight sparseness according to an embodiment of the present invention, a memory cell array 101 includes M rows and N columns of memory cells, which are divided into M rows and N columns of small blocks, and each M rows and N columns of small blocks form a sub-memory cell block. Wherein the weight of each small block is trained to be a value of 0 all (sparse) or not all of 0 during the algorithm training process. Meanwhile, the multiply-add output end of each column of sub-memory cell blocks is provided with a corresponding analog-to-digital conversion unit for converting the result obtained by memory calculation, such as analog voltage/current signals, into digital signals to be stored and processed in a digital circuit.

Optionally, the analog-to-digital conversion unit is specifically an analog-to-digital converter ADC, a sampling amplifying circuit SA, or an in-memory computing processing unit PU. That is, in a different memory computation architecture, the ADC may be a sampling amplifier circuit (SA), a Processing Unit (PU), or the like. In either case, the function implemented is to convert the results of the memory calculations from analog voltage or current to a digital circuit representation.

The operation module 102 is mainly used to implement a calculation function in the memory calculation architecture, and specifically, through a neural network algorithm, trains a weight of a neural network model into a sparse form according to a block, and corresponds to m rows and n columns of sub-memory cell blocks of a memory cell array in memory calculation. That is, by block-wise sparse training, the weights stored in each sub-block of memory cells are either all zero values (i.e., all zero values are achieved) or all zero values (i.e., non-all zero values are achieved).

The detection module 103 detects the weight condition stored in each sub-memory cell block on the basis of sparse training to determine the storage state of the sub-memory cell block corresponding to each analog-to-digital conversion unit. That is, it is determined whether it is idle or in an operating state, and it is determined whether all weights stored in the sub memory cell block in the operating state are all zero or not all zero. And if the weights stored in the sub-memory cell blocks in the working state corresponding to a certain analog-to-digital conversion unit are all zero values, correspondingly turning off the analog-to-digital conversion unit. In addition, since the memory cell array of the memory calculation uses m rows as an operation unit in one clock cycle, and since all stored data of one sparse block is 0, the corresponding multiply-accumulate result can be directly determined to be 0. Therefore, the detection module 103 sets the output of the corresponding analog-to-digital conversion unit to zero on the basis of turning off the analog-to-digital conversion unit.

According to the weight-sparse-support memory computing architecture provided by the embodiment of the invention, the weight in the memory-computed memory cell array is sparsely trained according to blocks, the memory-computed memory cell array is divided into the sub-memory cell blocks, the weight-sparse-according-blocks of the neural network model is realized, and meanwhile, the power consumption of the memory-computed memory in the neural network model weight sparse application can be effectively reduced and the application feasibility is improved by switching off the analog-to-digital conversion units corresponding to the sparse blocks.

In addition, on the basis of the foregoing embodiments, the operation module may further be configured to adaptively adjust the number of rows and the number of columns of the sub-memory cell block in the process of performing the sparse training, so as to adapt to the total number of rows and the total number of columns of the memory cell array.

It can be understood that, during the block-wise sparse training through the neural network, the number of rows and columns of each block may be adaptively adjusted, so that the number of rows and columns of the sub-memory cell blocks is also adaptively divided to adapt to the total number of rows and the total number of columns of the memory cell array.

Furthermore, the operation module is further configured to, in each clock cycle, match the number of rows and the number of columns of the corresponding open memory cell array with the number of rows and the number of columns of the sub memory cell block, respectively. That is, in each clock cycle, the number of rows of the turned-on memory cell array is consistent with the number of rows of the sub memory cell blocks, and the number of columns of the turned-on memory cell array is consistent with the number of bits of the weight.

Furthermore, the operation module is further used for marking the sub-storage unit block as a sparse block after training the weight stored in the sub-storage unit block to be an all-zero value; correspondingly, the detection module is further configured to detect whether the weight stored in the sub-memory cell block is all zero values by detecting whether each sub-memory cell block includes the sparse block flag.

It can be understood that after all weights stored in a certain sub-storage unit block are trained to be zero values through a neural network algorithm, the sub-storage unit block is marked, that is, marked as a sparse block, so as to obtain a corresponding sparse block mark. Where the sparse block is called Sparse Weight Block (SWB). Optionally, the operation module is specifically configured to mark whether the sub-storage unit block is a sparse block by using a 1-bit sparse mark sparse index. That is, whether the weight data block corresponding to the current ADC is an SWB may be marked by a 1-bit sparse flag sparse index. And if the voltage is SWB, controlling the ADC to be powered off, and directly outputting 0 in a subsequent circuit, so that the running power consumption of the ADC is reduced.

Accordingly, when detecting whether each sub-memory cell block is a sparse block, the detection module only needs to detect whether each sub-memory cell block includes a corresponding sparse block flag to detect whether all the weights stored in each sub-memory cell block are zero values. For example, when a certain sub-memory cell block includes a corresponding sparse block flag, it indicates that the weights stored therein are all zero.

Based on the same inventive concept, the embodiment of the invention also provides a data output method based on the memory computing architecture supporting weight sparsity, which is based on the above embodiments. Therefore, the description and definition in the memory computing architecture supporting weight sparsity in the above embodiments may be used for understanding the processing steps in the embodiments of the present invention, and reference may be made to the above embodiments specifically, and details are not repeated here.

As an embodiment of the present invention, a data output method based on the memory computing architecture supporting weight sparsity according to the above embodiments is shown in fig. 3, which is a schematic flow chart of the data output method based on the memory computing architecture supporting weight sparsity according to the embodiment of the present invention, and includes the following processing procedures:

s301, according to each sub-storage unit block, sparse training is carried out on the weight of the neural network model stored in the storage unit array, so that the weight stored in each sub-storage unit block is trained to be all-zero or non-all-zero.

It can be understood that, in this step, data calculation in the memory cell array is mainly implemented, specifically, by using a neural network algorithm, weights of a neural network model are trained to be sparse in a block form, and correspond to m rows and n columns of sub memory cell blocks of the memory cell array in the memory cell calculation. That is, by block-wise sparse training, the weights stored in each sub-block of memory cells are either all zero values (i.e., all zero values are achieved) or all zero values (i.e., non-all zero values are achieved).

S302, if the sub-memory cell block corresponding to the analog-to-digital conversion unit is detected to be in the working state and the stored weight is all zero, the analog-to-digital conversion unit is turned off, and the output of the analog-to-digital conversion unit is set to be zero, otherwise, multiplication and addition operation is carried out according to the input of the sub-memory cell block corresponding to the analog-to-digital conversion unit in the working state and the weight stored in the sub-memory cell block in the working state, and a multiplication and addition operation result is output by turning on the analog-to-digital conversion unit.

It can be understood that, in this step, on the basis of sparse training, the weight condition stored in each sub-memory cell block is detected to determine the storage state of the sub-memory cell block corresponding to each analog-to-digital conversion unit. That is, it is determined whether it is idle or in an operating state, and it is determined in which the weight stored in the sub memory cell block in the operating state is all zero or not all zero.

And if the weights stored in the sub-memory cell blocks in the working state corresponding to a certain analog-to-digital conversion unit are all zero values, correspondingly turning off the analog-to-digital conversion unit. In addition, since the memory cell array of the memory calculation uses m rows as an operation unit in one clock cycle, and since all stored data of one sparse block is 0, the corresponding multiply-accumulate result can be directly determined to be 0. Therefore, the output of the corresponding analog-to-digital conversion unit is set to zero on the basis of turning off the analog-to-digital conversion unit.

In addition, if the weights stored in the sub memory cell blocks in the working state corresponding to a certain analog-to-digital conversion unit are not all zero values, the analog-to-digital conversion unit is correspondingly turned on. And meanwhile, performing multiply-add operation on the input of the sub-storage unit block in the working state corresponding to the analog-digital conversion unit and the weight stored in the sub-storage unit block, and outputting a corresponding multiply-add operation result.

According to the data output method based on the memory computing architecture supporting weight sparseness, which is provided by the embodiments of the invention, the weight in the memory computing memory cell array is sparsely trained according to blocks, the memory computing memory cell array is divided into sub memory cell blocks, the weight sparseness of the neural network model according to blocks is realized, and meanwhile, by turning off the analog-to-digital conversion unit corresponding to the sparse block, the power consumption of the memory computing in the neural network model weight sparseness application can be effectively reduced, and the feasibility of the application is improved.

To further illustrate the technical solutions of the embodiments of the present invention, the embodiments of the present invention provide the following specific processes according to the above embodiments, but do not limit the scope of the embodiments of the present invention.

According to the embodiment of the invention, the integrated circuit chip comprising the memory computing architecture example of the embodiment of the invention is obtained through the front-end design, the back-end design and the wafer manufacturing of the digital circuit and the analog circuit. The process adopts a station accumulated power 65nm process, and then the power consumption and the performance are tested after the chip is packaged. The chip area is 3.0mmx3.0mm, and 4 identical invention examples are contained, and the area of each invention example is 0.37mm x 0.40 mm. The test running frequency is 50-100MHz, and the corresponding voltage is 0.90-1.05V.

The data storage and operation process comprises the following steps:

and training the weight into a sparse form through a neural network algorithm, and calculating the SWB of m rows and n columns of the array corresponding to the memory.

The number of open rows of the memory calculation array in each period is consistent with m, and n is consistent with the number of bits of the weight. Both m and n can be flexibly adjusted in the algorithm training process to adapt to the actual in-memory computing array.

Dynamic shutdown of SWB-based ADCs. Each SWB of m rows and n columns corresponds to a 1-bit sparse mark, and the ADC is dynamically turned on and off when the current SWB multiply-accumulate operation is executed.

Experiments show that the embodiment of the invention reduces the power consumption overhead of the memory computing architecture by supporting weight sparseness and dynamically switching off the ADC. Meanwhile, sparse training and chip testing are performed on different neural network algorithms, two neural network algorithm models of VGG16 and ResNet18 are used on two image recognition test sets based on MNIST and Cifar-10, and weight data block compression of 20-39 times is achieved, namely the proportion of SWB accounts for 95% -97.4% of all weights. For a VGG16 network based on Cifar-10, a 4-bit input image and a 4-bit weight are adopted, power consumption saving of 2.4-13.6 times (along with SWB proportion change of different layers of the network) is realized in the invention example part of an actual chip, and the average saving is 10.1 times.

It will be appreciated that the above described embodiments of in-memory computing architectures are merely illustrative, in which the units illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed over different network elements. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions may be embodied in software products, or hardware products, which may be stored in a computer-readable storage medium, such as a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, and include instructions for causing a computer device (such as a personal computer, a server, or a network device) to execute the method described in the method embodiments or some parts of the method embodiments.

In addition, it should be understood by those skilled in the art that in the specification of the embodiments of the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

In the description of the embodiments of the invention, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description. Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the embodiments of the present invention, and not to limit the same; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An in-memory computing architecture that supports weight sparseness, comprising:

2. The memory computing architecture of claim 1, wherein the computing module is further configured to adaptively adjust the number of rows and the number of columns of the sub-memory cell blocks to adapt to the total number of rows and the total number of columns of the memory cell array during the sparse training.

3. The memory computing architecture supporting weight sparsity according to claim 1 or 2, wherein the operation module is further configured to mark the sub-memory cell blocks as sparse blocks after training weights stored in the sub-memory cell blocks to all-zero values;

4. The memory computing architecture supporting weight sparseness of claim 1 or 2, wherein the analog-to-digital conversion unit is specifically an analog/digital converter (ADC), a sampling amplification circuit (SA), or a memory computing Processing Unit (PU).

5. The memory computing architecture of claim 3, wherein the computing module is specifically configured to mark whether the sub-memory cell block is a sparse block by using a 1-bit sparse mark sparse index.

6. The memory computing architecture of claim 2, wherein the computing module is further configured to, at each clock cycle, align the number of rows and columns of the corresponding memory cell array with the number of rows and columns of the sub-memory cell blocks, respectively.

7. A data output method based on the in-memory computing architecture supporting weight sparsity of any one of claims 1 to 6, comprising: