CN115586885A

CN115586885A - Memory computing unit and acceleration method

Info

Publication number: CN115586885A
Application number: CN202211215424.0A
Authority: CN
Inventors: 张盛; 赵越; 李政
Original assignee: Crystal Iron Semiconductor Technology Guangdong Co ltd
Current assignee: Crystal Iron Semiconductor Technology Guangdong Co ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-01-10
Anticipated expiration: 2042-09-30
Also published as: CN115586885B

Abstract

The invention relates to an in-memory computing unit and an acceleration method, and belongs to the field of in-memory computing. The invention combines an output memory and an inner product accumulation unit in the DNN accelerator, and stores the accumulation operation into a storage Cell for solving the problem of a memory wall existing at an output end. The in-memory computing unit comprises a plurality of blocks, each Block comprises parallel cells, and each Cell comprises a state decoder, a pre-storage memory, a result memory, an adder, a data selector and an activation unit; the activation unit includes an activation function, a custom polynomial, and a data selector. According to the memory computing unit and the acceleration method, the accumulation operation is put into the storage Cell, and the process of taking out the accumulation and storing the accumulation is not needed, so that the multiplexing of the output is realized, the operation speed is improved, and the operation power consumption is reduced; the method adopts a mode of inputting the combination of the input data group and the data address group into the memory and then carrying out addressing calculation, thereby further improving the operation speed and reducing the power consumption.

Description

Memory computing unit and acceleration method

Technical Field

The invention belongs to the field of memory computing, and particularly relates to a memory computing unit and an acceleration method.

Background

In recent years, artificial Intelligence (AI) technology has been developed rapidly, and deep learning network (DNN) is particularly widely applied and is one of the important representatives of AI. Due to the particularity of data multiplexing, convolution calculation and the like of DNN, the traditional processors such as CPU and GPU generate great inadaptation to the DNN, and therefore model lightweight processing and AI accelerator generation are promoted.

The deep learning model processed by the AI accelerator is to perform multiply-add operation on a plurality of input Activation and Weight values to generate corresponding output, and the accelerator realizes the operation of the same scale with smaller cost compared with the traditional calculation architecture.

One of the most important features of the deep learning model is its sufficient reusability, which is reflected in three elements participating in DNN calculation: activate, weight, and Output. Wherein, when the same network is used for processing different models, the multiplexing of Weight can occur; when the same Feature Map needs to operate through a plurality of network operations or one Activation and a plurality of different Kernels to obtain values corresponding to different channels of the next layer, the Activation is multiplexed; due to the limitation of hardware resources, the Output at the same position can be obtained by multiple operations, and at this time, the Output needs to be multiplexed.

Due to the special data reusability of DNN operation, a memory wall of DNN operation in a conventional computing architecture is created, and various accelerators attempt to save read-write operations of storage at different levels by multiplexing data in hardware to achieve the purpose of increasing the speed and energy efficiency of DNN computation.

In the prior art, memory computing is designed at an initial memory to solve the problem of a memory wall, a computing unit is usually directly placed inside a memory module to solve a bottleneck caused by access, the input Weight multiplexing or Activation multiplexing is generally realized, and the increase of computing speed and the reduction of power consumption are realized by reducing the refresh frequency input at one side. Due to the different sizes of DNN models, a trade-off needs to be made for the size of memory that needs to be saved, usually for versatility: when the memory computing unit using input multiplexing is too small or the storage grade is too high, frequent reading and writing can be introduced to cause extra overhead, so that the significance of memory computing is not great; when the memory computing unit using input multiplexing is designed to be too large, insufficient deployment of a model is caused, and the problem of Tiling is certainly involved as long as multiplexing of data is realized, which causes that when the computing result needs to be accumulated, an additional access operation is performed again, namely, accumulation of partial sums obtained by inner product operation must be stored in a memory firstly, and then a process of 'fetching → computing → writing' is performed through the inner product computing unit, the process at least involves two times of addressing, and almost no parallelism exists, power consumption is wasted again in the addressing process, a period is wasted, and a real bottleneck is caused by the operation.

Disclosure of Invention

In view of the foregoing analysis, the present invention aims to provide a memory computing unit and an acceleration method, in which an output memory in a DNN accelerator is combined with an inner product accumulation unit, accumulation operation is stored in a memory Cell, and a process of taking out the accumulation and storing the accumulation is not required, so that multiplexing of the output of the DNN accelerator is realized, a problem of a memory wall at an output end of the DNN accelerator is solved, and thus, an increase in operation speed and a reduction in operation power consumption are realized.

The location of the in-memory computing unit of the present invention in the DNN computing architecture is shown in fig. 1.

In one aspect, the present invention provides an in-memory computing unit, comprising a plurality of blocks, each Block comprising a plurality of cells, each Cell comprising a memory accumulation section and an output section; the storage and accumulation part is used for storing input data and performing accumulation operation, and the output part is used for activating the result of the accumulation operation to obtain the output result of the Cell;

the storage accumulation part comprises a state decoder, a pre-storage memory, a result memory, an adder and a first data selector; wherein, the first and the second end of the pipe are connected with each other,

the state decoder is used for determining the operation types of the pre-storage memory and the result memory based on the input data and obtaining first input data of the adder based on the input data;

the pre-storage memory is used for outputting the content in the pre-storage memory to the adder as second input data of the adder based on the operation type;

the adder is used for performing addition operation on the first input data and the second input data;

the first data selector is used for setting the result memory to 0 or storing the output of the adder into the result memory based on the reset flag selection;

the result memory is used for outputting the content in the result memory to the pre-storage memory or the output part based on the operation type.

Furthermore, each Block is used for completing a computation task, and the number of the blocks is determined by the maximum number of concurrent tasks; each Cell is used for operating data based on the instruction of the Block where the Cell is located, the number of the cells is determined by the size reserved for the output storage space, and the size reserved for the output storage space is determined by the DNN model.

Further, the output section includes an activation unit including a plurality of activation functions, a custom polynomial, and a second data selector, wherein,

the multiple activation functions and the self-defined polynomial in the activation unit are used for respectively carrying out corresponding activation operation on input data;

the second selector is used for selecting one activation operation result as an output result of the Cell based on activation selection.

Further, the state decoder is configured to determine the operation types of the pre-storage memory and the result memory based on the input data, and the obtaining the first input data of the adder based on the input data includes:

the input data comprises a read-write state, an input data group and a data address group;

the operation types comprise dormancy, writing, stop writing and reading;

determining the operation types of a pre-storage memory and a result memory based on the read-write state;

the first input data of the adder is obtained based on the input data group and the data address group.

Further, the outputting, by the pre-storage memory, the content in the pre-storage memory to the adder as the second input data of the adder based on the operation type includes: when the operation type is a write operation, outputting the content of the pre-storage memory to an adder; when the operation type is other types, the content of the pre-storage memory is not output.

Further, the result storage is used for outputting the content in the result storage to the pre-storage or the output part based on the operation type, and when the operation type is a write operation, the content in the result storage is output to the pre-storage; when the operation type is a read operation, the contents of the result memory are output to an output section; when the operation type is other type, the contents of the result memory are not output.

Further, the input of the memory computing unit comprises a read-write state, a Block instruction group, a reset mark and a custom activation configuration; the Block instruction group comprises a plurality of Block instructions, and the Block instructions comprise activation selection, input data groups and data address groups; wherein, the first and the second end of the pipe are connected with each other,

the reset mark is used for representing the reset state of the calculation;

the self-defined activation configuration is used for determining each item coefficient by a self-defined polynomial;

the activation selection is input into an activation unit;

the read-write state, the input data group and the data address group are input into a state decoder.

Further, the Block instruction further includes a Block address that is used to determine a valid Blcok to participate in the computation.

Furthermore, the output of the memory computing unit is obtained by splicing the output data of all blocks participating in the computation, and the output data of the blocks is obtained by splicing the output data of all cells participating in the computation in the blocks.

On the other hand, the invention also provides a memory computing acceleration method using the memory computing unit, which specifically comprises the following steps:

s1, determining a Block participating in calculation based on a Block instruction group of input data;

s2, determining cells participating in calculation based on an input data group and a data address group in the Block instruction;

s3, determining the operation type of each Cell in the memory computing unit based on the read-write state;

s4, based on the operation type, the Cell performs corresponding accumulation operation on the input data;

s5, based on activation selection, activating operation is carried out on the accumulated operation result by the Cell to obtain an output result of the Cell

And S6, obtaining the result of the memory calculation based on the output result of each Cell.

The invention can realize at least one of the following beneficial effects:

1. the output of the DNN accelerator is multiplexed by putting the accumulation operation into the storage Cell without the process of taking out the accumulation and then storing, so that the operation speed is increased and the operation power consumption is reduced.

2. The combination of the input data group and the data address group is input into the memory and then the addressing calculation is carried out, namely, after the inner product operation unit generates the corresponding partial sum, the accumulation operation can be completed directly through one-time addressing, the process of 'taking → calculating → writing' is directly saved as 'writing', the taking and the calculation are completed by the memory calculation unit, the method is more efficient compared with the mode of addressing and then storing, the operation speed is further improved, and the power consumption is reduced.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

FIG. 1 is a schematic diagram of the location of a memory computing unit in a DNN computing architecture according to the present invention;

FIG. 2 is a schematic diagram of a memory computing unit according to the present invention;

FIG. 3 is a diagram of a Cell structure of a memory computing unit according to the present invention.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.

Device embodiment

One embodiment of the present invention discloses an in-memory computing unit for performing concurrent computation tasks of a DNN model.

As shown in fig. 2: comprises m blocks which are arranged in parallel; each Block comprises n parallel cells, each Cell comprises a storage accumulation part and an output part; the storage accumulation part is used for storing input data and performing accumulation operation, and the output part is used for activating the result of the accumulation operation to obtain the output result of the Cell.

1 of the blocks is used for completing one of the calculation tasks, and the value of m is determined by the maximum concurrent task number.

The Cell is used for operating data based on the instruction of the Block where the Cell is located, the value of n is determined by the size reserved for the output storage space, and the size reserved for the output storage space is determined by the DNN model.

Specifically, as shown in fig. 3, the storage and accumulation part of each Cell includes a state decoder, a pre-storage memory, a result memory, an adder, and a first data selector, and the output part of each Cell includes an activation unit including a plurality of activation functions, a custom polynomial, and a second data selector.

The state decoder is used for receiving the read-write state, the input data group and the data address group. Optionally, the Cell may also receive a Block address, and at this time, the Block address is spliced with the data address group and then used as an input.

The read-write state is generated by a read enable signal and a write enable signal and is used for determining the operation types of the pre-storage memory and the result memory.

The input data set comprises a set of data used for participating in calculation and a Cell address corresponding to the data.

And the data address group is used for indicating whether the Cell participates in the calculation.

And the Block address is used for indicating the address of the Block where the Cell is located.

And the state decoder respectively outputs the corresponding operation types to the pre-storage memory and the result memory based on the read-write state.

Specifically, the read-write state comprises a read enable signal (0, 1) and a write enable signal (0, 1); the operation types include sleep, write, stop write, and read. The correspondence between the read-write state and the operation type is shown in table 1:

and the state decoder judges whether the Cell participates in the operation or not based on the Cell address in the data address group and outputs a result.

Specifically, if the data address group does not have the address of the Cell, all the inputs of the Cell at this time are ignored, and the Cell does not participate in the calculation; if the address of the Cell exists, finding out corresponding data participating in Cell calculation in the input data group based on the Cell address, and outputting the data to the adder. The Block address is used for determining a Block participating in calculation, and for the Block address, a state decoder in an internal result of the Block does not process the Block address.

And the adder is used for performing addition operation on the data participating in Cell calculation and the data in the pre-storage memory, and the result is used as one input of the first data selector.

And the result storage is used for outputting the content in the result storage to a pre-storage or activation unit based on the operation type.

Specifically, when the operation type is a write operation, the content of the result memory is output to a pre-storage memory; when the operation type is a read operation, the content of the result memory is output to the activation unit; when the operation type is other type, the contents of the result memory are not output.

Specifically, the length of the result memory is determined by the maximum value of the DNN model calculation result supported by the memory calculation unit, and the length of the result memory needs to satisfy the upper limit of the result of the accumulation calculation.

Preferably, the resulting memory is an SRAM.

And the pre-storage memory is used for outputting the content in the pre-storage memory to the adder based on the operation type.

Specifically, when the operation type is a write operation, the content of the pre-storage memory is output to the adder. When the operation type is other types, the content of the pre-storage memory is not output.

Specifically, the length of the pre-storage memory is determined by the maximum value of the DNN model calculation result supported by the memory calculation unit, and the length of the pre-storage memory needs to satisfy the upper limit of the result of the accumulation calculation.

Preferably, the pre-storage memory is an SRAM.

The first selector selects two inputs based on the reset mark and outputs the two inputs to the result storage; one of the inputs is the output of the adder and the other input is logic 0. Wherein the reset flag represents a reset state of the computation, the reset flag being true before the computation begins; after the computation begins, the reset flag is not true.

Specifically, when the reset flag is true, the output of the first data selector is 0; when the reset flag is not true, the output of the first data selector is the output of the adder.

And the activation unit is used for determining a corresponding activation function or a self-defined polynomial based on activation selection, and performing activation operation on input data to obtain output data of the Cell in which the activation unit is located.

Specifically, the plurality of activation functions and the custom polynomial in the activation unit respectively calculate the input data, and output the result to the second data selector, and the second selector selects corresponding output based on activation selection to obtain output data.

Specifically, the custom polynomial determines the coefficients based on the custom activation configuration.

Next, the relationship between the input/output data of the memory operation unit and the input/output data of the Block and the Cell will be described.

Specifically, the input of the memory computing unit comprises a read-write state, a Block instruction group, a reset mark and a custom activation configuration, and all the parts are connected in a splicing mode.

The read-write state data comprises read-write states of all cells of the memory computing unit, the data length is m multiplied by n multiplied by 4bit, the corresponding relation between the read-write states and the cells is hidden in the position information of the whole data, and after the data is input into the memory computing unit, the read-write state data of the cells is input into the corresponding cells based on the position relation.

The Block instruction group comprises a plurality of Block instructions connected in a splicing mode, and the number of the Block instructions is the number of concurrent tasks to be completed.

The Block instruction comprises activation selection, an input data group and a data address group, and optionally comprises a Block address.

When the Block instruction does not comprise a Block address, the corresponding relation between the data and the Block is implied in the position information of the whole data, and the memory computing unit realizes the control of the Blcok participating in the computation based on the position information; when the Block instruction comprises a Block address, the Block instruction corresponds to the Block address based on the Block address, and the memory computing unit judges effective blocks participating in computing based on the Block address.

The Block command is broadcast to all cells in the Block, and the inputs to all cells are the same Block command.

The activation selection is used for the activation unit to select an activation function or a self-defined polynomial and is a 3-bit or 2-bit numerical value.

The input data group and the data address group are in a corresponding relation, the data address group comprises the addresses of the cells needing to participate in calculation in the Block, and the input data group comprises the input data of the corresponding cells.

The reset marks comprise reset marks aiming at all blocks of the memory computing unit, and the length of the reset marks is mbit. For the cells of the same Block, one reset flag is common. The corresponding relation between the reset mark and the Block is hidden in the position information of the whole piece of data.

The custom activation configuration comprises custom activation configurations of all blocks of the memory computing unit, and the same custom activation configuration is used for cells of the same Block. The corresponding relation between the user-defined activation configuration and the Block is hidden in the position information of the whole piece of data.

The output of the memory computing unit is obtained by splicing the output data of all blocks participating in computing, and the output data of the blocks is obtained by splicing the output data of all cells participating in computing in the blocks.

The embodiment discloses an in-memory computing unit, which puts accumulation operation into a memory Cell, on one hand, the multiplexing of output is realized without the process of taking out the accumulation and then storing the accumulation, so that the operation speed is increased and the operation power consumption is reduced; on the other hand, for input data, the combination of the input data group and the data address group is input into the memory and then addressing calculation is carried out, so that the method is more efficient compared with a mode of addressing before storing, the operation speed is further improved, and the power consumption is reduced. The memory computing unit of the embodiment effectively solves the problem of a memory wall existing at the output end.

Method embodiment

The invention also discloses an in-memory calculation acceleration method using the in-memory calculation unit of the embodiment, which specifically comprises the following steps:

and S1, determining a Block participating in calculation based on a Block instruction group of input data.

Specifically, when a Block instruction in a Block instruction group does not include a Block address, determining a Block participating in calculation based on position information of the Block instruction in the Block instruction group; when a Block instruction includes a Block address, a Block participating in the computation is determined based on the Block address.

And S2, determining the cells participating in calculation based on the input data group and the data address group in the Block instruction.

Specifically, a Block command of the same Block is transmitted to each Cell in a broadcast manner. For a Cell, if the data address group does not have the address of the Cell, all the inputs of the Cell are ignored, and the Cell does not participate in the calculation; if the address of the Cell exists, finding out the corresponding data participating in the Cell calculation in the input data group, and outputting the data to the adder. For Block addresses, the state decoder does not do the processing.

And S3, determining the operation type of each Cell in the memory computing unit based on the read-write state.

Specifically, referring to table 1 of the embodiment, based on the read-write state, it may be determined that the operation type of the Cell is sleep, write, stop write, or read.

And S4, based on the operation type, the Cell performs corresponding accumulation operation on the input data.

Specifically, before the calculation starts, the "reset flag" is true, the output of the first data selector is 0, that is, the content of the result memory is set to 0; when the operation type is writing, the Cell inputs the content of the result memory into a pre-storage memory and outputs the data which is obtained by the state decoder and participates in Cell calculation to an adder; after the writing operation is finished, the operation type is changed into writing stopping, the content of the pre-storage unit is changed, and the calculation result of the adder is output to the result memory; repeating the steps, and realizing accumulation operation on the input data set by the Cell; when the operation type is changed into reading, the accumulation is finished, and the content of the result memory is the accumulation result and is output to the activation unit.

And S5, based on activation selection, activating operation is carried out on the accumulated operation result by the Cell to obtain an output result of the Cell.

Specifically, each activation function and the custom polynomial in the activation unit respectively calculate the accumulation operation in the step 4, and the activation unit selects one calculation result as an output result of the Cell based on activation.

Specifically, the output data of all cells participating in calculation in the same Block are spliced to obtain the output result of the Block; and splicing the output results of all blocks participating in calculation to obtain the result of in-memory calculation.

Compared with the prior art, the memory calculation acceleration method provided by the embodiment puts the accumulation operation into the memory Cell, and realizes the multiplexing of output without the process of taking out the accumulation and storing; and moreover, the step of addressing before storing is avoided, so that the efficiency is further improved, and the acceleration of operation is realized.

It should be noted that the above embodiments are based on the same inventive concept, and the description is not repeated, so that they can be mutually referred.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. An in-memory computing unit comprising a plurality of blocks, each Block comprising a plurality of cells, each Cell comprising a memory accumulation portion and an output portion; the storage accumulation part is used for storing input data and performing accumulation operation, and the output part is used for activating the result of the accumulation operation to obtain the output result of the Cell;

the first data selector is used for setting the result memory to 0 or storing the output of the adder into the result memory based on the reset mark selection;

the result storage is used for outputting the content in the result storage to the pre-storage or the output part based on the operation type.

2. The in-memory computing unit of claim 1, wherein each of the blocks is configured to complete one computing task, and a number of the blocks is determined by a maximum number of concurrent tasks; each Cell is used for operating data based on the instruction of the Block where the Cell is located, the number of the cells is determined by the size reserved for the output storage space, and the size reserved for the output storage space is determined by the DNN model.

3. The in-memory computing unit of claim 1, wherein the output portion comprises an activation unit comprising a plurality of activation functions, a custom polynomial, and a second data selector, wherein,

4. The in-memory computing unit of claim 2 or 3, wherein the state decoder is configured to determine the operation types of the pre-storage memory and the result memory based on the input data, and wherein deriving the first input data of the adder based on the input data comprises:

the operation types comprise dormancy, writing, stop writing and reading;

first input data of the adder is obtained based on the input data group and the data address group.

5. The in-memory computing unit of claim 4, wherein the pre-storage memory is configured to output the content in the pre-storage memory to the adder as the second input data of the adder based on the operation type, and comprises: when the operation type is a write operation, outputting the content of the pre-storage memory to an adder; when the operation type is other types, the content of the pre-storage memory is not output.

6. The in-memory computing unit of claim 4, wherein the result storage being configured to output the contents of the result storage to the pre-storage or the output section based on the operation type comprises, when the operation type is a write operation, the contents of the result storage being output to the pre-storage; when the operation type is a read operation, the contents of the result memory are output to an output section; when the operation type is other type, the contents of the result memory are not output.

7. The in-memory computing unit of claim 4, wherein inputs to the in-memory computing unit include read and write states, a Block instruction set, a reset flag, and a custom activation configuration; the Block instruction group comprises a plurality of Block instructions, and the Block instructions comprise activation selection, input data groups and data address groups; wherein the content of the first and second substances,

the reset mark is used for representing the reset state of the calculation;

the activation selection is input into an activation unit;

8. The in-memory computing unit of claim 7, wherein the Block instruction further comprises a Block address to determine a valid Blcok to participate in a computation.

9. The in-memory computing unit according to claim 7 or 8, wherein the output of the in-memory computing unit is obtained by concatenating the output data of all the computing blocks, and the output data of the blocks is obtained by concatenating the output data of all the computing cells in the Block.

10. A memory computation acceleration method using the memory computation unit of claim 9, characterized by comprising the steps of: