CN115586885B

CN115586885B - In-memory computing unit and acceleration method

Info

Publication number: CN115586885B
Application number: CN202211215424.0A
Authority: CN
Inventors: 张盛; 赵越; 李政
Original assignee: Crystal Iron Semiconductor Technology Guangdong Co ltd
Current assignee: Crystal Iron Semiconductor Technology Guangdong Co ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-05-05
Anticipated expiration: 2042-09-30
Also published as: CN115586885A

Abstract

The invention relates to an in-memory computing unit and an acceleration method, and belongs to the field of in-memory computing. The invention combines the output memory in the DNN accelerator with the inner product accumulating unit, and puts the accumulating operation into the memory Cell for solving the problem of the memory wall at the output end. The in-memory computing unit comprises a plurality of blocks, each Block comprises parallel cells, and each Cell comprises a state decoder, a pre-storage memory, a result memory, an adder, a data selector and an activating unit; the activation unit includes an activation function, a custom polynomial, and a data selector. According to the in-memory computing unit and the acceleration method, the accumulation operation is put into the memory Cell, and the process of taking out and accumulating and then storing is not needed, so that the output multiplexing is realized, the operation speed is improved, and the operation power consumption is reduced; the method of inputting the combination of the input data group and the data address group into the memory and then carrying out addressing calculation is adopted, so that the operation speed is further improved and the power consumption is reduced.

Description

In-memory computing unit and acceleration method

Technical Field

The invention belongs to the field of in-memory computing, and particularly relates to an in-memory computing unit and an acceleration method.

Background

In recent years, artificial Intelligence (AI) technology has evolved rapidly, with the use of deep learning networks (DNNs) being particularly widespread, being one of the important representatives of AI. Because of the particularities of DNN, such as data multiplexing and convolution computation, the conventional CPU and GPU processors are greatly not adapted to the DNN processors, thereby promoting the light weight processing of the model and the generation of an AI accelerator.

The deep learning model processed by the AI accelerator is to multiply and add a plurality of input activity and Weight values to generate corresponding outputs, and the accelerator realizes the same-scale operation with less expenditure compared with the traditional computing architecture.

One of the most important features of the deep learning model is its sufficient reusability, which is reflected in three elements involved in DNN calculation: among the Activation, weight and Output. When the same network is used for processing different models, weight multiplexing can occur; when the same Feature Map needs to be operated through a plurality of networks or an action and a plurality of different Kernel to obtain the value corresponding to the different channels of the next layer, multiplexing the actions; because of the limitation of hardware resources, the Output at the same position can be obtained through multiple operations, and the Output needs to be multiplexed at the moment.

The special data multiplexing property of DNN operation causes a memory wall in a traditional computing architecture, and various accelerators try to achieve the aim of saving the read-write operation of each level of storage to achieve the speed and energy efficiency improvement of DNN computation by multiplexing data in hardware.

In the prior art, in-memory computation is designed at an initial memory to solve the problem of a memory wall, and a computing unit is usually directly placed inside a memory module to solve the bottleneck caused by access. Due to the different sizes of DNN models, there is typically a trade-off for the size of memory that needs to be saved for versatility: when the memory computing unit using input multiplexing is too small or the memory level is too high, frequent reading and writing can be introduced to cause additional expenditure, so that the meaning of memory computing is not great; when the memory computing unit using input multiplexing is designed to be too large, insufficient deployment of a model is caused, and the problem of Tiling is necessarily involved as long as multiplexing of data is realized, which causes additional access operation to be performed again when the computing result needs to be accumulated, namely the accumulation of partial sums obtained by the inner product operation is required to be stored in a memory, and then the process of 'fetch- & gt computation- & gt writing' is performed by the inner product computing unit, wherein the process involves at least two addressing steps, and is almost completely unparallelable, power consumption is wasted again and period is wasted again in the addressing step, and the operation causes a real bottleneck.

Disclosure of Invention

In view of the above analysis, the present invention aims to provide an in-memory computing unit and an acceleration method, which combine an output memory in a DNN accelerator with an inner product accumulating unit, put accumulation operation into a memory Cell, and implement multiplexing of output of the DNN accelerator without a process of taking out and accumulating and then storing the accumulation operation, thereby solving the problem of a memory wall at an output end of the DNN accelerator, and further implementing improvement of operation speed and reduction of operation power consumption.

The location of the in-memory computing unit in the DNN computing architecture of the present invention is shown in fig. 1.

In one aspect, the present invention provides an in-memory computing unit, including a plurality of blocks, each Block including a plurality of cells, each Cell including a memory accumulating portion and an output portion; the storage accumulation part is used for storing input data and performing accumulation operation, and the output part is used for activating the result of the accumulation operation to obtain an output result of the Cell;

the storage accumulation part comprises a state decoder, a pre-storage memory, a result memory, an adder and a first data selector; wherein,,

the state decoder is used for determining operation types of the pre-storage memory and the result memory based on the input data and obtaining first input data of the adder based on the input data;

the pre-storage memory is used for outputting the content in the pre-storage memory to the adder as second input data of the adder based on the operation type;

the adder is used for carrying out addition operation on the first input data and the second input data;

the first data selector is used for selecting to set the result memory to 0 or store the output of the adder into the result memory based on the reset flag;

the result memory is used for outputting the content in the result memory to the pre-storage memory or the output part based on the operation type.

Further, each Block is used for completing one calculation task, and the number of the blocks is determined by the maximum concurrent task number; each Cell is used for operating data based on an instruction of a Block where the Cell is located, the number of the cells is determined by the size reserved for the output storage space, and the size reserved for the output storage space is determined by a DNN model.

Further, the output section includes an activation unit including a plurality of activation functions, a custom polynomial, and a second data selector, wherein,

the activation unit is used for respectively carrying out corresponding activation operation on the input data by a plurality of activation functions and the custom polynomials;

the second selector is used for selecting one activation operation result to be used as an output result of the Cell based on activation selection.

Further, the state decoder is configured to determine an operation type of the pre-storage memory and the result memory based on the input data, and obtain the first input data of the adder based on the input data includes:

the input data comprises a read-write state, an input data group and a data address group;

the operation types include sleep, write, stop write, and read;

determining operation types of a pre-storage memory and a result memory based on the read-write state;

first input data of the adder is obtained based on the input data set and the data address set.

Further, the pre-storage memory is configured to output, based on the operation type, contents in the pre-storage memory to the adder as second input data of the adder, including: when the operation type is a writing operation, outputting the content of the pre-storage memory to an adder; when the operation type is other types, the content of the pre-storage memory is not output.

Further, the outputting of the content in the result memory to the pre-storage memory based on the operation type or the outputting section includes outputting the content of the result memory to the pre-storage memory when the operation type is a write operation; when the operation type is a read operation, the contents of the result memory are output to the output section; when the operation type is other types, the content of the result memory is not output.

Further, the input of the in-memory computing unit comprises a read-write state, a Block instruction set, a reset mark and a custom activation configuration; the Block instruction group comprises a plurality of Block instructions, and the Block instructions comprise an activation selection, an input data group and a data address group; wherein,,

the reset mark is used for representing the calculated reset state;

the custom activation configuration is used for customizing a polynomial to determine various coefficients;

the activation selection is performed, and an activation unit is input;

the read-write state, the input data group and the data address group are input into a state decoder.

The method is further characterized in that the Block instruction further comprises a Block address, and the Block address is used for determining a valid Blcok participating in calculation.

Further, the output of the in-memory computing unit is obtained by splicing the output data of all the blocks participating in computation, and the output data of the blocks is obtained by splicing the output data of all the cells participating in computation in the blocks.

On the other hand, the invention also provides an in-memory computing acceleration method using the in-memory computing unit, which specifically comprises the following steps:

step S1, determining a Block participating in calculation based on a Block instruction set of input data;

step S2, determining a Cell participating in calculation based on an input data group and a data address group in a Block instruction;

step S3, determining the operation type of each Cell in the in-memory computing unit based on the read-write state;

step S4, based on the operation type, the Cell carries out corresponding accumulation operation on the input data;

step S5, based on activation selection, the Cell performs activation operation on the accumulated operation result to obtain an output result of the Cell

Step S6, obtaining a result of in-memory calculation based on the output result of each Cell.

The invention can realize at least one of the following beneficial effects:

1. through putting the accumulation operation into the storage Cell, the process of taking out and accumulating and then storing is not needed, and the output multiplexing of the DNN accelerator is realized, so that the operation speed is improved and the operation power consumption is reduced.

2. The combination of the input data group and the data address group is input into the memory for addressing calculation, namely, after the inner product operation unit generates corresponding parts, the accumulation operation can be completed by directly addressing once, the process of 'fetch number-calculate-write' is directly saved for 'write', the fetch number and the calculation are completed by the memory calculation unit, the mode of addressing before storing is more efficient, the operation speed is further improved, and the power consumption is reduced.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.

FIG. 1 is a schematic diagram of the location of an in-memory computing unit in a DNN computing architecture according to the present invention;

FIG. 2 is a schematic diagram of an in-memory computing unit according to the present invention;

FIG. 3 is a schematic diagram of a Cell structure of an in-memory computing unit according to the present invention.

Detailed Description

Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and together with the description serve to explain the principles of the invention, and are not intended to limit the scope of the invention.

Device embodiment

In one embodiment of the invention, an in-memory computing unit is disclosed, which is used for completing concurrent computing tasks of a DNN model.

As shown in fig. 2: comprises m parallel blocks; each Block comprises n parallel cells, and each Cell comprises a storage accumulation part and an output part; the storage accumulation part is used for storing input data and performing accumulation operation, and the output part is used for activating the result of the accumulation operation to obtain the output result of the Cell.

And 1 Block is used for completing one of the calculation tasks, and the m value is determined by the maximum concurrent task number.

The Cell is used for operating data based on the instruction of the Block where the Cell is located, the n value is determined by the size reserved for the output storage space, and the size reserved for the output storage space is determined by the DNN model.

Specifically, as shown in fig. 3, the storage accumulation section of each Cell includes a state decoder, a pre-storage memory, a result memory, an adder, and a first data selector, and the output section of each Cell includes an activation unit including a plurality of activation functions, a custom polynomial, and a second data selector.

The state decoder is used for receiving read-write states, input data groups and data address groups. Alternatively, the Cell may also receive the Block address, which is then spliced to the set of data addresses as input.

The read-write state is generated by a read enable signal and a write enable signal and is used for determining operation types of the pre-storage memory and the result memory.

The input data set comprises a set of data for participating in calculation and a Cell address corresponding to the data.

The data address group is used for marking whether the Cell participates in the calculation.

The Block address is used for marking the address of the Block where the Cell is located.

The state decoder outputs the corresponding operation types to the pre-storage memory and the result memory respectively based on the read-write state.

Specifically, the read-write state includes a read enable signal (0, 1) and a write enable signal (0, 1); the operation types include sleep, write, stop write, and read. The corresponding relation between the read-write state and the operation type is shown in table 1:

TABLE 1

The state decoder judges whether the Cell participates in the operation or not based on the Cell address in the data address group and outputs a result.

Specifically, if the address of the Cell does not exist in the data address group, all inputs of the Cell at this time are ignored, and the Cell does not participate in the calculation at this time; if the address of the Cell exists, the corresponding data participating in the Cell calculation is found in the input data group based on the Cell address and is output to the adder. The Block address is used for determining the Block participating in calculation, and for the Block address, a state decoder in the internal result of the Block does not process the Block address.

The adder is used for adding the data participating in the Cell calculation and the data in the pre-storage memory, and the result is used as one input of the first data selector.

And the result memory is used for outputting the content in the result memory to a pre-storage or activation unit based on the operation type.

Specifically, when the operation type is a write operation, the content of the result memory is output to the pre-memory; when the operation type is a read operation, outputting the content of the result memory to an activating unit; when the operation type is other types, the content of the result memory is not output.

Specifically, the length of the result memory is determined by the maximum value of the DNN model calculation result supported by the in-memory calculation unit, and the length of the result memory needs to satisfy the upper limit of the result of the accumulation calculation.

Preferably, the result memory is SRAM.

The pre-storage memory is used for outputting the content in the pre-storage memory to the adder based on the operation type.

Specifically, when the operation type is a write operation, the content of the pre-storage memory is output to the adder. When the operation type is other types, the content of the pre-storage memory is not output.

Specifically, the length of the pre-storage memory is determined by the maximum value of the DNN model calculation result supported by the in-memory calculation unit, and the length of the pre-storage memory needs to satisfy the upper limit of the result of the accumulation calculation.

Preferably, the pre-memory is an SRAM.

The first selector selects two inputs based on the reset mark and outputs the two inputs to the result memory; one of the inputs is the output of the adder and the other input is a logic 0. Wherein, the reset mark represents the calculated reset state, and before the calculation starts, the reset mark is true; after the calculation starts, the reset flag is not true.

Specifically, when the reset flag is true, the output of the first data selector is 0; when the reset flag is not true, the output of the first data selector is the output of the adder.

The activation unit is used for determining a corresponding activation function or a custom polynomial based on activation selection, and performing activation operation on the input data to obtain the output data of the Cell.

Specifically, the plurality of activation functions and the custom polynomials in the activation unit respectively calculate the input data, and output the result to the second data selector, and the second selector selects the corresponding output based on activation selection, so as to obtain output data.

Specifically, the custom polynomial determines the coefficients based on the custom activation configuration.

Next, a relationship between input/output data of the in-memory arithmetic unit and input/output data of the Block and Cell will be described.

Specifically, the input of the in-memory computing unit includes a read-write state, a Block instruction set, a reset flag and a custom activation configuration, and all the above parts are connected by a splicing mode.

The read-write state data comprises read-write states of all cells of the in-memory computing unit, the data length is m multiplied by n multiplied by 4bit, the corresponding relation between the read-write states and the cells is hidden in the position information of the whole piece of data, and after the data is input into the in-memory computing unit, the read-write state data of the cells are input into the corresponding cells based on the position relation.

The Block instruction group comprises a plurality of Block instructions connected in a splicing mode, and the number of the Block instructions is the number of concurrent tasks to be completed.

The Block instruction comprises an activation selection, an input data set and a data address set, and optionally comprises a Block address.

When the Block instruction does not comprise a Block address, the corresponding relation between the data and the Block is implicit in the position information of the whole piece of data, and the in-memory computing unit realizes the control of the participation computation of the Block based on the position information; when the Block instruction comprises a Block address, the Block instruction corresponds to the Block based on the Block address, and the in-memory computing unit judges the effective Block participating in computation based on the Block address.

The Block instruction is broadcast to all cells in the Block, and the input of all cells is the same Block instruction.

The activation selection is used for selecting an activation function or a custom polynomial by the activation unit, and is a 3-bit or 2-bit numerical value.

The input data set and the data address set are corresponding relations, the data address set comprises addresses of cells which need to participate in calculation in the Block, and the input data set comprises input data of the corresponding cells.

The reset mark comprises reset marks for all blocks of the in-memory computing unit, and the length of the reset marks is mbit. For the cells of the same Block, the reset flag is common to one. The correspondence of the reset flag and the Block is implicit in the position information of the whole piece of data.

The custom activation configuration comprises custom activation configurations for all blocks of the in-memory computing unit, and the same custom activation configuration is used for cells of the same Block. The corresponding relation between the custom activation configuration and the Block is implicit in the position information of the whole piece of data.

The output of the in-memory computing unit is obtained by splicing the output data of all the blocks participating in computation, and the output data of the blocks is obtained by splicing the output data of all the cells participating in computation in the blocks.

The embodiment discloses an in-memory computing unit, which is used for placing accumulation operation in a memory Cell, so that on one hand, multiplexing of output is realized without taking out and accumulating and then storing, and the operation speed is improved and the operation power consumption is reduced; on the other hand, for the input data, the combination of the input data group and the data address group is input into the memory for addressing calculation, so that the method is more efficient compared with a mode of addressing before storing, the operation speed is further improved, and the power consumption is reduced. The memory calculating unit of the embodiment effectively solves the problem of the memory wall existing at the output end.

Method embodiment

The invention also discloses an in-memory computing acceleration method using the in-memory computing unit of the embodiment, which comprises the following steps:

step S1, determining a Block participating in calculation based on a Block instruction group of input data.

Specifically, when a Block instruction in the Block instruction group does not include a Block address, determining a Block participating in calculation based on position information of the Block instruction in the Block instruction group; when the Block instruction includes a Block address, a Block participating in the calculation is determined based on the Block address.

Step S2, determining a Cell participating in calculation based on the input data group and the data address group in the Block instruction.

Specifically, the Block command of the same Block is broadcast to each Cell. For a Cell, if the address of the Cell is not in the data address group, all inputs of the Cell are ignored, and the Cell does not participate in the calculation; if the address of the Cell exists, the corresponding data participating in the Cell calculation is found in the input data group and is output to the adder. For Block addresses, the state decoder does not process.

Step S3, based on the read-write state, determining the operation type of each Cell in the in-memory computing unit.

Specifically, referring to table 1 of the embodiment, based on the read-write state, the operation type of the Cell can be determined as sleep, write, stop-write, or read.

Step S4, based on the operation type, the Cell carries out corresponding accumulation operation on the input data.

Specifically, before the calculation starts, the "reset flag" is true, and the output of the first data selector is 0, i.e., the content of the result memory is set to 0; when the operation type is writing, the Cell inputs the content of the result memory into the pre-memory, and outputs the data which is obtained by the state decoder and participates in the Cell calculation to the adder; when the writing operation is finished, the operation type is changed into stop writing, the content of the pre-storage unit is stopped changing, and the calculation result of the adder is output to a result memory; repeating the steps, and realizing accumulation operation on the input data set by the Cell; when the operation type is changed to read, accumulation is finished, and the content of the result memory is an accumulation result and is output to the activation unit.

Step S5, based on activation selection, the Cell performs activation operation on the accumulation operation result to obtain an output result of the Cell.

Specifically, each activation function and the custom polynomial in the activation unit respectively calculate the accumulation operation in step 4, and the activation unit selects one of the calculation results as the output result of the Cell based on activation selection.

Specifically, output data of all cells participating in calculation in the same Block are spliced to obtain an output result of the Block; and splicing output results of all the blocks participating in calculation to obtain an in-memory calculation result.

Compared with the prior art, the in-memory calculation acceleration method provided by the embodiment puts the accumulation operation into the memory Cell, and the output multiplexing is realized without the process of taking out and accumulating and then storing; in addition, the steps of addressing and then storing are avoided, so that the efficiency is further improved, and the acceleration of operation is realized.

It should be noted that, the above embodiments are based on the same inventive concept, and the description is not repeated, and the description may be referred to each other.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. An in-memory computing unit comprising a plurality of blocks, each Block comprising a plurality of cells, each Cell comprising a memory accumulation portion and an output portion; the storage accumulation part is used for storing input data and performing accumulation operation, and the output part is used for activating the result of the accumulation operation to obtain an output result of the Cell;

the obtaining the first input data of the adder based on the input data includes:

the operation types include sleep, write, stop write, and read;

obtaining first input data of the adder based on the input data group and the data address group;

the outputting the content in the pre-memory to the adder as second input data of the adder includes:

when the operation type is a writing operation, outputting the content of the pre-storage memory to an adder; when the operation type is other types, the content of the pre-storage memory is not output;

2. The in-memory computing unit of claim 1, wherein each Block is configured to perform a computing task, and the number of blocks is determined by a maximum number of concurrent tasks; each Cell is used for operating data based on an instruction of a Block where the Cell is located, the number of the cells is determined by the size reserved for the output storage space, and the size reserved for the output storage space is determined by a DNN model.

3. The in-memory computing unit of claim 1, wherein the output portion comprises an activation unit comprising a plurality of activation functions, a custom polynomial, and a second data selector, wherein,

the second data selector is used for selecting an activation operation result as an output result of the Cell based on activation selection.

4. The in-memory computing unit according to claim 3, wherein the result memory for outputting the content in the result memory to the pre-memory based on the operation type includes, when the operation type is a write operation, outputting the content of the result memory to the pre-memory; when the operation type is a read operation, the contents of the result memory are output to the output section; when the operation type is other types, the content of the result memory is not output.

5. The in-memory computing unit of claim 3, wherein the input to the in-memory computing unit comprises a read-write state, a Block instruction set, a reset flag, and a custom activation configuration; the Block instruction group comprises a plurality of Block instructions, and the Block instructions comprise an activation selection, an input data group and a data address group; wherein,,

the reset mark is used for representing the calculated reset state;

the activation selection is performed, and an activation unit is input;

6. The in-memory computing unit of claim 5, wherein the Block instruction further comprises a Block address to determine a valid Block to participate in the computation.

7. The in-memory computing unit according to claim 5 or 6, wherein the output of the in-memory computing unit is obtained by stitching output data of all blocks involved in computation, the output data of the blocks being obtained by stitching output data of all cells involved in computation in the Block.

8. An in-memory computation acceleration method using the in-memory computation unit according to claim 7, characterized by comprising the steps of: