CN114418072A

CN114418072A - Convolution operator mapping method for multi-core memristor storage and calculation integrated platform

Info

Publication number: CN114418072A
Application number: CN202210104656.2A
Authority: CN
Inventors: 绳伟光; 邓博; 李忻默; 景乃锋; 王琴; 蒋剑飞; 贺光辉; 毛志刚
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-04-29

Abstract

The invention discloses a convolution operator mapping method for a multi-core memristor storage and calculation integrated platform, and relates to the technical field of memristor storage and calculation integrated platforms. Meanwhile, the locality of input data and the inter-core communication overhead are considered, the total communication cost is taken as an optimization target, and finally a mapping scheme with low overall communication overhead is obtained. The method provided by the invention comprehensively considers the communication cost brought by directly reading and writing the memory and the communication cost brought by multi-core synchronization, can effectively reuse input data, exerts the parallelism of the memristor array and obtains a mapping scheme with lower communication cost. The method provided by the invention is simple to implement and strong in transportability, can be added to the back end of a neural network compiler to execute, and completes the mapping of the convolution operator of the memristor memory-computation-oriented integrated platform.

Description

Convolution operator mapping method for multi-core memristor storage and calculation integrated platform

Technical Field

The invention relates to the technical field of a memristor storage and calculation integrated platform, in particular to a convolution operator mapping method for a multi-core memristor storage and calculation integrated platform.

Background

With the increase of the scale of the neural network and the complexity of the algorithm, the von neumann architecture centering on the calculation is limited by the memory bandwidth and the data moving expense, so researchers propose the idea of memory calculation, different from the traditional von neumann architecture, the memory calculation centers on data, and the calculation logic is placed in the memory, or the characteristics of the memory are directly utilized for calculation, so that the communication expense of the kernel and the external storage is greatly reduced, and the von neumann architecture is particularly suitable for the application scene of large data volume parallel and large scale parallel of the deep learning neural network, and therefore, an accelerator which integrates the memory-nonvolatile memory (NVM) such as a memristor as a calculation unit is appeared. The accelerator effectively solves the bottleneck of bandwidth and has the characteristics of low power consumption and high speed.

The memristor Memory (ReRAM) is a nonvolatile Memory with leakage power consumption of almost 0, and stores information in a resistance state, and the resistance type storage principle can provide inherent computing capability, so that the functions of data storage and data processing can be integrated at the same physical unit address. As a leading-edge technology of a memory, ReRAM is expected to replace the current Flash RAM in the future and has the advantages of lower cost and more outstanding performance. The energy consumption of the ReRAM memory chip can reach 1/20 of a flash memory, and the upper limit of data erasing is 10 times that of the latter. In addition, the non-volatility can make the data be directly stored in the system-on-chip to implement instant on/off without need of extra off-chip memory. The ReRAM supports various memory operation operations, Matrix-Vector Multiplication (MVM), search, bit operation and the like, a large number of convolution operations related to the MVM operation exist in the convolutional neural network, the operation amount can reach more than 95% of the operation amount of the whole network, and the ReRAM has the characteristic of reducing the complexity of the Matrix-Vector Multiplication operation from o () to o (1), so that the accelerator based on the ReRAM can obviously improve the operation efficiency of the neural network. These superior characteristics have enabled researchers to implement a variety of ReRAM-based neural network accelerators. Of the many in-memory computing designs, ReRAM has significant advantages over conventional CMOS based designs.

With the deep learning research, a plurality of deep learning development software development frameworks such as TensorFlow, PyTorch and the like appear, and the frameworks simplify the development of deep learning models. Unlike the final relatively unified front-end, the hardware types at the application back-end are diverse, and researchers have made many efforts to efficiently map deep learning models, such as highly optimizing linear algebra libraries MKL and cuBLAS on general purpose processors, and GPU-oriented TensorRT, which supports graphics optimization, with a large number of optimized kernels. The disadvantage of relying on libraries is that the development of libraries lags behind the development of deep learning models and therefore new models cannot be applied quickly at the back end.

To solve the above problems, many deep learning compilers are proposed in the industry, which can map deep learning models to back-end hardware, such as TVM, sensor com, XLA, etc. The technical details of the compilers are different, but the basic processes are similar approximately, the compilers all use the model defined by the front-end deep learning development framework as input, effective codes of various back-end hardware as output, and the model and the generated codes are highly optimized during conversion by adopting multi-layer IR.

The basic concept of a single body of computer was originally traced back to the seventies of the last century, and the concept of a single body of computer was first proposed by Kautz et al, the stunford institute, in 1969. In 2010, the feasibility of ReRAM to implement simple boolean logic functions was proposed and verified by the professor Williams of hewlett packard. In 2016, professor of thank of the university of california, santa basba, University (UCSB), the society proposed to build a deep learning neural network (abbreviated PRIME) based on a unified architecture using ReRAM, and received much attention from the industry. Test results show that PRIME can achieve a reduction in power consumption by about 20 times and an increase in speed by about 50 times compared to conventional schemes based on von neumann computing architectures. Similarly, in 2016, Ali Shafiee et al at utah university in usa proposed a convolutional neural network accelerator ISAAC based on ReRAM, which optimizes throughput, energy consumption, and computation density by 14.8 times, 5.5 times, and 7.5 times, respectively, compared with the DaDianNao architecture, and they also proposed an in-block pipeline design, which improves the throughput of the accelerator. In 2019, Aayush Ankit et al, Puchage university, proposed a memristor neural network accelerator PUMA.

There are currently few compilers or code generation tools in the industry that are oriented towards a computationally integrated programming framework. ReRAM-based accelerators such as ISAAC, FloatPIM, Atomlayer and the like only discuss the advantages of a ReRAM array in the aspect of computational efficiency on the framework, and do not provide programming and compiling tools; PRIME, pipe layer, while providing a software and hardware programming interface, does not integrate the compilation tool with the mainstream neural network framework. PUMA provides a complete compiler, but it only performs a simple mapping process and does not perform any further optimization.

Therefore, those skilled in the art are dedicated to developing a convolution operator mapping method oriented to a multi-core memristor memory integration platform. Meanwhile, the locality of input data and the inter-core communication overhead are considered, the total communication cost is taken as an optimization target, and finally a mapping scheme with low overall communication overhead is obtained.

Disclosure of Invention

In view of the defects in the prior art, the technical problem to be solved by the invention is that only simple mapping is performed on a convolution operator, and the locality of memristor array input data is not fully utilized when convolution operation is performed; and for larger convolution operators, simple mapping may result in additional inter-core communication overhead.

In order to achieve the purpose, the invention provides a convolution operator mapping method for a multi-core memristor memory-computation integrated platform, which comprises the following steps:

step 1, expanding the weight W [ OC ] [ IC ] [ KH ] [ KW ] of the convolution layer into a weight matrix W [ IC × KH × KW ] [ OC ], multiplying all the weights in each row with the same input in parallel, obtaining output in parallel according to columns, and obtaining the result obtained by multiplying and accumulating the [ IC × KH × KW ] weights in each column with corresponding inputs respectively, namely each column is a convolution kernel and has OC convolution kernels in total;

step 2, dividing rows and columns of the matrix obtained in the step 1 according to the size xbar _ size of the memristor array, wherein the obtained matrix is represented by P, each element in P represents the memristor array with one size, and P [ i ] [ j ] represents core ID to which the ith row and ith column memristor arrays belong;

and 3, finishing mapping aiming at the matrix P.

Further, the multi-Core memristor storage integration platform respectively comprises a Core stage and a Crossbar stage from top to bottom.

Further, a plurality of cores of the multi-Core memristor storage and calculation integrated platform share a global memory through a bus.

Furthermore, the Core comprises an instruction fetching decoding module, a loading module, a storage module and a calculation module.

Further, the Core includes a data memory.

Further, the Crossbar unit and tensor ALU on the Core are Core computation units.

Further, step 3 adopts a greedy strategy.

Further, the communication overhead of the convolution operator and the Core external memory is as follows:

Target＝rd_factor(P)+sync_facotr(P)

the read-write cost rd _ factor (P) is generated by the Core directly reading and writing the external memory to acquire the input data; the synchronization cost sync _ facotr (p) is generated by transferring data between cores through an external memory.

wherein, the convolution operation input is a characteristic diagram In, a weight W and a step length stride, and the output characteristic diagram is Out [ B ]][OC][OH][OW](ii) a N is the number of P matrix columns, diffrow (core)_i) Represents core_iThe number of elements, diffcol (core), located in different rows in the matrix P_i) Represents core_iThe number of elements in different columns in the matrix P; reuse (core)_iStride) represents the number of memristor arrays that the input data can be reused when the convolution step is stride in the matrix P, and L is the number of cores participating in mapping.

Further, in step 3, with X memristor arrays on each Core, the number K of cores required for mapping the weights W is first determined, then all elements of the P array are initialized to K, i.e., all arrays are logically allocated to the Core K first, then the whole matrix P is traversed, the Target at which the element is allocated to the current Core to be allocated or to the adjacent Core is calculated, the smallest scheme is selected, and this process is repeated for unallocated elements until the number of elements allocated to the Core K is not greater than X.

In a preferred embodiment of the present invention, the present invention first provides an abstract hardware architecture with multi-core computation integration, which has a certain generality of the architecture, and aims to optimize the performance of a volume operator on the architecture and reduce the number of communication instructions.

1 hardware architecture

The invention relates to a multi-computing core oriented storage and computation integrated acceleration system, and provides a simple multi-core accelerator architecture for the consideration of universality. The abstract architecture to which the present invention is directed is top to bottom Core and Crossbar levels, respectively. A plurality of cores share a global memory through a bus. Fig. 1 shows details of the Core computing unit herein, which mainly includes four modules, namely, 101 instruction fetch decoding module, 102 load module, 103 storage module and 104 computing module in fig. 1. Each Core additionally includes a data memory for (a) buffers for input data, (b) temporarily storing intermediate results, and (c) storing instructions executed by the Core. Crossbar unit and tensor ALU on Core are Core computation units, which are responsible for handling matrix vector multiply operations and some other tensor ALU operations, respectively.

From the data path, before the accelerator starts the computation, the input data and the weight data are firstly copied from the host end to the global memory, and then the instructions required by the computation task are copied for each Core (the instructions are generated by the compiler statically in advance). After the instruction is copied, an instruction prefetching module on the Core unit reads the current instruction, performs preliminary decoding, and sends the instruction to a loading, calculating or storing module according to the decoding result, wherein the instruction can flow among the modules.

Aiming at the abstract framework, the invention can stably reduce the quantity of communication instructions generated when the convolution operator is executed.

2 mapping strategy

2.1 problem abstraction

The algorithm 1 is a naive convolution operation, the input is a characteristic diagram In, the weight W, the step length stride, the output characteristic diagram is Out [ B ] [ OC ] [ OH ] [ OW ], and the observation of the algorithm can find that for W, the OC axes can be calculated In parallel, the data of the ic, kh and kw axes and the input characteristic diagram are multiplied and accumulated correspondingly, so that the OC axes are mapped to different columns of a crossbar during mapping, the ic, kh and kw axes are mapped to different rows, meanwhile, partial overlapping between adjacent convolution windows can be found, and the number of read-write instructions can be further reduced by utilizing the partial data.

Assuming that the back-end architecture has L cores, each Core has X memristor arrays (crossbar), each crossbar is xbar _ size × xbar _ size, after the weights are divided according to xbar _ size, the data on the oc axis are distributed to different columns of the crossbar, the data on the ic axis, the kh axis and the kw axis are distributed to different rows of the crossbar, and the acceleration of the convolution operation can be realized by utilizing the memristor characteristics.

The mapping problem is further abstracted, firstly, the weight tensor is expanded into a matrix W [ IC × KH × KW ] [ OC ], wherein IC is the number of input channels, OC is the number of output channels, KH and KW are the sizes of convolution kernels, then the row and column of W are further divided according to the size xbar _ size of the memristor array to obtain a matrix P [ M ] [ N ], and P is a mapping scheme, and the process is shown in FIG. 2. P [ i ] [ j ] represents the number of the Core to which the memristor array of the ith row and jth column belongs, and at the moment, the problem is converted into the matrix P so that the objective function is minimum.

2.2 objective function

The communication overhead source of the convolution operator and the Core external memory is divided into two parts, one part is generated by directly reading and writing the external memory by the Core to acquire input data, namely, a reading and writing cost rd _ factor (P), the other part is generated by transmitting data between the cores through the external memory, namely, a synchronization cost sync _ facotr (P), and the total communication overhead of the two parts is the optimization target of the invention:

Target＝rd_factor(P)+sync_facotr(P)#(1)

in convolution operation, the read-write cost of one convolution window operation is the sum of the read-write costs of each Core, and if one communication operation is the Core reading or writing xbar _ size data, for one mapping P, because the same input data is used, the same element which is equal and located in the same row can only bring one communication operation, the same element number can be brought by the same element which is equal and located in the same column under the condition that the data multiplexing in the Core is not considered, and the number of the input multiplexing elements needs to be subtracted when the data multiplexing in the Core is considered. Thus, from

Wherein diffrow (core)_i) Represents core_iThe number of elements, reuse (core), located in different rows in the matrix P_iStride) represents the number of memristor arrays that the input data can be reused when the convolution step is stride in the matrix P, and L is the number of cores participating in mapping.

Similarly, the number of times of synchronization of a convolution window operation should be equal to the number of unequal elements on the same column in P minus 1, and one synchronization will result in two communication operations, one read and one write, and thus from the perspective of matrix P,

the observation of the two formulas (2) and (3) shows that they have the same factors OH and OW, which can be further simplified:

2.3 mapping Algorithm

The direct mapping algorithm is divided into row-first mapping and column-first mapping, and the elements are sequentially assigned to corresponding cores by traversing the matrix P in rows or columns, respectively. The row-first mapping minimizes the number of different elements on the same row, but increases the number of different elements on the same column, i.e., reduces rd _ factor (P) and increases sync _ facotr (P). Similarly, column-first mapping minimizes the number of different elements on the same column, but increases the number of different elements on the same row, i.e., increases rd _ factor (P) and decreases sync _ facotr (P).

Based on a greedy algorithm, the mapping scheme with relatively low communication cost can be obtained according to the size of the weight matrix by combining the characteristics of row-first mapping and column-first mapping, as shown in algorithm 2.

The algorithm inputs are the number of cores corenum, the number of crossbars on each Core xbarnum, the crossbar size xbar _ size, the weight W, the convolution step size stride, and the initial mapping scheme P output is the final mapping scheme P.

The main process is as follows: the number of cores K required for mapping the weights W is first determined, then all the elements of the P array are initialized to K, i.e. all the arrays are logically allocated to the Core K first, then the whole matrix P is traversed, the Target when the element is allocated to the Core currently to be allocated or its adjacent Core is calculated, the smallest scheme is selected, and the process is repeated for the unallocated points until the number of elements allocated to the Core K is not greater than X, as shown in the overall flow chart of fig. 3.

3 evaluation of results

Let L be 30, X be 16, use direct mapping algorithm as reference, arbitrarily select a plurality of convolution layers to evaluate, the result is as shown in fig. 4, the result shows that, compared with direct mapping algorithm, the mapping algorithm proposed by the present invention can stably obtain relatively better mapping scheme, and can reduce communication cost by 21% on average.

Compared with the prior art, the invention has the following obvious substantive characteristics and obvious advantages:

1. compared with a direct mapping strategy, the method provided by the invention comprehensively considers the communication cost brought by directly reading and writing the memory and the communication cost brought by multi-core synchronization, can effectively reuse input data, exerts the parallelism of the memristor array and obtains a mapping scheme with lower communication cost.

2. The method provided by the invention is simple to implement and strong in transportability, can be added to the back end of a neural network compiler to execute, and completes the mapping of the convolution operator of the memristor memory-computation-oriented integrated platform.

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

Drawings

FIG. 1 is an abstract architecture of a multi-core memristor memory bank of a preferred embodiment of the present disclosure;

FIG. 2 is a mapping process of convolution weights to memristor arrays of a preferred embodiment of the present disclosure;

FIG. 3 is a greedy policy based mapping algorithm flow of a preferred embodiment of the present invention;

FIG. 4 shows the mapping result of the preferred embodiment of the present invention.

The system comprises a memory, a fetch decoding module, a load module, a storage module and a calculation module, wherein the memory comprises a memory, a fetch decoding module, a load module, a memory module and a calculation module, and the fetch decoding module is 101-102-103-104-the calculation module.

Detailed Description

The technical contents of the preferred embodiments of the present invention will be more clearly and easily understood by referring to the drawings attached to the specification. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.

In the drawings, structurally identical elements are represented by like reference numerals, and structurally or functionally similar elements are represented by like reference numerals throughout the several views. The size and thickness of each component shown in the drawings are arbitrarily illustrated, and the present invention is not limited to the size and thickness of each component. The thickness of the components may be exaggerated where appropriate in the figures to improve clarity.

The method is mainly oriented to compiling of a multi-core memristor storage and calculation integrated platform, and mainly aims at mapping optimization of convolution operators in the compiling process.

The multi-core memristor storage and calculation integrated platform supports matrix vector calculation based on memristor arrays, when convolution calculation is carried out, the convolution calculation is disassembled into multiple matrix vector multiplication operations, then weights of convolution kernels are mapped to the multiple memristor arrays, when the convolution scale is large, weight data are mapped to the multiple calculation kernels, the multiple calculation kernels are required to jointly complete calculation of convolution operators, different mapping schemes bring different communication costs, and the traditional direct mapping strategy does not consider inter-core communication cost and data locality.

The mapping strategy provided by the invention is mainly applied to the rear end of a neural network compiler, and a better mapping scheme can be obtained at the rear end of the compiler according to the size of each parameter of a convolution operator in a memristor array, wherein the mapping process is as follows:

1. the weights W [ OC ] [ IC ] [ KH ] [ KW ] of the convolutional layer are expanded into a weight matrix W [ IC × KH × KW ] [ OC ], all the weights in each row can be multiplied by the same input in parallel, the output is obtained in parallel according to columns, the output in each column is the result obtained by multiplying and accumulating the [ IC × KH × KW ] weights and the corresponding inputs respectively, namely, each column is a convolution kernel, and OC convolution kernels are shared, as shown in the first step of FIG. 2.

2. And dividing the row and the column of the matrix obtained in the last step according to the size xbar _ size of the memristor array, wherein the obtained matrix is represented by P in the second step of FIG. 2, each element in P represents the memristor array of one size, and P [ i ] [ j ] represents the core ID to which the i-th row and the j-th column memristor array belong.

3. Mapping is completed with Algorithm 2 for the matrix P.

The invention discloses a convolution operator mapping method for a multi-core memristor storage and calculation integrated platform, which comprises the following steps of:

step 1 inputs Core (computation Core) number corenum, Crossbar (memristor array) number xbarnum on each Core, Crossbar size xbar _ size, weight matrix W (including its size OC, IC, KH, KW), step size stride of convolution kernel.

Step 2 derives the size (M × N) of the mapping scheme matrix P and the number K of cores needed to map the current convolution operator according to the method in 2.1 in combination with the parameters in step 1.

Step 3 initializes all elements of the P matrix to K.

Step 4, initializing the number curId of the Core to be allocated currently to 1, and initializing the current optimal mapping scheme curP to P.

Step 5 initializes the communication cost of the current mapping scheme to infinity.

Step 5.1 traverses the whole P matrix, calculates the communication cost when the element with the value of K is replaced by the curId or the curId +1 according to the formula 4 in 2.2, and keeps the scheme with the lowest communication cost in the curP.

Step 5.2 if there is no allocable crossbar on the Core corresponding to the currid number, then the currid is added with 1 and the next Core is allocated.

Step 5.3 updates P to curP.

Step 6P repeat step 5 if the number of elements with a value of K is greater than XbarNums.

Step 7P is the final mapping scheme.

The invention provides a convolution operator mapping method for a multi-core memristor memory-computation integrated architecture.

The invention provides an evaluation function of communication cost for a convolution operator, the evaluation function considers the communication cost brought by direct memory access and the communication cost brought by multi-core synchronization at the same time, and provides a corresponding expression.

The method provided by the invention considers the multiplexing of input data when mapping the weight, and can effectively reduce the communication cost to obtain a better mapping scheme.

The greedy strategy adopted by the method is simple to implement, has strong transportability and is faster than other intelligent search algorithms.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A convolution operator mapping method for a multi-core memristor storage and calculation integrated platform is characterized by comprising the following steps:

step 2, dividing rows and columns of the matrix obtained in the step 1 according to the size xbar _ size of the memristor array, wherein the obtained matrix is represented by P, each element in P represents the memristor array with one size, and P [ i ] [ j ] represents core ID to which the ith row and the jth column memristor arrays belong;

and 3, finishing mapping aiming at the matrix P.

2. The method as recited in claim 1, wherein the multi-Core memristor-computing-integrated platform is top-to-bottom in Core and Crossbar levels, respectively.

3. The method for mapping convolution operators of a multi-Core memristor-computing-oriented platform as claimed in claim 1, wherein a plurality of cores of the multi-Core memristor-computing-oriented platform share a global memory through a bus.

4. The method for mapping convolution operators of a multi-Core memristor-memory-computation-integrated platform as claimed in claim 1, wherein the Core comprises an instruction fetch decoding module, a loading module, a storage module and a computation module.

5. The method of claim 1, in which the Core comprises a data store.

6. The method of claim 1, in which cross bar units and tensor ALUs on the Core are Core computation units.

7. The method for mapping convolution operators of a multi-core memristor-memory-computation-integrated platform as claimed in claim 1, wherein greedy strategy is adopted in the step 3.

8. The method for mapping convolution operators oriented to a multi-Core memristor memory integrated platform according to claim 1, wherein communication overhead of the convolution operators and a Core external memory is as follows:

Target＝rd_factor(P)+sync_facotr(P)

9. The method for mapping convolution operators oriented to a multi-Core memristor memory integrated platform according to claim 1, wherein communication overhead of the convolution operators and a Core external memory is as follows:

10. The convolutional operator mapping method for a multicore memristor memory integral platform as claimed in claim 8 or 9, wherein step 3, having X memristor arrays on each Core, first determines the number K of cores required for mapping weight W, then initializes all elements of P array to K, that is, logically allocates all arrays to Core K first, then traverses the whole matrix P, calculates Target when assigning elements to the current Core to be assigned or adjacent cores, selects the minimum scheme, and repeats this process for unallocated elements until the number of elements assigned to Core K is not more than X.