CN115238877A

CN115238877A - Data mapping method and data mapping device for improving utilization rate of storage array

Info

Publication number: CN115238877A
Application number: CN202210875598.3A
Authority: CN
Inventors: 梁峰; 王尚
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-07-25
Filing date: 2022-07-25
Publication date: 2022-10-25

Abstract

The invention provides a data mapping method and a data mapping device for improving the utilization rate of a storage array, and relates to the technical field of chips. The method comprises the following steps: and splitting and mapping a k multiplied by k convolution kernel into k unit arrays, mapping a 1 multiplied by k multiplied by C vector in each unit array, mapping the unit array into the next unit array if the number of rows of the unit array is not enough, and determining that the storage and computation array uses a first data mapping method or a second data mapping method according to the relation between the convolution kernel size and the storage and computation array size. The invention effectively reduces the pressure of the input buffer and reduces the data transmission quantity, thereby reducing the time delay and the power consumption of the chip. In addition, according to the size of the convolution kernel, the speed/area can be increased to three times at most by applying the method provided by the invention, idle rows and columns in each array are effectively utilized, and the utilization rate and the calculation speed of the storage and calculation arrays are increased.

Description

Data mapping method and data mapping device for improving utilization rate of storage array

Technical Field

The invention relates to the technical field of chips, in particular to a data mapping method and a data mapping device for improving the utilization rate of a storage and computation array.

Background

The hierarchical structure of the units responsible for calculation in the calculation part of the calculation chip or the chip from top to bottom is as follows: tile, processing Element (Processing Element), array (XBar), several rows and columns in the array, one computing Element (bitcell) at each intersection of the rows and columns, each computing Element being capable of performing a computation once per cycle.

When the storage structure chip is used for calculating the neural network, the parameter matrix of the neural network is mapped onto the array according to a specific rule. One of the main purposes of weight mapping is to achieve the highest possible data reuse rate, and since each number in the input matrix needs to perform an average 9 times of multiply-add operations in a common 3 × 3 convolution kernel, the pressure of an input buffer is increased, and the data transmission amount is large, thereby causing the increase of chip delay and power consumption. In addition, when a layer of the neural network is accelerated multiply, more processing units and arrays are often used for performing accelerated computation during pipelining, and idle rows and columns inside each array cannot be effectively utilized, and particularly when the output channel of the convolution kernel is small, the problem is more obvious. How to effectively utilize idle rows and columns in each array and improve the utilization rate and the calculation speed of the storage and calculation array is an urgent problem to be solved.

Disclosure of Invention

In view of the above problems, the present invention has been made to provide a data mapping method and a data mapping apparatus that overcome or at least partially solve the above problems and improve the utilization rate of a computational array.

A first aspect of an embodiment of the present invention provides a data mapping method for increasing a utilization rate of a storage array, where the storage array includes: a plurality of cell arrays, the data mapping method comprising:

splitting and mapping a k multiplied by k convolution kernel into k unit arrays, mapping a 1 multiplied by k multiplied by C vector in each unit array, and mapping the unit arrays into the next unit array if the number of rows of the unit arrays is not enough, wherein k multiplied by k represents the height multiplied by the width of the convolution kernel, and C represents the number of input channels of the convolution kernel;

determining whether the storage array uses a first data mapping method or a second data mapping method according to the relation between the size of the convolution kernel and the size of the storage array;

taking k as an example, the first data mapping method includes:

mapping a first row of a 3 × 3 convolution kernel to any unit array, wherein elements in a first column in the unit array correspond to an arrangement mode of the first row in the 3 × 3 convolution kernel after being expanded into a column vector;

respectively mapping the same weights as those in the first column of the unit array in the second column and the third column of the unit array, wherein the mapping position corresponding to the second column of the unit array is vacant downwards by C rows, and the mapping position corresponding to the third column of the unit array is vacant downwards by 2C rows;

expanding the input elements in the input buffer to a 1 × 5 × C column vector;

in the current period, the input buffer outputs 5 elements in the input matrix to the unit array and carries out convolution operation;

when the current period is finished, obtaining a complete output element at the tail end of each row in the unit array, and then performing convolution operation of the next period;

taking k as an example, the second data mapping method includes:

mapping parameters on each input channel of a 3 × 3 convolution kernel into different unit arrays in the storage and computation array respectively, and mapping the same weight on three columns in each unit array respectively without empty rows, so that each input channel occupies ceil (C/n) unit arrays, ceil (C/n) × 3 arrays are occupied in the dimension of rows in total, and ceil (M × 3/n) arrays are occupied in the dimension of columns, wherein ceil is an rounding-up function, and n represents the size of the storage and computation array;

in the current period, the input buffer outputs 3 elements in the input matrix to three unit arrays in the storage and calculation array and carries out convolution operation to obtain partial results;

and adding partial results obtained in each period in groups to obtain a convolution result.

Optionally, the size of the convolution kernel is k × k × C × M, and the size of the storage array is n × n, where M represents the number of output channels of the convolution kernel;

determining whether the storage array uses a first data mapping method or a second data mapping method according to the relation between the size of the convolution kernel and the size of the storage array, wherein the method comprises the following steps:

when the convolution kernel is split and mapped to any unit array, if the number of rows of the unit array is not more than k/2k-1 of the size of the unit array, or the number of columns of the unit array is less than 1/k of the size of the unit array, determining that the first data mapping method is used by the storage and computation array;

when the convolution kernel is split and mapped to any unit array, if the number of lines occupying any unit array is not more than k/2k-1 of the size of the unit array, and the number of columns occupying any unit array is less than 1/k of the size of the unit array, determining that the storage array uses the first data mapping method;

when a convolution kernel is split and mapped to any unit array, if the number of rows of any unit array is occupied and is not smaller than k times of the size of the unit array, calculating to obtain the ratio of a first occupied array to a calculation speed and the ratio of a second occupied array to the calculation speed based on the size of the convolution kernel and the size of the calculation array, wherein the ratio of the first occupied array to the calculation speed is the ratio of the occupied array to the calculation speed when the convolution kernel is mapped to the calculation array by adopting a traditional method, the ratio of the second occupied array to the calculation speed is the ratio of the occupied array to the calculation speed obtained by adopting a method that parameters on each input channel of the convolution kernel are respectively mapped to different unit arrays in the calculation array, and k columns in each unit array are respectively mapped with the same weight without empty rows;

and if the value of the ratio of the first occupied array to the calculation speed is larger than the value of the ratio of the second occupied array to the calculation speed, determining that the storage and calculation array uses the second data mapping method.

Optionally, the formula of the ratio between the first occupancy array and the calculated speed is:

k×ceil(k×C/n)×ceil(M/n)

the formula of the ratio of the second occupancy array to the calculated velocity is:

Ceil(C/n)×k×ceil(M×k/n)

in the above two formulas, ceil (k × C/n) indicates that ceil (k × C/n) cell arrays need to be occupied by convolution kernel mapping in the row dimension in the conventional method, ceil (M/n) indicates that ceil (M/n) cell arrays need to be occupied by convolution kernel mapping in the column dimension in the conventional method, ceil (C/n) indicates that each input channel occupies ceil (C/n) cell arrays, ceil (C/n) × k indicates that ceil (C/n) × k cell arrays need to be occupied in the row dimension, and ceil (M × 3/n) indicates that ceil (M × k/n) cell arrays need to be occupied in the column dimension.

Optionally, in the current cycle, the outputting, by the input buffer, 5 elements in the input matrix to the cell array and performing convolution operation includes:

the first two elements of the 5 elements are partially calculated in the previous period, the rest of the calculations are completed in the current period, the middle element is input once in the current period to complete the whole 3 calculations, the last two elements are partially calculated in the current period, and the rest of the calculations are completed by placing the last two elements at the positions of the first two elements in the next period.

Optionally, when the current period ends, a complete output element is obtained at the end of each column in the cell array, and then the convolution operation in the next period is performed, where the convolution operation includes:

when the current period is finished, obtaining a complete output element at the tail end of each row in the unit array, then when the next period is started, translating the value area of the input buffer on the input matrix rightwards by 3 pixels on the basis of the value area of the input buffer on the input matrix in the current period, then taking 5 elements and outputting the elements to the unit array for convolution operation.

Optionally, the three cell arrays comprise: a first cell array, a second cell array, and a third cell array;

adding the partial result groups obtained in each period to obtain a convolution result, wherein the convolution result comprises the following steps:

adding the result obtained at the end of the first column in the first unit array, the result obtained at the end of the first column in the second unit array and the result obtained at the end of the first column in the third unit array in the current period to obtain a complete output element;

adding a result obtained at the end of a second column in the first unit array, a result obtained at the end of a third column in the first unit array and a result obtained at the end of a third column in the second unit array in the current period to a partial result obtained in the previous period respectively to obtain a complete output element;

and adding the result obtained at the end of the second row in the second unit array, the result obtained at the end of the second row in the third unit array and the result obtained at the end of the third row in the third unit array in the current period to the partial result obtained in the next period to obtain a complete output element.

A second aspect of the embodiments of the present invention provides a data mapping apparatus for increasing a utilization rate of a storage array, where the data mapping apparatus includes:

the system comprises a splitting and mapping module, a convolution kernel mapping module and a convolution kernel mapping module, wherein the splitting and mapping module is used for splitting and mapping k multiplied by k convolution kernels to k storage arrays, each storage array is mapped with a 1 multiplied by k multiplied by C vector, and if the number of lines of the storage array is not enough, the storage array is mapped to the next storage array, wherein k multiplied by k represents the height multiplied by the width of a C convolution kernel, and C represents the number of input channels of the convolution kernel;

the determining method module is used for determining that the calculation array uses a first data mapping method or a second data mapping method according to the relation between the size of the convolution kernel and the size of the calculation array;

taking k as an example, 3, the data mapping apparatus further includes: a first method module and a second method module;

wherein the first method module comprises:

the expansion mapping unit is used for mapping the first row of the 3 x 3 convolution kernel to any unit array in the storage and computation array, and elements of a first column in the unit array correspond to an arrangement mode of the first row in the 3 x 3 convolution kernel after being expanded into column vectors;

the empty row mapping unit is used for respectively mapping the same weights as those of the first column in the unit array in the second column and the third column in the unit array, the mapping positions corresponding to the second column elements in the unit array are empty downwards for C rows, and the mapping positions corresponding to the third column elements in the unit array are empty downwards for 2C rows;

an expansion input unit for expanding input data in the input buffer into a 1 × 5 × C column vector;

the first convolution unit is used for outputting 5 elements in an input matrix to the unit array by the input buffer in the current period and performing convolution operation;

the value output unit is used for obtaining a complete output element at the tail end of each row in the unit array when the current period is finished, and then carrying out convolution operation of the next period;

the second method module comprises:

a full row mapping unit, configured to map parameters on each input channel of the 3 × 3 convolution kernel to different cell arrays in the storage array, respectively, and map the same weight to three columns in each cell array, respectively, without empty rows, so that each input channel occupies ceil (C/n) cell arrays, total ceil (C/n) × 3 arrays are occupied in a row dimension, and ceil (M × 3/n) arrays are occupied in a column dimension, where ceil is an rounding-up function, and n represents a size of the storage array;

the second convolution unit is used for outputting 3 elements in an input matrix to three unit arrays in the storage and calculation array by the input buffer in the current period and performing convolution operation to obtain a partial result;

and the adding unit is used for adding the partial results obtained in each period in groups to obtain a convolution result.

the determination method module is specifically configured to:

when the convolution kernel is split and mapped to any unit array, if the number of lines occupying any unit array is not more than k/2k-1 of the size of the unit array, or the number of columns occupying any unit array is less than 1/k of the size of the unit array, determining that the first data mapping method is used by the storage array;

when the convolution kernel is split and mapped to any unit array, if the number of rows of the unit array is not more than k/2k-1 of the size of the unit array, and the number of columns of the unit array is less than 1/k of the size of the unit array, determining that the storage and computation array uses the first data mapping method;

when the convolution kernel is split and mapped to any unit array, if the number of rows of the unit array is occupied and is not less than k times of the size of the unit array, calculating to obtain the ratio of a first occupied array to a calculation speed and the ratio of a second occupied array to the calculation speed based on the size of the convolution kernel and the size of the calculation array, wherein the ratio of the first occupied array to the calculation speed is the ratio of the occupied array to the calculation speed when the convolution kernel is mapped to the calculation array by adopting a traditional method, the ratio of the second occupied array to the calculation speed is the ratio of the occupied array to the calculation speed obtained by adopting a method that parameters on each input channel of the convolution kernel are respectively mapped to different unit arrays in the calculation array, and three rows in each unit array are respectively mapped with the same weight without idle rows;

Optionally, the first volume unit is specifically configured to:

the first two elements of the 5 elements are partially calculated in the previous cycle, the rest of the calculations are completed in the current cycle, the middle element is input once in the current cycle to complete all 3 calculations, the last two elements are partially calculated in the current cycle, and the rest of the calculations are completed by placing the last two elements at the positions of the first two elements in the next cycle.

Optionally, the value output unit is specifically configured to:

the adding unit is specifically configured to:

adding the result obtained at the tail end of the first column in the first unit array, the result obtained at the tail end of the first column in the second unit array and the result obtained at the tail end of the first column in the third unit array in the current period to obtain a complete output element;

and adding a result obtained at the end of the second column in the second unit array, a result obtained at the end of the second column in the third unit array and a result obtained at the end of the third column in the third unit array in the current period to a partial result obtained in the next period respectively to obtain a complete output element.

The invention provides a data mapping method for improving the utilization rate of storage arrays, which is characterized in that a convolution kernel of k multiplied by k is split and mapped into k storage arrays, each storage array is mapped with a vector of 1 multiplied by k multiplied by C, if the number of lines of the storage arrays is not enough, the storage arrays are mapped into the next storage array, and the storage arrays are determined to use a first data mapping method or a second data mapping method according to the relation between the size of the convolution kernel and the size of the storage arrays.

Taking a 3 × 3 convolution kernel as an example, when the convolution kernel is small, a first data mapping method is used, 80% of input data needs to be input for 2 times in total, the rest 20% of input data only needs to be input for 1 time, and the average multiplexing rate is 1.8; when the convolution kernel is large, the second data mapping method is used, all input data only need to be input for 1 time, and the average multiplexing rate is 3. The two data mapping methods can effectively reduce the pressure of an input buffer and reduce the data transmission quantity, thereby reducing the time delay and the power consumption of a chip.

In addition, according to the size of the convolution kernel, the method provided by the invention can increase the ratio of the occupied array to the calculation speed (speed/area for short) to three times at most. When the second mapping method is used, about 3 times of calculation array is occupied, and meanwhile, the calculation speed is improved to 3 times, so that the speed/area is not changed. And when the input or output channel of the convolution kernel is very small, particularly the occupied line number is increased to 5/3, the occupied column number is increased to 3 times, and the size of a single array is not exceeded, under the condition that the number of the calculation arrays is not changed, the calculation speed is increased to 3 times, so the speed/area is also increased to 3 times, the idle lines and columns in each array are effectively utilized, and the utilization rate of the calculation arrays and the calculation speed are increased.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a diagram illustrating a simulation of a mapping scheme when a convolution kernel is small according to an embodiment of the present invention;

fig. 2 is a diagram illustrating a simulation of a mapping manner when a convolution kernel is large according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

The inventor finds that the conventional method for performing convolution operation by using a storage array has the problems of high power consumption, low speed/area ratio and the like. Taking M convolution kernels of 3 × 3 × C size as an example, each number in the input matrix needs to perform an average of 9 times of multiply-add operations, and the mapping method of the currently arranged weight matrix and the flowing method of the input data are common:

firstly, each convolution kernel of 3 multiplied by C is unfolded into a one-dimensional column vector which is arranged on the same column, and different columns are mapped with different convolution kernels; secondly, splitting and mapping a 3 × 3 convolution kernel into 9 unit arrays, wherein each column of each unit array is mapped with a 1 × C vector; thirdly, 3 × 3 convolution kernels are split and mapped into 3 unit arrays, and each unit array is mapped with a 1 × 3 × C column vector. These three approaches still have two significant drawbacks:

1) The multiplexing of input data inside the array cannot be realized. As mentioned above, since 9 multiply-add operations are required to average each input data in a 3 × 3 convolution kernel, the above method still needs to input each data 9 times into different cell arrays or different rows of the same cell array. The multiplexing rate of the input data is insufficient.

2) When a layer of the neural network is accelerated in multiple ways, more processing units and arrays are often used for accelerated calculation in a pipeline, idle rows and columns in each unit array cannot be effectively utilized, and the accelerated calculation is more prominent particularly when an output channel of a convolution kernel is small.

In view of the above problems, the inventors have creatively proposed the data mapping method for improving the utilization rate of the storage array of the present invention, and the method of the present invention will be described in detail below.

In general, a memory array is composed of a plurality of unit arrays, when the data mapping method of the present invention is applied, firstly, a k × k convolution kernel is split and mapped into k unit arrays, each unit array maps a 1 × k × C column vector, if the number of rows of the unit array is not enough, the unit array is continuously mapped into the next unit array, where k × k represents the height × width of the convolution kernel, and C represents the number of input channels of the convolution kernel.

And then determining whether the calculation array uses the first data mapping method or the second data mapping method according to the relation between the size of the convolution kernel and the size of the calculation array. Specifically, the following method can be used for determining which data mapping method is used:

assuming that the size of the convolution kernel is k (height of the convolution kernel) × k (width of the convolution kernel) × C (number of input channels of the convolution kernel) × M (number of output channels of the convolution kernel), the size of the computing array is n (number of rows of the computing array) × n (number of columns of the computing array).

When the convolution kernel is split and mapped to any unit array, if the number of rows occupied by any unit array is not more than k/2k-1 of the size of the unit array, or the number of columns occupied by any unit array is less than 1/k of the size of the unit array, determining that the storage and computation array uses a first data mapping method.

If the number of rows of any unit array is not more than k/2k-1 of the size of the unit array, but the number of columns of any unit array is not less than 1/k of the size of the unit array, the speed/area corresponding to the data mapping method can be reduced by 1-k/2 times compared with the speed/area corresponding to the existing mapping method; if the number of columns occupied by any unit array is smaller than 1/k of the size of the unit array, but the number of rows occupied by any unit array is larger than k/2k-1 of the size of the unit array, the speed/area corresponding to the data mapping method can be reduced by k/2-k times compared with the speed/area corresponding to the existing mapping method.

And when the convolution kernel is split and mapped to any unit array, if the number of rows occupied by any unit array is not more than k/2k-1 of the size of the unit array, and the number of columns occupied by any unit array is less than 1/k of the size of the unit array, determining that the storage and computation array uses the first data mapping method. In this case, the reduction of the speed/area by k times can be achieved completely.

The foregoing approach substantially addresses the case where the convolution kernel is small, and the case where the convolution kernel is large when the convolution kernel needs to occupy multiple cell arrays to completely map all data. In this case, when the convolution kernel is split and mapped to any one cell array, the number of rows occupying any one cell array is not less than k times of the size of the cell array, and based on the size of the convolution kernel and the size of the storage array, the ratio of the first occupied array to the calculation speed and the ratio of the second occupied array to the calculation speed are obtained through calculation, that is, the first speed/area and the second speed/area are obtained through calculation.

The ratio of the first occupied array to the calculation speed is: mapping the convolution kernel to the ratio of the occupied array to the calculation speed when storing and calculating the array by adopting a traditional method; the ratio of the second occupancy array to the calculation speed is: and respectively mapping the parameters on each input channel of the convolution kernel to different unit arrays in the storage and calculation array, and respectively mapping the same weight to k columns in each unit array without a row-empty method to obtain the ratio of the occupied array to the calculation speed. The meaning of the empty row is explained below and will not be described in detail.

And after the ratio of the first occupation array to the calculation speed and the ratio of the second occupation array to the calculation speed are obtained, if the value of the ratio of the first occupation array to the calculation speed is larger than the value of the ratio of the second occupation array to the calculation speed, determining that the storage and calculation array uses a second data mapping method.

The formula of the ratio between the first occupancy array and the calculated speed may be:

k×ceil(k×C/n)×ceil(M/n)

the formula for the ratio of the second occupancy array to the calculated speed may be:

Ceil(C/n)×k×ceil(M×k/n)

in the above two formulas, ceil (k × C/n) indicates that ceil (k × C/n) cell arrays need to be occupied by convolution kernel mapping in a row dimension in the conventional method, ceil (M/n) indicates that ceil (M/n) cell arrays need to be occupied by convolution kernel mapping in a column dimension in the conventional method, ceil (C/n) indicates that ceil (C/n) cell arrays are occupied by each input channel in the method of the present invention, ceil (C/n) × k indicates that ceil (C/n) × k cell arrays are occupied in a row dimension in the method of the present invention, and ceil (M × 3/n) indicates that ceil (M × k/n) cell arrays are occupied in a column dimension in the method of the present invention.

If the value calculated by k × Ceil (k × C/n) × Ceil (M/n) is greater than the value calculated by Ceil (C/n) × k × Ceil (M × k/n), the data mapping method of the present invention is used.

The following describes the contents of the first data mapping method and the second data mapping method in the embodiment of the present invention, taking a 3 × 3 convolution kernel as an example.

In combination with the simulation diagram of the mapping mode with a smaller convolution kernel shown in fig. 1, the first data mapping method includes: and mapping the first row of the 3 x 3 convolution kernel to any unit array, wherein elements in the first column in the unit array correspond to the arrangement mode of the first row in the 3 x 3 convolution kernel after being expanded into column vectors. For example, as shown in FIG. 1, the 3 × 3 convolution kernels illustratively show X1 to Xc, Y1 to Yc, Z1 to Zc; exemplary matrix data INPUT in an INPUT matrix ₀₁ 、INPUT ₀₂ ，INPUT ₀₁ The method comprises the following steps: IN ₀₁ ～IN _0c 、IN ₁₁ ～IN _1c 、IN ₂₁ ～IN _2c 、IN ₃₁ ～IN _3c 。

In the first cycle, the INPUT register will have the corresponding matrix data INPUT ₀₁ Inputting the data into a memory array, and calculating to obtain the first element OUT of an output matrix ₀₁ . In the conventional mapping method, the INPUT buffer will INPUT the next cycle ₀₂ Inputting and obtaining a second element OUT of the output matrix at the end of the first column of the memory array ₀₂ . However, as shown in the upper left corner of fig. 1, two thirds of the input data of two adjacent cycles are repeated, and in order to effectively utilize the data input multiple times in the first cycle, the same weight as that of the first column is respectively mapped in the second column and the third column of the storage array, but the mapping positions are downward in sequence, and each time the data are emptyAnd C, discharging. That is, the position of X1 in the second column is aligned with the position of Y1 in the first column, and the position of X1 in the third column is aligned with the position of Y1 in the second column and with the position of Z1 in the first column.

Expanding the input elements of the input buffer to a 1 x 5 x C column vector results in three elements OUT of the output matrix in one cycle ₀₁ 、OUT ₀₂ 、OUT ₀₃ The calculation speed is increased to three times. It can be seen that, in the input data, one fifth of the data is calculated three times, two fifths of the data is calculated twice, and the other two fifths of the data are calculated once, and the average multiplexing rate of the data is 1.8. Meanwhile, the cost is that the number of lines occupied in the array is increased to 5/3 of the original number, the number of columns occupied is increased to 3 times, and the data volume input by the input buffer in each period is increased to 5/3.

I.e. 5 elements of the input, where the first two elements have already been partially computed in the previous cycle, which completes the rest of the computation; one element in the middle can complete all 3 times of calculation only by inputting once in the period; the latter two elements complete part of the computation in this cycle, and the remaining computation is completed by placing the two elements in the positions of the first two elements in the next cycle. Under the data mapping method, after each period is finished, a complete output element can be obtained at the tail end of each column, and then the value area of the input buffer on the input matrix is shifted to the right by 3 pixels. Thus, under this data mapping method, the input buffer increases the input data by a factor of 5/3 per cycle, but the number of calculation cycles becomes one third of the original.

In combination with the simulation diagram of the mapping mode with a large convolution kernel shown in fig. 2, the second data mapping method includes: the parameters on each input channel of the 3 multiplied by 3 convolution kernel are respectively mapped to different unit arrays in the storage and computation array, and three columns in each unit array are respectively mapped with the same weight without empty rows. For example, as shown in FIG. 2, the 3 × 3 convolution kernels illustratively show X1 to Xc, Y1 to Yc, Z1 to Zc; exemplary matrix data INPUT in an INPUT matrix _m1 ，INPUT _m1 The method comprises the following steps: IN _m1 ～IN _mc 、IN _(m+1)1 ～IN _(m+1)c 、IN _(m+2)1 ～IN _(m+2)c 。

And respectively mapping the parameters on each input channel into 3 unit arrays, wherein three columns in each unit array are respectively mapped with the same weight without empty rows. The parameters mapped by each column of each of the three cell arrays are different from those mapped by the other cell arrays. Such as: the first column mapping parameters in the upper first cell array in the memory array are X1-Xc, the first column mapping parameters in the middle second cell array are Y1-Yc, and the first column mapping parameters in the lower third cell array are Z1-Zc. And partial results are output at the tail end of each column of the unit array, and partial results obtained in each period are grouped and added to obtain a convolution result. In this case, the average multiplexing rate of data reaches 3, and no additional rows are occupied compared to the first data mapping, and the input buffer does not need to input additional data per cycle.

The result OUT01 obtained at the tail end of the first column in the first unit array, the result OUT04 obtained at the tail end of the first column in the second unit array and the result OUT07 obtained at the tail end of the first column in the third unit array in the current period are added to obtain a complete output element.

And adding the result OUT02 obtained at the end of the second column in the first unit array, the result OUT03 obtained at the end of the third column in the first unit array and the result OUT06 obtained at the end of the third column in the second unit array in the current period to the partial result obtained in the previous period respectively to obtain a complete output element.

And adding the result OUT05 obtained at the end of the second column in the second unit array in the current period, the result OUT08 obtained at the end of the second column in the third unit array in the current period and the result OUT09 obtained at the end of the third column in the third unit array to the partial result obtained in the next period respectively to obtain a complete output element. Therefore, all calculation can be completed by inputting each element in the input matrix once, the input data of the input buffer per period is unchanged, and the calculation period number is changed to one third of the original calculation period number.

In addition, the speed/area can be increased to three times at most by applying the method provided by the invention according to the size of the convolution kernel. When the second mapping method is used, about 3 times of calculation array is occupied, and meanwhile, the calculation speed is improved to 3 times, so that the speed/area is not changed. And when the input or output channel of the convolution kernel is very small, particularly the number of occupied lines is increased to 5/3, the number of occupied columns is increased to 3 times, and the size of a single array is not exceeded, under the condition that the number of the calculation arrays is not changed, the calculation speed is increased to 3 times, so the speed/area is increased to 3 times, idle lines and columns in each array are effectively utilized, and the utilization rate of the calculation arrays and the calculation speed are increased.

Based on the above data mapping method for improving the utilization rate of the storage and computation array, an embodiment of the present invention further provides a data mapping apparatus for improving the utilization rate of the storage and computation array, where the data mapping apparatus includes:

a determining method module, configured to determine that the computational array uses a first data mapping method or a second data mapping method according to a relationship between the convolution kernel size and the computational array size;

wherein the first method module comprises:

the expansion mapping unit is used for mapping the first row of the 3 x 3 convolution kernel to any unit array in the storage array, and elements in the first column in the unit array correspond to an arrangement mode of the first row in the 3 x 3 convolution kernel after being expanded into column vectors;

the empty row mapping unit is used for respectively mapping the same weights as those of the first column in the unit array in the second column and the third column in the unit array, a C row is vacated downwards from the mapping position corresponding to the second column element in the unit array, and a 2C row is vacated downwards from the mapping position corresponding to the third column element in the unit array;

the second method module comprises:

the second convolution unit is used for outputting 3 elements in the input matrix to three unit arrays in the storage and calculation array by the input buffer in the current period and performing convolution operation to obtain a partial result;

the determination method module is specifically configured to:

Optionally, the first volume unit is specifically configured to:

Optionally, the value output unit is specifically configured to:

when the current period is finished, a complete output element is obtained at the tail end of each row in the unit array, then when the next period starts, the value area of the input buffer on the input matrix is translated rightwards by 3 pixels on the basis of the value area of the input buffer on the input matrix in the current period, and then 5 elements are taken and output to the unit array for convolution operation.

Optionally, the three cell arrays include: a first cell array, a second cell array, a third cell array;

the adding unit is specifically configured to:

By the above example, in the data mapping method for improving the utilization rate of the storage array, k × k convolution kernels are split and mapped into k storage arrays, each storage array maps one 1 × k × C vector, if the number of rows of the storage array is not enough, the storage array is mapped into the next storage array, and the storage array is determined to use the first data mapping method or the second data mapping method according to the relationship between the convolution kernel size and the storage array size.

In addition, according to the size of the convolution kernel, the method provided by the invention can increase the ratio of the occupied array to the calculation speed (speed/area for short) to three times at most. When the second mapping method is used, about 3 times of calculation array is occupied, and meanwhile, the calculation speed is improved to 3 times, so that the speed/area is not changed. And when the input or output channel of the convolution kernel is very small, particularly the number of occupied lines is increased to 5/3, the number of occupied columns is increased to 3 times, and the size of a single array is not exceeded, under the condition that the number of the calculation arrays is not changed, the calculation speed is increased to 3 times, so the speed/area is increased to 3 times, idle lines and columns in each array are effectively utilized, and the utilization rate of the calculation arrays and the calculation speed are increased.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one of 8230, and" comprising 8230does not exclude the presence of additional like elements in a process, method, article, or apparatus comprising the element.

While the present invention has been described with reference to the particular illustrative embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but is intended to cover various modifications, equivalent arrangements, and equivalents thereof, which may be made by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A data mapping method for increasing utilization of an inventory array, the inventory array comprising: a plurality of cell arrays, the data mapping method comprising:

determining whether the storage and computation array uses a first data mapping method or a second data mapping method according to the relation between the size of the convolution kernel and the size of the storage and computation array;

taking k as an example, the first data mapping method includes:

taking k as an example, the second data mapping method includes:

mapping parameters on each input channel of a 3 × 3 convolution kernel into different unit arrays in the storage array respectively, and mapping the same weight on three columns in each unit array respectively without empty rows, so that each input channel occupies ceil (C/n) unit arrays, which occupy ceil (C/n) × 3 arrays in the row dimension in total, and ceil (M × 3/n) arrays in the column dimension, wherein ceil is an upward rounding function, and n represents the size of the storage array;

2. The data mapping method of claim 1, wherein the size of the convolution kernel is k x C x M, and the size of the storage array is n x n, where M represents the number of output channels of the convolution kernel;

when the convolution kernel is split and mapped to any unit array, if the number of lines occupying any unit array is not less than k times of the size of the unit array, calculating to obtain a ratio of a first occupied array to a calculation speed and a ratio of a second occupied array to the calculation speed based on the size of the convolution kernel and the size of the calculation array, wherein the ratio of the first occupied array to the calculation speed is the ratio of the occupied array to the calculation speed when the convolution kernel is mapped to the calculation array by adopting a traditional method, the ratio of the second occupied array to the calculation speed is the ratio of the occupied array to the calculation speed obtained by respectively mapping parameters on each input channel of the convolution kernel to different unit arrays in the calculation array, and respectively mapping the same weight on k columns in each unit array without using an empty line method;

3. The data mapping method of claim 2, wherein the formula of the ratio of the first occupancy array to the calculated speed is:

k×ceil(k×C/n)×ceil(M/n)

the formula of the ratio of the second occupancy array to the calculated speed is:

Ceil(C/n)×k×ceil(M×k/n)

4. The data mapping method of claim 1, wherein in the current cycle, the input buffer outputs 5 elements in the input matrix to the cell array and performs convolution operation, comprising:

5. The data mapping method according to claim 4, wherein at the end of the current cycle, a complete output element is obtained at the end of each column in the cell array, and then the next cycle of convolution operation is performed, including:

6. The data mapping method of claim 2, wherein the three cell arrays comprise: a first cell array, a second cell array, a third cell array;

and adding the partial results obtained in each period in groups to obtain a convolution result, wherein the convolution result comprises the following steps:

7. A data mapping apparatus for increasing utilization of a storage array, the data mapping apparatus comprising:

the split mapping module is used for splitting and mapping k multiplied by k convolution kernels into k storage and computation arrays, each storage and computation array is mapped with a 1 multiplied by k multiplied by C vector, and if the number of rows of the storage and computation arrays is not enough, the storage and computation arrays are mapped into the next storage and computation array, wherein k multiplied by k represents the height multiplied by the width of the C convolution kernel, and C represents the number of input channels of the convolution kernel;

taking k as an example, the data mapping apparatus further includes: a first method module and a second method module;

wherein the first method module comprises:

the second method module comprises:

8. The data mapping apparatus according to claim 7, wherein the size of the convolution kernel is kxkxCxM, and the size of the storage array is nxn, where M represents the number of output channels of the convolution kernel;

the determination method module is specifically configured to:

9. The data mapping apparatus of claim 7, wherein the first convolution unit is specifically configured to:

10. The data mapping apparatus according to claim 9, wherein the value output unit is specifically configured to: