CN114330656A

CN114330656A - Convolution operation hardware accelerator and data processing method

Info

Publication number: CN114330656A
Application number: CN202111596275.2A
Authority: CN
Inventors: 丁昊杰; 王文华
Original assignee: Hangzhou Flyslice Technologies Co ltd
Current assignee: Hangzhou Flyslice Technologies Co ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-04-12
Anticipated expiration: 2041-12-24
Also published as: CN114330656B

Abstract

The invention discloses a convolution operation hardware accelerator, comprising: the feature map reorganization module is used for segmenting input feature data of each row of the input feature map, and the maximum number of the data of each segment is M; the characteristic diagram cache write control module generates a characteristic diagram cache unit serial number and a cache write address to which each input characteristic data is written, and at most M input characteristic data are written into M characteristic diagram cache units in parallel in the same clock cycle in the characteristic diagram cache module; the characteristic diagram cache read control module reads at most M input characteristic data in parallel in a single clock cycle in one convolution operation; the convolution kernel caching module writes convolution kernels of a single output channel into M convolution kernel caching units according to M convolution kernel data as a group; and the calculation module performs corresponding convolution operation on the convolution kernel data and the input characteristic data. The invention can enable the hardware accelerator to support convolution operation of parameters with different sizes and improve the calculation efficiency of the convolution operation.

Description

Convolution operation hardware accelerator and data processing method

Technical Field

The invention relates to the technical field of deep learning hardware acceleration, in particular to a convolution operation hardware accelerator and a data processing method.

Background

The convolutional neural network is one of the most important algorithms in deep learning, and is widely applied to multiple fields such as target recognition, unmanned driving and artificial intelligence. The convolution operation is the most computationally intensive part of the convolutional neural network. Because the current convolutional neural network is deeper and deeper, the calculation amount is larger and larger, and the data size of each layer of operator is changeable, the calculation efficiency improvement and calculation optimization of the convolutional operation become problems which need to be solved urgently by a hardware accelerator.

In the prior art, the universality of convolution operation is not flexible enough, convolution operations with different sizes cannot be supported efficiently, computing resources on a hardware accelerator are difficult to be fully utilized, and the technical problem of low computing efficiency exists.

Disclosure of Invention

Based on this, the present invention provides a convolution hardware accelerator and a data processing method, which can rapidly improve the calculation efficiency of convolution.

In order to achieve the above object, the present invention provides a convolution operation hardware accelerator, which includes a feature diagram reorganizing module, a feature diagram cache write control module, a feature diagram cache read control module, N convolution kernel cache modules, and N calculation modules, wherein the feature diagram cache module includes M parallel feature diagram cache units, M and N are integers greater than or equal to 1,

the characteristic diagram reorganization module is used for segmenting input characteristic data of each row of input characteristic diagrams (IC, IH and IW) input according to the column priority, and the maximum number of the input characteristic data of each segment is M, wherein IC is the number of input channels, IH is the height of the input characteristic diagrams, and IW is the width of the input characteristic diagrams;

the feature map cache write control module is configured to generate, based on a cache write address rule, a sequence number of a feature map cache unit to which each piece of input feature data is to be written and a corresponding cache write address, where the cache write address rule includes: the sequence number deviant value between the characteristic diagram cache units written by the input characteristic data of the same column of the adjacent rows in the input characteristic diagram is KW% M, and the sequence number deviant value (KH KW)% M between the characteristic diagram cache units corresponding to the input characteristic data of the same column of the same row of the adjacent input channels in the input characteristic diagram is selected, wherein% is a modulus operator, KH represents the height of a convolution kernel, and KW represents the width of the convolution kernel;

the characteristic diagram caching module is used for writing each input characteristic data into the characteristic diagram caching module according to the serial number of the characteristic diagram caching unit to be written and the corresponding caching write address, and writing at most M input characteristic data into the corresponding M characteristic diagram caching units in parallel in the same clock cycle;

the characteristic graph cache read control module is used for reading at most M input characteristic data from the characteristic cache module in parallel in a single clock cycle in one convolution operation;

the convolution kernel cache module is used for grouping convolution kernels K (IC, KH and KW) of a single output channel into a group according to M convolution kernel data in a row-first order, writing each group of convolution kernel data into corresponding M convolution kernel cache units in parallel, and simultaneously storing convolution kernel data of N output channels by the N convolution kernel cache modules and outputting convolution kernel data of N output channels required by calculation in parallel during calculation;

the calculation module is used for performing corresponding convolution operation on the convolution kernel data read from the convolution kernel cache module and the input feature data read from the feature map cache module, the N calculation modules perform convolution operation on the N output channels at the same time, and the N calculation modules perform convolution operation on the N output channels at the same time and output convolution operation results to obtain the output feature map.

Preferably, the order input according to the column priority order includes an input order of an input column, an input row, and an input channel in sequence, and specifically includes: inputting input characteristic data of all columns on the same row of the same channel, inputting input characteristic data of all columns on the next row of the same channel, and inputting the input characteristic data of all columns on all rows of the next channel according to the same sequence after the input of the input characteristic data of all columns on all rows of the same channel is finished until the input of the input characteristic data of all columns on all rows of all channels is finished.

Preferably, the sequence input according to the column priority order includes an input sequence according to an input column, an input channel, and an input row in sequence, and specifically includes: inputting input characteristic data of all columns on the same channel in the same row, inputting input characteristic data of all columns on the next channel in the same row, and inputting the input characteristic data of all columns on all channels in the next row according to the same sequence after the input of the input characteristic data of all columns on all channels in the same row is finished until the input of the input characteristic data of all columns of all channels in all rows is finished.

Preferably, the profile reorganizing module divides each line of input profile (IC, IH, IW) into L segments, each segment having a maximum number of M input profile, wherein,

indicating rounding up.

Preferably, the cache write address rule further includes: and data of adjacent columns in the same row in the input characteristic diagram are written into the characteristic diagram cache units with adjacent serial numbers, and the serial number M-1 and the serial number 0 are also equal to the serial numbers of the adjacent characteristic diagram cache units.

Preferably, the characteristic diagram cache read control module is configured to generate a characteristic diagram cache unit sequence number and cache read addresses of each input characteristic data required by the current convolution operation in the characteristic diagram cache module based on a cache read address rule, group-divide the generated IC × KH × KW cache read addresses into P groups according to M, and output the M cache read addresses of each group to the M characteristic diagram cache units with corresponding sequence numbers respectively, where,

wherein

Indicating rounding up, IC the number of input channels, KH the height of the convolution kernel, and KW the width of the convolution kernel.

Preferably, the cache read address rule specifically includes:

generating initial caching read addresses of IC, KH and KW, wherein each KW initial caching read address in the initial reading addresses of IC, KH and KW is the same, the incremental value between adjacent KW initial caching read addresses is the address depth occupied by the input feature map cache module by one line of input feature data, the incremental value between adjacent KH and KW initial caching read addresses is the address depth occupied by the input feature map cache module by the input feature data of all lines of a single input channel, and the initial caching read addresses are the storage addresses of the input feature data in the input feature map cache module, which are required for calculating the 0 th row and the 0 th column of input feature data of the output feature map;

when reading input feature data required by calculating next column input feature data on the same row of the output feature map, adding a span offset value stride to the feature map cache unit serial number corresponding to the current feature map cache unit serial numbers of the IC KH KW current cache read addresses on the basis of the feature map cache unit serial number corresponding to the last time, if the obtained feature map cache unit serial number is greater than M, taking the obtained feature map cache unit serial number as a value obtained by modulo M, adding an address offset to the current cache read address on the basis of the corresponding initial read address, wherein the address offset is the address offset of the feature map cache module of the adjacent data of the same row of the same input channel;

when the input characteristic data required by calculating the 0 th row input characteristic data of the next row of the output characteristic diagram is read, adding an offset KW to the serial number of the current characteristic diagram cache unit corresponding to the current cache read addresses of the IC KH KW on the basis of the serial number of the characteristic diagram cache unit corresponding to the previous row 0 th row of the output characteristic diagram, if the serial number of the obtained characteristic diagram cache unit is greater than M, taking the obtained serial number of the characteristic diagram cache unit as a value obtained by modulo M, adding an address offset to the current cache read address on the basis of the corresponding previous row initial read address, and enabling the address offset to be the address offset of the characteristic diagram cache module of the adjacent row data of the same input channel.

Preferably, the convolution kernel buffer module is configured to group convolution kernel data input in a column priority order by taking M data as a group, divide a row of convolution kernel data into P groups, sequentially write P addresses of M buffer units of the convolution kernel buffer module in parallel, sequentially read P addresses during calculation, and concurrently read M convolution kernel data from the M buffer units and output the M convolution kernel data to the calculation unit.

Preferably, the calculation module performs multiplication between M input feature data and M convolution kernel data during convolution operation, and accumulates the results of the M multiplication operations to obtain a result of one convolution operation of a single output channel.

In order to achieve the above object, the present invention provides a data processing method of a convolution operation hardware accelerator, the method comprising:

segmenting input characteristic data of each row of input characteristic graphs (IC, IH and IW) input according to the column priority, wherein the maximum number of the input characteristic data of each segment is M, IC is the number of input channels, IH is the height of the input characteristic graphs, and IW is the width of the input characteristic graphs;

generating a serial number of a feature map cache unit to which each piece of input feature data of each segment is written and a corresponding cache write address based on a cache write address rule, wherein the cache write address rule comprises: the sequence number deviant value between the characteristic diagram cache units written by the input characteristic data of the same column of the adjacent rows in the input characteristic diagram is KW% M, and the sequence number deviant value (KH KW)% M between the characteristic diagram cache units corresponding to the input characteristic data of the same column of the same row of the adjacent input channels in the input characteristic diagram is selected, wherein% is a modulus operator, KH represents the height of a convolution kernel, and KW represents the width of the convolution kernel;

writing each input feature data into the feature map cache module according to the serial number of the feature map cache unit to be written and the corresponding cache write address, writing at most M input feature data into the corresponding M feature map cache units in parallel in the same clock cycle, and reading at most M input feature data from the feature cache module in parallel in a single clock cycle in a convolution operation;

taking convolution kernels K (IC, KH and KW) of a single output channel as a group according to M convolution kernel data in a row-first order, writing each group of convolution kernel data into corresponding M convolution kernel cache units in parallel, storing convolution kernel data of N output channels by N convolution kernel cache modules simultaneously, and outputting convolution kernel data of N output channels required by calculation in parallel during calculation;

and performing corresponding convolution operation on the convolution kernel data read from the convolution kernel cache module and the input feature data read from the feature map cache module, and performing convolution operation on N output channels simultaneously by the N calculation modules and outputting convolution operation results to obtain an output feature map.

Compared with the prior art, the convolution operation hardware accelerator and the data processing method have the beneficial effects that: the computing efficiency and the computing speed are improved while the universality of the convolution operation is considered, and the parallelism can be flexibly adjusted to adapt to computing resources of different scales; by adopting a data rearrangement method, namely rearranging the input characteristic data through the cache, the efficient input during reading the input characteristic data and the efficient output of the calculation data during calculation are realized.

Drawings

FIG. 1 is a system block diagram of a convolution operation hardware accelerator according to one embodiment of the present invention.

Fig. 2 is a schematic diagram of a convolution operation in the prior art.

FIG. 3 is a schematic diagram of a data dimension transformation process of an input feature map.

Fig. 4 is a schematic diagram of the storage in the input feature map caching module according to an embodiment of the present invention.

FIG. 5 is a diagram illustrating a convolution operation according to an embodiment of the present invention.

FIG. 6 is a diagram illustrating a convolution operation according to an embodiment of the present invention.

FIG. 7 is a schematic diagram of a data dimension transformation process for a convolution kernel.

Detailed Description

The present invention will be described in detail with reference to the specific embodiments shown in the drawings, which are not intended to limit the present invention, and structural, methodological, or functional changes made by those skilled in the art according to the specific embodiments are included in the scope of the present invention.

As shown in fig. 1, according to an embodiment of the present invention, the present invention provides a convolution operation hardware accelerator, which includes a feature map reorganizing module 10, a feature map cache write control module 11, a feature map cache module 12, a feature map cache read control module 13, N convolution kernel cache modules 14, and N calculation modules 15, where the feature map cache module includes M parallel feature map cache units, M and N are integers greater than or equal to 1, where,

a feature map reorganizing module 10, configured to segment input feature data of each row of input feature maps (IC, IH, IW) input according to a column-priority order, where the number of input feature data in each segment is M at most, where IC is the number of input channels, IH is the height of the input feature map, and IW is the width of the input feature map;

the feature map cache write control module 11 is configured to generate, based on a cache write address rule, a serial number of a feature map cache unit to which each piece of input feature data is to be written and a corresponding cache write address, where the cache write address rule includes: the sequence number deviant value between the characteristic diagram cache units written by the input characteristic data of the same column of the adjacent rows in the input characteristic diagram is KW% M, and the sequence number deviant value (KH KW)% M between the characteristic diagram cache units corresponding to the input characteristic data of the same column of the same row of the adjacent input channels in the input characteristic diagram is selected, wherein% is a modulus operator, KH represents the height of a convolution kernel, and KW represents the width of the convolution kernel;

the characteristic diagram caching module 12 is configured to write each input characteristic data into the characteristic diagram caching module according to the serial number of the characteristic diagram caching unit to be written and the corresponding caching write address, and write at most M input characteristic data into the corresponding M characteristic diagram caching units in parallel in the same clock cycle;

a feature map cache read control module 13, configured to read, in parallel, at most M input feature data from the feature cache module in a single clock cycle in a convolution operation;

the convolution kernel cache module 14 is configured to group convolution kernels K (IC, KH, KW) of a single output channel into a group according to M pieces of convolution kernel data in a row-first order, write each group of convolution kernel data into corresponding M convolution kernel cache units in parallel, store convolution kernel data of N output channels by the N convolution kernel cache modules at the same time, and output convolution kernel data of N output channels required for calculation in parallel during calculation;

and the calculation module 15 is configured to perform corresponding convolution operations on the convolution kernel data read from the convolution kernel cache module and the input feature data read from the feature map cache module, where the N calculation modules perform convolution operations on the N output channels at the same time, and the N calculation modules perform convolution operations on the N output channels at the same time and output convolution operation results to obtain an output feature map.

Generally, before performing convolution operation, sometimes it is necessary to fill the input feature map to adapt adjustment of data size, the input feature data size is filled from (IC, IH, IW) to (IC, IH +2 pad, IW +2 pad), IC is the number of input channels, IH is the input height, i.e. the input row number, IW is the input width, i.e. the input column number, pad is a filling value, the process of convolution operation is as shown in fig. 2, data corresponding to convolution kernel data of a single output channel is selected from the filled input feature data (IC, IH +2 pad, IW +2 pad), i.e. data within the range of gray part in the feature map filled in the previous map, and point-to-point multiplication and accumulation are performed on the convolution kernel data of different output channels, so as to obtain one result data of different channels in the output feature map, i.e. gray part in the output feature map, and one data of 0 th row and 0 column in the output feature map is obtained through convolution operation, and then shifting stride (span) in the row direction from the gray part in the filled feature map, repeating the calculation process, shifting stride in the column direction after one row is calculated, executing one row calculation, and repeating the process to obtain all convolution operation result data on the output feature map.

Based on the described data dimension transformation and convolution operation process of the input feature map, the feature map cache module comprises M parallel feature map cache units, and each convolution kernel cache module comprises M convolution kernel cache units. Inputting the input characteristic data into the characteristic graph cache module according to the column priority, and taking the input characteristic data out of the characteristic graph cache module according to the column priority to carry out convolution operation during convolution operation; and storing the convolution kernel data into the convolution kernel storage module according to the column priority corresponding to the input feature graph, and taking out the row convolution operation of the convolution kernel data from the convolution kernel storage module according to the column priority.

The characteristic diagram reorganizing module is used for segmenting input characteristic data of each row of input characteristic diagrams (IC, IH and IW) input according to the column priority, the number of the input characteristic data of each segment is M at most, wherein IC is the number of input channels, IH is the height of the input characteristic diagrams, and IW is the width of the input characteristic diagrams. As one implementation of the present invention, the profile reorganization module divides each line of input profile (IC, IH, IW) into L segments, where the maximum number of input profile segments corresponds to M, where IC is the number of input channels, IH is the height of the input profile, IW is the width of the input profile,

wherein

And the rounding-up is shown, and at most M input feature data are written into the feature map cache module in parallel in the same clock cycle. Specifically, the data dimensions of the input feature map are transformed, and the transformation process is as shown in fig. 3, the input feature map F0(IC, IHp, IWp) participating in the convolution operation is transformed from three dimensions to a two-dimensional feature map F1(IC × IHp, IWp), and IC, IHp, IWp respectively represent the number of input channels, the height of the input after filling and the width of the input after filling; the two-dimensional feature map F1(IC × IHp, IWp) is then divided into L two-dimensional feature maps F2(IC × IHp, M),

m is the parallelism of single output channel calculation, M is more than or equal toAt integer 1, when IWp cannot be divided exactly, the effective data size of the last, L-1 th two-dimensional feature is (IC × IHp, S), S IWp% M, and each row of the two-dimensional feature needs to be filled with 0 to expand the width from S to M. The L two-dimensional feature maps F2(IC × IHp, M) were recombined into a single two-dimensional feature map F3(IC × IHp × L, M).

The feature map reorganizing module inputs each input feature data according to the column priority order, including the input order of the input column, the input row and the input channel in sequence, specifically: inputting input characteristic data of all columns on the same row of the same channel, inputting input characteristic data of all columns on the next row of the same channel, and inputting the input characteristic data of all columns on all rows of the next channel according to the same sequence after the input of the input characteristic data of all columns on all rows of the same channel is finished until the input of the input characteristic data of all columns on all rows of all channels is finished.

As an implementation manner of the present invention, the order of inputting the feature map reorganization module according to the column priority order includes an input order of an input column, an input channel, and an input row in sequence, and specifically includes: inputting input characteristic data of all columns on the same channel in the same row, inputting input characteristic data of all columns on the next channel in the same row, and inputting the input characteristic data of all columns on all channels in the next row according to the same sequence after the input of the input characteristic data of all columns on all channels in the same row is finished until the input of the input characteristic data of all columns of all channels in all rows is finished.

The characteristic diagram cache write control module generates a serial number of a characteristic diagram cache unit to which each piece of input characteristic data of each section is written and a corresponding cache write address based on a cache write address rule. The serial number of the characteristic diagram cache unit is 0-M-1, the number of cache write addresses is IC (IH) (_ L), the address number can be set to be 0-IC (IH) (-L-1), if the input characteristic diagrams (IC, IH and IW) are filled with data, the width of the filled data is pad, and the number of the cache write addresses is IC (IH + 2) (-L) if the filled input characteristic diagrams (IC, IH + 2) (-pad and IW + 2) (-pad). The number of the cache addresses is the cache depth of the feature map cache unit. The cache write address rule includes: the sequence number offset value between the characteristic diagram cache units written by the input characteristic data of the same column in the adjacent rows in the input characteristic diagram is KW% M, and the sequence number offset value (KH KW)% M between the characteristic diagram cache units corresponding to the input characteristic data of the same column in the same row in the adjacent input channels in the input characteristic diagram is selected, wherein% is a modulus operator, KH represents the height of a convolution kernel, and KW represents the width of the convolution kernel. The cache write address rule further comprises: and inputting data of adjacent columns in the same row in the feature diagram into the feature diagram cache units with adjacent serial numbers, wherein the serial number M-1 and the serial number 0 are also equal to the serial number adjacent. The serial number of the cache unit and the cache write address which are set based on the rule can ensure that input characteristic data required by convolution operation are always output in parallel.

The characteristic diagram caching module writes each input characteristic data into the characteristic diagram caching module according to the serial number of the characteristic diagram caching unit to be written and the corresponding caching write address, and at most M input characteristic data are written into the corresponding M characteristic diagram caching units in parallel in the same clock cycle. Sometimes the input feature map needs to be filled in to accommodate the adjustment in data size before the convolution operation is performed. The data output by the characteristic buffer module during convolution operation is the filled input characteristic data, but if the filled input characteristic data is directly written during writing, the writing efficiency of the data is affected. Therefore, in order to improve the data writing efficiency, the input feature data written by the feature map caching module is unfilled input feature data, so that the input feature data needs to be data-filled. The filling mode is specifically that the data in the feature map cache module is cleared in advance, the input feature map is written into the feature map cache module, the filled data is written into the corresponding feature map cache unit serial number and write address in the feature map cache module, and the commonly filled data is written into 0. And the other filling mode is that the data in the feature map cache module is not cleared in advance, the filling position of the filling data in the feature map cache module is reserved, the filling position comprises a feature map cache unit serial number and a write address, and the data in the filling position is replaced by 0 when the data is read out from the feature map cache module.

The input characteristic diagram (IC, IH, IW) is based on three-dimensional data in each dimension of the input column, the input row, and the input channel, and when the input characteristic data is written into the characteristic diagram buffer module, the corresponding input characteristic data may be written into the characteristic diagram buffer module in the order of the input column, the input row, and the input channel, or the corresponding input characteristic data may be written into the characteristic diagram buffer module in the order of the input column, the input channel, and the input row. Specifically, if the corresponding input feature data in the input feature map (IC, IH, IW) is written into the feature map buffer module according to the sequence of the input column, the input row, and the input channel, the input feature data in the jth row and ith column of the kth channel is written into the feature map buffer unit with the sequence number:

wr_seq[k,j,i]＝(k*KW*KH+(j+pad)*KW+i+pad)％M；

the cache write address of the input characteristic data of the jth row and ith column of the kth channel written into the characteristic diagram cache unit sequence number is as follows:

wherein KW is convolution kernel width, KH is convolution kernel height, pad is row and column filling width, M is parallelism, k is [ IC-1, 0]]，j∈[IH-1,0]，i∈[IW-1,0]，

Wherein

Which means that the rounding is made up,

indicating a rounding down.

According to a specific embodiment of the present invention, as shown in fig. 4, in an example of storing an input feature map in a feature map cache module, in this embodiment, input feature data is written into the feature map cache module according to a feature map cache unit sequence number and a write address in the feature map cache write control module in the above example in order of an input column, an input row, and an input channel. In this embodiment, M is 8, IC is 3, IH is 5, KH is 3, pad is 0, the signature buffer unit numbers are 0 to 7, the write addresses are 0 to 14, the data in the frame correspond to each input signature data in the input signature, and the C _ H _ W symbols respectively correspond to the input channel where the data is located on the input signature, the input height, i.e., the number of rows, and the input width, i.e., the number of columns. According to the above-mentioned write address rule, each input feature data in the input feature map is written into the corresponding feature map cache unit serial number and cache write address, which is described below by using a specific embodiment, the input feature data of the same line and the same segment in the input feature map correspond to the same cache write address of different feature map cache unit serial numbers, as shown in fig. 4, the input feature data (0_0_0), (0_0_1), (0_0_2), (0_0_3), and (0_0_4) respectively correspond to the cache write address 0 with the cache unit number 0 of the write feature map, the cache write address 0 with the cache unit number 1 of the feature map, the cache write address 0 with the cache unit number 2 of the feature map, the cache write address 0 with the cache unit number 3 of the feature map, and the cache write address 0 with the cache unit number 4 of the feature map. The input feature data of the same row and adjacent column in the input feature map correspond to the serial numbers of the adjacent feature map buffer units, for example, the serial numbers of the feature map buffer units of the input feature data (0_1_0) and (0_1_1) in fig. 4 are 3 and 4, respectively. The offset value of the feature map buffer unit sequence number corresponding to the input feature data in the same column of adjacent rows in the input feature map is KW% M, for example, the feature map buffer unit sequence numbers of the input feature data (0_1_0) and (0_2_0) in fig. 4 are 3 and 6, respectively, and the offset value is 3% 8 — 3. The offset value (KH × KW)% M of the sequence number of the feature map buffer unit corresponding to the input feature data in the same row and the same column of the adjacent input channel in the input feature map is, as shown in fig. 4, 0 and 1 for the sequence numbers of the feature map buffer units of the input feature data (0_0_0) and (1_0_0), respectively, and the offset value is (3 × 3)% 8 ═ 1. According to the preset rule, the input feature data in the input feature map is written into the feature cache module, as shown in fig. 4.

The characteristic diagram cache read control module is the most in a single clock cycle in one convolution operationAnd a plurality of M input characteristic data are read from the characteristic cache module in parallel. The characteristic diagram cache read control module generates a characteristic diagram cache unit sequence number and cache read addresses of each input characteristic data required by the current convolution operation in the characteristic diagram cache module based on a cache read address rule, divides the generated IC KH KW cache read addresses into a group P according to M, and respectively outputs the M cache read addresses of each group to M characteristic diagram cache units with corresponding sequence numbers, wherein,

wherein

Indicating rounding up, IC the number of input channels, KH the height of the convolution kernel, and KW the width of the convolution kernel. During convolution operation, the cache read address of each input characteristic data in the characteristic cache module needs to be adjusted, so that the required input characteristic data can be read out from the characteristic cache module in parallel. The cache read address rule specifically includes: generating initial caching read addresses of IC, KH and KW, wherein each KW initial caching read address in the initial reading addresses of IC, KH and KW is the same, the incremental value between adjacent KW initial caching read addresses is the address depth occupied by the input feature map cache module by one line of input feature data, the incremental value between adjacent KH and KW initial caching read addresses is the address depth occupied by the input feature map cache module by the input feature data of all lines of a single input channel, and the initial caching read addresses are the storage addresses of the input feature data in the input feature map cache module, which are required for calculating the 0 th row and the 0 th column of input feature data of the output feature map; when reading input feature data required by calculating next column input feature data on the same row of the output feature map, adding span deviation value stride to the serial number of the feature map cache unit corresponding to the current cache read addresses of IC KH KW on the basis of the serial number of the feature map cache unit corresponding to the last time, if the serial number of the obtained feature map cache unit is greater than M, taking the serial number of the obtained feature map cache unit as a value obtained by modulo M, and reading the current cache asAdding an address offset to the corresponding initial read address, wherein the address offset is the address offset of adjacent data in the same row of the same input channel in the feature map cache module; when the input characteristic data required by calculating the 0 th row input characteristic data of the next row of the output characteristic diagram is read, adding an offset KW to the serial number of the current characteristic diagram cache unit corresponding to the current cache read addresses of the IC KH KW on the basis of the serial number of the characteristic diagram cache unit corresponding to the previous row 0 th row of the output characteristic diagram, if the serial number of the obtained characteristic diagram cache unit is greater than M, taking the obtained serial number of the characteristic diagram cache unit as a value obtained by modulo M, adding an address offset to the current cache read address on the basis of the corresponding previous row initial read address, and enabling the address offset to be the address offset of the characteristic diagram cache module of the adjacent row data of the same input channel.

According to the above embodiments, the input feature map writes the corresponding input feature data into the feature map buffer module in the order of the input column, the input row, and the input channel, or writes the corresponding input feature data into the feature map buffer module in the order of the input column, the input channel, and the input row, and therefore, corresponding read address embodiments are also set for these two input modes. If the corresponding input characteristic data in the input characteristic diagram (IC, IH, IW) is written into the characteristic diagram cache module according to the sequence of the input row, the input line and the input channel, the sequence number of the characteristic diagram cache unit corresponding to each group of cache read address sequence number m when the jth row and the ith column of the output characteristic diagram are calculated is as follows:

rd_seq[j,i,m]＝(m+j*stride)％M；

calculating the reading address value of each group of cache reading address sequence number m when the jth row and ith column results of the output characteristic diagram are:

the address offset of adjacent rows of the same channel is L, the address offset of adjacent segments of the same row of the same channel is 1, stride is span, M belongs to [ M-1,0], j belongs to [ OH-1,0], i belongs to [ OW-1,0], OH is the height of the output characteristic diagram, and OW is the width of the output characteristic diagram.

As shown in fig. 5, when M is 8, IC is 3, KW is KH is 3, IW is 5, and pad is 0, IW is smaller than M, so that the address depth occupied by a row of input feature data in the feature map buffer is 1, and the address depth occupied by all rows of input feature data in a single input channel in the feature map buffer is 5, then the initial buffer read address init _ drrad, which contains 27 addresses, of the group is {0,0,0,1,1, 2,2,2,5,5,5,6,6,6,7,7,7,10,10, 11,11,11,12,12,12 }; when the 0 th line and the 0 th column output characteristic diagram data of the 0 th line are calculated, the cache read address is the initial cache read address; at this time, the number of the cache unit corresponding to the set of cache read addresses is {0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7,0,1,2 }; the data corresponding to the initial cache read address and the initial cache unit number is the data of the dark gray part in the feature diagram cache of fig. 5. As shown in fig. 6, when the output characteristic diagram data of row 0 and column 1 is calculated, the cache read address is still the initial cache read address; at this time, the number of the cache unit corresponding to the set of cache read addresses is {1,2,3,4,5,6,7,0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7,0,1,2,3 }; the data corresponding to the initial cache read address and the initial cache unit number is the data of the dark gray portion in the feature diagram cache of fig. 6. When the output characteristic diagram of the line 1 is calculated, the cache read address is {1,1,1,2,2,2,5,5,5,6,6,6,7,7,7,10,10,10,11,11,11,12,12,12,13,13,13 }; it can be seen that if the sequence numbers of the buffer units are sequentially divided into 4 groups of 8 sequence numbers, the 8 sequence numbers of each group are different from each other, so that the input feature data required for calculation can be read out from 8 different buffer units in parallel in the same clock cycle.

The convolution kernel cache module takes convolution kernels K (IC, KH and KW) of a single output channel as a group according to M convolution kernel data in a row-first order, and writes each group of convolution kernel data into corresponding M convolution kernel cache units in parallel, and the N convolution kernel cache modules simultaneously store convolution kernel data of N output channels and output convolution kernel data of N output channels required by calculation in parallel during calculation. The data dimension transformation process of the convolution kernel of a single output channel is shown in fig. 7, and the convolution kernel transformation modes of different output channels are also consistent. Will participate in the volumeThree-dimensional convolution kernels K0(IC, KH, KW) of OC output channels of the product operation, OC represents the number of output channels, KH, KW represent the height and width of the convolution kernels, the three-dimensional convolution kernels are converted into one-dimensional convolution kernels K1(IC KH KW), all OC one-dimensional convolution kernels K1(IC KH KW) are arranged in a row according to M data to be converted into two-dimensional convolution kernels K2(P, M),

wherein

When IC × KH × KW cannot be divided by M, the effective data width of the last row, i.e., the P-1 th row, is T (IC × KH × KW)% M, and this row needs to be filled with 0 value to expand the width from T to M.

Based on the principle, the convolution kernel cache module of a single output channel comprises M convolution kernel cache units, the serial numbers of the convolution kernel cache units are correspondingly 0-M-1, the number of cache addresses of the convolution kernel cache units is P,

wherein

The convolution kernel data input in the column priority order are divided into a group of M data, one row of convolution kernel data is divided into P groups, the P addresses of the M cache units of the convolution kernel cache module are sequentially and parallelly written in, and the M convolution kernel data are read from the M cache units in parallel and output to the computing unit by sequentially reading the P addresses during computing. When IC × KH × KW cannot be divided by M, the number of valid data in the last group, i.e., the P-1 th group, is T (IC × KH × KW)% M, and the width needs to be expanded from T to M by filling the row with 0.

The calculation module carries out corresponding convolution operation on convolution kernel data read from the convolution kernel caching module and input feature data read from the feature map caching module, and the N calculation modules simultaneously carry out convolution operation on the N output channels and output convolution operation results to obtain an output feature map. In one convolution operation, at most M input feature data are read from the feature cache module in parallel in a single clock cycle, and the input feature data required by one convolution operation are read in P clock cycles. During convolution operation, M convolution kernel data are parallelly sent to the computing module in the same clock period, each group of M convolution kernel data is sent to the computing module for P times in sequence, and the convolution kernels are sent to the computing module for convolution operation in P clock periods. Under the write control of a feature map cache write control module, writing at most M input feature inputs into M feature map cache units in parallel in the same clock cycle, repeatedly executing the operations, writing all input feature data in the input feature map into the feature map cache module, when performing convolution operation, under the read control of the feature map cache read control module, reading the input feature data from the M feature map cache units in parallel in the same clock cycle, requiring P clock cycles in total to read the input feature data required by each convolution operation in parallel in the M feature map cache units, sending the read input feature data into a computing module, similarly, reading the convolution data from the M convolution kernel cache units in parallel in the same clock cycle, requiring P clock cycles in total to read all convolution kernel data from the M convolution kernel cache units, and performing corresponding convolution operation on the read convolution kernel data and the input characteristic data, and outputting a convolution operation result.

According to an embodiment of the present invention, the convolution operation process for calculating the 0 th row and the 0 th column of a certain output channel of the output feature map as shown in fig. 5 is performed. In the figure, the dark gray parts of the input feature map buffer module and the convolution kernel buffer module are the input feature data and the convolution kernel data of the input feature map required by the current convolution operation, M is 8, IC is 3, IH is IW is 5, KH is KW is 3, pad is 0, the input feature data and the convolution kernel data of the input feature map are output as shown in fig. 5, 8 input feature data are output in parallel in the first clock cycle, 8 convolution kernel data are output in parallel, and the data are sent to corresponding multipliers for corresponding multiplication, so that 4 clock cycles are required, and the output feature map data of the 0 th row and the 0 th column of a certain output channel are obtained by calculation. When IC KH KW can not be divided by M, the filling data can appear in the characteristic diagram buffer output data and convolution kernel buffer output data in the (P-1) th clock period, and the result after multiplication is also 0 only by ensuring that the filling data on one side is 0, thus ensuring that the final result of accumulation can not be influenced.

According to an embodiment of the present invention, the convolution operation process for calculating the 0 th row and the 1 st column of a certain output channel of the output feature map as shown in fig. 6 is performed. Setting M to 8, IC to 3, IH to IW to 5, KH to KW to 3, pad to 0, wherein the dark gray part in the graph is the input characteristic data and convolution kernel data of the input characteristic graph required by the current convolution operation. And outputting 8 input characteristic data in parallel in the first clock cycle, outputting 8 convolution kernel data in parallel, sending the data to a corresponding multiplier for corresponding multiplication, and so on, wherein 4 clock cycles are needed, and calculating to obtain output characteristic diagram data of a 0 th row and a 1 st column of a certain output channel.

The values of the actual M and N are jointly determined by hardware resources of a hardware accelerator, input and output interface bandwidth and accelerated convolution operation parameters including IW, IH, KW, KH, IC, OC and the like so as to maximize the calculation efficiency. In some cases, performance is not the only indicator, for example, the values of M and N may be reduced appropriately to reduce the usage of hardware resources, thereby reducing power consumption. In any case, the method can meet the universality of convolution operation, namely, the specific values of M and N are determined according to the method and the consideration of indexes such as comprehensive performance, power consumption and the like, and the obtained convolution operation accelerator can meet the change of convolution operation parameters such as IW, IH, KW, KH, IC, OC and the like by using the same hardware circuit structure.

According to an embodiment of the present invention, the present invention provides a data processing method of a convolution operation hardware accelerator, including:

Although the preferred embodiments of the present invention have been disclosed for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims.

Claims

1. A convolution operation hardware accelerator is characterized by comprising a feature diagram reorganization module, a feature diagram cache write control module, a feature diagram cache read control module, N convolution kernel cache modules and N calculation modules, wherein the feature diagram cache module comprises M parallel feature diagram cache units, M and N are integers which are more than or equal to 1,

the convolution kernel cache module is used for grouping convolution kernels K (IC, KH and KW) of a single output channel into a group according to M convolution kernel data in a row-first order, writing each group of convolution kernel data into corresponding M convolution kernel cache units in parallel, and simultaneously storing convolution kernel data of N output channels by the N convolution kernel cache modules and outputting convolution kernel data of N output channels required by calculation in parallel during calculation; the calculation modules are used for performing corresponding convolution operation on the convolution kernel data read from the convolution kernel cache module and the input feature data read from the feature map cache module, and the N calculation modules perform convolution operation on the N output channels at the same time and output convolution operation results to obtain the output feature map.

2. The convolution operation hardware accelerator of claim 1 wherein said order of input in column priority order includes an order of input in input columns, input rows, input channels in order, specifically: inputting input characteristic data of all columns on the same row of the same channel, inputting input characteristic data of all columns on the next row of the same channel, and inputting the input characteristic data of all columns on all rows of the next channel according to the same sequence after the input of the input characteristic data of all columns on all rows of the same channel is finished until the input of the input characteristic data of all columns on all rows of all channels is finished.

3. The convolution operation hardware accelerator of claim 1 wherein said order of input in column-preferred order comprises an order of input in input columns, input channels, input rows in order, specifically: inputting input characteristic data of all columns on the same channel in the same row, inputting input characteristic data of all columns on the next channel in the same row, and inputting the input characteristic data of all columns on all channels in the next row according to the same sequence after the input of the input characteristic data of all columns on all channels in the same row is finished until the input of the input characteristic data of all columns of all channels in all rows is finished.

4. The convolution hardware accelerator of claim 1 wherein said profile reorganizing module is configured to segment each line of input profile (IC, IH, IW) into L segments, each segment having a maximum number of M input profiles, wherein,

indicating rounding up.

5. The convolution operation hardware accelerator of claim 1 wherein the cache write address rule further comprises: and data of adjacent columns in the same row in the input characteristic diagram are written into the characteristic diagram cache units with adjacent serial numbers, and the serial number M-1 and the serial number 0 are also equal to the serial numbers of the adjacent characteristic diagram cache units.

6. The convolution operation hardware accelerator of claim 5, wherein: the characteristic diagram cache read control module is used for generating a characteristic diagram cache unit sequence number and cache read addresses of each input characteristic data required by the current convolution operation in the characteristic diagram cache module based on cache read address rules, dividing the generated IC KH KW cache read addresses into a group of P according to M, and respectively outputting the M cache read addresses of each group to the M characteristic diagram cache units with corresponding sequence numbers, wherein,

wherein

7. The convolution operation hardware accelerator of claim 6, wherein: the cache read address rule specifically includes:

8. The convolution operation hardware accelerator of claim 7 wherein: the convolution kernel cache module is used for grouping convolution kernel data input in a row priority order by taking M data as a group, dividing a row of convolution kernel data into P groups, sequentially and parallelly writing the P groups of convolution kernel data into P addresses of M cache units of the convolution kernel cache module, sequentially reading the P addresses during calculation, reading the M convolution kernel data from the M cache units in parallel and outputting the M convolution kernel data to the calculation unit.

9. The convolution operation hardware accelerator of claim 8, wherein: and the calculation module simultaneously performs multiplication operation between the M input characteristic data and the M convolution kernel data during convolution operation, and accumulates the results of the M multiplication operations to obtain a primary convolution operation result of a single output channel.

10. A method of processing data in a convolution operation hardware accelerator according to any one of claims 1 to 9, the method comprising: