CN117787365A - Method, device, medium and equipment for scheduling convolution data stream - Google Patents

Method, device, medium and equipment for scheduling convolution data stream Download PDF

Info

Publication number
CN117787365A
CN117787365A CN202311849647.7A CN202311849647A CN117787365A CN 117787365 A CN117787365 A CN 117787365A CN 202311849647 A CN202311849647 A CN 202311849647A CN 117787365 A CN117787365 A CN 117787365A
Authority
CN
China
Prior art keywords
input
convolution kernel
convolution
channel
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311849647.7A
Other languages
Chinese (zh)
Inventor
乔树山
王建超
游恒
尚德龙
周玉梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Nanjing Intelligent Technology Research Institute
Original Assignee
Zhongke Nanjing Intelligent Technology Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Nanjing Intelligent Technology Research Institute filed Critical Zhongke Nanjing Intelligent Technology Research Institute
Priority to CN202311849647.7A priority Critical patent/CN117787365A/en
Publication of CN117787365A publication Critical patent/CN117787365A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a method, a device, a medium and equipment for scheduling a convolution data stream, wherein the method comprises the following steps: dividing an input feature map into C/M input channel groups in turn according to the channel direction, and dividing each channel group into a plurality of input blocks in turn according to the width from left to right in each input channel group according to the width dimension of a calculation block required by a multiplier array; dividing convolution kernels participating in operation into Z/N columns of convolution kernel groups, and sequentially dividing each column of convolution kernel groups into C/M convolution kernel channel groups according to the channel direction; wherein Z represents the number of channels of the output feature map, and N represents the number of convolution kernels in the single-column convolution kernel group; outputting the convolution kernel and the input feature map which participate in the operation to a multiplier array to execute the convolution operation; the convolution kernel participating in the operation is input by taking a channel group as a unit, and the input feature map is input by taking an input block as a unit.

Description

Method, device, medium and equipment for scheduling convolution data stream
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a medium, and a device for scheduling a convolutional data stream.
Background
In the artificial neural network, the convolution operation can extract local characteristics of input data, hierarchical characteristic learning is carried out by stacking a plurality of convolution layers, the parameter number of the network is reduced through parameter sharing, and the calculation efficiency is improved.
At present, convolution related data is mainly transmitted to a multiplier array to carry out multiply-accumulate convolution calculation. This approach requires instantaneous transmission of large amounts of data into the multiplier array, which is prone to data congestion, making it difficult to deploy and implement convolution calculations in low power multipliers with limited bandwidth and tight computational power.
Disclosure of Invention
The invention provides a scheduling method, a device, a medium and equipment of convolution data flow, which are used for solving or partially solving the technical problem of data congestion caused by overlarge calculation data transmitted to a multiplier array at present.
To solve the above technical problem, in a first aspect of the present invention, a scheduling method of a convolutional data stream is disclosed, where the method includes:
dividing an input feature map into C/M input channel groups in turn according to the channel direction, and dividing each channel group into a plurality of input blocks in turn according to the width from left to right in each input channel group according to the width dimension of a calculation block required by a multiplier array; in a single input channel group, all input blocks participate in convolution operation in sequence from left to right and from top to bottom, and the size of the single input block along the width direction is consistent with the width size of the required calculation block; wherein C represents the number of channels of the input feature map, and M represents the number of channels in a single channel group;
dividing convolution kernels participating in operation into Z/N columns of convolution kernel groups, and sequentially dividing each column of convolution kernel groups into C/M convolution kernel channel groups according to the channel direction; wherein Z represents the number of channels of the output feature map, and N represents the number of convolution kernels in the single-column convolution kernel group;
outputting the convolution kernel participating in the operation and the input feature map to the multiplier array to execute the convolution operation; the convolution kernel participating in the operation is input by taking a channel group as a unit, and the input feature map is input by taking an input block as a unit.
Alternatively, the number of input blocks in a single input channel group depends on the output feature map size and the required calculated block width size.
Optionally, the outputting the convolution kernel and the input feature map to the multiplier array to perform convolution operation specifically includes:
inside a single input block, outputting the signal to the multiplier array according to the sequence of the front row and the rear column of the signal and then carrying out convolution operation;
in a single convolution kernel channel group, outputting N convolution kernel blocks contained in the single convolution kernel channel group to the multiplier array in columns to execute convolution operation; wherein the single convolution kernel channel group comprises N convolution kernel blocks arranged in a column.
Optionally, in the single convolution kernel channel group, outputting the N convolution kernel blocks included in the single convolution kernel channel group to the multiplier array in columns to perform convolution operation, which specifically includes:
and outputting weight data of the corresponding position of each input feature in the single convolution kernel block to the multiplier array for performing convolution operation aiming at the single convolution kernel block, and traversing the output in a way of sequentially carrying out the front-column and rear-column re-channel until the single convolution kernel block is traversed.
Optionally, the weight data corresponding to the convolution kernel participating in the operation and the input feature corresponding to the input feature map are stored in a cache.
Optionally, before the input feature map is sequentially segmented into C/M input channel groups according to the channel direction, the method further includes:
and sequentially dividing the output characteristic diagram into Z/N output channel groups along the channel direction, and sequentially filling the calculation results in the multiplier array in a mode of leading and trailing and re-channel.
Optionally, before the output feature map is sequentially segmented into Z/N output channel groups along the channel direction, the method further includes:
inputting a size calculation formula according to the related parameters of the input feature map and the related parameters of the convolution kernel in advance, and calculating the size of the output feature map;
the size calculation formula is as follows:
wherein LP represents the left boundary fill size, wx represents the input feature map width, RP represents the right boundary fill size, sx represents the convolution kernel height, DX represents the convolution kernel lateral expansion, sx represents the convolution lateral step size, TP represents the top boundary fill size, hx represents the input feature map height, BP represents the bottom boundary fill size, rx represents the convolution kernel width, DX represents the convolution kernel longitudinal expansion, SY represents the convolution longitudinal step size.
In a second aspect of the present invention, a scheduling apparatus for a convolutional data stream is disclosed, including:
the first dividing module is used for dividing the input characteristic diagram into C/M input channel groups in turn according to the channel direction, and dividing each channel group into a plurality of input blocks in turn according to the width from left to right in each input channel group according to the calculated block width size required by the multiplier array; in a single input channel group, all input blocks participate in convolution operation in sequence from left to right and from top to bottom, and the size of the single input block along the width direction is consistent with the width size of the required calculation block; wherein C represents the number of channels of the input feature map, and M represents the number of channels in a single channel group;
the second segmentation module is used for dividing the convolution kernels participating in the operation into Z/N columns of convolution kernel groups, and each column of convolution kernel groups is sequentially segmented into C/M convolution kernel channel groups according to the channel direction; wherein Z represents the number of channels of the output feature map, and N represents the number of convolution kernels in the single-column convolution kernel group;
the output module is used for outputting the convolution kernel participating in the operation and the input feature map to the multiplier array to execute the convolution operation; the convolution kernel participating in the operation is input in a unit of a channel group, and the input feature map is input in a unit of a block.
In a third aspect of the present invention, a computer-readable storage medium is disclosed, on which a computer program is stored which, when being executed by a processor, implements the steps of the above-described method.
In a fourth aspect of the invention, a computer device is disclosed comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing the steps of the above method when executing said program.
Through one or more technical schemes of the invention, the invention has the following beneficial effects or advantages:
the invention provides a scheduling method, a device, a medium and equipment for convolution data flow, which are used for converting a large convolution task into a small convolution task and gradually outputting the small convolution task to a multiplier array to execute convolution operation by carrying out batch processing on an input feature map and convolution kernels participating in operation, so that data congestion is relieved to a great extent, the influence on bandwidth is relieved, and the method can also support deployment and implementation of convolution calculation in a low-power multiplier with limited bandwidth and tension calculation.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures.
In the drawings:
FIG. 1 illustrates a flow chart of a method of scheduling a convolved data stream according to one embodiment of the invention;
FIG. 2 illustrates a partitioning logic diagram of an input signature, convolution kernel, output signature in the ith cycle according to one embodiment of the present invention;
FIGS. 3-4 are diagrams illustrating positional correspondence of input blocks according to one embodiment of the present invention;
FIG. 5 illustrates a convolution calculation correspondence diagram of an input block according to one embodiment of the present invention;
fig. 6 shows a schematic diagram of a scheduling apparatus for convolved data streams according to an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
In a first aspect, embodiments of the present disclosure provide a method for scheduling a convolutional data stream, where the method is optimized for a bottom layer of a convolutional operation, and thus may be applied in a variety of fields such as artificial intelligence, image recognition, industry classification, and the like. As shown in fig. 1, the scheduling method of a convolutional data stream provided in the embodiment of the present disclosure at least includes the following steps:
s101, sequentially dividing an input feature map into C/M input channel groups according to the channel direction, and sequentially dividing each channel group into a plurality of input blocks according to the width from left to right in each input channel group according to the width dimension of a calculation block required by a multiplier array.
Specifically, referring to FIG. 2, the dimensions of the input feature map are [ Wx, hx, C ]. Wx is width, hx is height, and C is channel number. In order to reduce output quantity, firstly dividing an input characteristic diagram into C/M input channel groups in sequence according to the channel direction; where C represents the number of channels of the input feature map and M represents the number of channels in a single channel group.
For example, if the input feature map with the size of 6×6 channels being 12, it may be segmented into 4 input channel groups with the size of 6×6 channels being 3, and the number of channels in each input channel group being 3.
In a single input channel group, the single input channel group is divided into several input blocks. The input block division manner of other input channel groups is similar and will not be described again.
Notably, the number of input block partitions for a single input channel group depends on the number of weights within a single convolution kernel that participates in the operation, with each weight data corresponding to one input block partition in the single input channel group. For example, a convolution kernel of 3×3, the number of weights is 9, the number of divisions of an input block of a single input channel group is 9, the number of divided input blocks is the same for each division, but the positions of the divided input blocks are different, and the positions of the input blocks of the single input channel group are related to the positions of weight data inside the single convolution kernel participating in the operation. The number of input block divisions, the number of loops, and the number of weights within a single convolution kernel that participates in the operation are the same for a single input channel group. Taking fig. 3-4 as an example, in the 1 st cycle, the first weight data of a single convolution kernel participates in calculation, and then, for the input feature map of S101, the positions of the input blocks divided in the single input channel group are as shown in fig. 3 (the channel number is not shown). Within the 2 nd cycle, the second weight data of the single convolution kernel is involved in calculation, and then for the input feature map of S101, the positions of the input blocks divided in the single input channel group are as shown in fig. 4 (the channel number is not shown). A single input channel group need only perform input block partitioning once per cycle.
In a single input block division operation of a single input channel group:
1. the number of divisions of the input block depends on the output feature map size and the calculated block width size required by the multiplier array, as:wherein U is x×Vx The size of the output feature map is shown, for example, the size of the input feature map is 6x6, the size of the convolution kernel is 3x3, and the step size is 1, and then the size of the output feature map is 4x4 (note that the size of the output feature map may be calculated in advance, and the calculation mode of the size of the output feature map will be described later, which will not be described here again); l represents the calculated block width size required by the multiplier array, and is related to the number of the multiplier arrays, and can be self-defined and known. Referring to fig. 3, when the output feature map size is 4x4, l=2, the number of divisions of the input block is: />
2. The size of the input block is [ width L, height 1, channel number M ]. The width dimension L of the input block in the width direction is consistent with the required calculated block width dimension, the height is 1, the channel number M in the depth direction is determined by the number of multiplier arrays, for example, the number of multiplier arrays is 16, then M takes 16, other values can be taken, and no limitation is imposed. The interior of a single input block sequentially participates in convolution operation according to the sequence of the front row and the rear row of the back channel. The input blocks are sequentially involved in convolution operation according to the sequence from left to right and from top to bottom.
3. The division position of the input block is related to the position of the single weight data. The corresponding position relationship between the two is determined by the size of the output characteristic diagram.
In a specific implementation process, the method comprises the following steps:
(1) And calculating the element number of the output characteristic diagram according to the size of the output characteristic diagram. The number of elements represents the number of relevant input features corresponding to a single weight data. Related meaning means: having a convolution calculation relationship. For example, the size of the input feature map of the single channel is 6x6, the convolution kernel is 3x3, and the step size is 1, and the size of the output feature map is 4x4, and the number of elements representing the output feature map is 16, that is, one weight data in the convolution kernel corresponds to 16 relevant input features. As shown in fig. 3, the first weight data of the convolution kernel has a convolution calculation relationship with 16 input features of the shadow region. Notably, the convolution operation of the individual weight data with the associated input feature is referred to as a stripe operation. The upper limit of the number of the elements of the stripe operation is determined by the number of the multiplier arrays, if 16 multipliers are provided, the upper limit of the number of the elements calculated by the stripe operation is 16, and the lower limit is 8.
(2) Based on the number of elements, a feature number of relevant input features of the single weight data in the single channel is determined. For example, when the number of elements of the output feature map is 16, the number of features representing that the single weight data has a convolution calculation relationship in the output feature map is 16.
(3) And determining the feature position area of the relevant input feature according to the number of elements and the position of the single weight data. For example, the first weight data of the 3x3 convolution kernel, the location area of its associated input feature references the location area of the 16 input features of the shaded area of fig. 3.
(4) And obtaining the feature quantity contained in each input block by utilizing the feature quantity of the related input features and the division quantity of the input blocks. For example, according to the determined 16 input features and 8 input blocks to be divided, each input block is determined to contain 2 input features.
(5) And determining the dividing position of each input block from the characteristic position area according to the characteristic quantity contained in each input block from left to right and from top to bottom in sequence. When dividing the input blocks, as each input block needs to contain 2 input features, the 2 input features are determined from the feature position areas in sequence from left to right and from top to bottom to form one input block, and the respective dividing positions of the 8 input blocks can be obtained.
For further explanation, referring to fig. 4, it can be seen that the input block division position corresponding to the second weight data of the convolution kernel is also another input block division result in the single input channel group.
In addition to the processing of the input feature map component blocks, the convolution involved in the operation is also segmented to further reduce the input to the multiplier array.
S102, dividing convolution kernels participating in operation into Z/N columns of convolution kernel groups, and sequentially dividing each column of convolution kernel group into C/M convolution kernel channel groups according to the channel direction.
Specifically, the size of the convolution kernel involved in the operation is [ width Rx, height Sx, channel number C, number of convolution kernels Z ]. Wherein Z simultaneously represents the channel number of the output characteristic diagram, and N represents the number of convolution kernels in the single-column convolution kernel group.
Further, referring to FIG. 2, the Z convolution kernels involved in the operation are divided into Z/N columns of convolution kernel groups, each column of convolution kernel groups containing N convolution kernels. Further, each row of convolution kernel groups is divided into C/M convolution kernel channel groups in sequence according to the channel direction. In a single convolution kernel channel group, N convolution kernel blocks arranged in a column are included. The size of the convolution kernels involved in the operation is [3, 12, number 3], the number of the convolution kernels in each row is divided into 3 columns of convolution kernels according to the number, the number of the convolution kernels in each column is one, and the convolution kernels are divided into 4 convolution kernel channel groups according to the channel direction in sequence, wherein each convolution kernel channel group comprises 3 channels.
It should be noted that, when the convolution kernel channel groups are segmented, the number M of channels in a single channel group when the input channel groups are segmented according to the input feature map is needed to be segmented, so that each convolution kernel block in each convolution kernel channel group contains M channel numbers, so as to facilitate subsequent convolution calculation.
S103, outputting the convolution kernel and the input feature map which participate in the operation to a multiplier array to execute the convolution operation.
When the loop is executed, the convolution kernel and the input feature map participating in the operation are output to the multiplier array to execute the convolution operation of the ith loop, and the concept of the loop period will be described later, which is not described herein.
Wherein, the input feature map is input in units of "input blocks". In particular, referring to FIG. 2, several input blocks are included in a single input channel group. In a single input channel group, a plurality of input blocks are sequentially output to a multiplier array from left to right and from top to bottom to perform convolution operation. Inside a single input block, the signals are output to the multiplier array in the order of the preceding and the following columns of the re-channels to perform convolution operation.
For example, the input feature map divides the channel groups along the channel direction, each input channel group includes M channels, and each channel group is divided into a plurality of input blocks according to the width dimension L and the height dimension 1. The size of the input block is [ width L, height 1, channel number M ].
When the input characteristic diagram is output, the input characteristic value multiplier array is output by taking the unit of ' input block ' & gtchannel group '.
L times of traversal are performed in the width direction and M times of traversal are performed in the channel direction aiming at the current input block of the current input channel group, M pieces of data with L lengths are sequentially output to the multiplier array to indicate that the current input block is output, the next input block is shifted to from left to right and from top to bottom, and convolution operation is output and performed in the same mode. In the current input channel group, the input blocks are output to the multiplier array from left to right and from top to bottom.
After all input blocks in the current input channel group are output, the calculation of the current input channel group is finished, the output of the next input channel group is started in the same output mode, and the output of each input channel group is sequentially output according to the segmentation direction of the input channel number until the calculation of all the C/M input channel groups is finished.
The convolution kernels involved in the operation are output in units of "channel groups". Specifically, referring to FIG. 2, a single set of convolution kernel channels contains N convolution kernel blocks arranged in a column. In a single convolution kernel channel group, N convolution kernel blocks are output to a multiplier array in columns to execute convolution operation.
Further, for a single convolution kernel block, outputting weight data of the single convolution kernel block and corresponding positions of each input feature to a multiplier array to execute convolution operation, and traversing output in sequence along a mode of leading and trailing columns and re-channels until the single convolution kernel block is traversed. Wherein the input feature and the weight data of the corresponding position have convolution operation relation. The correspondence between the input feature and the weight data is determined by the size of the output feature map. And calculating the element number of the output feature map according to the size of the output feature map, and determining the feature number and feature position area of the related input features according to the element number and the position of the single weight data. For example, if the size of the input feature map is 6x6, the convolution kernel is 3x3, and the step size is 1, the size of the output feature map is 4x4, and the number of elements representing the output feature map is 16, that is, one weight data in the convolution kernel corresponds to 16 input features. As shown in fig. 3, the first weight data of the convolution kernel has a convolution operation relationship with 16 input features of the shadow region.
The output order of the convolution kernels that participate in the calculation is described in detail below.
In a single convolution kernel channel set, N convolution kernel blocks are output into the multiplier array in columns. For the first convolution kernel block in the N convolution kernel blocks, traversing the output in a first channel according to a mode of leading and trailing columns until the last row and the last column show that the convolution kernel block of the first channel is completely scanned and output. After the scan out of the convolution kernel block of the first channel is completed, the next channel of the convolution kernel block is traversed in the same way according to the channel direction until the convolution kernel block is traversed.
It should be noted that each position in the convolution kernel block is traversed, but only the weight data of the corresponding position of the input feature is convolved.
After the convolution kernel block traversing output is completed, the next convolution kernel block traversing output in N convolution kernel blocks is transferred in columns until the N convolution kernel blocks are traversed, and the single convolution kernel channel group traversing is completed. The traversal manner inside each convolution kernel block is similar to the above manner, so that a description thereof is omitted.
After the single convolution kernel channel group is traversed, traversing the next convolution kernel channel group according to the channel direction until C/M convolution kernel channel groups in the single-column convolution kernel group are traversed.
After the single-column convolution kernel group is traversed, the remaining convolution kernel groups can be traversed randomly until Z/N columns of convolution kernel groups are traversed; the convolution kernel groups can be traversed one by one according to the dividing direction until the Z/N columns of convolution kernel groups are traversed.
In an alternative embodiment, the weight data corresponding to the convolution kernel involved in the operation and the input features corresponding to the input feature map are stored in the buffer, and may be stored in other memories, which is not limited.
When extracting weight data, the weight data is extracted according to the output sequence of the convolution kernel output to the multiplier array. Specifically, in the convolution kernel block, weight data is extracted in the order of the preceding and following channels. Traversing and extracting weight data from each convolution kernel block from top to bottom according to columns among the convolution kernel blocks; traversing and extracting weight data among the convolution kernel channel groups one by one according to the switching sequence of the convolution kernel channel groups; and (3) between the convolution kernel groups, the weight data can be extracted by randomly traversing the rest convolution kernel groups, or the weight data can be extracted by traversing the convolution kernel groups one by one according to the dividing direction until the weight data corresponding to all the convolution kernels are extracted.
When the input feature map is extracted, the input features are extracted in the output order of the input feature map to the multiplier array. Specifically, in the input block, extracting input features according to the sequence of the preceding sequence and the following sequence; traversing and extracting input features from each input block according to the sequence from left to right; and traversing and extracting the input features among the input channel groups one by one according to the switching sequence of the input channel groups until the input feature traversing and extracting which is cut out from the input feature graph at the time is completed.
It should be noted that in a single cycle, the weight data may be used as reference, and after single weight data in the same position in the convolution kernel block is extracted, the corresponding input feature is extracted according to the convolution operation relationship. Similarly, a plurality of input blocks in a single input channel group can be used as targets, and corresponding single weight data can be extracted according to convolution operation relation.
In the convolution calculation process of the input feature map and the convolution kernel, referring to fig. 5, matrix multiplication operation is performed on a single input block [ width L, height 1, channel number M ] and a convolution kernel block [ convolution kernel number N, channel number M ] in a single convolution kernel channel group, so as to obtain a part of output features corresponding to the single input block, wherein the size is [ width L, height 1, channel number N ]. And filling partial output features corresponding to the single input block into the output feature map.
And sequentially dividing the output characteristic diagram into Z/N output channel groups along the channel direction, and sequentially filling the calculation results in the multiplier array in a mode of leading, trailing and re-channel. Referring to fig. 2, each output channel group sequentially fills the calculation results of the multiplier array along the width direction and the channel number direction until all calculation blocks are fully filled, which means that the output characteristics of the channel group are finished, and filling of the next output channel group is performed. And (5) sequentially calculating until the Z/N output channel groups are completely filled. It is noted that the logic is logic of the output feature map at the time of the ith cycle filling, and the output feature map in the cycle period is filled according to the logic.
Note that S101 to S103 are performed by the ith cycle output, and the cycles S101 to S103 are performed cyclically until the final output characteristic map is obtained. The cycle period is determined by the number of weight data within a single convolution kernel. For example, if the number of weight data is 9 by a convolution kernel of 3×3, the loop output is performed 9 times according to the operations of S101 to S103. Notably, the locations of the several input blocks divided in a single input channel group within each cycle depend on the weight data locations having a convolution relationship. Of course, in S102, all weights may be extracted at one time, and when all the weights are output, the weights may be output only once to save computing resources, or may be output together with S101 and S103 in a circulating manner.
Taking fig. 3-4 as an example, in the 1 st cycle, the first weight data of the single convolution kernel participates in calculation, and in S101, when the input feature map divides the input blocks, the positions of the divided input blocks in the single input channel group are shown in fig. 3 (the channel number is not shown). In the 2 nd cycle, the second weight data of the single convolution kernel participates in the calculation, and then in S101, the input feature map divides the input blocks, and the positions of the divided input blocks in the single input channel group are as shown in fig. 4 (the channel number is not shown).
For ease of illustration and explanation of the present invention, the following examples are given with input feature pattern dimensions of: [6,6,2], the number of convolution kernels is 2, and the size is: [3, 2] into 2 groups of 1 convolution kernel each. 9 cycles were performed.
In the 1 st cycle, the input feature map is divided into 8 input blocks shown in fig. 3 according to input channel groups and input blocks, the size of the first input block is [2,1,2], the data is a1, a2, b1, b2, the convolution kernel channel group comprises 1 convolution kernel block, the size is [2,1,2], and after the data of the first input block and the first weight data A, B in the convolution kernel block are subjected to convolution calculation, the partial output features corresponding to the first input block are obtained: [2, 1], see FIG. 5. And filling part of output features corresponding to the first input block into the corresponding output feature map, wherein the filling sequence is along the width direction and the channel direction. Refer to the first output feature fill of the output feature map in fig. 2. And performing convolution calculation on the first weight data in the remaining 7 input blocks and the convolution kernel blocks according to the logic, and sequentially filling the obtained corresponding partial output features into the output feature map.
In the 2 nd cycle, the input feature map is divided into 8 input blocks shown in fig. 4 according to the input channel group and the input blocks, and after the convolution calculation is sequentially performed on the 8 input blocks and the second weight data in the convolution kernel block, the corresponding partial output features are obtained and sequentially filled into the output feature map, wherein the filling sequence is along the width direction and the channel direction. .
When the final output characteristic diagram is calculated, the 2 nd cycle data can be directly filled into the output characteristic diagram corresponding to the 1 st cycle filling according to the progressive relation, the 3 rd cycle data can be directly filled into the output characteristic diagram corresponding to the 2 nd cycle filling, and the final output characteristic diagram can be obtained after the 9 th cycle is finished. And respectively filling one output characteristic diagram in each single cycle, and accumulating data in the output characteristic diagrams corresponding to the 9 cycles according to the corresponding positions after the 9 cycles are finished, so that a final output characteristic diagram can be obtained.
In an alternative embodiment, the calculation of the output feature map size is introduced. Inputting a size calculation formula according to the related parameters of the input feature map and the related parameters of the convolution kernel in advance, and calculating the size of the output feature map; the dimensions of the output feature map include [ width Ux, height Vx, number of channels Z ].
The size calculation formula is:
wherein LP represents the left boundary fill size of the input feature map, wx represents the input feature map width, RP represents the right boundary fill size of the input feature map, sx represents the convolution kernel height, DX represents the convolution kernel lateral expansion, sx represents the convolution lateral step size, TP represents the top boundary fill size of the input feature map, hx represents the input feature map height, BP represents the bottom boundary fill size of the input feature map, rx represents the convolution kernel width, DX represents the convolution kernel longitudinal expansion, SY represents the convolution longitudinal step size.
The above is a specific scheduling logic of the scheduling method for a convolution data stream provided by the technical scheme of the present invention, in order to more effectively map convolution operations onto a multiplier array, re-optimize the circulation sequence of the data stream (input feature diagram, convolution kernel) input to the multiplier array, and perform grouping and batch processing on the data stream, and by inputting the input feature diagram and the convolution kernel participating in the operations into the multiplier array in a pipelining manner in parallel, a large convolution task is converted into a small convolution task, so that the data congestion can be relieved to a great extent, the influence on the bandwidth can be reduced, and the implementation of convolution calculation can also be supported in a low-power multiplier with limited bandwidth and intense calculation power.
In addition, since no data dependency exists between each cycle in the convolution operation, the technical scheme of the invention adjusts the cycle output sequence of the input feature diagram and the convolution kernel participating in the operation, effectively divides the cycle into a plurality of subcycles, so as to maximally utilize the multiplier array, improve the utilization efficiency of the multiplier array, and also can improve the pressure of data cache and reduce the influence on bandwidth.
In a second aspect, based on the same inventive concept as the method for scheduling a convolutional data stream provided in the foregoing first aspect, an embodiment of the present disclosure further provides a scheduling apparatus for a convolutional data stream, referring to fig. 6, including:
a first dividing module 601, configured to divide an input feature map into C/M input channel groups in turn according to a channel direction, and divide each channel group into a plurality of input blocks in turn according to a width of a calculation block required by a multiplier array in each input channel group from left to right according to a width; in a single input channel group, all input blocks participate in convolution operation in sequence from left to right and from top to bottom, and the size of the single input block along the width direction is consistent with the width size of the required calculation block; wherein C represents the number of channels of the input feature map, and M represents the number of channels in a single channel group;
the second segmentation module 602 is configured to divide the convolution kernels involved in the operation into Z/N columns of convolution kernel groups, where each column of convolution kernel groups is segmented into C/M convolution kernel channel groups in sequence according to the channel direction; wherein Z represents the number of channels of the output feature map, and N represents the number of convolution kernels in the single-column convolution kernel group;
an output module 603, configured to output the convolution kernel participating in the operation and the input feature map to the multiplier array to perform the convolution operation; the convolution kernel participating in the operation is input in a unit of a channel group, and the input feature map is input in a unit of a block.
It should be noted that, the specific manner in which the respective modules perform the operations in the scheduling apparatus for a convolutional data stream provided in the embodiment of the present disclosure has been described in detail in the embodiment of the method provided in the first aspect, and the specific implementation process may refer to the embodiment of the method provided in the first aspect and will not be described in detail herein.
In a third aspect, based on the same inventive concept as the method for scheduling a convolutional data stream provided in the foregoing first aspect embodiment, an embodiment of the present invention further discloses a computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of any of the foregoing methods.
In a fourth aspect, based on the same inventive concept as the method for scheduling a convolutional data stream provided in the foregoing first aspect, an embodiment of the present invention further discloses a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of any of the foregoing methods when executing the program.
Through one or more embodiments of the present invention, the present invention has the following benefits or advantages:
the invention provides a scheduling method, a device, a medium and equipment for convolution data flow, which are used for converting a large convolution task into a small convolution task and gradually outputting the small convolution task to a multiplier array to execute convolution operation by carrying out batch processing on an input feature map and convolution kernels participating in operation, so that data congestion is relieved to a great extent, the influence on bandwidth is relieved, and the method can also support deployment and implementation of convolution calculation in a low-power multiplier with limited bandwidth and tension calculation.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, the present invention is not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some or all of the components in a gateway, proxy server, system according to embodiments of the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention can also be implemented as an apparatus or device program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

Claims (10)

1. A method of scheduling a convolved data stream, the method comprising:
dividing an input feature map into C/M input channel groups in turn according to the channel direction, and dividing each channel group into a plurality of input blocks in turn according to the width from left to right in each input channel group according to the width dimension of a calculation block required by a multiplier array; in a single input channel group, all input blocks participate in convolution operation in sequence from left to right and from top to bottom, and the size of the single input block along the width direction is consistent with the width size of the required calculation block; wherein C represents the number of channels of the input feature map, and M represents the number of channels in a single channel group;
dividing convolution kernels participating in operation into Z/N columns of convolution kernel groups, and sequentially dividing each column of convolution kernel groups into C/M convolution kernel channel groups according to the channel direction; wherein Z represents the number of channels of the output feature map, and N represents the number of convolution kernels in the single-column convolution kernel group;
outputting the convolution kernel participating in the operation and the input feature map to the multiplier array to execute the convolution operation; the convolution kernel participating in the operation is input by taking a channel group as a unit, and the input feature map is input by taking an input block as a unit.
2. The method of claim 1, wherein the number of input blocks in a single input channel group is dependent on an output feature map size and the required calculated block width size.
3. The method as claimed in claim 1, wherein said outputting the convolution kernel and the input signature of the participating operation into the multiplier array performs a convolution operation, specifically comprising:
inside a single input block, outputting the signal to the multiplier array according to the sequence of the front row and the rear column of the signal and then carrying out convolution operation;
in a single convolution kernel channel group, outputting N convolution kernel blocks contained in the single convolution kernel channel group to the multiplier array in columns to execute convolution operation; wherein the single convolution kernel channel group comprises N convolution kernel blocks arranged in a column.
4. The method of claim 3, wherein in the single convolution kernel channel group, the convolution operation is performed by outputting N convolution kernel blocks included in columns into the multiplier array, and specifically includes:
and outputting weight data of the corresponding position of each input feature in the single convolution kernel block to the multiplier array for performing convolution operation aiming at the single convolution kernel block, and traversing the output in a way of sequentially carrying out the front-column and rear-column re-channel until the single convolution kernel block is traversed.
5. The method according to claim 3 or 4, wherein weight data corresponding to the convolution kernel participating in the operation and input features corresponding to the input feature map are stored in a cache.
6. The method of claim 1, wherein before the sequentially segmenting the input feature map into the C/M input channel groups according to the channel direction, the method further comprises:
and sequentially dividing the output characteristic diagram into Z/N output channel groups along the channel direction, and sequentially filling the calculation results in the multiplier array in a mode of leading and trailing and re-channel.
7. The method of claim 6, wherein before sequentially slicing the output signature into Z/N output channel groups along the channel direction, the method further comprises:
inputting a size calculation formula according to the related parameters of the input feature map and the related parameters of the convolution kernel in advance, and calculating the size of the output feature map;
the size calculation formula is as follows:
wherein LP represents the left boundary fill size, wx represents the input feature map width, RP represents the right boundary fill size, sx represents the convolution kernel height, DX represents the convolution kernel lateral expansion, sx represents the convolution lateral step size, TP represents the top boundary fill size, hx represents the input feature map height, BP represents the bottom boundary fill size, rx represents the convolution kernel width, DX represents the convolution kernel longitudinal expansion, SY represents the convolution longitudinal step size.
8. A scheduling apparatus for a convolved data stream, comprising:
the first dividing module is used for dividing the input characteristic diagram into C/M input channel groups in turn according to the channel direction, and dividing each channel group into a plurality of input blocks in turn according to the width from left to right in each input channel group according to the calculated block width size required by the multiplier array; in a single input channel group, all input blocks participate in convolution operation in sequence from left to right and from top to bottom, and the size of the single input block along the width direction is consistent with the width size of the required calculation block; wherein C represents the number of channels of the input feature map, and M represents the number of channels in a single channel group;
the second segmentation module is used for dividing the convolution kernels participating in the operation into Z/N columns of convolution kernel groups, and each column of convolution kernel groups is sequentially segmented into C/M convolution kernel channel groups according to the channel direction; wherein Z represents the number of channels of the output feature map, and N represents the number of convolution kernels in the single-column convolution kernel group;
the output module is used for outputting the convolution kernel participating in the operation and the input feature map to the multiplier array to execute the convolution operation; the convolution kernel participating in the operation is input in a unit of a channel group, and the input feature map is input in a unit of a block.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1-7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1-7 when the program is executed by the processor.
CN202311849647.7A 2023-12-29 2023-12-29 Method, device, medium and equipment for scheduling convolution data stream Pending CN117787365A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311849647.7A CN117787365A (en) 2023-12-29 2023-12-29 Method, device, medium and equipment for scheduling convolution data stream

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311849647.7A CN117787365A (en) 2023-12-29 2023-12-29 Method, device, medium and equipment for scheduling convolution data stream

Publications (1)

Publication Number Publication Date
CN117787365A true CN117787365A (en) 2024-03-29

Family

ID=90396186

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311849647.7A Pending CN117787365A (en) 2023-12-29 2023-12-29 Method, device, medium and equipment for scheduling convolution data stream

Country Status (1)

Country Link
CN (1) CN117787365A (en)

Similar Documents

Publication Publication Date Title
CN107862650B (en) Method for accelerating calculation of CNN convolution of two-dimensional image
CN108205701B (en) System and method for executing convolution calculation
CN108241890B (en) Reconfigurable neural network acceleration method and architecture
JP2022037022A (en) Execution of kernel stride in hardware
US11734788B2 (en) Task execution in a SIMD processing unit with parallel groups of processing lanes
KR20170133364A (en) Batch processing in a neural network processor
CN116541647A (en) Operation accelerator, processing method and related equipment
Kästner et al. Hardware/software codesign for convolutional neural networks exploiting dynamic partial reconfiguration on PYNQ
EP3872713A2 (en) Mapping convolution to a partition channel convolution engine
US10402196B2 (en) Multi-dimensional sliding window operation for a vector processor, including dividing a filter into a plurality of patterns for selecting data elements from a plurality of input registers and performing calculations in parallel using groups of the data elements and coefficients
CN112395092B (en) Data processing method and artificial intelligent processor
CN111133457B (en) Electronic apparatus and control method thereof
CN110321996B (en) Image processing method and device based on convolutional neural network
CN113344172A (en) Mapping convolutions to channel convolution engines
CN104572588B (en) Matrix inversion process method and apparatus
CN117787365A (en) Method, device, medium and equipment for scheduling convolution data stream
CN115294361A (en) Feature extraction method and device
CN115293978A (en) Convolution operation circuit and method, image processing apparatus
CN111027670B (en) Feature map processing method and device, electronic equipment and storage medium
CN110930290A (en) Data processing method and device
CN112884138A (en) Hardware implementation of neural network
CN113704172B (en) Transposed convolution and convolution accelerator chip design method based on systolic array
CN116090518A (en) Feature map processing method and device based on systolic operation array and storage medium
Ma et al. An Efficient Dataflow for Convolutional Generative Models
CN117413280A (en) Convolution with kernel expansion and tensor accumulation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination