CN111797985A - Convolution operation memory access optimization method based on GPU - Google Patents

Convolution operation memory access optimization method based on GPU Download PDF

Info

Publication number
CN111797985A
CN111797985A CN202010710031.1A CN202010710031A CN111797985A CN 111797985 A CN111797985 A CN 111797985A CN 202010710031 A CN202010710031 A CN 202010710031A CN 111797985 A CN111797985 A CN 111797985A
Authority
CN
China
Prior art keywords
data
convolution
thread
row
gpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010710031.1A
Other languages
Chinese (zh)
Other versions
CN111797985B (en
Inventor
张伟哲
鲁刚钊
王峥
李克勤
孙广中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202010710031.1A priority Critical patent/CN111797985B/en
Publication of CN111797985A publication Critical patent/CN111797985A/en
Application granted granted Critical
Publication of CN111797985B publication Critical patent/CN111797985B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

A convolution operation memory access optimization method based on a GPU relates to a convolution operation memory access optimization technology. The invention can solve the defect of high access and storage overhead of convolution operation in the prior art. The technical points are as follows: loading the convolution kernel data into a shared memory; dividing the convolution output into subblocks by 32 columns to obtain a plurality of subblocks containing 32 columns of data and 1 subblock less than 32 columns of data; each thread calculates the index of the first data required by the thread; each thread acquires the residual required input data from the index of the first data through a column reuse algorithm and transmits the acquired input data to a row reuse algorithm; calculating an output result through a row reuse algorithm and storing the output result in register data sum; writing sum into the global memory; and calculating the rest data to be calculated in the convolution output. The method is used for carrying out access optimization on convolution operation in the fields of image processing, video processing and machine learning.

Description

Convolution operation memory access optimization method based on GPU
Technical Field
The invention relates to a convolution operation memory access optimization technology, in particular to a convolution operation memory access optimization method based on a GPU.
Background
Convolution operation has become a core computing mode in the fields of image processing, video processing and machine learning. The 2D convolution is widely applied to image filtering and frame difference value, depth-wise convolution is commonly used in a mobile neural network, and multichannel 2D convolution is a core operation in the neural network. However, convolution operations consume a large amount of computational and memory resources, and take up 90% of execution time in image processing and machine learning. Many optimization methods for convolution operations have been proposed, with GEMM (matrix multiplication), FFT and Winograd based methods being most widely used. However, these methods require converting input and output data into a matrix of a specific type before operation, which increases the access cost. There is therefore a need for an optimization technique that reduces memory accesses to address the deficiencies of the prior art.
Disclosure of Invention
The technical problem to be solved by the invention is as follows:
the invention aims to solve the problems that the access and storage cost of the convolution operation in the prior art is high, and the access and storage times of the convolution are large, so that the performance of the convolution operation is reduced.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the invention provides a convolution operation memory access optimization method based on a GPU, which comprises the following steps: loading the convolution kernel data into a shared memory; dividing the convolution output into subblocks by 32 columns to obtain a plurality of subblocks containing 32 columns of data and 1 subblock less than 32 columns of data; n threads for processing the sub-blocks are set; each thread calculates the index of the first data required by the thread; each thread acquires the residual required input data from the index of the first data through a column reuse algorithm and transmits the acquired input data to a row reuse algorithm; calculating an output result through a row reuse algorithm and storing the output result in register data sum; writing sum into the global memory; and calculating the rest data to be calculated in the convolution output.
Preferably, the convolution kernel is of arbitrary size.
Preferably, the convolution operation is a 2D convolution, a depth-wise convolution or a multi-channel 2D convolution.
Preferably, the process of the column weight algorithm is as follows: each thread loads the first data and the last data required by the thread from the global memory; each thread acquires required third data from the thread with the interval of 2; each thread retrieves the second and fourth data needed from the thread with interval 1.
Preferably, the method of the present invention further comprises: after each thread finishes loading the needed first data and the last data, combining the first data and the last data into 64-bit data, and storing the 64-bit data into a first variable array; wherein the last data required is stored in the upper 32 bits and the first data required is stored in the lower 32 bits; and right shifting the variable values corresponding to the threads which need to provide the high 32-bit data to other threads in all the threads by 32 bits, right shifting the other threads by 0 bit, and splitting the obtained 64-bit variable array, wherein the high 32 bits are used as the fourth data, and the low 32 bits are used as the second data.
Preferably, the process of the row reuse algorithm is: each time a row of inputs is loaded, all outputs that can be calculated by the row are calculated using the row inputs.
Preferably, each thread fetches the required data from a thread with interval 1 or 2 via a CUDA shuffle instruction.
Preferably, all outputs that can be calculated by this row are determined by the calculation formula of the convolution algorithm. Preferably, the convolution kernel size is 3 or 5.
Preferably, the remaining data includes edge data and unprocessed internal data.
The invention has the beneficial effects that: the copy frequency of convolution can be reduced, and the performance of convolution operation is improved. The invention can reduce the access and storage expenses of the convolution operation, and greatly improves the performance of the convolution operation by reducing the access and storage times of the convolution. One application of the invention is to perform memory access optimization for convolution operations in the fields of image processing, video processing and machine learning. Convolution operation is adopted as a core calculation mode in the fields of image processing, video processing and machine learning, and the speed and the efficiency of the image processing, the video processing and the machine learning can be greatly improved by applying the method. The invention can also be applied to other technical fields which adopt convolution operation as a core calculation mode.
In one embodiment, there is a significant speed-up ratio compared to other algorithms and the number of transfers of the memory chunks is greatly reduced.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic diagram of a 2D convolution operation;
FIG. 2 is a schematic diagram of a column reuse algorithm according to one embodiment of the present invention; FIG. 2(a) illustrates a loading method in the prior art; FIG. 2(b) is a diagram illustrating each thread obtaining a third data according to an embodiment of the present invention; FIG. 2(c) is a diagram illustrating the acquisition of second and fourth data by each thread in one embodiment of the invention;
FIG. 3 is a diagram illustrating conversion of a dynamic index into a static index according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a row reuse algorithm according to one embodiment of the present invention; wherein FIG. 4(a) is a schematic illustration of an input; FIG. 4(b) is a schematic diagram of a convolution kernel; FIG. 4(c) is a schematic of the output;
FIG. 5 is a diagram illustrating a thread to output mapping relationship, according to an embodiment;
FIG. 6 is a flow chart of a method of one embodiment of the present invention;
FIG. 7 is a graph comparing performance of 2D convolution on NVIDIA GPU RTX2080 Ti; wherein fig. 7(a) is a comparison of acceleration ratios for 2D 3 x 3 convolution; fig. 7(b) is a comparison of acceleration ratios for 2D 5 x 5 convolution;
FIG. 8 is a graph comparing performance of depth-wise convolution on NVIDIA GPU RTX2080 Ti; wherein fig. 8(a) is an accelerated contrast plot of depth-wise 3 x 3 convolution; fig. 8(b) is a graph comparing acceleration ratios of depth-wise se 5 x 5 convolution.
FIG. 9 is a graph comparing performance of multi-channel 2D convolution on NVIDIA GPU RTX2080 Ti; wherein FIG. 9(a) is an acceleration ratio contrast map of a multi-channel 2D convolution with a convolution depth of 1; FIG. 9(b) is an acceleration ratio contrast map for a multi-channel 2D convolution with a convolution depth of 3.
FIG. 10 is a comparison graph of depth-wise convolution versus memory performance on NVIDIA GPU RTX2080 Ti; wherein FIG. 10(a) is a graph comparing memory access throughput; fig. 10(b) is a comparison graph of the maximum bandwidth of memory access.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but are intended to be considered a part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
It is an object of the present invention to improve the performance of convolution operations by reducing the number of accesses to memory for convolution. The design concept of the present invention is illustrated by an example.
< example >
Fig. 1 is a schematic diagram of this example, which mainly reduces the number of memory accesses in convolution operation by two algorithms. Fig. 1 shows a simple 2D convolution with a picture size of 6 x 11, a convolution kernel size of 5 x 5, and an output size of 2 x 6, with each CUDA thread computing a column of outputs.
As can be seen from the figure, there are 4 columns of repeated data in the input data processed by thread 0 and thread 1, and 4 rows of repeated data in the input data loaded by thread 6. Aiming at the repeated data in the two forms, the invention provides two optimization methods: (1) the column weight algorithm allows each thread to load the first and last columns it needs, and then fetches the remaining columns from the other threads using the CUDAshuffle instruction. (2) The row reuse algorithm allows each thread to load only once the repeated rows, and then computes multiple outputs by multiplying each row of data with multiple rows of the convolution kernel. One serious performance problem is that when using dynamic index arrays in a shuffle instruction, the CUDA stores the arrays in the local memory of the GPU, however, the access latency of the local memory is the same as the global memory, thus causing a serious performance degradation. We use pack and unpack instructions to solve this performance problem.
1. Column weight algorithm:
taking thread 0 and thread 1 in fig. 1 as examples, a column reuse algorithm is shown, as shown in fig. 2. Fig. 2(a) shows the process of loading data by the direct convolution algorithm. In a first step, each thread loads the first data required from the global memory. In the second step, each thread loads the second data needed from the global memory, and so on until each thread is loaded with 5 data. It can be seen that part of the data loaded by each thread in the second step has already been loaded in the first step. This creates a problem of data reloading. To solve this problem, the present invention proposes a column reuse algorithm, as shown in fig. 2(b) and (c).
In the first step, each thread loads the first data from the global memory. In the second step, each thread loads the last data. In the third step, each thread acquires the required data from the neighbor with interval 2 using a shuffle instruction, __ shfl _ xor _ sync (0xfffffff, iTemp [ i ], 2). Thread 0 and thread 1 load the required third data from thread 2 and thread 3, respectively, and provide the data for thread 2 and thread 3. Meanwhile, thread 2 and thread 3 acquire the required third data from thread 0 and thread 1, respectively, and provide the data for thread 0 and thread 1. In the fourth and fifth steps, each thread acquires the required data from the neighbors of interval 1, in a similar way to the third step.
It can be seen that iTemp [ i ] in the shuffle instruction is a dynamic index array, and the CUDA compiler cannot determine its address in the compile stage, so that the variable is stored in the local memory. However, the access latency of the local memory is consistent with the global memory, which may cause the performance of the program to be degraded. The problem of dynamic indexing is solved using algorithm 1 and algorithm 2. Where algorithm 1 is used to process the third step of fig. 2(b) and algorithm 2 is used to process the fourth and fifth steps of fig. 2 (c).
Figure BDA0002596207080000041
Figure BDA0002596207080000051
The following takes algorithm 1 and fig. 3 as an example to illustrate how to eliminate the dynamic index. Fig. 3 shows the process of lines 4-7 in algorithm 1. Each thread first loads the first and last data to be processed into a register (lines 2-3). The two 32-bit data are then merged into one 64-bit data and stored in the variable exchange (4 rows), where iTemp [4] is the upper 32-bits and iTemp [0] is the lower 32-bits. For thread 2 and thread 3 in FIG. 3, both threads need to provide the first data, the low 32 bits of the exchange variable, respectively. Thus, the exchange variable in thread 2 and thread 3 is shifted to the right by 0 bits. After shifting, the exchange variable is split, with the upper 32 bits stored in iTemp [2] and the lower 32 bits stored in iTemp [1] (line 7). At this time, the data stored in iTemp [1] is the data that each thread needs to provide. Finally, the shuffle instruction is used to swap iTemp [1] between threads (8 lines).
The third step in fig. 2(b) and the fourth and fifth steps in (c) have similar processes, so algorithm 1 can be simply modified to obtain algorithm 2. In fig. 2(c), two adjacent processes need to exchange data. Thus in algorithm 2, the shift amount (3 lines) and the pattern of swapping (6 lines) for each thread need to be modified.
2. Row reuse algorithm
FIG. 4 illustrates the process of convolving an input with a convolution kernel in the elevation direction to generate a list of outputs. Where row represents a row of data. The calculation formula of the direct convolution algorithm is as follows:
out1=rowi1·rowf1+rowi2·rowf2+rowi3·rowf3
out2=rowi2·rowf1+rowi3·rowf2+rowi4·rowf3
out3=rowi3·rowf1+rowi4·rowf2+rowi5·rowf3
from the above equation, rowi2And rowi2Is loaded twice, rowi3Is loaded three times. To reduce repetitive data loading, as many different outputs are computed with a row as possible after each loading of the row of inputs. For example, rowi1For calculating out1,rowi2For calculating out1And out2. The calculation is re-modified so that each row of inputs needs to be loaded once, as follows:
loadrowi1:out1=rowi1·rowf1
loadrowi2:out1=out1+rowi2·rowf2
out2=rowi2·rowf1
loadrowi3:out1=out1+rowi3·rowf3
out2=out2+rowi3·rowf2
out3=rowi3·rowf1
loadrowi4:out2=out2+rowi4·rowf3
out3=out3+rowi4·rowf2
loadrowi5:out3=out3+rowi5·rowf3
as can be seen from the above formula, the input of each row needs to be loaded once. One common way of computational loading is shown in algorithm 3.
Figure BDA0002596207080000061
Figure BDA0002596207080000071
In algorithm 3, row represents a line of data loaded from the input, index represents the line is the row number in the input, and filter represents a set of lines representing the convolution kernel. Algorithm line 31-5 handles the first F of the inputH-1 lines, the number of outputs required for these lines being less than FH. Some rows are just covered by FHThese lines are processed in the algorithm 36-11 lines as required by the output. Algorithm lines 312-17 process the last FH-1 line input.
3. Final access optimization algorithm
Taking 2D convolution as an example, it is explained how to apply the column reuse and row reuse algorithms to the convolution operation. In the implementation of this example, the output is first partitioned into sub-blocks, each containing 32 columns of data. The last sub-block may contain less than 32 columns of data. Each CUDA thread block processes one or more sub-blocks, and each warp processes one sub-block. The segmentation process is shown in fig. 5.
The edge data and the internal data are processed in different ways. In fig. 5, the edge data is represented by shaded squares and the internal data is represented by dashed squares. Assume that each warp contains 4 threads. It can be seen that the inner data is split into two sub-blocks, sub-block 0 contains 4 columns, which can be handled by exactly one warp. Subblock 1 contains only two columns, and in order to fully utilize the threads, the two columns of data are equally divided into 4 parts and distributed to 4 threads. Algorithm 4 shows a general algorithmic process.
Figure BDA0002596207080000072
Figure BDA0002596207080000081
In algorithm 4, we first load the convolution kernel data into shared memory (lines 1-2). The sub-blocks containing exactly 32 columns are then processed in rows 3-13 and the last sub-block is processed in rows 14-17. Each thread computes a column of outputs as follows: first, each thread calculates the address of the first input data it needs (lines 4-6); then, each thread acquires the remaining required input data using algorithm 1 and algorithm 2 (8 rows); each thread then passes the acquired input data to algorithm 3 for calculation of a number of outputs and storing the results in register data sum (9 lines); finally, the sum array is written to global memory (12 rows).
< embodiment >
Embodiments of the present invention are shown below by way of the foregoing examples.
For the purpose of memory access optimization, one embodiment of the present invention is shown in fig. 6, and includes:
s1: and loading the convolution kernel data into the shared memory.
S2: the convolution output is divided into subblocks in units of 32 columns, and a plurality of subblocks containing 32 columns of data and 1 subblock containing less than 32 columns of data are obtained. I.e. the actual division in fig. 5.
S3: n threads for processing the sub-blocks are set; each thread computes an index of the first data that the thread needs. The index of the first data is the first and left and right data required by each thread shown in fig. 2. Other required data can be obtained by the index operation of the first data.
S4: and each thread acquires the residual required input data from the index of the first data through a column reuse algorithm and transmits the acquired input data to a row reuse algorithm. The process of acquiring the remaining required input data is the process of acquiring data from the adjacent threads with the interval of 1 or 2 in algorithm 1 and algorithm 2.
One embodiment of the column reuse algorithm is: each thread loads the first data and the last data required by the thread from the global memory; each thread acquires required third data from the thread with the interval of 2; each thread retrieves the second and fourth data needed from the thread with interval 1. Wherein each thread may fetch the required data from a thread with interval 1 or 2 via the CUDA shuffle instruction.
Further, in order to solve the dynamic indexing problem, the row reuse algorithm uses the following mode when performing data exchange: after each thread finishes loading the needed first data and the last data, combining the first data and the last data into 64-bit data, and storing the 64-bit data into a first variable array; wherein the last data required is stored in the upper 32 bits and the first data required is stored in the lower 32 bits; and right shifting the variable values corresponding to the threads which need to provide the high 32-bit data to other threads in all the threads by 32 bits, right shifting the other threads by 0 bit, and splitting the obtained 64-bit variable array, wherein the high 32 bits are used as the fourth data, and the low 32 bits are used as the second data.
It should be noted that the above embodiment is a case where the convolution kernel size is 5, and those skilled in the art can conceive a specific implementation when the convolution kernel size is 3. Likewise, the 2D convolution process is similar to depth-wise convolution and multi-channel 2D, and those skilled in the art can unambiguously determine the same modified way of depth-wise and multi-channel 2D convolution according to the examples of the present invention.
S5: calculating an output result through a row reuse algorithm and storing the output result in register data sum; and writes sum to global memory.
One embodiment of the row reuse algorithm is: each time a row of inputs is loaded, all outputs that can be calculated by the row are calculated using the row inputs. The loaded inputs can calculate which outputs can be determined from a convolution calculation formula, such as the formula involved in calculating FIG. 4. The skilled person can make a specific choice according to the picture and the specific situation of the convolution kernel in combination with common knowledge.
S6: and calculating the rest data to be calculated in the convolution output. In fig. 5, the remaining data to be calculated includes edge data and unprocessed internal data.
< Experimental Effect >
The test was compared to 5 implementations of 2D convolution for cuDNN, GEMM-im2col, GEMM-im2row, ArrayFire and NPP. The experiments were performed on NVIDIA GPU RTX2080 Ti.
(ii) 2D convolution experiment
The experimental results of the 2D convolution are shown in fig. 7. FIG. 7 shows the speed-up ratio of the method of the present invention relative to other methods at different picture sizes and GPU hardware. It can be seen from the figure that cuDNN, im2col and im2row are not suitable for 2D convolution. ArrayFire, NPP and the method of the present invention can achieve a very good effect. Relative to cuDNN, im2col and im2row, the method of the invention achieved average acceleration ratios of 5.9, 5.9 and 5.8 times on both platforms. FIGS. 7(a) and (b) show the results on RTX2080Ti, the method of the present invention achieves average acceleration ratios of 3.1 and 1.3 times.
In order to verify that the method of the present invention can actually reduce the number of memory access times, nvprof is used to count the number of memory block transmission times realized by each 2D convolution in the test, and the specific results are shown in table 1.
TABLE 1 memory Block Transmission times
Figure BDA0002596207080000101
It can be seen from the table that the method can greatly reduce the transmission times of the memory, and bring about the improvement of performance.
② Depth-wise convolution experiment
The acceleration of the algorithm in the method and cuDNN of the present invention relative to im2col is shown in fig. 8. At RTX2080Ti, the method of the present invention achieves an average acceleration ratio of 1.4 and 4 times on the 3 x 3 and 5 x 5 convolution kernels relative to the fastest algorithm for cuDNN. The access throughput and the maximum bandwidth of the method and the algorithm with the fastest cuDNN are shown in FIGS. 10(a) and 10(b), and it can be seen that the method achieves the algorithm with the fastest cuDNN in two access performance indexes.
③ multichannel 2D convolution experiment
When testing the performance of the multichannel 2D convolution, the test extracts different convolution configurations from a common neural network, and sets the convolution depths to 1 and 3, and the batch size to 128. The acceleration of the method of the present invention relative to other methods is shown in fig. 9. The method of the invention achieves average acceleration ratios of 17.9 and 18.8 times with respect to im2col and im2 row. At RTX2080Ti, the method of the invention achieved an average acceleration ratio of 1.2 times with respect to cuDNN fast.
The test also counts the transmission times of the memory block of the multi-channel 2D convolution, as shown in table 2, where only the statistical result when the depth is 1 is shown. It can be seen that the method of the present invention achieves a minimum number of transmissions.
TABLE 1 multichannel 2D convolutional memory block transmission times
Figure BDA0002596207080000111
Although some specific embodiments of the present invention have been described in detail by way of illustration, it should be understood by those skilled in the art that the above illustration is only for the purpose of illustration and is not intended to limit the scope of the invention. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims (10)

1. A convolution operation memory access optimization method based on a GPU is characterized by comprising the following steps:
loading the convolution kernel data into a shared memory;
dividing the convolution output into subblocks by 32 columns to obtain a plurality of subblocks containing 32 columns of data and 1 subblock less than 32 columns of data;
n threads for processing the sub-blocks are set; each thread calculates the index of the first data required by the thread;
each thread acquires the residual required input data from the index of the first data through a column reuse algorithm and transmits the acquired input data to a row reuse algorithm;
calculating an output result through a row reuse algorithm and storing the output result in register data sum; writing sum into the global memory;
and calculating the rest data to be calculated in the convolution output.
2. The GPU-based convolution operation memory access optimization method of claim 1, wherein a column reuse algorithm can be used for convolution kernels of any size.
3. The GPU-based convolution memory access optimization method according to claim 1 or 2, wherein the convolution operations are 2D convolution, depth-wise convolution and multi-channel 2D convolution.
4. The GPU-based convolution operation memory access optimization method according to claim 3, wherein the process of the column reuse algorithm is as follows:
each thread loads the first data and the last data required by the thread from the global memory;
each thread acquires required third data from the thread with the interval of 2;
each thread retrieves the second and fourth data needed from the thread with interval 1.
5. The GPU-based convolution operation memory access optimization method of claim 4, further comprising:
after each thread finishes loading the needed first data and the last data, combining the first data and the last data into 64-bit data, and storing the 64-bit data into a first variable array; wherein the last data required is stored in the upper 32 bits and the first data required is stored in the lower 32 bits;
and right shifting the variable values corresponding to the threads which need to provide the high 32-bit data to other threads in all the threads by 32 bits, right shifting the other threads by 0 bit, and splitting the obtained 64-bit variable array, wherein the high 32 bits are used as the fourth data, and the low 32 bits are used as the second data.
6. The GPU-based convolution operation memory access optimization method of claim 4, wherein a process of a row reuse algorithm is as follows:
each time a row of inputs is loaded, all outputs that can be calculated by the row are calculated using the row inputs.
7. The GPU-based convolutional arithmetic memory access optimization method of claim 4, wherein each thread acquires required data from threads with interval 1 or 2 through a CUDA shuffle instruction.
8. The GPU-based convolution operation memory access optimization method of claim 6, wherein all outputs that can be computed from the row are determined by a computation formula of a convolution algorithm.
9. The GPU-based convolution operation memory access optimization method of claim 1, wherein a column reuse algorithm can be used for convolution kernels with a size of 3 or 5.
10. The GPU-based convolutional memory access optimization method of claim 1, wherein said residual data comprises edge data and unprocessed internal data.
CN202010710031.1A 2020-07-22 2020-07-22 Convolution operation memory access optimization method based on GPU Active CN111797985B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010710031.1A CN111797985B (en) 2020-07-22 2020-07-22 Convolution operation memory access optimization method based on GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010710031.1A CN111797985B (en) 2020-07-22 2020-07-22 Convolution operation memory access optimization method based on GPU

Publications (2)

Publication Number Publication Date
CN111797985A true CN111797985A (en) 2020-10-20
CN111797985B CN111797985B (en) 2022-11-22

Family

ID=72827265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010710031.1A Active CN111797985B (en) 2020-07-22 2020-07-22 Convolution operation memory access optimization method based on GPU

Country Status (1)

Country Link
CN (1) CN111797985B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116091299A (en) * 2023-04-07 2023-05-09 南京砺算科技有限公司 Implicit GEMM convolution calculation method, device, equipment and medium based on GPU
CN116088773A (en) * 2023-04-11 2023-05-09 南京砺算科技有限公司 Data loading method, device, equipment and medium based on implicit GEMM convolution

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106846235A (en) * 2016-12-26 2017-06-13 中国科学院计算技术研究所 Convolution optimization method and system that a kind of utilization NVIDIA Kepler GPU assembly instructions accelerate
US20190079764A1 (en) * 2017-09-08 2019-03-14 Oracle International Corporation Efficient direct convolution using simd instructions
CN109871949A (en) * 2017-12-22 2019-06-11 泓图睿语(北京)科技有限公司 Convolutional neural networks accelerator and accelerated method
US20190187963A1 (en) * 2017-12-19 2019-06-20 Canon Kabushiki Kaisha Memory access optimisation using per-layer computational mapping and memory allocation for cnn application
US20190303762A1 (en) * 2018-03-30 2019-10-03 Xilinx, Inc. Methods of optimization of computational graphs of neural networks
CN110308982A (en) * 2018-03-20 2019-10-08 华为技术有限公司 A kind of shared drive multiplexing method and device
CN110348574A (en) * 2019-07-17 2019-10-18 哈尔滨理工大学 A kind of general convolutional neural networks accelerating structure and design method based on ZYNQ
CN110458280A (en) * 2019-07-15 2019-11-15 武汉魅瞳科技有限公司 A kind of convolutional neural networks accelerated method and system suitable for mobile terminal
CN111160534A (en) * 2019-12-31 2020-05-15 中山大学 Binary neural network forward propagation frame suitable for mobile terminal
CN111178519A (en) * 2019-12-27 2020-05-19 华中科技大学 Convolutional neural network acceleration engine, convolutional neural network acceleration system and method
US20200211262A1 (en) * 2018-12-28 2020-07-02 Intel Corporation Apparatus and method for ray tracing instruction processing and execution

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106846235A (en) * 2016-12-26 2017-06-13 中国科学院计算技术研究所 Convolution optimization method and system that a kind of utilization NVIDIA Kepler GPU assembly instructions accelerate
US20190079764A1 (en) * 2017-09-08 2019-03-14 Oracle International Corporation Efficient direct convolution using simd instructions
US20190187963A1 (en) * 2017-12-19 2019-06-20 Canon Kabushiki Kaisha Memory access optimisation using per-layer computational mapping and memory allocation for cnn application
CN109871949A (en) * 2017-12-22 2019-06-11 泓图睿语(北京)科技有限公司 Convolutional neural networks accelerator and accelerated method
CN110308982A (en) * 2018-03-20 2019-10-08 华为技术有限公司 A kind of shared drive multiplexing method and device
US20190303762A1 (en) * 2018-03-30 2019-10-03 Xilinx, Inc. Methods of optimization of computational graphs of neural networks
US20200211262A1 (en) * 2018-12-28 2020-07-02 Intel Corporation Apparatus and method for ray tracing instruction processing and execution
CN110458280A (en) * 2019-07-15 2019-11-15 武汉魅瞳科技有限公司 A kind of convolutional neural networks accelerated method and system suitable for mobile terminal
CN110348574A (en) * 2019-07-17 2019-10-18 哈尔滨理工大学 A kind of general convolutional neural networks accelerating structure and design method based on ZYNQ
CN111178519A (en) * 2019-12-27 2020-05-19 华中科技大学 Convolutional neural network acceleration engine, convolutional neural network acceleration system and method
CN111160534A (en) * 2019-12-31 2020-05-15 中山大学 Binary neural network forward propagation frame suitable for mobile terminal

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
GANGZHAO LU 等: "Optimizing Depthwise Separable Convolution Operations on GPUs", 《IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS》, vol. 33, no. 01, 31 January 2022 (2022-01-31), pages 70 - 87 *
GOPALAKRISHNAN ELANGO 等: "Convolutional Neural Network Acceleration on GPU by Exploiting Data Reuse", 《SAN JOSE STATE UNIVERSITY MASTER" THESES》, 31 March 2017 (2017-03-31), pages 1 - 67 *
刘磊 等: "一种基于GPU的二维离散多分辨率小波变换加速方法", 《吉林大学学报(理学版)》, vol. 53, no. 02, 26 March 2015 (2015-03-26), pages 267 - 272 *
张军阳 等: "二维矩阵卷积的并行计算方法", 《浙江大学学报(工学版)》, vol. 52, no. 03, 15 March 2018 (2018-03-15), pages 515 - 523 *
王开宇 等: "卷积神经网络的FPGA实现及优化", 《实验室科学》, vol. 21, no. 04, 28 August 2018 (2018-08-28), pages 79 - 84 *
谢根栓 等: "面向CUDA程序的线程放置优化策略研究", 《智能计算机与应用》, vol. 10, no. 02, 29 February 2020 (2020-02-29), pages 341 - 345 *
邹虹 等: "基于FPGA的CNN算法加速", 《电子世界》, no. 2019, 28 February 2019 (2019-02-28), pages 82 - 83 *
陈朋 等: "基于改进动态配置的FPGA卷积神经网络加速器的优化方法", 《高技术通讯》, vol. 30, no. 03, 15 March 2020 (2020-03-15), pages 240 - 247 *
韩博 等: "GPGPU性能模型及应用实例分析", 《计算机辅助设计与图形学学报》, vol. 21, no. 09, 15 September 2009 (2009-09-15), pages 1219 - 1226 *
马龙飞: "二维卷积计算在CUDAGPU架构上的性能优化研究", 《电子世界》, no. 2018, 23 January 2018 (2018-01-23), pages 56 - 57 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116091299A (en) * 2023-04-07 2023-05-09 南京砺算科技有限公司 Implicit GEMM convolution calculation method, device, equipment and medium based on GPU
CN116088773A (en) * 2023-04-11 2023-05-09 南京砺算科技有限公司 Data loading method, device, equipment and medium based on implicit GEMM convolution
CN116088773B (en) * 2023-04-11 2023-06-16 南京砺算科技有限公司 Data loading method, device, equipment and medium based on implicit GEMM convolution

Also Published As

Publication number Publication date
CN111797985B (en) 2022-11-22

Similar Documents

Publication Publication Date Title
EP3499428A1 (en) Method and electronic device for convolution calculation in neutral network
EP3499427A1 (en) Method and electronic device for convolution calculation in neutral network
US8539201B2 (en) Transposing array data on SIMD multi-core processor architectures
CN111797985B (en) Convolution operation memory access optimization method based on GPU
CN106846235B (en) Convolution optimization method and system accelerated by NVIDIA Kepler GPU assembly instruction
WO2019205617A1 (en) Calculation method and apparatus for matrix multiplication
CN110555516B (en) Method for realizing low-delay hardware accelerator of YOLOv2-tiny neural network based on FPGA
CN108897716B (en) Data processing device and method for reducing calculation amount through memory read-write operation
US20230068450A1 (en) Method and apparatus for processing sparse data
CN103177414A (en) Structure-based dependency graph node similarity concurrent computation method
CN115390788A (en) Sparse matrix multiplication distribution system of graph convolution neural network based on FPGA
CN114970849A (en) Hardware accelerator multi-array parallel computing method and system
US20200364289A1 (en) Data processing method and apparatus
KR20230081697A (en) Method and apparatus for accelerating dilatational convolution calculation
CN113222129A (en) Convolution operation processing unit and system based on multi-level cache cyclic utilization
CN111667052A (en) Standard and nonstandard volume consistency transformation method for special neural network accelerator
CN110008436B (en) Fast Fourier transform method, system and storage medium based on data stream architecture
CN110580675A (en) Matrix storage and calculation method suitable for GPU hardware
Lu et al. Optimizing GPU memory transactions for convolution operations
CN108198128A (en) A kind of method and device of alpha channel boundary corrosions
CN115859011A (en) Matrix operation method, device and unit, and electronic equipment
CN115293978A (en) Convolution operation circuit and method, image processing apparatus
CN113705784A (en) Neural network weight coding method based on matrix sharing and hardware system
TWI773783B (en) Apparatus, method, integrated circuit, computer program, and computer-readable storage medium for register-based complex number processing
Honda et al. A warp-synchronous implementation for multiple-length multiplication on the GPU

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant