CN105869117B

CN105869117B - GPU acceleration method for deep learning super-resolution technology

Info

Publication number: CN105869117B
Application number: CN201610184129.1A
Authority: CN
Inventors: 宋利; 赵章宗; 解蓉
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2016-03-28
Filing date: 2016-03-28
Publication date: 2021-04-02
Anticipated expiration: 2036-03-28
Also published as: CN105869117A

Abstract

The invention discloses a GPU acceleration method aiming at a deep learning super-resolution technology, which parallelizes all steps of the super-resolution technology based on deep learning and a convolutional neural network and runs in a GPU; the parallelization is to perform parallel task division on the convolution of the super-resolution technology based on deep learning and the convolutional neural network, and divide the convolution operation into millions of micro tasks which are not related and can be executed in parallel in any sequence, so that the ultra-strong parallel computing capability of the GPU is exerted. Furthermore, the convolution kernel data and the input image data are cached to a shared memory and a register by utilizing the characteristics of a GPU memory, so that the calculation speed of convolution is greatly optimized; fusing the convolution and non-linear layers; and selecting an optimal optimization method aiming at different convolution sizes. The invention accelerates a high-quality super-resolution method to meet the speed requirement of video processing, and does not bring any image quality loss.

Description

GPU acceleration method for deep learning super-resolution technology

Technical Field

The invention relates to the field of image super-resolution and a GPU (graphics processing Unit) acceleration method, in particular to a GPU acceleration method aiming at a deep learning super-resolution technology.

Background

The super-resolution of images is to convert a low-resolution image into a high-resolution image, and the super-resolution of images has wide application in image post-processing and video non-linear editing. Early super-resolution methods (such as bicubic) are often based on simple interpolation, can work quickly and reliably, and are easy for chip integration; however, the high-resolution images obtained by the methods have poor quality, and can generate remarkable artifacts such as rings, aliasing, blurring and the like. Such a quality super-resolution method is difficult to meet the current high-quality video demand. The current super-resolution method with advanced performance can generate high-quality images, but with huge calculation cost, the method is difficult to meet the requirements of practical application. At present, there are some super-resolution methods for GPU acceleration, which achieve a sufficiently fast operating speed, but sacrifice the operating quality of the method.

In recent years, Deep learning technology has been greatly improved, accuracy of Computer Vision recognition has been remarkably improved, Super-Resolution technology based on Deep learning and Convolutional neural Network has also been developed, and a Super-Resolution method based on Convolutional neural Network (Chao Dong, Chen Change lot, learning He, Xiaoou tang.learning. left conditional access for Image Super-Resolution, in Proceedings of European Conference on Computer Vision (ECCV),2014, pp.184-199, SRCNN) published in European Computer Vision Conference in 2014 is one of the best performance methods. By means of well-designed 3-layer convolution and 2-layer RELU (nonlinear layer), massive training data and ingenious and detailed training parameter fine adjustment, the SRCNN becomes a super-resolution method with the best performance. However, this method relies on a huge amount of computational overhead, 300 seconds are required for each frame (1920 × 1080 to 3840 × 2160, a single channel, all tests below are based on this resolution) to execute this method with a CPU, and even with the GEMM-based GPU convolution, acceleration method, each frame needs nearly 1 second, which cannot meet the needs of practical applications.

Disclosure of Invention

In order to enable the super-resolution technology based on deep learning and convolutional neural network to meet the requirements of practical application, the invention provides a GPU acceleration method aiming at the deep learning super-resolution technology.

In order to achieve the purpose, the GPU acceleration method aiming at the deep learning super-resolution technology provided by the invention parallelizes all steps of the super-resolution technology based on deep learning and convolutional neural network and runs in a GPU. The parallelization is to perform parallel task division on the convolution of the super-resolution technology based on deep learning and the convolutional neural network, and divide the convolution operation into millions of micro tasks which are not related and can be executed in parallel in any sequence, so that the ultra-strong parallel computing capability of the GPU is exerted.

Further, in the method: and performing task division according to the convolution output pixels, wherein the calculation task of each output pixel is distributed into one micro task, so that the convolution tasks can be executed in parallel on a large scale, and data depended by the calculation micro tasks of adjacent pixels are adjacent, thereby perfectly achieving merging access, and fully utilizing the video memory bit width and bandwidth of the GPU.

Further, in the method: shared memory is utilized as a cache for convolution kernel parameters, thereby reducing global memory I/O and speeding up convolution. Specifically, the convolution kernel parameters are first read by the concurrent thread block into the shared memory of the thread block, and then each thread retrieves the required convolution kernel parameters from the shared memory. The method can reduce the global memory throughput required by the GPU for reading the convolution kernel parameters, thereby greatly optimizing and accelerating the execution speed of convolution.

Further, in the method: shared memory or registers are utilized as a cache for the input image, thereby reducing global memory I/O and speeding up convolution. Specifically, an input image area on which a concurrent thread block depends is found out firstly, then the thread block reads the area data into a shared memory of the thread block, and finally each thread acquires the required input image data from the shared memory; or if the required input image data of each thread is small enough, the required data is directly read into the register in the thread once, and then calculation is carried out. The method can reduce the global memory throughput required by the GPU for reading the input image, thereby greatly optimizing and accelerating the execution speed of convolution.

Further, in the method: the method adopts a deep neural network GPU acceleration technology, combines convolution operation and nonlinear operation, and can reduce the global memory throughput required by convolution and nonlinear layers, thereby accelerating the execution speed of the whole process. Specifically, the deep neural network GPU acceleration technology refers to: the processing process of the nonlinear layer is fused in the convolution calculation, and the nonlinear layer calculation is carried out in the register immediately after the convolution calculation is finished, so that the I/O of a wheel to the global memory is omitted.

Further, in the method: and selecting an optimal optimization method aiming at different convolution sizes by adopting a deep convolution network GPU acceleration technology. The deep convolutional network GPU acceleration technology is as follows: and testing each optimized acceleration technology for the convolution layers with different sizes, and further selecting the fastest acceleration technology to obtain the method with the highest overall operation speed as possible.

Compared with the prior art, the invention has the following remarkable advantages:

the invention carries out parallelization and optimization acceleration aiming at convolution, accelerates a high-quality super-resolution method to meet the speed requirement of video processing, and does not bring any image quality loss.

Further, task division is carried out according to the output pixels, so that parallelization of convolution is realized; all steps of the super-resolution technology based on deep learning and convolutional neural networks are parallelized, so that the ultra-strong parallel computing capability of a GPU can be utilized;

furthermore, the convolution kernel data and the input image data are cached to a shared memory and a register by utilizing the characteristics of a GPU memory, so that the calculation speed of convolution is greatly optimized;

furthermore, convolution and non-linear layers are fused, and the optimal optimization is selected according to different convolution sizes.

The method fully utilizes the hardware and storage characteristics of the GPU, greatly accelerates the calculation speed of convolution, and therefore the super-resolution method with high quality can run at the speed meeting the actual working requirement.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow diagram of SRCNN;

FIG. 2 is a diagram of convolution parallelization in a preferred embodiment of the present invention;

FIG. 3 is a schematic illustration of the parameter improvement of convolution kernel through a shared memory cache in a preferred embodiment of the present invention;

FIG. 4 is a schematic illustration of the improvement in caching input tiles by a shared memory in a preferred embodiment of the present invention;

FIG. 5 is a schematic illustration of the improvement by fusing convolution and non-linear calculations in a preferred embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

As shown in fig. 1, a flow diagram of the SRCNN. As a preferred embodiment of the present invention, the super-resolution GPU acceleration technique of the present invention is directed to the SRCNN, and the flowchart thereof is shown in fig. 1, and includes bicubic preprocessing (not shown), three convolution layers, and two RELU layers (respectively following the first convolution and the second convolution). The three convolution layer sizes are (convolution width by output channel and convolution height by input channel): 64 × 9 × 1, 32 × 1 × 64, 1 × 5 × 32. In one 1080P to 4K image super resolution, the required floating point number multiply-add operand is 66.6G, and the required storage I/O is 800 GBytes. Obviously, such a large amount of calculation cannot meet the time requirements of actual work and production through CPU calculation. Therefore, aiming at the situation, the method adopts the GPU for processing, parallelizes each step of the SRCNN process, realizes the GPU, and fully utilizes the hardware characteristic of the GPU for optimization and acceleration.

The method is mainly used for parallelization and optimization acceleration of convolution, because bicubic preprocessing calculation cost is low and GPU implementation is easy, and meanwhile parallelization implementation of the nonlinear layer RELU is obvious, and more than 95% of time cost is in convolution.

To understand how the SRCNN method adapts to the GPU and how the GPU parallel program is optimized, the architecture of the GPU is first introduced. Due to the restriction of physical factors, the working frequency of the processor cannot be greatly increased in years, the computing capacity of the computer industry is increased by increasing the number of cores of the processor, and typical products comprise a multi-core Central Processing Unit (CPU) and a Graphics Processing Unit (GPU) with a plurality of cores. The GPU has thousands of computing units and a super-high bandwidth video memory, for example, Nvidia GTX980TI has 2816 CUDA cores and a global memory bandwidth of 336 GB/s. If a large computing task is divided into tens of thousands or even millions of micro tasks and then handed over to the GPU for processing, the GPU dispatches the micro tasks to the CUDA cores, and the CUDA cores can process the micro tasks concurrently and efficiently, so that the execution speed of the GPU reaches hundreds of times of that of the CPU.

The GPU has a hierarchical storage mechanism, and the present invention uses a global memory (global memory), a shared memory (shared memory), and a register (register) of the GPU. The access bandwidth, latency, capacity and accessible units of these three types of storage are very different. The global memory can be accessed by all threads and has a large capacity (several GB), but the access bandwidth is the lowest, and the access bandwidth is often the bottleneck of the whole process. The shared memory is a high-speed buffer controlled by a programmer, the whole calculation unit in the GPU is divided into a plurality of thread blocks, each thread block is provided with a certain number of threads and an independent shared memory, the shared memory can be accessed by all the threads in the thread block, and the shared memory has high access bandwidth and low time delay. The registers are located inside each thread, have the highest bandwidth and the lowest latency, but have small capacity, and if the repeatedly used data is stored in the registers, the storage access overhead is greatly reduced.

In the GPU acceleration SRCNN technology described in the invention, firstly, input image data is transferred from a memory to a video memory for bicubic preprocessing, then, a first layer of convolution (conv1), relu, a second layer of convolution (conv2), relu and a third layer of convolution (conv3) are sequentially carried out, and then, the data is transferred from the video memory to the memory. When each layer of convolution is carried out, parallelization of task division is carried out according to output pixels, so that the ultra-strong parallel computing capacity of the GPU can be utilized; in order to further accelerate the calculation speed of convolution, the invention provides that the shared memory is used for caching convolution kernel data, the shared memory or the register is used for caching input image block data, and convolution and nonlinear operation are fused; furthermore, the invention tests the execution speeds of different convolution methods aiming at the convolutions with different sizes, and selects the combination with the fastest speed to ensure that the whole process is as fast as possible. The key technical details of the invention are as follows.

In a preferred embodiment, to parallelize the convolution, the present invention divides the convolution task into millions of micro-tasks by output pixel, called convolution direct computation of convolution (fig. 2). The whole convolution is to calculate the value of each output pixel, so that the calculation task of each output pixel can be distributed to a GPU thread to be executed as an independent micro task, and the micro tasks are independent, can be concurrent and do not need to be communicated with each other. Another advantage of this division is that: the input image data accessed by the adjacent concurrently-executed threads are also adjacent, for example, while the thread (x, y) accesses the I (a, b), the thread (x +1, y) also accesses the I (a +1, b), so the access requests are automatically merged into one access by the GPU hardware, thereby fully utilizing the video memory width and bandwidth of the GPU. The rest part (relu) of the SRCNN is also parallelized, so that the whole SRCNN process can be executed on the GPU, and repeated data transfer between the CPU and the GPU is avoided. Through the parallelization of the convolution, the execution speed of the SRCNN is accelerated from 300 seconds/frame (using the CPU) to 1 second/frame.

By using a GPU hierarchical storage mechanism, the invention caches convolution kernel data and input image data into a shared memory or register, thereby accelerating the convolution by 2 to 10 times.

In a preferred embodiment, the present invention can save the global memory I/O overhead of redundant convolution kernel data by pre-reading the convolution kernel data into the shared memory, which is called shared kernel data method (shared kernel), as shown in fig. 3. In the above convolution direct computation method, each thread reads the identical convolution kernel data, and redundant repeated reading generates a large amount of global memory I/O waste. In the shared convolution kernel data method, a thread block first pre-reads convolution kernel data into a shared memory, and then all threads in the thread block acquire needed convolution kernel data from the shared memory. The shared memory is actually a cache of the convolution kernel data, and therefore global memory I/O overhead for reading a large amount of convolution kernel data is saved.

In a preferred embodiment, the present invention can save the global memory I/O overhead of redundant input image data by pre-reading the input image block data into a shared memory or register, which is called a shared patch or a register-buffered input image block (registered pixel), as shown in fig. 4. When performing a convolution with a width or height greater than 1, adjacent output pixels depend on the input image blocks overlapping each other. The convolution direct calculation method does not take such an overlapping relationship into consideration, and therefore input image data is redundantly read into each thread, which also wastes global memory I/O. This I/O waste can become very severe when the convolution is large in width and height. In the input image block method of the invention, which shares the input image block method and the register cache, an input image block area on which one thread block depends is firstly found, then the area is read into a shared memory or a register (which is feasible only under the condition that the area is small and the register can hold the area), and then each thread acquires the required input image data from the shared memory. At this time, the shared memory or register is the high-speed buffer memory of the input image data, thus saving the cost of global memory I/O for reading a large amount of input image data.

In a preferred embodiment, the present invention eliminates the I/O overhead of the non-linear layer by fusing the convolution and non-linear layers, as shown in FIG. 5. Conventional acceleration of convolutional neural networks focuses on acceleration of convolution because convolution is a bottleneck in computation time and because it is difficult for a non-linear layer to achieve acceleration. However, when the convolution is accelerated to such a fast degree, the time overhead of the non-linear layer is already not negligible. In convolutional neural networks, the non-linear layer always follows the convolutional layer, and each output pixel of the non-linear layer depends only on a corresponding one of the input pixels. Therefore, the invention combines the convolution and the non-linear layer into a process, and each thread performs the non-linear operation on the output value immediately after the convolution is performed and then writes the output value back to the global memory. This eliminates the overhead of the convolutional layer writing back to global memory and the nonlinear layer reading from global memory, which is equivalent to almost completely eliminating the computation time of the nonlinear layer.

In a preferred embodiment, the present invention is a method for testing the running time of each optimized acceleration technique for different sized convolutional layers, and further selecting the fastest acceleration technique to obtain the fastest overall running speed as possible, and the running time test results are shown in table 1. Where cuDNN is a method within the convolution operator provided by Nvidia. It can be seen that when the first layer of convolution adopts cuDNN, the second layer of convolution adopts shared convolution kernel parameters and input image blocks cached by a register, and the third layer of convolution adopts shared convolution kernel parameters and input image blocks, the whole process can be fastest, and finally reaches 0.15 second/frame, which is 2000 times of the CPU speed.

TABLE 1 run time of each convolutional layer using each optimized acceleration method

In the above table: single channel super resolution was tested 1920 x 1080 to 3840 x 2160 using Nvidia GTX980TI and a two-way Intel E5-2697V2@2.7GHz 12 cores process.

In conclusion, the invention accelerates a high-quality super-resolution method to meet the speed requirement of video processing without any image quality loss.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A GPU acceleration method aiming at a deep learning super-resolution technology is characterized by comprising the following steps: all steps of a super-resolution technology based on deep learning and a convolutional neural network are parallelized and run in a GPU; the parallelization is to perform parallel task division on the convolution of the super-resolution technology based on deep learning and a convolutional neural network, and divide the convolution operation into millions of micro tasks which are not related and can be executed in parallel in any sequence, so that the ultra-strong parallel computing capability of the GPU is exerted;

firstly, transferring input image data from a memory to a video memory, performing bicubic preprocessing, then sequentially performing a first layer of convolution conv1, relu, a second layer of convolution conv2, relu and a third layer of convolution conv3, and then transferring the data from the video memory to the memory; when each layer of convolution is carried out, parallelization of task division is carried out according to output pixels, so that the ultra-strong parallel computing capacity of the GPU can be utilized; in order to further accelerate the calculation speed of convolution, the shared memory is used for caching convolution kernel data, the shared memory or the register is used for caching input image block data, and convolution and nonlinear operation are fused;

the method comprises the following steps: performing task division according to convolution output pixels, wherein a calculation task of each output pixel is distributed into a micro task, so that the convolution tasks can be executed in parallel on a large scale, and data depended by the calculation micro tasks of adjacent pixels are adjacent, thereby perfectly achieving merging access, and fully utilizing the video memory bit width and bandwidth of the GPU; specifically, the convolution task is divided into millions of micro tasks according to output pixels, which is called a convolution direct calculation method, the value of each output pixel is calculated by the whole convolution, so that the calculation task of each output pixel can be distributed to a GPU thread to be executed as an independent micro task, the micro tasks are independent, can be executed concurrently and do not need to communicate with each other, input image data accessed by adjacent concurrently executed threads are also adjacent, when the threads (x, y) access to I (a, b), the threads (x +1, y) access to I (a +1, b), and the access requests are automatically merged into one access by GPU hardware, so that the display memory bit width and bandwidth of the GPU are fully utilized; the rest relu of the SRCNN is also parallelized, so that the whole SRCNN process can be executed on a GPU, and repeated data transfer between the CPU and the GPU is avoided;

the method comprises the following steps: the shared memory is used as a cache of convolution kernel parameters, so that the global memory I/O is reduced and the convolution is accelerated; specifically, by pre-reading the convolution kernel data into the shared memory, the redundant global memory I/O overhead of the convolution kernel data can be saved, which is called a shared convolution kernel data method; in the direct convolution calculation method, each thread reads completely same convolution kernel data, and redundant repeated reading generates a large amount of global memory I/O waste; in the method for sharing the convolution kernel data, a thread block reads the convolution kernel data into a shared memory in advance, then all threads in the thread block acquire the needed convolution kernel data from the shared memory, and the shared memory is a high-speed cache of the convolution kernel data, so that the global memory I/O overhead for reading a large amount of convolution kernel data is saved;

the method comprises the following steps: a shared memory or a register is used as a high-speed cache of an input image, so that the global memory I/O is reduced and the convolution is accelerated; by pre-reading the input image block data into the shared memory or the register, the redundant input image data global memory I/O overhead can be saved, which is called as a shared input image block method or a register-cached input image block method; when the convolution with the width or height larger than 1 is carried out, adjacent output pixels depend on the mutually overlapped input image blocks, the direct convolution calculation method does not consider the overlapping relation, so that input image data are redundantly read into each thread, and the waste of global memory I/O is also brought, and when the convolution with the width or height larger than 1, the waste of the I/O becomes very serious; in the input image block method sharing the input image block method and the register cache, the using the shared memory or the register as the cache of the input image means: firstly, finding out an input image area on which a concurrent thread block depends, reading the area data into a shared memory of the thread block by the thread block, and finally acquiring required input image data from the shared memory by each thread; or if the image data required to be input by each thread is small enough, the required data is directly read into the register in the thread once, and then calculation is carried out, and the shared memory or the register is the cache of the input image data, so that the global memory I/O expense for reading a large amount of input image data is saved;

the method adopts a deep neural network GPU acceleration technology, combines convolution operation and nonlinear operation, reduces the global memory throughput required by convolution and nonlinear layers, and accordingly accelerates the execution speed of the whole process; the convolution and the nonlinear layer are integrated into a process, after the convolution is executed, each thread immediately carries out nonlinear operation on an output value and then writes back the output value to the global memory, the overhead of writing back the convolution layer to the global memory and reading the global memory by the nonlinear layer is avoided, and the process is equivalent to almost completely eliminating the calculation time of the nonlinear layer.

2. The GPU acceleration method for the deep learning super-resolution technique of claim 1, characterized in that: the method comprises the following steps: and selecting an optimal optimization method aiming at different convolution sizes by adopting a deep convolution network GPU acceleration technology.

3. The GPU acceleration method for the deep learning super-resolution technique of claim 2, characterized in that: the deep convolutional network GPU acceleration technology is as follows: and testing each optimized acceleration technology for the convolution layers with different sizes, and further selecting the fastest acceleration technology to obtain the method with the highest overall operation speed as possible.