CN105869117A

CN105869117A - Method for accelerating GPU directed at deep learning super-resolution technology

Info

Publication number: CN105869117A
Application number: CN201610184129.1A
Authority: CN
Inventors: 宋利; 赵章宗; 解蓉
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2016-03-28
Filing date: 2016-03-28
Publication date: 2016-08-17
Anticipated expiration: 2036-03-28
Also published as: CN105869117B

Abstract

The invention discloses a method for accelerating a GPU directed at a deep learning super-resolution technology. The method conducts concurrent processing on all the steps of a super-resolution technology which is based on deep learning and a convolutional neural network, and operates on a GPU. The concurrent processing refers to conducting concurrent task dividing on convolutions of the super-resolution technology which is based on deep learning and the convolutional neural network into millions of micro-tasks which are irrelevant to one another and can be concurrently executed in any order so as to fully exhibit the super-strong calculating capability of the GPU. Further, the method uses features of a GPU memory to cache convolution nuclear data and input image data to a shared memory and a register so as to substantially optimize calculating speeds of the convolutions. The method integrates the convolutions and non-linear layers. The method selects the optimal method for the sizes of different convolutions. According to the invention, the method accelerates the high quality super-resolution method to meet velocity requirements for processing videos, and does not cause any image quality loss.

Description

A kind of GPU accelerated method for degree of depth study super-resolution technique

Technical field

The present invention relates to a kind of image super-resolution field and the method for GPU acceleration, specifically a kind of for degree of depth study The GPU accelerated method of super-resolution technique.

Background technology

One secondary low-resolution image is converted to high-definition picture by image super-resolution exactly, and it is in post processing of image and regards Frequently non-linear editing has a wide range of applications.Super-resolution method (such as bicubic) in early days is often based upon simple inserting Value, can work with fast and reliable, be also easy to integrated chip；But the high-definition picture quality that these methods obtain is not Good, significant artificial trace can be produced, such as ring, aliasing, the effect such as fuzzy.So super-resolution method of quality is difficult to Meet current high quality video requirement.The super-resolution method of current performance advanced person can generate high-quality image, but accompanies Along with huge computing cost, it is difficult to meet reality application needs.There is the super-resolution method that some GPU accelerate at present, These methods have reached the sufficiently fast speed of service, but also sacrifice the running quality of method.

Degree of depth learning art has obtained huge advance in recent years, and Computer Vision Recognition accuracy rate is obviously improved, based on the degree of depth Study, the super-resolution technique of convolutional neural networks are also arisen at the historic moment, and are published in European Computer visual conference in 2014 Super-resolution method based on convolutional neural networks (Chao Dong, Chen Change Loy, Kaiming He, Xiaoou Tang.Learning a Deep Convolutional Network for Image Super-Resolution,in Proceedings Of European Conference on Computer Vision (ECCV), 2014, pp.184-199, referred to as SRCNN) It it is one of performance the best way.By well-designed 3 layers of convolution and 2 layers of RELU (non-linear layer), instruction of magnanimity Practicing data, ingenious careful training parameter fine setting, SRCNN was once becoming the super-resolution method that performance is optimal.So And the method depends on the computing cost of flood tide, using CPU to perform each frame of the method needs 300 seconds (1920*1080 To 3840*2160, single channel, the most all tests are all based on this resolution), even if using based on GEMM GPU convolution, accelerated method, each frame is also required to close to 1 second, it is impossible to meet reality application needs.

Summary of the invention

For making to disclosure satisfy that actual application needs, the present invention based on degree of depth study, the super-resolution technique of convolutional neural networks A kind of GPU accelerated method for degree of depth study super-resolution technique is provided.

For achieving the above object, the GPU accelerated method for degree of depth study super-resolution technique of the present invention, institute Method of stating is by based on degree of depth study, the most all parallelizations of super-resolution technique of convolutional neural networks, and at GPU Run.Parallelization of the present invention is to carry out convolution based on degree of depth study, the super-resolution technique of convolutional neural networks Parallel task divides, convolution operation is divided into millions of orthogonal, can be with micro-of random order executed in parallel Business, so that the superpower computation capability of GPU is played.

Further, in described method: carry out task division according to convolution output pixel, the calculating of each output pixel is appointed Business is assigned to a micro-task, thus convolution task can be performed in parallel on a large scale, and the meter of neighbor The data calculating the dependence of micro-task are also adjacent, and perfection reaches to merge and accesses, thus makes full use of the video memory bit wide of GPU And bandwidth.

Further, in described method: utilize the cache sharing memorizer as convolution kernel parameter, thus reduce complete Intra-office is deposited I/O and accelerates convolution.Specifically, first by concurrent thread block, convolution kernel parameter read sharing of thread block In memorizer, the most each thread again from this shared memorizer obtain needed for convolution kernel parameter.The method can reduce GPU reads the global memory's handling capacity needed for convolution kernel parameter, thus is greatly optimized, accelerates the execution speed of convolution.

Further, in described method: utilize the cache sharing memorizer or depositor as input picture, thus Reduce global memory I/O and accelerate convolution.Specifically, first find out the input picture region that concurrent thread block is relied on, Then, during this area data is read the shared memorizer of thread block by thread block, last each thread is from this shared memorizer Input image data needed for acquisition；If also or time every thread required input view data is sufficiently small, the most once by institute Need data to be read in the depositor in thread, then calculate.The method can reduce GPU and read input picture Required global memory's handling capacity, thus it is greatly optimized, accelerates the execution speed of convolution.

Further, in described method: use deep neural network GPU speed technology, merge convolution algorithm and non-thread Property computing, the method can reduce the global memory's handling capacity needed for convolution, non-linear layer, thus accelerates whole process Perform speed.Specifically, described deep neural network GPU speed technology, refer to: by the processing procedure of non-linear layer Merging in convolutional calculation, convolutional calculation carries out non-linear layer computing after completing the most in a register, thus eliminates one The wheel I/O to global memory.

Further, in described method: use degree of depth convolutional network GPU speed technology, for different convolution size choosings Take optimum optimization method.Described degree of depth convolutional network GPU speed technology, refers to: to different size convolutional layer, test Each optimizes speed technology, and then selects the fastest speed technology with the method obtaining the fastest overall operation speed.

Compared with prior art, the present invention has a following significant advantage:

The present invention is directed to convolution carry out parallelization and optimize acceleration, accelerate to meet by a high-quality super-resolution method The rate request of Video processing, and any image quality loss will not be brought.

Further, carry out task according to output pixel and divide realization, thus realize the parallelization of convolution；Will be based on the degree of depth Study, the most all parallelizations of super-resolution technique of convolutional neural networks such that it is able to utilize GPU superpower Computation capability；

Further, utilize the characteristic of GPU bin, be cached to convolution kernel data and input image data share and deposit Reservoir and depositor, thus the calculating speed of convolution is greatly optimized；

Further, merge convolution and non-linear layer, choose optimum optimization for different convolution sizes.

The present invention takes full advantage of the hardware of GPU and stores characteristic, significantly accelerates the calculating speed of convolution, so that One high-quality super-resolution method can run to meet the speed of real work requirement.

Accompanying drawing explanation

The detailed description with reference to the following drawings, non-limiting example made by reading, other features of the present invention, mesh And advantage will become apparent from:

Fig. 1 is the flow chart of SRCNN；

Fig. 2 is the schematic diagram of convolution parallelization in the present invention one preferred embodiment；

Fig. 3 is the schematic diagram in the present invention one preferred embodiment by shared memory cache convolution kernel parameter improvement；

Fig. 4 is the schematic diagram improved by shared memory cache input picture block in the present invention one preferred embodiment；

Fig. 5 is by merging convolution and the schematic diagram of NONLINEAR CALCULATION improvement in the present invention one preferred embodiment.

Detailed description of the invention

Below in conjunction with specific embodiment, the present invention is described in detail.Following example will assist in those skilled in the art Member is further appreciated by the present invention, but limits the present invention the most in any form.It should be pointed out that, the common skill to this area For art personnel, without departing from the inventive concept of the premise, it is also possible to make some deformation and improvement.These broadly fall into Protection scope of the present invention.

As it is shown in figure 1, the flow chart of SRCNN.As one embodiment of the present invention, super-resolution GPU of the present invention Speed technology for SRCNN, its flow chart as it is shown in figure 1, contain bicubic pretreatment (not shown), three Individual convolutional layer and two RELU layers (respectively after first convolution and second convolution).Three convolutional layer sizes are divided It is not (according to output channel * convolution width * convolution height * input channel): 64*9*9*1,32*1*1*64,1*5*5*32. In 1080P to a 4K image super-resolution, required floating number multiply-add operation amount is 66.6G, required storage I/O is 800GBytes.The biggest amount of calculation obviously cannot calculate the time meeting real work, production by CPU Requirement.Therefore for this kind of situation, the present invention uses GPU process, by each step of SRCNN flow process Parallelization, GPU realize, and make full use of GPU hardware characteristic and be optimized and accelerate.

The present invention carries out parallelization for convolution, optimizes and accelerate, this is because bicubic pretreatment computing cost is very emphatically GPU low, easy realizes, and the parallelization of non-linear layer RELU simultaneously realizes being it is clear that time more than 95% Expense is all in convolution.

For understanding how adaptation enters GPU to SRCNN method, and how to optimize GPU concurrent program, first introduce GPU Framework.Due to the restriction of physical factor, in the past few years the operating frequency of processor cannot be substantially improved, and computer industry is led to Crossing the core amounts lifting computing capability increasing processor, typical product has multi-core central processing unit (CPU) and has The graphic process unit (GPU) of numerous cores.Wherein GPU has thousands of computing units and the video memory of ultra high bandwidth, example As Nvidia GTX 980TI has global memory's bandwidth of 2816 CUDA cores and 336GB/s.If by one Individual mass computing task is divided into tens thousand of or even millions of micro-tasks, then gives GPU when process, and GPU can be by This task scheduling slightly distributes to these CUDA cores, and numerous CUDA cores can concomitantly, process efficiently Micro-task, so that GPU performs speed and reaches the hundreds times of CPU.

GPU has the bin mechanism of a stratification, and the global memory (global of GPU is employed herein Memory), memorizer (shared memory) and depositor (register) are shared.The access bandwidth of this three classes bin, Time delay, capacity and addressable unit have the biggest difference.Global memory can be had the biggest by all thread accesses Capacity (number GB), but access bandwidth is minimum, often becomes the bottleneck of whole process.Shared memorizer is a kind of by journey The Cache that sequence person controls, in GPU, whole computing unit is divided into several thread block, has in each thread block A number of thread and an independent shared memorizer, this shared memorizer can be by all threads in this thread block Access, have the highest access bandwidth and relatively low time delay.Register-bit, inside each thread, has the highest bandwidth With minimum time delay, but capacity is the least, storage will be greatly reduced in a register if the data of Reusability are stored The expense accessed.

The GPU that the present invention illustrates accelerates, in SRCNN technology, first to transfer to video memory by input image data from internal memory, Carry out bicubic pretreatment, carry out the most successively ground floor convolution (conv1), relu, second layer convolution (conv2), relu, Then data are transferred to internal memory by third layer convolution (conv3) from video memory.Carrying out each layer of convolution when, all adopt With the parallelization carrying out task division according to output pixel such that it is able to utilize the computation capability that GPU is superpower； In order to further speed up the calculating speed of convolution, the present invention proposes to use shared memory cache convolution kernel data, make With shared memorizer or register cache input picture blocks of data, merge convolution and nonlinear operation；Further, The present invention is directed to different size of convolution, test the execution speed of different convolution method, have chosen the combination of prestissimo So that whole flow process is the fastest.Key technology details of the present invention is as follows.

In a preferred embodiment, in order to make convolution parallelization, convolution task is divided into number according to output pixel by the present invention Million micro-tasks, referred to as convolution direct computing method (direct GPU implementation of convolution), such as Fig. 2 Shown in.Whole convolution seeks to calculate the value of each output pixel, and the calculating task of the most each output pixel can conduct One independent micro-task is assigned on a GPU thread perform, this between task be slightly independent, can be concurrent , without be in communication with each other.This dividing mode additional advantage is that the adjacent thread accesses concurrently performed Input image data is also adjacent, such as thread (x, y) access I (a, b) while, thread (x+1, y) also access I (a+1, b), therefore can automatically be merged into by GPU hardware and once access, thus makes full use of GPU by these access request Video memory bit wide and bandwidth.The remainder (relu) of SRCNN is also parallelized so that whole SRCNN flow process all may be used To perform on GPU, it is to avoid data transfer between CPU/GPU repeatedly.By the parallelization of convolution, SRCNN Execution speed accelerate to 1 second/frame from 300 seconds/frame (use CPU).

Utilizing the mechanism of GPU stratification bin, convolution kernel data and input image data are cached to share by the present invention In memorizer or depositor, thus accelerate convolution 2 to 10 times.

In a preferred embodiment, the present invention is by taking convolution kernel data pre-head to shared memorizer, it is possible to save superfluous Remaining convolution kernel data global memory I/O expense, is referred to as sharing convolution Nuclear Data method (shared kernel), as shown in Figure 3. In above-mentioned convolution direct computing method, each thread have read identical convolution kernel data, redundancy repeat read Take and create substantial amounts of global memory I/O waste.In shared convolution Nuclear Data method, a thread block is first by convolution kernel Data pre-head takes to shared memorizer, the volume needed for then all threads obtain from this shared memorizer again in thread block Long-pending Nuclear Data.This shared memorizer is actually the cache of convolution kernel data, saves and reads convolution kernel in a large number Global memory's I/O expense of data.

In a preferred embodiment, the present invention is by extremely sharing input picture blocks of data pre-read with memorizer or depositor In, it is possible to save the input image data global memory I/O expense of redundancy, be referred to as sharing input picture block method (shared Patch) or the input picture block method (registered pixel) of register cache, as shown in Figure 4.Carry out wide or tall and big in 1 convolution time, adjacent output pixel depends on the input picture block overlapped each other.Convolution direct computing method does not accounts for This overlapping relation, therefore input image data is read in each thread redundantly, also brings global memory I/O Waste.When convolution width and height are bigger when, this I/O waste can become the most serious.Shared defeated in the present invention Enter in the input picture block method of Image module algorithm and register cache, first find the input picture block that a thread block is relied on Region, then this region is read into shared memorizer or depositor, and (only smaller in region, depositor can accommodate In the case of feasible), the most each thread obtains required input image data again from shared memorizer.At this moment share and deposit Reservoir or depositor are exactly the cache of input image data, save in a large amount of overall situation reading input image data Deposit I/O expense.

In a preferred embodiment, the present invention, by merging convolution and non-linear layer, eliminates the I/O expense of non-linear layer, As shown in Figure 5.Traditional acceleration to convolutional neural networks concentrates in the acceleration to convolution, this is because convolution is meter The bottleneck of evaluation time, the most also because non-linear layer is difficult to obtain acceleration.But, when convolution accelerates to the fastest degree Time, the time overhead of non-linear layer can not ignore.In convolutional neural networks, non-linear layer is always followed at volume After lamination, and each output pixel of non-linear layer depends only on an input pixel of correspondence.Therefore, the present invention Permeate a process by convolution and non-linear layer, and after performing convolution, output valve is carried out non-by each thread immediately Linear operation, is then written back global memory.Such way can be removed convolutional layer from and be write back global memory and non-linear layer reading The expense of global memory, this calculating time being equivalent to almost completely eliminate non-linear layer.

In a preferred embodiment, when the present invention is directed to operation that different size convolutional layer tests each optimization speed technology Between, and then select the fastest speed technology with the method obtaining the fastest overall operation speed, run time test result It is shown in Table 1.Method in wherein cuDNN is the convolution algorithm storehouse that Nvidia provides.Therefrom it will be seen that ground floor is rolled up The long-pending cuDNN of employing, second layer convolution use shares convolution nuclear parameter and the input picture block of register cache, third layer Convolution uses shares convolution nuclear parameter and during input picture block, and whole flow process can be accomplished the fastest, be finally reached 0.15 second/ Frame, this is 2000 times of CPU speed.

When table 1 uses each to optimize accelerated method, the operation time of each convolutional layer

In upper table: use Nvidia GTX980TI and two-way Intel E5-2697V2@2.7GHz 12 cores Processers, tests 1920*1080 to 3840*2160 single channel super-resolution.

To sum up, a high-quality super-resolution method is accelerated to meet the rate request of Video processing by the present invention, and Any image quality loss will not be brought.

Above the specific embodiment of the present invention is described.It is to be appreciated that the invention is not limited in Stating particular implementation, those skilled in the art can make various deformation or amendment within the scope of the claims, This has no effect on the flesh and blood of the present invention.

Claims

1. the GPU accelerated method for degree of depth study super-resolution technique, it is characterised in that: described method is by base In degree of depth study, the most all parallelizations of the super-resolution technique of convolutional neural networks, and run at GPU；Institute Stating parallelization is that convolution based on degree of depth study, the super-resolution technique of convolutional neural networks is carried out parallel task division, Convolution operation is divided into millions of orthogonal, can be with micro-task of random order executed in parallel so that GPU Superpower computation capability is played.

GPU accelerated method for degree of depth study super-resolution technique the most according to claim 1, its feature exists In: in described method: carry out task division according to convolution output pixel, the calculating task of each output pixel is assigned to One micro-task, thus convolution task can be performed in parallel on a large scale, and the micro-task of calculating of neighbor depends on The data relied also are adjacent, and perfection reaches to merge and accesses, thus makes full use of video memory bit wide and the bandwidth of GPU.

GPU accelerated method for degree of depth study super-resolution technique the most according to claim 1, its feature exists In: in described method: utilize the cache sharing memorizer as convolution kernel parameter, thus reduce global memory I/O And accelerate convolution.

GPU accelerated method for degree of depth study super-resolution technique the most according to claim 3, its feature exists In: the memorizer cache as convolution kernel parameter is shared in described utilization, refers to: first by concurrent thread block by convolution Nuclear parameter reads in the shared memorizer of thread block, the most each thread again from this shared memorizer obtain needed for convolution Nuclear parameter.

GPU accelerated method for degree of depth study super-resolution technique the most according to claim 1, its feature exists In: in described method: utilize the cache sharing memorizer or depositor as input picture, thus reduce in the overall situation Deposit I/O and accelerate convolution.

GPU accelerated method for degree of depth study super-resolution technique the most according to claim 5, its feature exists In: memorizer or the depositor cache as input picture is shared in described utilization, refers to: first find out concurrent thread The input picture region that block is relied on, during then this area data is read the shared memorizer of thread block by thread block, Rear each thread input image data needed for this shared memorizer obtains；If also or every thread required input picture number According to time sufficiently small, the most once desired data be read in the depositor in thread, then calculate.

GPU accelerated method for degree of depth study super-resolution technique the most according to claim 1, its feature exists In: in described method: use deep neural network GPU speed technology, merge convolution algorithm and nonlinear operation, subtract Global memory's handling capacity needed for few convolution, non-linear layer, thus accelerate the execution speed of whole process.

GPU accelerated method for degree of depth study super-resolution technique the most according to claim 7, its feature exists In: described deep neural network GPU speed technology, refer to: the processing procedure of non-linear layer is merged in convolutional calculation In, convolutional calculation carries out non-linear layer computing after completing the most in a register, thus eliminates one and take turns global memory I/O。

9. according to the GPU accelerated method for degree of depth study super-resolution technique described in any one of claim 1-8, It is characterized in that: in described method: use degree of depth convolutional network GPU speed technology, choose for different convolution sizes Optimum optimization method.

GPU accelerated method for degree of depth study super-resolution technique the most according to claim 9, its feature It is: described degree of depth convolutional network GPU speed technology, refers to: to different size convolutional layer, test each optimization and add Speed technology, and then select the fastest speed technology with the method obtaining the fastest overall operation speed.