CN105869117B - GPU acceleration method for deep learning super-resolution technology - Google Patents

GPU acceleration method for deep learning super-resolution technology Download PDF

Info

Publication number
CN105869117B
CN105869117B CN201610184129.1A CN201610184129A CN105869117B CN 105869117 B CN105869117 B CN 105869117B CN 201610184129 A CN201610184129 A CN 201610184129A CN 105869117 B CN105869117 B CN 105869117B
Authority
CN
China
Prior art keywords
convolution
gpu
input image
memory
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201610184129.1A
Other languages
Chinese (zh)
Other versions
CN105869117A (en
Inventor
宋利
赵章宗
解蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201610184129.1A priority Critical patent/CN105869117B/en
Publication of CN105869117A publication Critical patent/CN105869117A/en
Application granted granted Critical
Publication of CN105869117B publication Critical patent/CN105869117B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a GPU acceleration method aiming at a deep learning super-resolution technology, which parallelizes all steps of the super-resolution technology based on deep learning and a convolutional neural network and runs in a GPU; the parallelization is to perform parallel task division on the convolution of the super-resolution technology based on deep learning and the convolutional neural network, and divide the convolution operation into millions of micro tasks which are not related and can be executed in parallel in any sequence, so that the ultra-strong parallel computing capability of the GPU is exerted. Furthermore, the convolution kernel data and the input image data are cached to a shared memory and a register by utilizing the characteristics of a GPU memory, so that the calculation speed of convolution is greatly optimized; fusing the convolution and non-linear layers; and selecting an optimal optimization method aiming at different convolution sizes. The invention accelerates a high-quality super-resolution method to meet the speed requirement of video processing, and does not bring any image quality loss.

Description

GPU acceleration method for deep learning super-resolution technology
Technical Field
The invention relates to the field of image super-resolution and a GPU (graphics processing Unit) acceleration method, in particular to a GPU acceleration method aiming at a deep learning super-resolution technology.
Background
The super-resolution of images is to convert a low-resolution image into a high-resolution image, and the super-resolution of images has wide application in image post-processing and video non-linear editing. Early super-resolution methods (such as bicubic) are often based on simple interpolation, can work quickly and reliably, and are easy for chip integration; however, the high-resolution images obtained by the methods have poor quality, and can generate remarkable artifacts such as rings, aliasing, blurring and the like. Such a quality super-resolution method is difficult to meet the current high-quality video demand. The current super-resolution method with advanced performance can generate high-quality images, but with huge calculation cost, the method is difficult to meet the requirements of practical application. At present, there are some super-resolution methods for GPU acceleration, which achieve a sufficiently fast operating speed, but sacrifice the operating quality of the method.
In recent years, Deep learning technology has been greatly improved, accuracy of Computer Vision recognition has been remarkably improved, Super-Resolution technology based on Deep learning and Convolutional neural Network has also been developed, and a Super-Resolution method based on Convolutional neural Network (Chao Dong, Chen Change lot, learning He, Xiaoou tang.learning. left conditional access for Image Super-Resolution, in Proceedings of European Conference on Computer Vision (ECCV),2014, pp.184-199, SRCNN) published in European Computer Vision Conference in 2014 is one of the best performance methods. By means of well-designed 3-layer convolution and 2-layer RELU (nonlinear layer), massive training data and ingenious and detailed training parameter fine adjustment, the SRCNN becomes a super-resolution method with the best performance. However, this method relies on a huge amount of computational overhead, 300 seconds are required for each frame (1920 × 1080 to 3840 × 2160, a single channel, all tests below are based on this resolution) to execute this method with a CPU, and even with the GEMM-based GPU convolution, acceleration method, each frame needs nearly 1 second, which cannot meet the needs of practical applications.
Disclosure of Invention
In order to enable the super-resolution technology based on deep learning and convolutional neural network to meet the requirements of practical application, the invention provides a GPU acceleration method aiming at the deep learning super-resolution technology.
In order to achieve the purpose, the GPU acceleration method aiming at the deep learning super-resolution technology provided by the invention parallelizes all steps of the super-resolution technology based on deep learning and convolutional neural network and runs in a GPU. The parallelization is to perform parallel task division on the convolution of the super-resolution technology based on deep learning and the convolutional neural network, and divide the convolution operation into millions of micro tasks which are not related and can be executed in parallel in any sequence, so that the ultra-strong parallel computing capability of the GPU is exerted.
Further, in the method: and performing task division according to the convolution output pixels, wherein the calculation task of each output pixel is distributed into one micro task, so that the convolution tasks can be executed in parallel on a large scale, and data depended by the calculation micro tasks of adjacent pixels are adjacent, thereby perfectly achieving merging access, and fully utilizing the video memory bit width and bandwidth of the GPU.
Further, in the method: shared memory is utilized as a cache for convolution kernel parameters, thereby reducing global memory I/O and speeding up convolution. Specifically, the convolution kernel parameters are first read by the concurrent thread block into the shared memory of the thread block, and then each thread retrieves the required convolution kernel parameters from the shared memory. The method can reduce the global memory throughput required by the GPU for reading the convolution kernel parameters, thereby greatly optimizing and accelerating the execution speed of convolution.
Further, in the method: shared memory or registers are utilized as a cache for the input image, thereby reducing global memory I/O and speeding up convolution. Specifically, an input image area on which a concurrent thread block depends is found out firstly, then the thread block reads the area data into a shared memory of the thread block, and finally each thread acquires the required input image data from the shared memory; or if the required input image data of each thread is small enough, the required data is directly read into the register in the thread once, and then calculation is carried out. The method can reduce the global memory throughput required by the GPU for reading the input image, thereby greatly optimizing and accelerating the execution speed of convolution.
Further, in the method: the method adopts a deep neural network GPU acceleration technology, combines convolution operation and nonlinear operation, and can reduce the global memory throughput required by convolution and nonlinear layers, thereby accelerating the execution speed of the whole process. Specifically, the deep neural network GPU acceleration technology refers to: the processing process of the nonlinear layer is fused in the convolution calculation, and the nonlinear layer calculation is carried out in the register immediately after the convolution calculation is finished, so that the I/O of a wheel to the global memory is omitted.
Further, in the method: and selecting an optimal optimization method aiming at different convolution sizes by adopting a deep convolution network GPU acceleration technology. The deep convolutional network GPU acceleration technology is as follows: and testing each optimized acceleration technology for the convolution layers with different sizes, and further selecting the fastest acceleration technology to obtain the method with the highest overall operation speed as possible.
Compared with the prior art, the invention has the following remarkable advantages:
the invention carries out parallelization and optimization acceleration aiming at convolution, accelerates a high-quality super-resolution method to meet the speed requirement of video processing, and does not bring any image quality loss.
Further, task division is carried out according to the output pixels, so that parallelization of convolution is realized; all steps of the super-resolution technology based on deep learning and convolutional neural networks are parallelized, so that the ultra-strong parallel computing capability of a GPU can be utilized;
furthermore, the convolution kernel data and the input image data are cached to a shared memory and a register by utilizing the characteristics of a GPU memory, so that the calculation speed of convolution is greatly optimized;
furthermore, convolution and non-linear layers are fused, and the optimal optimization is selected according to different convolution sizes.
The method fully utilizes the hardware and storage characteristics of the GPU, greatly accelerates the calculation speed of convolution, and therefore the super-resolution method with high quality can run at the speed meeting the actual working requirement.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flow diagram of SRCNN;
FIG. 2 is a diagram of convolution parallelization in a preferred embodiment of the present invention;
FIG. 3 is a schematic illustration of the parameter improvement of convolution kernel through a shared memory cache in a preferred embodiment of the present invention;
FIG. 4 is a schematic illustration of the improvement in caching input tiles by a shared memory in a preferred embodiment of the present invention;
FIG. 5 is a schematic illustration of the improvement by fusing convolution and non-linear calculations in a preferred embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.
As shown in fig. 1, a flow diagram of the SRCNN. As a preferred embodiment of the present invention, the super-resolution GPU acceleration technique of the present invention is directed to the SRCNN, and the flowchart thereof is shown in fig. 1, and includes bicubic preprocessing (not shown), three convolution layers, and two RELU layers (respectively following the first convolution and the second convolution). The three convolution layer sizes are (convolution width by output channel and convolution height by input channel): 64 × 9 × 1, 32 × 1 × 64, 1 × 5 × 32. In one 1080P to 4K image super resolution, the required floating point number multiply-add operand is 66.6G, and the required storage I/O is 800 GBytes. Obviously, such a large amount of calculation cannot meet the time requirements of actual work and production through CPU calculation. Therefore, aiming at the situation, the method adopts the GPU for processing, parallelizes each step of the SRCNN process, realizes the GPU, and fully utilizes the hardware characteristic of the GPU for optimization and acceleration.
The method is mainly used for parallelization and optimization acceleration of convolution, because bicubic preprocessing calculation cost is low and GPU implementation is easy, and meanwhile parallelization implementation of the nonlinear layer RELU is obvious, and more than 95% of time cost is in convolution.
To understand how the SRCNN method adapts to the GPU and how the GPU parallel program is optimized, the architecture of the GPU is first introduced. Due to the restriction of physical factors, the working frequency of the processor cannot be greatly increased in years, the computing capacity of the computer industry is increased by increasing the number of cores of the processor, and typical products comprise a multi-core Central Processing Unit (CPU) and a Graphics Processing Unit (GPU) with a plurality of cores. The GPU has thousands of computing units and a super-high bandwidth video memory, for example, Nvidia GTX980TI has 2816 CUDA cores and a global memory bandwidth of 336 GB/s. If a large computing task is divided into tens of thousands or even millions of micro tasks and then handed over to the GPU for processing, the GPU dispatches the micro tasks to the CUDA cores, and the CUDA cores can process the micro tasks concurrently and efficiently, so that the execution speed of the GPU reaches hundreds of times of that of the CPU.
The GPU has a hierarchical storage mechanism, and the present invention uses a global memory (global memory), a shared memory (shared memory), and a register (register) of the GPU. The access bandwidth, latency, capacity and accessible units of these three types of storage are very different. The global memory can be accessed by all threads and has a large capacity (several GB), but the access bandwidth is the lowest, and the access bandwidth is often the bottleneck of the whole process. The shared memory is a high-speed buffer controlled by a programmer, the whole calculation unit in the GPU is divided into a plurality of thread blocks, each thread block is provided with a certain number of threads and an independent shared memory, the shared memory can be accessed by all the threads in the thread block, and the shared memory has high access bandwidth and low time delay. The registers are located inside each thread, have the highest bandwidth and the lowest latency, but have small capacity, and if the repeatedly used data is stored in the registers, the storage access overhead is greatly reduced.
In the GPU acceleration SRCNN technology described in the invention, firstly, input image data is transferred from a memory to a video memory for bicubic preprocessing, then, a first layer of convolution (conv1), relu, a second layer of convolution (conv2), relu and a third layer of convolution (conv3) are sequentially carried out, and then, the data is transferred from the video memory to the memory. When each layer of convolution is carried out, parallelization of task division is carried out according to output pixels, so that the ultra-strong parallel computing capacity of the GPU can be utilized; in order to further accelerate the calculation speed of convolution, the invention provides that the shared memory is used for caching convolution kernel data, the shared memory or the register is used for caching input image block data, and convolution and nonlinear operation are fused; furthermore, the invention tests the execution speeds of different convolution methods aiming at the convolutions with different sizes, and selects the combination with the fastest speed to ensure that the whole process is as fast as possible. The key technical details of the invention are as follows.
In a preferred embodiment, to parallelize the convolution, the present invention divides the convolution task into millions of micro-tasks by output pixel, called convolution direct computation of convolution (fig. 2). The whole convolution is to calculate the value of each output pixel, so that the calculation task of each output pixel can be distributed to a GPU thread to be executed as an independent micro task, and the micro tasks are independent, can be concurrent and do not need to be communicated with each other. Another advantage of this division is that: the input image data accessed by the adjacent concurrently-executed threads are also adjacent, for example, while the thread (x, y) accesses the I (a, b), the thread (x +1, y) also accesses the I (a +1, b), so the access requests are automatically merged into one access by the GPU hardware, thereby fully utilizing the video memory width and bandwidth of the GPU. The rest part (relu) of the SRCNN is also parallelized, so that the whole SRCNN process can be executed on the GPU, and repeated data transfer between the CPU and the GPU is avoided. Through the parallelization of the convolution, the execution speed of the SRCNN is accelerated from 300 seconds/frame (using the CPU) to 1 second/frame.
By using a GPU hierarchical storage mechanism, the invention caches convolution kernel data and input image data into a shared memory or register, thereby accelerating the convolution by 2 to 10 times.
In a preferred embodiment, the present invention can save the global memory I/O overhead of redundant convolution kernel data by pre-reading the convolution kernel data into the shared memory, which is called shared kernel data method (shared kernel), as shown in fig. 3. In the above convolution direct computation method, each thread reads the identical convolution kernel data, and redundant repeated reading generates a large amount of global memory I/O waste. In the shared convolution kernel data method, a thread block first pre-reads convolution kernel data into a shared memory, and then all threads in the thread block acquire needed convolution kernel data from the shared memory. The shared memory is actually a cache of the convolution kernel data, and therefore global memory I/O overhead for reading a large amount of convolution kernel data is saved.
In a preferred embodiment, the present invention can save the global memory I/O overhead of redundant input image data by pre-reading the input image block data into a shared memory or register, which is called a shared patch or a register-buffered input image block (registered pixel), as shown in fig. 4. When performing a convolution with a width or height greater than 1, adjacent output pixels depend on the input image blocks overlapping each other. The convolution direct calculation method does not take such an overlapping relationship into consideration, and therefore input image data is redundantly read into each thread, which also wastes global memory I/O. This I/O waste can become very severe when the convolution is large in width and height. In the input image block method of the invention, which shares the input image block method and the register cache, an input image block area on which one thread block depends is firstly found, then the area is read into a shared memory or a register (which is feasible only under the condition that the area is small and the register can hold the area), and then each thread acquires the required input image data from the shared memory. At this time, the shared memory or register is the high-speed buffer memory of the input image data, thus saving the cost of global memory I/O for reading a large amount of input image data.
In a preferred embodiment, the present invention eliminates the I/O overhead of the non-linear layer by fusing the convolution and non-linear layers, as shown in FIG. 5. Conventional acceleration of convolutional neural networks focuses on acceleration of convolution because convolution is a bottleneck in computation time and because it is difficult for a non-linear layer to achieve acceleration. However, when the convolution is accelerated to such a fast degree, the time overhead of the non-linear layer is already not negligible. In convolutional neural networks, the non-linear layer always follows the convolutional layer, and each output pixel of the non-linear layer depends only on a corresponding one of the input pixels. Therefore, the invention combines the convolution and the non-linear layer into a process, and each thread performs the non-linear operation on the output value immediately after the convolution is performed and then writes the output value back to the global memory. This eliminates the overhead of the convolutional layer writing back to global memory and the nonlinear layer reading from global memory, which is equivalent to almost completely eliminating the computation time of the nonlinear layer.
In a preferred embodiment, the present invention is a method for testing the running time of each optimized acceleration technique for different sized convolutional layers, and further selecting the fastest acceleration technique to obtain the fastest overall running speed as possible, and the running time test results are shown in table 1. Where cuDNN is a method within the convolution operator provided by Nvidia. It can be seen that when the first layer of convolution adopts cuDNN, the second layer of convolution adopts shared convolution kernel parameters and input image blocks cached by a register, and the third layer of convolution adopts shared convolution kernel parameters and input image blocks, the whole process can be fastest, and finally reaches 0.15 second/frame, which is 2000 times of the CPU speed.
TABLE 1 run time of each convolutional layer using each optimized acceleration method
Figure BDA0000952133480000061
In the above table: single channel super resolution was tested 1920 x 1080 to 3840 x 2160 using Nvidia GTX980TI and a two-way Intel E5-2697V2@2.7GHz 12 cores process.
In conclusion, the invention accelerates a high-quality super-resolution method to meet the speed requirement of video processing without any image quality loss.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims (3)

1. A GPU acceleration method aiming at a deep learning super-resolution technology is characterized by comprising the following steps: all steps of a super-resolution technology based on deep learning and a convolutional neural network are parallelized and run in a GPU; the parallelization is to perform parallel task division on the convolution of the super-resolution technology based on deep learning and a convolutional neural network, and divide the convolution operation into millions of micro tasks which are not related and can be executed in parallel in any sequence, so that the ultra-strong parallel computing capability of the GPU is exerted;
firstly, transferring input image data from a memory to a video memory, performing bicubic preprocessing, then sequentially performing a first layer of convolution conv1, relu, a second layer of convolution conv2, relu and a third layer of convolution conv3, and then transferring the data from the video memory to the memory; when each layer of convolution is carried out, parallelization of task division is carried out according to output pixels, so that the ultra-strong parallel computing capacity of the GPU can be utilized; in order to further accelerate the calculation speed of convolution, the shared memory is used for caching convolution kernel data, the shared memory or the register is used for caching input image block data, and convolution and nonlinear operation are fused;
the method comprises the following steps: performing task division according to convolution output pixels, wherein a calculation task of each output pixel is distributed into a micro task, so that the convolution tasks can be executed in parallel on a large scale, and data depended by the calculation micro tasks of adjacent pixels are adjacent, thereby perfectly achieving merging access, and fully utilizing the video memory bit width and bandwidth of the GPU; specifically, the convolution task is divided into millions of micro tasks according to output pixels, which is called a convolution direct calculation method, the value of each output pixel is calculated by the whole convolution, so that the calculation task of each output pixel can be distributed to a GPU thread to be executed as an independent micro task, the micro tasks are independent, can be executed concurrently and do not need to communicate with each other, input image data accessed by adjacent concurrently executed threads are also adjacent, when the threads (x, y) access to I (a, b), the threads (x +1, y) access to I (a +1, b), and the access requests are automatically merged into one access by GPU hardware, so that the display memory bit width and bandwidth of the GPU are fully utilized; the rest relu of the SRCNN is also parallelized, so that the whole SRCNN process can be executed on a GPU, and repeated data transfer between the CPU and the GPU is avoided;
the method comprises the following steps: the shared memory is used as a cache of convolution kernel parameters, so that the global memory I/O is reduced and the convolution is accelerated; specifically, by pre-reading the convolution kernel data into the shared memory, the redundant global memory I/O overhead of the convolution kernel data can be saved, which is called a shared convolution kernel data method; in the direct convolution calculation method, each thread reads completely same convolution kernel data, and redundant repeated reading generates a large amount of global memory I/O waste; in the method for sharing the convolution kernel data, a thread block reads the convolution kernel data into a shared memory in advance, then all threads in the thread block acquire the needed convolution kernel data from the shared memory, and the shared memory is a high-speed cache of the convolution kernel data, so that the global memory I/O overhead for reading a large amount of convolution kernel data is saved;
the method comprises the following steps: a shared memory or a register is used as a high-speed cache of an input image, so that the global memory I/O is reduced and the convolution is accelerated; by pre-reading the input image block data into the shared memory or the register, the redundant input image data global memory I/O overhead can be saved, which is called as a shared input image block method or a register-cached input image block method; when the convolution with the width or height larger than 1 is carried out, adjacent output pixels depend on the mutually overlapped input image blocks, the direct convolution calculation method does not consider the overlapping relation, so that input image data are redundantly read into each thread, and the waste of global memory I/O is also brought, and when the convolution with the width or height larger than 1, the waste of the I/O becomes very serious; in the input image block method sharing the input image block method and the register cache, the using the shared memory or the register as the cache of the input image means: firstly, finding out an input image area on which a concurrent thread block depends, reading the area data into a shared memory of the thread block by the thread block, and finally acquiring required input image data from the shared memory by each thread; or if the image data required to be input by each thread is small enough, the required data is directly read into the register in the thread once, and then calculation is carried out, and the shared memory or the register is the cache of the input image data, so that the global memory I/O expense for reading a large amount of input image data is saved;
the method adopts a deep neural network GPU acceleration technology, combines convolution operation and nonlinear operation, reduces the global memory throughput required by convolution and nonlinear layers, and accordingly accelerates the execution speed of the whole process; the convolution and the nonlinear layer are integrated into a process, after the convolution is executed, each thread immediately carries out nonlinear operation on an output value and then writes back the output value to the global memory, the overhead of writing back the convolution layer to the global memory and reading the global memory by the nonlinear layer is avoided, and the process is equivalent to almost completely eliminating the calculation time of the nonlinear layer.
2. The GPU acceleration method for the deep learning super-resolution technique of claim 1, characterized in that: the method comprises the following steps: and selecting an optimal optimization method aiming at different convolution sizes by adopting a deep convolution network GPU acceleration technology.
3. The GPU acceleration method for the deep learning super-resolution technique of claim 2, characterized in that: the deep convolutional network GPU acceleration technology is as follows: and testing each optimized acceleration technology for the convolution layers with different sizes, and further selecting the fastest acceleration technology to obtain the method with the highest overall operation speed as possible.
CN201610184129.1A 2016-03-28 2016-03-28 GPU acceleration method for deep learning super-resolution technology Expired - Fee Related CN105869117B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610184129.1A CN105869117B (en) 2016-03-28 2016-03-28 GPU acceleration method for deep learning super-resolution technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610184129.1A CN105869117B (en) 2016-03-28 2016-03-28 GPU acceleration method for deep learning super-resolution technology

Publications (2)

Publication Number Publication Date
CN105869117A CN105869117A (en) 2016-08-17
CN105869117B true CN105869117B (en) 2021-04-02

Family

ID=56626131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610184129.1A Expired - Fee Related CN105869117B (en) 2016-03-28 2016-03-28 GPU acceleration method for deep learning super-resolution technology

Country Status (1)

Country Link
CN (1) CN105869117B (en)

Families Citing this family (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106447609A (en) * 2016-08-30 2017-02-22 上海交通大学 Image super-resolution method based on depth convolutional neural network
CN106529679B (en) * 2016-10-14 2020-01-14 腾讯科技(上海)有限公司 Machine learning method and system
CN106779057B (en) * 2016-11-11 2020-04-17 北京旷视科技有限公司 Method and device for calculating binary neural network convolution based on GPU
CN108073548B (en) * 2016-11-14 2021-09-10 耐能股份有限公司 Convolution operation device and convolution operation method
TWI634490B (en) * 2016-11-14 2018-09-01 美商耐能股份有限公司 Convolution operation device and convolution operation method
CN108268931B (en) * 2016-12-30 2022-10-25 华为技术有限公司 Data processing method, device and system
US10586148B2 (en) * 2016-12-31 2020-03-10 Via Alliance Semiconductor Co., Ltd. Neural network unit with re-shapeable memory
CN107085827B (en) * 2017-04-27 2020-06-16 中国电子科技集团公司第二十八研究所 Super-resolution image restoration method based on hardware platform
CN108229645B (en) * 2017-04-28 2021-08-06 北京市商汤科技开发有限公司 Convolution acceleration and calculation processing method and device, electronic equipment and storage medium
CN107515736B (en) * 2017-07-01 2021-01-15 广州深域信息科技有限公司 Method for accelerating computation speed of deep convolutional network on embedded equipment
CN107341127B (en) * 2017-07-05 2020-04-14 西安电子科技大学 Convolutional neural network acceleration method based on OpenCL standard
CN107895191B (en) 2017-10-30 2022-02-22 上海寒武纪信息科技有限公司 Information processing method and related product
CN108012156B (en) * 2017-11-17 2020-09-25 深圳市华尊科技股份有限公司 Video processing method and control platform
CN108052891A (en) * 2017-12-08 2018-05-18 触景无限科技(北京)有限公司 Facial contour parallel calculating method and device
CN108062532B (en) * 2017-12-28 2020-07-28 智慧眼科技股份有限公司 Deep learning face recognition network optimization method, device and storage medium
CN110321998B (en) * 2018-03-31 2022-06-14 赛灵思公司 Convolutional neural network implementation method and device, acceleration equipment and storage medium
CN108564524A (en) * 2018-04-24 2018-09-21 开放智能机器(上海)有限公司 A kind of convolutional calculation optimization method of visual pattern
CN110633785B (en) * 2018-06-21 2021-01-05 清华大学 Method and system for calculating convolutional neural network
CN109165723B (en) * 2018-08-03 2021-03-19 北京字节跳动网络技术有限公司 Method and apparatus for processing data
KR20200025200A (en) * 2018-08-29 2020-03-10 삼성전자주식회사 Electronic devices and methods of operating electronic devices
US10497258B1 (en) 2018-09-10 2019-12-03 Sony Corporation Vehicle tracking and license plate recognition based on group of pictures (GOP) structure
US11996105B2 (en) 2018-09-13 2024-05-28 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
CN109409513B (en) * 2018-10-10 2021-03-12 广州市百果园信息技术有限公司 Task processing method based on neural network and related equipment
CN109740731B (en) * 2018-12-15 2023-07-18 华南理工大学 Design method of self-adaptive convolution layer hardware accelerator
CN111461296B (en) * 2018-12-29 2023-09-22 中科寒武纪科技股份有限公司 Data processing method, electronic device, and readable storage medium
CN109886407B (en) * 2019-02-27 2021-10-22 上海商汤智能科技有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN110032538B (en) * 2019-03-06 2020-10-02 上海熠知电子科技有限公司 Data reading system and method
CN110009644B (en) * 2019-03-26 2021-02-23 深兰科技(上海)有限公司 Method and device for segmenting line pixels of feature map
CN110188863B (en) * 2019-04-30 2021-04-09 杭州电子科技大学 Convolution kernel compression method of convolution neural network suitable for resource-limited equipment
CN111914985B (en) * 2019-05-10 2023-07-04 杭州海康威视数字技术股份有限公司 Configuration method, device and storage medium of deep learning network model
CN110399883A (en) * 2019-06-28 2019-11-01 苏州浪潮智能科技有限公司 Image characteristic extracting method, device, equipment and computer readable storage medium
CN110659119B (en) * 2019-09-12 2022-08-02 浪潮电子信息产业股份有限公司 Picture processing method, device and system
KR102624027B1 (en) * 2019-10-17 2024-01-11 삼성전자주식회사 Image processing apparatus and method
CN112633470B (en) * 2020-12-11 2023-01-06 苏州浪潮智能科技有限公司 Method, system, device and medium for optimizing neural network convolution residual structure
CN113286174B (en) * 2021-05-21 2022-11-08 浙江商汤科技开发有限公司 Video frame extraction method and device, electronic equipment and computer readable storage medium
CN113806044B (en) * 2021-08-31 2023-11-07 天津大学 Heterogeneous platform task bottleneck eliminating method for computer vision application
CN114445687B (en) * 2021-12-31 2024-01-19 苏州浪潮智能科技有限公司 Image recognition reasoning method, system, storage medium and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9466102B2 (en) * 2012-09-26 2016-10-11 Siemens Corporation Multi-GPU FISTA implementation for MR reconstruction with non-uniform K-space sampling
CN104778659A (en) * 2015-04-15 2015-07-15 杭州电子科技大学 Single-frame image super-resolution reconstruction method on basis of deep learning
CN105279741A (en) * 2015-11-17 2016-01-27 集美大学 Image super-resolution reconstruction method and system based on graph-cut algorithm

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"cuda-convnet深层卷积神经网络算法的一种速度优化";李大霞;《中国优秀硕士学位论文全文数据库 信息科技辑(月刊 )》;20160315(第03期);"第4.5节" *
"基于GPU 的HOTPANTS 并行优化";李佳骏 等;《天文研究与技术》;20140430;第11卷(第2期);第184-191页 *
"基于GPU的表面形貌测量系统的研究";金鹭;《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》;20110815(第08期);"第1.3.2节" *
"面向CPU+GPU 异构平台的模板匹配目标识别并行算法";马永军 等;《天津科技大学学报》;20140831;第29卷(第4期);第48-52页 *

Also Published As

Publication number Publication date
CN105869117A (en) 2016-08-17

Similar Documents

Publication Publication Date Title
CN105869117B (en) GPU acceleration method for deep learning super-resolution technology
DE102018132069A1 (en) Equivariant landmark transformation for landmark localization
US20230229931A1 (en) Neural processing apparatus and method with neural network pool processing
DE102018133555A1 (en) Computational optimization mechanism for deep neural networks
Du et al. Anchor-based plain net for mobile image super-resolution
US7734442B2 (en) Apparatus and method for a test and measurement instrument
DE102020101814A1 (en) EFFICIENT EXECUTION BASED ON TASK GRAPHS OF DEFINED WORKLOADS
CN111028360B (en) Data reading and writing method and system in 3D image processing, storage medium and terminal
DE102020112826A1 (en) PROCESS FOR EFFICIENT PERFORMANCE OF DATA REDUCTION IN PARALLEL PROCESSING UNITS
DE102019103319A1 (en) STOCHASTIC ROUNDING OF NUMBER VALUES
US9460489B2 (en) Image processing apparatus and image processing method for performing pixel alignment
Rahman et al. Parallel implementation of a spatio-temporal visual saliency model
Zhao et al. GPU accelerated high-quality video/image super-resolution
Cadenas et al. Parallel pipelined array architectures for real-time histogram computation in consumer devices
US10353591B2 (en) Fused shader programs
KR102064581B1 (en) Apparatus and Method for Interpolating Image Autoregressive
DE102020134345A1 (en) TECHNOLOGY FOR LEARNING AND DOWNLOADING FREQUENT MEMORY ACCESS AND CALCULATION PATTERNS
CN110289861A (en) The half precision compressed sensing method of sampling
CN113344765B (en) Frequency domain astronomical image target detection method and system
Guo et al. A novel lightweight multi-dimension feature fusion network for single-image super-resolution reconstruction
CN202093573U (en) Parallel acceleration device used in industrial CT image reconstruction
KR101672539B1 (en) Graphics processing unit and caching method thereof
DE102022112459A1 (en) TECHNIQUES FOR EFFICIENTLY SYNCHRONIZING MULTIPLE PROGRAM THREADS
Fu et al. A CPU-GPU data transfer optimization approach based on code migration and merging
Oo The Improvement of 1D Gaussian Blur Filter using AVX and OpenMP

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210402

CF01 Termination of patent right due to non-payment of annual fee